linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
       [not found] ` <1094224071.431.7758.camel@cube>
@ 2004-09-06  6:08   ` Ulrich Windl
  2004-09-12 17:11     ` Albert Cahalan
  0 siblings, 1 reply; 60+ messages in thread
From: Ulrich Windl @ 2004-09-06  6:08 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: john stultz, linux-kernel

On 3 Sep 2004 at 11:07, Albert Cahalan wrote:

> On Fri, 2004-09-03 at 05:08, Ulrich Windl wrote:
> > On 3 Sep 2004 at 2:42, Albert Cahalan wrote:
> 
> > > HZ not being HZ is the source of many foul problems.
> > > 
> > > NTP should be able to correct for the error. For systems
> > > not running NTP, provide a fake NTP to make corrections
> > > based on the expected frequency error.
> > > 
> > > Based on that, skip or double-up on the ticks to make
> > > them be exactly HZ over long periods of time.
> > 
> > I think nobody wants a sawtooth-like timing. Time should
> > proceed as smoothly as possible.
> 
> Of course, and all hardware should have ideal clocks.
> Now, back to the real world...

Albert,

I am serious.

> 
> The kernel is broken if:
> 
> a. HZ is not really HZ

HZ is an integer however.

> b. Timekeeping is via 2 unrelated clocks. (jiffies+offset)

What do you do if one clock lacks resolution, and the other clock lacks digits? 
The interrupt clock may suck, but the TSC sucks even more.

> c. HZ is not an integer

HZ is HZ, but the true interrupt frequency is something completely different.

> 
> So on box using only clock ticks, steer jiffies
> toward HZ using NTP (or default frequency error value).

Are you saying you are happy with a lcok that less than 1 HZ off? That would be -- 
in a extreme case -- about 86000 seconds off per day.

> On a box with high-res time, use that instead, and make
> jiffies follow it to satisfy various kernel-internal
> uses of jiffies.

Make jiffies follow variable CPU clock? Are you serious?

> 
> Look, if HZ won't be HZ then you can just remove
> the "HZ" define from the kernel. It's useless.

It's not useless, it's a historic standard.

> 
> I think we're all sick of the recent time-related
> bugs. I could go for ripping out all the fancy and

Yes.

> broken stuff that was added recently, replacing it
> with the simple Linux 2.4.xx or 2.2.xx code. Swiping

It depends on what you want. The kernel really needs a working framework for 
nanoseconds; at least regarding the variables' precision.

> code from DragonflyBSD would be worth investigating.
> 
> 
> 

Regards,
Ulrich




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-06  6:08   ` [RFC][PATCH] new timeofday core subsystem (v.A0) Ulrich Windl
@ 2004-09-12 17:11     ` Albert Cahalan
  0 siblings, 0 replies; 60+ messages in thread
From: Albert Cahalan @ 2004-09-12 17:11 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: Albert Cahalan, john stultz, linux-kernel mailing list

On Mon, 2004-09-06 at 02:08, Ulrich Windl wrote:
> On 3 Sep 2004 at 11:07, Albert Cahalan wrote:

> > The kernel is broken if:
> > 
> > a. HZ is not really HZ
> 
> HZ is an integer however.

Yes, and this is the rate at which jiffies must tick.
At no time should jiffies be more than one tick away
from the value it would have according to HZ.

(where "tick" is the slower of the two)

> > b. Timekeeping is via 2 unrelated clocks. (jiffies+offset)
> 
> What do you do if one clock lacks resolution, and the other clock lacks digits? 
> The interrupt clock may suck, but the TSC sucks even more.

You choose one. If you have sucky hardware, that's the
best that can be done.

If by "lacks digits" you mean something like the 24-bit
cycle counter on the Alpha, faking a 64-bit counter is
not terribly hard. I've seen it done for the SHARC DSP.
You'll just need an interrupt before roll-over.

> > c. HZ is not an integer
> 
> HZ is HZ, but the true interrupt frequency is something completely different.

I could run the interrupts at 4152.0625 Hz and still
produce a jiffies tick at 1000.0000 Hz over the long
term. At any given point in time, jiffies won't be
off by more than 1 tick.

I could run the interrupts at 666 Hz too, and not be
off by more than 2 jiffies ticks.

Having the interrupt frequency be _almost_ equal to
the HZ value is kind of bad, because it tempts people to
equate them. When the frequencies are really different,
the errors and solutions become more obvious.

> > So on box using only clock ticks, steer jiffies
> > toward HZ using NTP (or default frequency error value).
> 
> Are you saying you are happy with a lcok that less than 1 HZ off? That would be -- 
> in a extreme case -- about 86000 seconds off per day.

Less than 1/HZ off, yes. That's 1 millisecond max error
over any time period from nanoseconds to years.

> > On a box with high-res time, use that instead, and make
> > jiffies follow it to satisfy various kernel-internal
> > uses of jiffies.
> 
> Make jiffies follow variable CPU clock? Are you serious?

You can not use a variable clock. End of story.

Well, if you do, you pay the price. (time goes crazy every
now and then, until a very-forgiving ntpd adjusts things)
This would not be a sane default.

If you want high-res time on a PC, your choices:

a. read the PIT registers (slow)
b. have a PC with the HPET
c. have a PC with a stable TSC
d. some ACPI PM thing?

If your hardware isn't suited for high-res time,
nothing will save you. Linux got along OK for a
decade without high-res time at all; you'll live.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 18:48                                                   ` john stultz
  2004-09-15 19:58                                                     ` George Anzinger
  2004-09-15 20:20                                                     ` Christoph Lameter
@ 2004-09-16  7:02                                                     ` Ulrich Windl
  2 siblings, 0 replies; 60+ messages in thread
From: Ulrich Windl @ 2004-09-16  7:02 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On 15 Sep 2004 at 11:48, john stultz wrote:

> NTP adjustments need to be applied smoothly and consistently as time
> progresses. Adjustment to the NTP parameters occur at do_adjtimex and at

Actually I think (constant) NTP adjustments need to be applied any time. That's 
why I even tried to correct the TSC interpolation between ticks (If you visit the 
PPSkit patch, that's the "nanoscale" thing). The "constant adjustment values" are 
updated regularly also (per tick, per second, per NTP update).

The basic thought was: If the time between two ticks is not 1/HZ the TSC 
interpolation (that is calibrated against the timer ticks) also will not 
interpolate between 1/HZ ticks. I assumed the TSC interpolations is wrong by the 
same percentage the timer interrupts are. Thus I tried to reversely apply the 
frequency error learnt through NTP even to the TSC interpolation (in addition to 
the time increments every tick). My code however has the problem that I ran out of 
significant bits during several muliplies and divides, while supporting that 
maximum imaginable frequency range for TSC. Even with the best oscillator, it 
won't reach single nanosecond precision as I don't have enough bits for them.

Regards,
Ulrich



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 18:48                                                   ` john stultz
  2004-09-15 19:58                                                     ` George Anzinger
@ 2004-09-15 20:20                                                     ` Christoph Lameter
  2004-09-16  7:02                                                     ` Ulrich Windl
  2 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15 20:20 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 15 Sep 2004, john stultz wrote:

> On Wed, 2004-09-15 at 10:30, Christoph Lameter wrote:
> > On Wed, 15 Sep 2004, john stultz wrote:
> >
> > > Well, not all time sources have that feature. Both TSC, and cyclone
> > > don't. You'd have to do something else for those. This is why my
> > > proposal has absolutely nothing to do with interrupt generation. It has
> > > a single hook that needs to be called only every once in a while, which
> > > can be done however any architecture wants.
> > >
> > > Interrupt generation has more to do with soft-timers and scheduling then
> > > time of day anyway.
> >
> > Hmm... I think this is a serious issue. The ability to exactly time an
> > interrupt and therefore specific actions is important. Maybe we can have a
> > fall back to soft interrupts if interrupt_at == NULL?
>
> Oh, its an issue, but one the time of day subsystem shouldn't solve. The
> code that manages interval (be it timer or whatever) interrupts should
> use the time of day subsystem to ensure they are triggering accurately,
> and manipulate the intervals as needed.

Huh? These are time related events. They cannot be managed effectively
outside of the time subsystem. You need at least a regularly interrupting
time source even for the current tick based approach.

> I think the functionality is essential, but that it doesn't belong in the time of day code.

Hmm. I was thinking all of this was about managing
the various time sources not only the time of day code.

> Basically we have two things we're trying to do:
>
> 1. Keep accurate time
> 2. Generate hardware interrupts accurately
>
> While frequently the same hardware can do both, not all hardware is
> usable for both functions. Thus I believe we should cleanly split these
> two subsystems. My proposal only provided the keep accurate time part,
> however one could using that functionality, to then manipulate hardware
> interrupts to ensure accuracy in the timer subsystem.

These two are intrisincally related and all modern hardware that I know
of does both. In a sense the regularly occuring  timer tick is also a
yet another form of time source but not does provide a counter (unless
simulated by jiffies). The timer subsystem needs to manage all
the available time sources of the system effectively and provide time
services to other subsystems.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 18:48                                                   ` john stultz
@ 2004-09-15 19:58                                                     ` George Anzinger
  2004-09-15 20:20                                                     ` Christoph Lameter
  2004-09-16  7:02                                                     ` Ulrich Windl
  2 siblings, 0 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-15 19:58 UTC (permalink / raw)
  To: john stultz
  Cc: Christoph Lameter, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

john stultz wrote:
~

> 
> 
>>Real time features such as posix-timer's also depend on the ability to
>>deliver a signal at an exact point in time. Soft timers can only give a
>>very rough approximation in these cases.
>>
>>So I think this feature is essential.
> 
> 
> I think the functionality is essential, but that it doesn't belong in the time of day code.
> 
> Basically we have two things we're trying to do: 
> 
> 1. Keep accurate time 
> 2. Generate hardware interrupts accurately
> 
> While frequently the same hardware can do both, not all hardware is
> usable for both functions. Thus I believe we should cleanly split these
> two subsystems. My proposal only provided the keep accurate time part,
> however one could using that functionality, to then manipulate hardware
> interrupts to ensure accuracy in the timer subsystem.
> 
The thing I think is missing in all this is that, in some platforms, the 
hardware to provide the interrupts is more accurate.  We have, IMHO, three cases 
here:
a) The interrupts are accurate but the clock info (e.g. TSC) is not.
b) The clock in accurate but the timer is not, and
c) The clock and interrupt come from the same accurate hardware source.

The X86 is in class a) in that the PIT is accurate and the TSC is not.  The 
muddy part here is the pm-timer which is accurate but takes a _long_ time to access.

PPCs are in class c) as are some MIPS, ARM, and PARISC.  I am not sure about the 
rest and can not lay my hands on one of class b).

For the class a) machines, I think the best approach is to use a clock that has 
reasonable short term accuracy and to use the timer to, from time to time, 
correct it.  This should be on the order of an internal NTP sort of correction.

For the class a) and b) machines, I think it would be wise to not seperated the 
timer and the clock so much as to make it "hard" to use one to correct the other.
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 17:30                                                 ` Christoph Lameter
@ 2004-09-15 18:48                                                   ` john stultz
  2004-09-15 19:58                                                     ` George Anzinger
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: john stultz @ 2004-09-15 18:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 2004-09-15 at 10:30, Christoph Lameter wrote:
> On Wed, 15 Sep 2004, john stultz wrote:
> 
> > Well, not all time sources have that feature. Both TSC, and cyclone
> > don't. You'd have to do something else for those. This is why my
> > proposal has absolutely nothing to do with interrupt generation. It has
> > a single hook that needs to be called only every once in a while, which
> > can be done however any architecture wants.
> >
> > Interrupt generation has more to do with soft-timers and scheduling then
> > time of day anyway.
> 
> Hmm... I think this is a serious issue. The ability to exactly time an
> interrupt and therefore specific actions is important. Maybe we can have a
> fall back to soft interrupts if interrupt_at == NULL?

Oh, its an issue, but one the time of day subsystem shouldn't solve. The
code that manages interval (be it timer or whatever) interrupts should
use the time of day subsystem to ensure they are triggering accurately,
and manipulate the intervals as needed. 

> If the scheduler could assign dynamic time slices to processes then new
> more effective designs of process scheduling become possible. F.e. on an
> SMP system if a single process is running and no other contention exists
> then the scheduler can simply let the process run without interruption.
> 
> On the other hand the system may reduce the time slices assigned to a
> process that causes significant load on the system in order to insure the
> responsiveness of other applications despite the load.

Yep, all of this is outside what the time of day subsystem should do.
Interval interrupt scheduling should be done by other code (I suggest
the timer code do it).


> If one can schedule a tick dynamically then also NTP time correction can
> be "scheduled" when it is likely that the clock has sufficiently deviated
> from standard time. As soon as the time source has been sufficiently tamed
> by scaling it correctly then the periods of checkup on the time source
> could be gradually expanded.

NTP adjustments need to be applied smoothly and consistently as time
progresses. Adjustment to the NTP parameters occur at do_adjtimex and at
interval boundaries (ie: the interrupt_hook in my code, or look at
update_wall_time and second_overflow in the existing code). So in my
understanding its not really a schedulable function. 


> Real time features such as posix-timer's also depend on the ability to
> deliver a signal at an exact point in time. Soft timers can only give a
> very rough approximation in these cases.
> 
> So I think this feature is essential.

I think the functionality is essential, but that it doesn't belong in the time of day code.

Basically we have two things we're trying to do: 

1. Keep accurate time 
2. Generate hardware interrupts accurately

While frequently the same hardware can do both, not all hardware is
usable for both functions. Thus I believe we should cleanly split these
two subsystems. My proposal only provided the keep accurate time part,
however one could using that functionality, to then manipulate hardware
interrupts to ensure accuracy in the timer subsystem.

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 18:00                                               ` George Anzinger
@ 2004-09-15 18:28                                                 ` Christoph Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15 18:28 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott


On Wed, 15 Sep 2004, George Anzinger wrote:
> > I am not following you here. Why does the context switch overhead
> > increase? Because there are multiple interrupts for different tasks done
> > in the tick?
> >
> Each task has several timers, i.e. time slice, time limit, and possibly itimer
> profile.  Granted only one of these needs to be sent to the timer code, but that
> takes a bit of time, not much, but enough to increase the context switch
> overhead such that a system with a modest amount of context switching will incur
> more timer management overhead than the periodic tick generates.

I thought that the timer handling could be separate from the time slice
handling of the scheduler. With the ability schedule events as needed the
scheduler could generate its own independent timing from the necessary
clock adjustments. The tick would be disassembled into its various
components.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 15:46                                             ` Christoph Lameter
@ 2004-09-15 18:00                                               ` George Anzinger
  2004-09-15 18:28                                                 ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Christoph Lameter wrote:
~

>>>One could do this but we want to have a tickless system. The tick is only
>>>necessary if the time needs to be adjusted.
>>
>>I really think a tickless system, for other than UML systems, is a loosing
>>thing.  The accounting overhead on context switch (which increases as the number
>>of switchs per second) will cause more overhead than a periodic accounting tick
>>once a respectable load appears.  The periodic accounting tick has a flat
>>overhead that does not depend on load.
> 
> 
> I am not following you here. Why does the context switch overhead
> increase? Because there are multiple interrupts for different tasks done
> in the tick?
> 
Each task has several timers, i.e. time slice, time limit, and possibly itimer 
profile.  Granted only one of these needs to be sent to the timer code, but that 
takes a bit of time, not much, but enough to increase the context switch 
overhead such that a system with a modest amount of context switching will incur 
more timer management overhead than the periodic tick generates.

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  8:54                                             ` Dominik Brodowski
@ 2004-09-15 17:54                                               ` George Anzinger
  0 siblings, 0 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-15 17:54 UTC (permalink / raw)
  To: Dominik Brodowski
  Cc: Christoph Lameter, john stultz, Albert Cahalan, lkml, tim,
	Ulrich.Windl, Len Brown, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

Dominik Brodowski wrote:
> On Wed, Sep 15, 2004 at 01:04:04AM -0700, George Anzinger wrote:
> 
>>>One could do this but we want to have a tickless system. The tick is only
>>>necessary if the time needs to be adjusted.
>>
>>I really think a tickless system, for other than UML systems, is a loosing 
>>thing.  The accounting overhead on context switch (which increases as the 
>>number of switchs per second) will cause more overhead than a periodic 
>>accounting tick once a respectable load appears.
> 
> 			 ^^^^^^^^^^^^^^^^
> 
> On a largely idle system (like notebooks on battery power in typical use)
> the accounting overhead will be less a problem. However, the CPU being 
> woken up each millisecond will cause an increased battery usage. So if 
> the load is less than a certain threshold, tickless systems do make much 
> sense.

At MontaVista I have been working on a thing we call VST which looks ahead in 
the timer list and, finding nothing for N ticks, turns off the ticker until that 
time.  It is not tickless, unless the system is idle, but then it can go 
tickless for as long as the max value that can be programmed on the wakeup 
timer.  Interrupts prior to that time will, of course, also wake the system.

Seems like the best of both worlds to me.

An early version of this is on the HRT sourceforge site.


-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 17:13                                               ` john stultz
@ 2004-09-15 17:30                                                 ` Christoph Lameter
  2004-09-15 18:48                                                   ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15 17:30 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 15 Sep 2004, john stultz wrote:

> On Wed, 2004-09-15 at 09:46, Christoph Lameter wrote:
> > On Wed, 15 Sep 2004, john stultz wrote:
> >
> > > > struct time_source_t {
> > > > 	u8 type;
> > > > 	u8 shift;
> > > > 	u32 multiply;
> > > > 	void *address;
> > > > 	u64 mask;
> > > > 	int (*interrupt_at)(u64 counter_value);
> > > > };
> > >
> > > Could you explain the interrupt_at function further?
> >
> > Generates a timer interrupt when a certain counter value has been reached.
> > This is a common feature of most clock chips that I am aware of.
>
> Well, not all time sources have that feature. Both TSC, and cyclone
> don't. You'd have to do something else for those. This is why my
> proposal has absolutely nothing to do with interrupt generation. It has
> a single hook that needs to be called only every once in a while, which
> can be done however any architecture wants.
>
> Interrupt generation has more to do with soft-timers and scheduling then
> time of day anyway.

Hmm... I think this is a serious issue. The ability to exactly time an
interrupt and therefore specific actions is important. Maybe we can have a
fall back to soft interrupts if interrupt_at == NULL?

If the scheduler could assign dynamic time slices to processes then new
more effective designs of process scheduling become possible. F.e. on an
SMP system if a single process is running and no other contention exists
then the scheduler can simply let the process run without interruption.

On the other hand the system may reduce the time slices assigned to a
process that causes significant load on the system in order to insure the
responsiveness of other applications despite the load.

If one can schedule a tick dynamically then also NTP time correction can
be "scheduled" when it is likely that the clock has sufficiently deviated
from standard time. As soon as the time source has been sufficiently tamed
by scaling it correctly then the periods of checkup on the time source
could be gradually expanded.

Real time features such as posix-timer's also depend on the ability to
deliver a signal at an exact point in time. Soft timers can only give a
very rough approximation in these cases.

So I think this feature is essential.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 16:46                                             ` Christoph Lameter
@ 2004-09-15 17:13                                               ` john stultz
  2004-09-15 17:30                                                 ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-15 17:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 2004-09-15 at 09:46, Christoph Lameter wrote:
> On Wed, 15 Sep 2004, john stultz wrote:
> 
> > > struct time_source_t {
> > > 	u8 type;
> > > 	u8 shift;
> > > 	u32 multiply;
> > > 	void *address;
> > > 	u64 mask;
> > > 	int (*interrupt_at)(u64 counter_value);
> > > };
> >
> > Could you explain the interrupt_at function further?
> 
> Generates a timer interrupt when a certain counter value has been reached.
> This is a common feature of most clock chips that I am aware of.

Well, not all time sources have that feature. Both TSC, and cyclone
don't. You'd have to do something else for those. This is why my
proposal has absolutely nothing to do with interrupt generation. It has
a single hook that needs to be called only every once in a while, which
can be done however any architecture wants. 

Interrupt generation has more to do with soft-timers and scheduling then
time of day anyway. 

> > > 	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);
> >
> > Out of curiosity, what happens in your fsyscall code in these cases
> > where TIME_SOURCE_FUNCTION is used?
> 
> It backs out of the fast call track and does a regular call instead.

Ok, just curious. 

> > > /* simulation of the old tick behavior */
> > > tick(u64 when) {
> > > 	/* do tick stuff */
> > >
> > > 	/* time base update .... */
> > > 	time_source_adjust(0);
> > > 	/* Schedule next timer tick */
> > > 	event_new(when + NSEC_PER_SEC / HZ , tick,  when + NSEC_PER_SEC / HZ);
> > > }
> >
> > Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
> > originally? How is event_run being called?
> 
> This is just some toying around with the concept. Maybe tick could be
> called by the regular timer interrupt?

Ok, no worries. Its just one of the big reasons I started my earlier
proposal is to lessen the dependencies and cleanup the interactions
between the interrupt generation, soft-timer and the time of day code. 


> > > /* New tick would be scheduled by the ntp logic when a correction is needed.
> > >  * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
> > >  * make the clock run faster.
> > >  * One way to do this is to accumulate a time difference to real time.
> > >  * if this time difference is small and positive then skip time forward a bit.
> > >  * if the time difference is negative then slow down the clock.
> > >  * if the time difference is way too high then accelerate the clock
> > >  */
> >
> > Well, this is still a bit vague.  do_adjtimex gives us 4 values we need
> > to use in adjusting time. The parts-per-million (ppm - all of which are
> > signed) tick adjustment value, the ppm frequency adjustment, the ppm
> > offset adjustment, and the singleshot offset adjustment length (# of
> > nsecs in which we apply the maximum ppm singleshot offset adjustment).
> >
> > How do these 4 values get translated to adjusting the time source?
> 
> I agree this is vague and I have no clue here. I hoped that you
> could come up with some may to use these functions for adjtimex etc? I
> would need some time to figure out how the adjusting works in order to
> come up with a solution. I thought we agreed you do the NTP and time
> adjustment things ;-) I have never touched that code....

Heh. Well, I don't think anyone wants to do the NTP work, but I did have
to go through it to implement my proposal. Unfortunately it isn't
something that can easily be tacked on to any design, so you're going to
have to understand some aspects of it if you want to do your own
design.  That said, if you have any questions, hopefully I can explain
my interpretation of the existing code as well as my own implementation.

> > You might want to just swipe my timeofday-core patch and re-work the
> > timesource.h interface to your liking (make it like time_source_t). That
> > way you get all the NTP details as well as the interrupt separation for
> > free.
> 
> Will have a look at it and put it together when I have some time.

Sure thing, and I'll try to take a stab at it myself after next week.


thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15 16:32                                           ` john stultz
@ 2004-09-15 16:46                                             ` Christoph Lameter
  2004-09-15 17:13                                               ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15 16:46 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 15 Sep 2004, john stultz wrote:

> > struct time_source_t {
> > 	u8 type;
> > 	u8 shift;
> > 	u32 multiply;
> > 	void *address;
> > 	u64 mask;
> > 	int (*interrupt_at)(u64 counter_value);
> > };
>
> Could you explain the interrupt_at function further?

Generates a timer interrupt when a certain counter value has been reached.
This is a common feature of most clock chips that I am aware of.

> > 	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);
>
> Out of curiosity, what happens in your fsyscall code in these cases
> where TIME_SOURCE_FUNCTION is used?

It backs out of the fast call track and does a regular call instead.

> > /* simulation of the old tick behavior */
> > tick(u64 when) {
> > 	/* do tick stuff */
> >
> > 	/* time base update .... */
> > 	time_source_adjust(0);
> > 	/* Schedule next timer tick */
> > 	event_new(when + NSEC_PER_SEC / HZ , tick,  when + NSEC_PER_SEC / HZ);
> > }
>
> Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
> originally? How is event_run being called?

This is just some toying around with the concept. Maybe tick could be
called by the regular timer interrupt?

>
> > /* New tick would be scheduled by the ntp logic when a correction is needed.
> >  * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
> >  * make the clock run faster.
> >  * One way to do this is to accumulate a time difference to real time.
> >  * if this time difference is small and positive then skip time forward a bit.
> >  * if the time difference is negative then slow down the clock.
> >  * if the time difference is way too high then accelerate the clock
> >  */
>
> Well, this is still a bit vague.  do_adjtimex gives us 4 values we need
> to use in adjusting time. The parts-per-million (ppm - all of which are
> signed) tick adjustment value, the ppm frequency adjustment, the ppm
> offset adjustment, and the singleshot offset adjustment length (# of
> nsecs in which we apply the maximum ppm singleshot offset adjustment).
>
> How do these 4 values get translated to adjusting the time source?

I agree this is vague and I have no clue here. I hoped that you
could come up with some may to use these functions for adjtimex etc? I
would need some time to figure out how the adjusting works in order to
come up with a solution. I thought we agreed you do the NTP and time
adjustment things ;-) I have never touched that code....

> You might want to just swipe my timeofday-core patch and re-work the
> timesource.h interface to your liking (make it like time_source_t). That
> way you get all the NTP details as well as the interrupt separation for
> free.

Will have a look at it and put it together when I have some time.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  6:46                                         ` Christoph Lameter
@ 2004-09-15 16:32                                           ` john stultz
  2004-09-15 16:46                                             ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-15 16:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Tue, 2004-09-14 at 23:46, Christoph Lameter wrote:
> V3 of the concept for a time subsystem.
>  * add a function to the time_source_t to be able to trigger the timer
>    interrupt at a certain counter value instead of at regular intervals
>  * function to map back from nanoseconds to counter units
>  * add primitive scheduler that wakes up for timer events and
>    figures out when to schedule the next event.
>  * Some info at the end on how maybe have the NTP code interact with
>    this stuff.
> 
> The deferral of the time_base updates actually is beneficial since it
> allows a more accurate scaling of the counter and thus does not loose
> remaining nanoseconds as often.
> 
> /****************************************************************
>  * A concept for a new timer module
>  * V3
>  *
>  * Christoph Lameter, September 14, 2003
>  ***************************************************************/
> 
> struct time_source_t {
> 	u8 type;
> 	u8 shift;
> 	u32 multiply;
> 	void *address;
> 	u64 mask;
> 	int (*interrupt_at)(u64 counter_value);
> };

Could you explain the interrupt_at function further?



> /* Exampe of a CPU time time source */
> 	int itm_setup(u64 counter) {
> 		set_itm(counter);
> 		return 1;	/* Success */
> 	}
> 
> 	make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44,itm_setup);
> 
> /* A memory based time source at 20Mhz 55 bits wide, no ability to fire an interrupt */
> 
> 	make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55, NULL);
> 
> /* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
>  * sync. If we have such a problem  then insure at least sure that the time source never goes backward.
>  */
> 
> u64 time_source_last;
> 
> u64 get_cpu_time_filtered() {
> 	u64 x;
> 	u64 l;
> 
> 	do {
> 		l = time_source_last;
> 		x = cycles();
> 		if (x<l) return l;
> 		/* the cmpxchg is going to hurt in terms of scalability ! */
> 	} while (cmpxchg(&time_source_last, l, x) != l);
> 	return x;
> }
> 
> Generate the time_source with
> 
> 	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);

Out of curiosity, what happens in your fsyscall code in these cases
where TIME_SOURCE_FUNCTION is used?


> /* event scheduler to be able to run events at specific times */
> 
> struct event {
> 	u64 when;
> 	u64 param;
> 	void (*run)(u64);
> 	struct event next;
> }
> 
> struct event *event_queue;
> 
> void event_run(void) {
> 	u64 t;
> 
> 	spin_lock(event_handler);
> 	t = now();
> redo:
> 	/* Run all events scheduled before now */
> 	while (event_queue && event_queue->when <= t) {
> 		struct event *e=event_queue;
> 
> 		e->run(e->param);
> 		event_queue = e->next;
> 		kfree(e);
> 	}
> 	if (event_queue = NULL) {
> 		spin_unlock(event_handler);
> 		return;
> 	}
> 	t = now();
> 	d = t - event_queue->when;
> 	if (d<= 0) goto retry;
> 
> 	time_source->interrupt_at(time_source_at_base + ns_to_time_source(event_queue->when - time_base));
> 	spin_unlock(event_handler);
> }
> 
> void event_new(u64 w,void (*r)(u64), u64 p) {
> 	struct event *e=kmalloc(sizeof(struct event));
> 
> 	e->when =w;
> 	e->param = p;
> 	e->run = r;
> 	spin_lock(event_handler);
> 	if (event_queue) {
> 		/* Insert time event at the appropriate time */
> 		struct event *p = event_queue;
> 		while (p->next && p->next->w < w) p = p->next;
> 		e->next = p->next;
> 		p->next = e;
> 	} else
> 		event_queue=e;
> 	spin_unlock(event_handler);
> 	event_run();
> }
> 
> void scheduled_timer_interrupt(void) {
> 	event_run();
> }
> 
> /* simulation of the old tick behavior */
> tick(u64 when) {
> 	/* do tick stuff */
> 
> 	/* time base update .... */
> 	time_source_adjust(0);
> 	/* Schedule next timer tick */
> 	event_new(when + NSEC_PER_SEC / HZ , tick,  when + NSEC_PER_SEC / HZ);
> }

Hmm. This is getting somewhat tangled in my mind. Who is calling tick()
originally? How is event_run being called?

> /* New tick would be scheduled by the ntp logic when a correction is needed.
>  * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
>  * make the clock run faster.
>  * One way to do this is to accumulate a time difference to real time.
>  * if this time difference is small and positive then skip time forward a bit.
>  * if the time difference is negative then slow down the clock.
>  * if the time difference is way too high then accelerate the clock
>  */

Well, this is still a bit vague.  do_adjtimex gives us 4 values we need
to use in adjusting time. The parts-per-million (ppm - all of which are
signed) tick adjustment value, the ppm frequency adjustment, the ppm
offset adjustment, and the singleshot offset adjustment length (# of
nsecs in which we apply the maximum ppm singleshot offset adjustment). 

How do these 4 values get translated to adjusting the time source?

You might want to just swipe my timeofday-core patch and re-work the
timesource.h interface to your liking (make it like time_source_t). That
way you get all the NTP details as well as the interrupt separation for
free.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  8:04                                           ` George Anzinger
  2004-09-15  8:54                                             ` Dominik Brodowski
  2004-09-15  9:12                                             ` Andi Kleen
@ 2004-09-15 15:46                                             ` Christoph Lameter
  2004-09-15 18:00                                               ` George Anzinger
  2 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15 15:46 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Wed, 15 Sep 2004, George Anzinger wrote:

> Lets assume the pm counter which has 24 bits.  This means your shift is 40 bits.
>   In "s->multiply = (NSEC_PER_SEC << s->shift) / freq;" you will have an overflow.
> Here you need to keep (NSEC_PER_SEC << s->shift) in 64 bits AND multiply must
> also be 32 bits or less.  I really don't think you can choose the scale so easily.

Thanks for pointing that out. Will fix that. I sure wish one could
have a 128 bit intermediate result.

> > Nope. time_source_last is the global. l is just a copy of
> > time_source_last.
>
> Right, I miss read the function.  cycles() should be now() if I am reading this
> right.

Sortof. now() is the time in nanoseconds whereas cycles() is a counter
value of the counter in the cpu.

> > One could do this but we want to have a tickless system. The tick is only
> > necessary if the time needs to be adjusted.
>
> I really think a tickless system, for other than UML systems, is a loosing
> thing.  The accounting overhead on context switch (which increases as the number
> of switchs per second) will cause more overhead than a periodic accounting tick
> once a respectable load appears.  The periodic accounting tick has a flat
> overhead that does not depend on load.

I am not following you here. Why does the context switch overhead
increase? Because there are multiple interrupts for different tasks done
in the tick?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  8:04                                           ` George Anzinger
  2004-09-15  8:54                                             ` Dominik Brodowski
@ 2004-09-15  9:12                                             ` Andi Kleen
  2004-09-15 15:46                                             ` Christoph Lameter
  2 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-09-15  9:12 UTC (permalink / raw)
  To: George Anzinger
  Cc: Christoph Lameter, john stultz, Albert Cahalan, lkml, tim,
	Ulrich.Windl, Len Brown, linux, David Mosberger, Andi Kleen,
	paulus, schwidefsky, jimix, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott

> I really think a tickless system, for other than UML systems, is a loosing 
> thing.  The accounting overhead on context switch (which increases as the 

You can run the accounting independently at much lower frequency (10ms is
perfectly fine as 2.4 has proven - i suspect even lower would be ok too) 
IMHO it should be a sysctl so that users can do a trade off between
power consumption and accounting accuracy.

And there is one important special that doesn't need any accounting
for user space at all: the idle loop. 
It still needs some way to account interrupts, but that could be done
in do_IRQ or also do with rather low frequency. 

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  8:04                                           ` George Anzinger
@ 2004-09-15  8:54                                             ` Dominik Brodowski
  2004-09-15 17:54                                               ` George Anzinger
  2004-09-15  9:12                                             ` Andi Kleen
  2004-09-15 15:46                                             ` Christoph Lameter
  2 siblings, 1 reply; 60+ messages in thread
From: Dominik Brodowski @ 2004-09-15  8:54 UTC (permalink / raw)
  To: George Anzinger
  Cc: Christoph Lameter, john stultz, Albert Cahalan, lkml, tim,
	Ulrich.Windl, Len Brown, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, Sep 15, 2004 at 01:04:04AM -0700, George Anzinger wrote:
> >One could do this but we want to have a tickless system. The tick is only
> >necessary if the time needs to be adjusted.
> 
> I really think a tickless system, for other than UML systems, is a loosing 
> thing.  The accounting overhead on context switch (which increases as the 
> number of switchs per second) will cause more overhead than a periodic 
> accounting tick once a respectable load appears.
			 ^^^^^^^^^^^^^^^^

On a largely idle system (like notebooks on battery power in typical use)
the accounting overhead will be less a problem. However, the CPU being 
woken up each millisecond will cause an increased battery usage. So if 
the load is less than a certain threshold, tickless systems do make much 
sense.

	Dominik

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  3:32                                         ` Christoph Lameter
@ 2004-09-15  8:04                                           ` George Anzinger
  2004-09-15  8:54                                             ` Dominik Brodowski
                                                               ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-15  8:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Christoph Lameter wrote:
> On Tue, 14 Sep 2004, George Anzinger wrote:
> 
> 
>>>u64 time_source_to_ns(u64 x) {
>>>	return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
>>>}
>>
>>This seems to assume that the time souce is incrementing.  On some archs, I
>>think, it decrements...
> 
> 
> This could be handled by a function that transforms the value read from
> the counter into an incrementing value. I.e.
> 
> u64 get_rev_timerval(void) {
> 	return  1<< 55 - readq(TIMER_PORT);
> }
> 
> 
>>So we would do "time_adjust_skip(0);" to update time_source_at_base?
> 
> 
> There is no reason to update time_source_at_base unless adjustments need
> to be done or a danger exists of the counter wrapping around (16 bit
> counter?)

Yes, for example the pm counter is 24 bits.  A lot of platforms have 32 bit 
counters...
> 
> 
>>If we do a "good" job of choosing <multiply> and <shift> this will be a "very"
>>small change.  Might be better to pass in a "delta" to change it by.  Then you
>>would only need one function.
> 
> 
> These are the raw routines. Higher level function could translate a delta
> into the appropriate adjustments.
> 
> 
>>The mask and the shift value are not really related.  The mask is a function of
>>the number of bits the hardware provides.  The shift is related to the value of
>>freq.  Me thinks they should not be tied together here.
> 
> 
> They are related because the maximum shift for a 64 bit value without
> loosing bits is 64 - number of significant bits. This basically insures
> maximum precision when scaling the counter.

Lets assume the pm counter which has 24 bits.  This means your shift is 40 bits. 
  In "s->multiply = (NSEC_PER_SEC << s->shift) / freq;" you will have an overflow.
Here you need to keep (NSEC_PER_SEC << s->shift) in 64 bits AND multiply must 
also be 32 bits or less.  I really don't think you can choose the scale so easily.
> 
> 
> 
>>>/* Values in use in the kernel and how they may be derived from xtime */
>>>#define jiffies (now()/1000000)
>>
>>This assumes HZ=1000.  (Assuming there is an HZ any more, that is.)  Not all
>>archs will want this value.  Possibly:
>>#define jiffies ((now() * HZ) / 1000000000)
> 
> 
> Right. I just thought of the standard case HZ=1000.
> 
> 
>>>u64 get_cpu_time_filtered() {
>>>	u64 x;
>>>	u64 l;
>>
>>This will need to be "static";
> 
> 
> Nope. time_source_last is the global. l is just a copy of
> time_source_last.

Right, I miss read the function.  cycles() should be now() if I am reading this 
right.
> 
> 
>>Ok, so now lets hook this up with interval timers:
>>
>>#define ns_per_jiffie (NSEC_PER_SEC / HZ)
>>#define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)
>>
>>This function is a request to interrupt at the next jiffie after the passed
>>reference jiffie.  If that time is passed return true, else false.
> 
> 
> One could do this but we want to have a tickless system. The tick is only
> necessary if the time needs to be adjusted.

I really think a tickless system, for other than UML systems, is a loosing 
thing.  The accounting overhead on context switch (which increases as the number 
of switchs per second) will cause more overhead than a periodic accounting tick 
once a respectable load appears.  The periodic accounting tick has a flat 
overhead that does not depend on load.
> 
> But you are right there is the need for timer event scheduling that is
> not included yet. This should be a method of the time source.
> 
I am not sure that is the right thing to do here.  For example, on SMP systems 
today we have a timer event interrupt per cpu.  This is much more scaleable and 
not so easy to do if we tie it to the time source.  All we need is a reasonably 
accurate short term counter.
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  0:57                                       ` George Anzinger
  2004-09-15  3:32                                         ` Christoph Lameter
@ 2004-09-15  6:46                                         ` Christoph Lameter
  2004-09-15 16:32                                           ` john stultz
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15  6:46 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

V3 of the concept for a time subsystem.
 * add a function to the time_source_t to be able to trigger the timer
   interrupt at a certain counter value instead of at regular intervals
 * function to map back from nanoseconds to counter units
 * add primitive scheduler that wakes up for timer events and
   figures out when to schedule the next event.
 * Some info at the end on how maybe have the NTP code interact with
   this stuff.

The deferral of the time_base updates actually is beneficial since it
allows a more accurate scaling of the counter and thus does not loose
remaining nanoseconds as often.

/****************************************************************
 * A concept for a new timer module
 * V3
 *
 * Christoph Lameter, September 14, 2003
 ***************************************************************/

struct time_source_t {
	u8 type;
	u8 shift;
	u32 multiply;
	void *address;
	u64 mask;
	int (*interrupt_at)(u64 counter_value);
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t time_lock;	/* Protects access to time_base time_source time_mono_offset and time_source_at_base */
u64 time_base;
u64 time_mono_offset;
struct time_source_t *time_source;
u64 time_source_at_base;	/* time source value at time time_base */

u64 inline time_source_get(struct time_source_t *s) {
	switch (s->type) {
		case TIME_SOURCE_CPU	: return cycles();
		case TIME_SOURCE_MMIO32	: return readl(s->address);
		case TIME_SOURCE_MMIO64	: return readq(s->address);
		case TIME_SOURCE_FUNCTION: {
			u64 (*x)(void);
			x = s->address;
			return x();
		}
}

/* Convert from time source units to ns */
u64 time_source_to_ns(u64 x) {
	return ((x & time_source->mask)*time_source->multiply) >> time_source->shift;
}

/* Convert from ns time to time source units */
u64 ns_to_time_source(u64 x) {
	return (x << time_source->shift ) / time_source->multiply;
}
inline u64 now(void) {
	u64 t;
	unsigned long seq;

	do {
		seq = read_seqbegin(&time_lock);
		t = time_base + time_source_to_ns(time_source_get(time_source) - time_source_at_base);
	} while (unlikely(seq_retry(&time_lock,seq)));
	return t;
}

#define now_mono() (now()+time_mono_offset)

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += ns + time_source_to_ns(c - time_source_at_base);
	time_source_at_base = c;
	write_sequnlock(&time_lock);
}

void time_adjust_slower(void) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += time_source_to_ns(c - time_source_at_base);
	time_source_at_base = c;
	time_source->multiply--;
	write_sequnlock(&time_lock);
}

void time_adjust_faster(void) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += time_source_to_ns(c - time_source_at_base);
	time_source_at_base = c;
	time_source->multiply++;
	write_sequnlock(&time_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
	u64 c_old;
	u64 c_new;

	write_seqlock(&time_lock);
	c_old = time_source_get(time_source);
	c_new = time_source_get(s);

	time_base += time_source_to_ns(c_old - time_source_at_base);

	time_source_at_base = c_new;
	time_source = s;

	write_sequnlock(&time_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits, void *i) {
	struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

	s->shift = 64 - bits;
	s->multiply = (NSEC_PER_SEC << s->shift) / freq;
	s->address = a;
	s->type = t;
	s->mask = 1 << bits -1;
	s->interrupt_at = i;
}

void time_set(u64 ns) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	/* Adjust monotonic time base */
	time_mono_offset += time_base + time_source_to_ns(c - time_source_at_base) - ns;
	/* Setup new time base */
	time_base = ns;
	time_source_at_base = c;
	write_sequnlock(&time_lock);
}

void time_init(struct time_source_t *s) {
{
	write_seqlock(&time_lock);
	time_base = 0;
	time_mono_offset = 0;
	time_source_at_base = time_source_get(s);
	time_source = s;
	write_sequnlock(&time_lock);
}

/* Values in use in the kernel and how they may be derived from xtime */
#define jiffies (now() * HZ /1000000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
	return seconds(now());
}

void gettimeofday(struct timeval *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_nsec = nanoseconds_within_second(t);
}

/* Exampe of a CPU time time source */
	int itm_setup(u64 counter) {
		set_itm(counter);
		return 1;	/* Success */
	}

	make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44,itm_setup);

/* A memory based time source at 20Mhz 55 bits wide, no ability to fire an interrupt */

	make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55, NULL);

/* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
 * sync. If we have such a problem  then insure at least sure that the time source never goes backward.
 */

u64 time_source_last;

u64 get_cpu_time_filtered() {
	u64 x;
	u64 l;

	do {
		l = time_source_last;
		x = cycles();
		if (x<l) return l;
		/* the cmpxchg is going to hurt in terms of scalability ! */
	} while (cmpxchg(&time_source_last, l, x) != l);
	return x;
}

Generate the time_source with

	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44, itm_setup);


/* event scheduler to be able to run events at specific times */

struct event {
	u64 when;
	u64 param;
	void (*run)(u64);
	struct event next;
}

struct event *event_queue;

void event_run(void) {
	u64 t;

	spin_lock(event_handler);
	t = now();
redo:
	/* Run all events scheduled before now */
	while (event_queue && event_queue->when <= t) {
		struct event *e=event_queue;

		e->run(e->param);
		event_queue = e->next;
		kfree(e);
	}
	if (event_queue = NULL) {
		spin_unlock(event_handler);
		return;
	}
	t = now();
	d = t - event_queue->when;
	if (d<= 0) goto retry;

	time_source->interrupt_at(time_source_at_base + ns_to_time_source(event_queue->when - time_base));
	spin_unlock(event_handler);
}

void event_new(u64 w,void (*r)(u64), u64 p) {
	struct event *e=kmalloc(sizeof(struct event));

	e->when =w;
	e->param = p;
	e->run = r;
	spin_lock(event_handler);
	if (event_queue) {
		/* Insert time event at the appropriate time */
		struct event *p = event_queue;
		while (p->next && p->next->w < w) p = p->next;
		e->next = p->next;
		p->next = e;
	} else
		event_queue=e;
	spin_unlock(event_handler);
	event_run();
}

void scheduled_timer_interrupt(void) {
	event_run();
}

/* simulation of the old tick behavior */
tick(u64 when) {
	/* do tick stuff */

	/* time base update .... */
	time_source_adjust(0);
	/* Schedule next timer tick */
	event_new(when + NSEC_PER_SEC / HZ , tick,  when + NSEC_PER_SEC / HZ);
}

/* New tick would be scheduled by the ntp logic when a correction is needed.
 * ntp logic needs to decide when to skip a few nanosecond or slow down the clock or
 * make the clock run faster.
 * One way to do this is to accumulate a time difference to real time.
 * if this time difference is small and positive then skip time forward a bit.
 * if the time difference is negative then slow down the clock.
 * if the time difference is way too high then accelerate the clock
 */



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-15  0:57                                       ` George Anzinger
@ 2004-09-15  3:32                                         ` Christoph Lameter
  2004-09-15  8:04                                           ` George Anzinger
  2004-09-15  6:46                                         ` Christoph Lameter
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-15  3:32 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Tue, 14 Sep 2004, George Anzinger wrote:

> > u64 time_source_to_ns(u64 x) {
> > 	return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
> > }
>
> This seems to assume that the time souce is incrementing.  On some archs, I
> think, it decrements...

This could be handled by a function that transforms the value read from
the counter into an incrementing value. I.e.

u64 get_rev_timerval(void) {
	return  1<< 55 - readq(TIMER_PORT);
}

> So we would do "time_adjust_skip(0);" to update time_source_at_base?

There is no reason to update time_source_at_base unless adjustments need
to be done or a danger exists of the counter wrapping around (16 bit
counter?)

> If we do a "good" job of choosing <multiply> and <shift> this will be a "very"
> small change.  Might be better to pass in a "delta" to change it by.  Then you
> would only need one function.

These are the raw routines. Higher level function could translate a delta
into the appropriate adjustments.

> The mask and the shift value are not really related.  The mask is a function of
> the number of bits the hardware provides.  The shift is related to the value of
> freq.  Me thinks they should not be tied together here.

They are related because the maximum shift for a 64 bit value without
loosing bits is 64 - number of significant bits. This basically insures
maximum precision when scaling the counter.


> > /* Values in use in the kernel and how they may be derived from xtime */
> > #define jiffies (now()/1000000)
>
> This assumes HZ=1000.  (Assuming there is an HZ any more, that is.)  Not all
> archs will want this value.  Possibly:
> #define jiffies ((now() * HZ) / 1000000000)

Right. I just thought of the standard case HZ=1000.

> > u64 get_cpu_time_filtered() {
> > 	u64 x;
> > 	u64 l;
> This will need to be "static";

Nope. time_source_last is the global. l is just a copy of
time_source_last.

> Ok, so now lets hook this up with interval timers:
>
> #define ns_per_jiffie (NSEC_PER_SEC / HZ)
> #define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)
>
> This function is a request to interrupt at the next jiffie after the passed
> reference jiffie.  If that time is passed return true, else false.

One could do this but we want to have a tickless system. The tick is only
necessary if the time needs to be adjusted.

But you are right there is the need for timer event scheduling that is
not included yet. This should be a method of the time source.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-14 17:49                                     ` Christoph Lameter
@ 2004-09-15  0:57                                       ` George Anzinger
  2004-09-15  3:32                                         ` Christoph Lameter
  2004-09-15  6:46                                         ` Christoph Lameter
  0 siblings, 2 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-15  0:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Christoph Lameter wrote:
> Here is a rev on the concept of a time source subsystem taking into
> account some of the feedback:
> 
> 1. add mask
> 2. rename time source so that its different from xtime
> 3. provide set_time functionality
> 4. add monotonic time support
> 5. Give some example of time sources including a function time source.
> 
> note that the original concept already includes the ability to switch time
> sources. I would think that this would be necessary especially if the CPU
> time sources frequency switches or if the system goes into sleep mode.
> 
> /****************************************************************
>  * A concept for a new timer module
>  * V2
>  *
>  * Christoph Lameter, September 14, 2003
>  ***************************************************************/
> 
> struct time_source_t {
> 	u8 type;
> 	u8 shift;
> 	u32 multiply;
> 	void *address;
> 	u64 mask;
> };
> 
> #define TIME_SOURCE_CPU 0
> #define TIME_SOURCE_MMIO32 1
> #define TIME_SOURCE_MMIO64 2
> #define TIME_SOURCE_FUNCTION 15
> 
> struct seqlock_t time_lock;	/* Protects access to time_base time_source time_mono_offset and time_source_at_base */
> u64 time_base;
> u64 time_mono_offset;
> struct time_source_t *time_source;
> u64 time_source_at_base;	/* time source value at time time_base */
> 
> u64 inline time_source_get(struct time_source_t *s) {
> 	switch (s->type) {
> 		case TIME_SOURCE_CPU	: return cycles();
> 		case TIME_SOURCE_MMIO32	: return readl(s->address);
> 		case TIME_SOURCE_MMIO64	: return readq(s->address);
> 		case TIME_SOURCE_FUNCTION: {
> 			u64 (*x)(void);
> 			x = s->address;
> 			return x();
> 		}
> }
> 
> u64 time_source_to_ns(u64 x) {
> 	return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
> }

This seems to assume that the time souce is incrementing.  On some archs, I 
think, it decrements...
> 
> inline u64 now(void) {
> 	u64 t;
> 	unsigned long seq;
> 
> 	do {
> 		seq = read_seqbegin(&time_lock);
> 		t = time_base + time_source_to_ns(time_source_get(time_source));
> 	} while (unlikely(seq_retry(&time_lock,seq)));
> 	return t;
> }
> 
> #define now_mono() (now()+time_mono_offset)
> 
> /* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
> void time_adjust_skip(u64 ns) {
> 	u64 c;
> 
> 	write_seqlock(&time_lock);
> 	c = time_source_get(time_source);
> 	time_base += ns + time_source_to_ns(c);
> 	time_source_at_base = c;
> 	write_sequnlock(&time_lock);
> }

So we would do "time_adjust_skip(0);" to update time_source_at_base?
> 
> void time_adjust_slower(void) {
> 	u64 c;
> 
> 	write_seqlock(&time_lock);
> 	c = time_source_get(time_source);
> 	time_base += time_source_to_ns(c);
> 	time_source_at_base = c;
> 	time_source->multiply--;
> 	write_sequnlock(&time_lock);
> }
> 
If we do a "good" job of choosing <multiply> and <shift> this will be a "very" 
small change.  Might be better to pass in a "delta" to change it by.  Then you 
would only need one function.

> void time_adjust_faster(void) {
> 	u64 c;
> 
> 	write_seqlock(&time_lock);
> 	c = time_source_get(time_source);
> 	time_base += time_source_to_ns(c);
> 	time_source_at_base = c;
> 	time_source->multiply++;
> 	write_sequnlock(&time_unlock);
> }
> 
> /* Switch to another time source */
> void new_time_source(struct time_source_t *s) {
> 	u64 c_old;
> 	u64 c_new;
> 
> 	write_seqlock(&time_lock);
> 	c_old = time_source_get(time_source);
> 	c_new = time_source_get(s);
> 
> 	time_base += time_source_to_ns(c_old);
> 
> 	time_source_at_base = c_new;
> 	time_source = s;
> 
> 	write_sequnlock(&time_lock);
> }
> 
> struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
> 	struct time_source_t *s = kmalloc(sizeof(struct time_source_t));
> 
> 	s->shift = 64 - bits;
> 	s->multiply = (NSEC_PER_SEC << s->shift) / freq;
> 	s->address = a;
> 	s->type = t;
> 	s->mask = 1 << bits -1;
> }

The mask and the shift value are not really related.  The mask is a function of 
the number of bits the hardware provides.  The shift is related to the value of 
freq.  Me thinks they should not be tied together here.
> 
> void time_set(u64 ns) {
> 	u64 c;
> 
> 	write_seqlock(&time_lock);
> 	c = time_source_get(time_source);
> 	/* Adjust monotonic time base */
> 	time_mono_offset += time_base + time_source_to_ns(c) - ns;
> 	/* Setup new time base */
> 	time_base = ns;
> 	time_source_at_base = c;
> 	write_sequnlock(&time_lock);
> }
> 
> void time_init(struct time_source_t *s) {
> {
> 	write_seqlock(&time_lock);
> 	time_base = 0;
> 	time_mono_offset = 0;
> 	time_source_at_base = time_source_get(s);
> 	time_source = s;
> 	write_sequnlock(&time_lock);
> }
> 
> /* Values in use in the kernel and how they may be derived from xtime */
> #define jiffies (now()/1000000)

This assumes HZ=1000.  (Assuming there is an HZ any more, that is.)  Not all 
archs will want this value.  Possibly:
#define jiffies ((now() * HZ) / 1000000000)

> #define seconds(x) ((x)/1000000000)
> #define nanoseconds_within_second(x) ((x)%1000000000)
> #define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000
> 
> u32 time(void) {
> 	return seconds(now());
> }
> 
> void gettimeofday(struct timeval *p) {
> 	u64 t=now();
> 
> 	p->tv_sec = seconds(t);
> 	p->tv_usec = microseconds_within_second(t);
> }
> 
> void clock_gettime(int clock, struct timespec *p) {
> 	u64 t=now();
> 
> 	p->tv_sec = seconds(t);
> 	p->tv_nsec = nanoseconds_within_second(t);
> }
> 
> /* Exampe of a CPU time time source */
> 
> 	make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44);
> 
> /* A memory based time source at 20Mhz 55 bits wide */
> 
> 	make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55);
> 
> /* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
>  * sync. If we have such a problem  then insure at least sure that the time source never goes backward.
>  */
> 
> u64 time_source_last;
> 
> u64 get_cpu_time_filtered() {
> 	u64 x;
> 	u64 l;
This will need to be "static";
> 
> 	do {
> 		l = time_source_last;
> 		x = cycles();
> 		if (x<l) return l;
> 		/* the cmpxchg is going to hurt in terms of scalability ! */
> 	} while (cmpxchg(&time_source_last, l, x) != l);
> 	return x;
> }
> 
> Generate the time_source with
> 
> 	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44);
> -

Ok, so now lets hook this up with interval timers:

#define ns_per_jiffie (NSEC_PER_SEC / HZ)
#define jiffies_to_ns(jiff) (jiff * ns_per_jiffie)

This function is a request to interrupt at the next jiffie after the passed 
reference jiffie.  If that time is passed return true, else false.

int schedule_next_jiffie(unsigned long ref_jiffie)
{
	u64 remains;
	u64 now_frozen = now();
	u32 jiffies_frozen = ((now_frozen * HZ) / 1000000000);

	if (jiffies_frozen > ref_jiffies)
		return 1;
	
	remains = now_frozen - jiffies_to_ns(ref_jiffie + 1);
         /*
  	 * set_timer_to_interrupt(u64 ns) is provided by the arch layer
	 */
	set_timer_to_interrupt_in(remains);
	return 0;
}

On interrupt:

	time_adjust_skip(0);
	do_timer();

Or, possibly do_timer() does the time_adjust_skip(0);

do_timer() will set up the softirq to run the timer list.  The timer list could 
could call schedule_next_jiffie, but there are SMP issues here.






-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-13 22:25                                   ` john stultz
  2004-09-13 22:45                                     ` Christoph Lameter
@ 2004-09-14 17:49                                     ` Christoph Lameter
  2004-09-15  0:57                                       ` George Anzinger
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-14 17:49 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

Here is a rev on the concept of a time source subsystem taking into
account some of the feedback:

1. add mask
2. rename time source so that its different from xtime
3. provide set_time functionality
4. add monotonic time support
5. Give some example of time sources including a function time source.

note that the original concept already includes the ability to switch time
sources. I would think that this would be necessary especially if the CPU
time sources frequency switches or if the system goes into sleep mode.

/****************************************************************
 * A concept for a new timer module
 * V2
 *
 * Christoph Lameter, September 14, 2003
 ***************************************************************/

struct time_source_t {
	u8 type;
	u8 shift;
	u32 multiply;
	void *address;
	u64 mask;
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t time_lock;	/* Protects access to time_base time_source time_mono_offset and time_source_at_base */
u64 time_base;
u64 time_mono_offset;
struct time_source_t *time_source;
u64 time_source_at_base;	/* time source value at time time_base */

u64 inline time_source_get(struct time_source_t *s) {
	switch (s->type) {
		case TIME_SOURCE_CPU	: return cycles();
		case TIME_SOURCE_MMIO32	: return readl(s->address);
		case TIME_SOURCE_MMIO64	: return readq(s->address);
		case TIME_SOURCE_FUNCTION: {
			u64 (*x)(void);
			x = s->address;
			return x();
		}
}

u64 time_source_to_ns(u64 x) {
	return (((x-time_source_at_base) & time_source->mask)*time_source->multiply) >> time_source->shift;
}

inline u64 now(void) {
	u64 t;
	unsigned long seq;

	do {
		seq = read_seqbegin(&time_lock);
		t = time_base + time_source_to_ns(time_source_get(time_source));
	} while (unlikely(seq_retry(&time_lock,seq)));
	return t;
}

#define now_mono() (now()+time_mono_offset)

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += ns + time_source_to_ns(c);
	time_source_at_base = c;
	write_sequnlock(&time_lock);
}

void time_adjust_slower(void) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += time_source_to_ns(c);
	time_source_at_base = c;
	time_source->multiply--;
	write_sequnlock(&time_lock);
}

void time_adjust_faster(void) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	time_base += time_source_to_ns(c);
	time_source_at_base = c;
	time_source->multiply++;
	write_sequnlock(&time_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
	u64 c_old;
	u64 c_new;

	write_seqlock(&time_lock);
	c_old = time_source_get(time_source);
	c_new = time_source_get(s);

	time_base += time_source_to_ns(c_old);

	time_source_at_base = c_new;
	time_source = s;

	write_sequnlock(&time_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
	struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

	s->shift = 64 - bits;
	s->multiply = (NSEC_PER_SEC << s->shift) / freq;
	s->address = a;
	s->type = t;
	s->mask = 1 << bits -1;
}

void time_set(u64 ns) {
	u64 c;

	write_seqlock(&time_lock);
	c = time_source_get(time_source);
	/* Adjust monotonic time base */
	time_mono_offset += time_base + time_source_to_ns(c) - ns;
	/* Setup new time base */
	time_base = ns;
	time_source_at_base = c;
	write_sequnlock(&time_lock);
}

void time_init(struct time_source_t *s) {
{
	write_seqlock(&time_lock);
	time_base = 0;
	time_mono_offset = 0;
	time_source_at_base = time_source_get(s);
	time_source = s;
	write_sequnlock(&time_lock);
}

/* Values in use in the kernel and how they may be derived from xtime */
#define jiffies (now()/1000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
	return seconds(now());
}

void gettimeofday(struct timeval *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_nsec = nanoseconds_within_second(t);
}

/* Exampe of a CPU time time source */

	make_time_source(1200000000, TIME_SOURCE_CPU, NULL, 44);

/* A memory based time source at 20Mhz 55 bits wide */

	make_time_source(20000000, TIME_SOURCE_MMIO64, &my_timer_address, 55);

/* Example of handling a difficult time source. In SMP systems the CPU time sources are notoriously difficult to
 * sync. If we have such a problem  then insure at least sure that the time source never goes backward.
 */

u64 time_source_last;

u64 get_cpu_time_filtered() {
	u64 x;
	u64 l;

	do {
		l = time_source_last;
		x = cycles();
		if (x<l) return l;
		/* the cmpxchg is going to hurt in terms of scalability ! */
	} while (cmpxchg(&time_source_last, l, x) != l);
	return x;
}

Generate the time_source with

	make_time_source(1200000000, TIME_SOURCE_FUNCTION, cpu_time_filtered, 44);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-13 22:45                                     ` Christoph Lameter
@ 2004-09-14  6:53                                       ` Ulrich Windl
  0 siblings, 0 replies; 60+ messages in thread
From: Ulrich Windl @ 2004-09-14  6:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On 13 Sep 2004 at 15:45, Christoph Lameter wrote:

> > o My only other nit is that you use a different name then xtime. If
> > you're changing the type, you might as well use a meaningful name.
> 
> xtime is the traditional name. Maybe renaming it to intentionally break
> old code would be good but I thought it would be good to understand
> the approach.
> 

I think direct access to xtime should vanish. Provide an inlinable function to get 
the current time coarsely and quickly (that's what reading xtime is).

Ulrich



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-13 22:25                                   ` john stultz
@ 2004-09-13 22:45                                     ` Christoph Lameter
  2004-09-14  6:53                                       ` Ulrich Windl
  2004-09-14 17:49                                     ` Christoph Lameter
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-13 22:45 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Mon, 13 Sep 2004, john stultz wrote:
> My first pass comments:
> o How would settimeofday work? Would it be forced to use
> time_adjust_XYZ? Do we still need the wall_to_monotonic uglyness to
> manage posix CLOCK_MONOTONIC timers?

You could simply set xtime and time_source_last = time_source_get(time_source);

I think the existing wall_to_monotonic stuff is okay. Monotonic time is
rarely used and thus its best to keep xtime on CLOCK_REALTIME to simplify
things for general time access.

> My suggestion: split xtime into two values (system_time,
> wall_time_offset) like in my proposal. This allows you to keep a true
> monotonically increasing time base (system_time) and wall_time_offset
> can then be used to adjust to wall time. settimeofday() would only
> change wall_time_offset.

xtime is also truly monotonic increasing in my proposal. Setting the time
is an extraordinary event which I would not count against monotoneity. The
clock corrections that I proposed all preserve the monotoneity of xtime.

> o time_source_to_ns() needs some method of masking the subtraction in
> order to handle timesources that overflow under 64 bits.

Yes. My interpolator logic limits begin to show....

> o How would your time_adjust_XYZ interfaces merge w/ adjtimex and the
> NTP wall_time_update()/second_overflow() code? I spent quite a bit of
> time on my ntp.c implementation and I'm not 100% confident in it. Could
> you explain in further detail how yours would work?

I have no idea how adjtimex works. Would have to look at that in detail
first. The timer interpolator stuff also does time adjustments and I am
sure that something like that is in my mind. The time interpolator only
ever corrects the clock forward. Otherwise it adjusts the scaling if the
clock is running too fast. Thats also what I provided in the proposal.

> o My only other nit is that you use a different name then xtime. If
> you're changing the type, you might as well use a meaningful name.

xtime is the traditional name. Maybe renaming it to intentionally break
old code would be good but I thought it would be good to understand
the approach.

> Unfortunately I'm off working on other things for the next two weeks,
> but once that is over I look forward to trying to integrate some of your
> design ideas into my own.  Keeping my timeofday and ntp core code, but
> using your timesource interface looks to be quite promising.

That sounds very promising. I may also have to make time for
different projects but I will try to do as much as I can.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-13 21:29                                 ` Christoph Lameter
@ 2004-09-13 22:25                                   ` john stultz
  2004-09-13 22:45                                     ` Christoph Lameter
  2004-09-14 17:49                                     ` Christoph Lameter
  0 siblings, 2 replies; 60+ messages in thread
From: john stultz @ 2004-09-13 22:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Mon, 2004-09-13 at 14:29, Christoph Lameter wrote:
> Hmm... I am still thinking that making xtime an u64 could help simplify a
> lot of things. Together with some other parameters it would allow a
> tickless system where there is no need to update xtime regularly. It would
> also simplify gettimeofday and jiffies calculation.

Looks interesting! Regardless of our earlier butting heads, both this
and my earlier proposal are indeed very similar. 

My first pass comments:
o How would settimeofday work? Would it be forced to use
time_adjust_XYZ? Do we still need the wall_to_monotonic uglyness to
manage posix CLOCK_MONOTONIC timers? 

My suggestion: split xtime into two values (system_time,
wall_time_offset) like in my proposal. This allows you to keep a true
monotonically increasing time base (system_time) and wall_time_offset
can then be used to adjust to wall time. settimeofday() would only
change wall_time_offset. 

o time_source_to_ns() needs some method of masking the subtraction in
order to handle timesources that overflow under 64 bits. 

o How would your time_adjust_XYZ interfaces merge w/ adjtimex and the
NTP wall_time_update()/second_overflow() code? I spent quite a bit of
time on my ntp.c implementation and I'm not 100% confident in it. Could
you explain in further detail how yours would work?

o My only other nit is that you use a different name then xtime. If
you're changing the type, you might as well use a meaningful name.

Unfortunately I'm off working on other things for the next two weeks,
but once that is over I look forward to trying to integrate some of your
design ideas into my own.  Keeping my timeofday and ntp core code, but
using your timesource interface looks to be quite promising. 

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09 20:49                               ` George Anzinger
@ 2004-09-13 21:29                                 ` Christoph Lameter
  2004-09-13 22:25                                   ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-13 21:29 UTC (permalink / raw)
  To: George Anzinger
  Cc: john stultz, Albert Cahalan, lkml, tim, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Hmm... I am still thinking that making xtime an u64 could help simplify a
lot of things. Together with some other parameters it would allow a
tickless system where there is no need to update xtime regularly. It would
also simplify gettimeofday and jiffies calculation.

In order to calculate the current time, one then has to use a counter that
started running at a certain point. The counter can be used to calculate
an offset to xtime after scaling it. It is then only necessary to update
xtime if the value calculated by xtime and the counter deviates
significantly from the real value. Then time keeping may be corrected
(after xtime has been updated, we might have handed out time before
this event and we cannot change time keeping without a new starting point
for the calculation of the offset!) by

1. Adding a number to xtime resulting in a time jump forward. This
needs to be a small number.
2. Slowing down the clock by reducing the scaling (time cannot
jump backward. This is the only way to correct the clock if its
running too fast).
3. Speeding up the clock by increasing the scaling. This would presumably
be done if the time jumps become too large.

All of this is based a lot on the work done for the time interpolator.

/****************************************************************
 * A concept for a new timer module
 *
 ***************************************************************/

struct time_source_t {
	u8 type;
	u8 shift;
	u32 multiply;
	void *address;
};

#define TIME_SOURCE_CPU 0
#define TIME_SOURCE_MMIO32 1
#define TIME_SOURCE_MMIO64 2
#define TIME_SOURCE_FUNCTION 15

struct seqlock_t xtime_lock;	/* Protects access to xtime time_source and time_source_last, may be used by
				time retrieval routines if 64 bit atomic operations are not available */
u64 xtime;
struct time_source_t *time_source;
u64 time_source_last;

u64 inline time_source_get(struct time_source_t *s) {
	switch (s->type) {
		case TIME_SOURCE_CPU	: return cycles();
		case TIME_SOURCE_MMIO32	: return readl(s->address);
		case TIME_SOURCE_MMIO64	: return readq(s->address);
		case TIME_SOURCE_FUNCTION: {
			u64 (*x)(void);
			x = s->address;
			return x();
		}
		default : return 0;
	}
}

u64 time_source_to_ns(u64 x) {
	return ((x-time_source_last)*time_source->multiply) >> time_source->shift;
}

inline u64 now(void) {
	u64 t;
	unsigned long seq;

	do {
		seq = read_seqbegin(&xtime_lock);
		t = xtime + time_source_to_ns(time_source_get(time_source));
	} while (unlikely(seq_retry(&xtime_lock,seq)));
	return t;
}

/* Time adjustment (only possible forward, to go backwards adjust time_source->multiply and wait ...) */
void time_adjust_skip(u64 ns) {
	u64 c;

	write_seqlock(&xtime_lock);
	c = time_source_get(time_source);
	xtime += ns + time_source_to_ns(c);
	time_source_last = c;
	write_sequnlock(&xtime_lock);
}

void time_adjust_slower(void) {
	u64 c;

	write_seqlock(&xtime_lock);
	c = time_source_get(time_source);
	xtime += time_source_to_ns(c);
	time_source_last = c;
	time_source->multiply--;
	write_sequnlock(&xtime_lock);
}

void time_adjust_faster(void) {
	u64 c;

	write_seqlock(&xtime_lock);
	c = time_source_get(time_source);
	xtime += time_source_to_ns(c);
	time_source_last = c;
	time_source->multiply++;
	write_sequnlock(&xtime_unlock);
}

/* Switch to another time source */
void new_time_source(struct time_source_t *s) {
	u64 c_old;
	u64 c_new;

	write_seqlock(&xtime_lock);
	c_old = time_source_get(time_source);
	c_new = time_source_get(s);

	xtime += time_source_to_ns(c_old);

	time_source_last = c_new;
	time_source = s;

	write_sequnlock(&xtime_lock);
}

struct time_source_t *make_time_source(u64 freq, int t,void *a, int bits) {
	struct time_source_t *s = kmalloc(sizeof(struct time_source_t));

	s->shift = 64 - bits;
	s->multiply = (NSEC_PER_SEC << s->shift) / freq;
	s->address = a;
	s->type = t;
}

/* Values in use in the kernel and how they may be derived from xtime */

#define jiffies (now()/1000000)
#define seconds(x) ((x)/1000000000)
#define nanoseconds_within_second(x) ((x)%1000000000)
#define microseconds_within_second(x) (nanoseconds_within_second(x) / 1000)

u32 time(void) {
	return seconds(now());
}

void gettimeofday(struct timeval *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_usec = microseconds_within_second(t);
}

void clock_gettime(int clock, struct timespec *p) {
	u64 t=now();

	p->tv_sec = seconds(t);
	p->tv_nsec = nanoseconds_within_second(t);
}


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09 19:07                             ` john stultz
@ 2004-09-09 20:49                               ` George Anzinger
  2004-09-13 21:29                                 ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-09 20:49 UTC (permalink / raw)
  To: john stultz
  Cc: Christoph Lameter, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

john stultz wrote:
> On Wed, 2004-09-08 at 21:31, George Anzinger wrote:
> 
>>john stultz wrote:
>>
>>>I'm not sure about the busy wait bit, but yes, at some point I'd like to
>>>see the timer subsystem use the timeofday subsystem instead of jiffies
>>>for its timekeeping. 
>>>
>>
>>Yes, I think this is the way we want to go.  Here are the "rubs" I see:
>>
>>a.) resolution.  If you don't put a limit on this you will invite timer storms. 
>>  Currently, by using 1/HZ resolution, all timer "line up" on ticks and reduce 
>>the interrupt overhead that would occur if we actually tried to give "exactly" 
>>what was asked for.  This is a matter of math and can be handled (assuming we 
>>resist the urge to go shopping :))
> 
> 
> Well, basically we have the same resolution as we do today. Right now
> you specify the time in jiffies which has different resolutions
> depending on configuration, so I'm not sure I see the problem. I mean, I
> guess someone could be off put when they ask for a 1 ns expiration time
> and it returns in 1 ms, but I thought timers followed sleep() semantics,
> so it shouldn't be that much of an issue.  

What we get down to here is setting up a tick source that depends on the clock. 
  This issue of resolution, as I said, is easy to solve.
> 
> 
>>b.) For those platforms with repeating timers that can not "hit" our desired 
>>resolution (i.e. 1/HZ) there is an additional overhead to program the timer each 
>>interrupt.   In principle we do this in the HRT patch, but there we only do it 
>>for high resolution timers, which we assume are rather rare.  It is good to have 
>>a low res timer that is also accurate.  Even better if we can keep the overhead 
>>low by not having to reprogram a timer each tick.
>>
>>In the ideal world we would have a hardware repeating timer that is reasonably 
>>accurate (we might want to correct it every second or so) to generate the low 
>>res timing interrupts and a high res timer that we can program quickly for high 
>>resolution interrupts.
> 
> 
> Even in this case, inaccurate hardware timers can only cause a single
> tick jitter. The current problem we're seeing is that the timer
> inaccuracy accumulates, so if we're 1% off, a sleep(600) returns 6
> seconds late. Using the time source instead of jiffies for expiration
> would avoid the accumulation, so we'd be at most a single tick late.
> 
> Or is the problem more subtle?

No, I think you have it.  The x86 problem, in this regard, is that the "clock" 
xtal is only connected to the pm timer and the PIT, both of which are slow to 
access (not to mention all the hardware errors).  The x86 TSC is driven by 
another xtal which, to the best of my knowledge, is not a "clock" grade xtal. 
We also don't know what the actual TSC frequency is, thus the calibration loop.

What we get down to, and what others have said, is that the TSC is great for 
interpolation but bad as a long term clock.

I suspect that a time solution that required us to reprogram the PIT each tick 
would not fly, just because of the access overhead.  One that required 
reprogramming the PIT every second or so _might_ be ok.

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  4:31                           ` George Anzinger
  2004-09-09  6:37                             ` Jesse Barnes
@ 2004-09-09 19:07                             ` john stultz
  2004-09-09 20:49                               ` George Anzinger
  1 sibling, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-09 19:07 UTC (permalink / raw)
  To: george anzinger
  Cc: Christoph Lameter, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 2004-09-08 at 21:31, George Anzinger wrote:
> john stultz wrote:
> > I'm not sure about the busy wait bit, but yes, at some point I'd like to
> > see the timer subsystem use the timeofday subsystem instead of jiffies
> > for its timekeeping. 
> > 
> Yes, I think this is the way we want to go.  Here are the "rubs" I see:
> 
> a.) resolution.  If you don't put a limit on this you will invite timer storms. 
>   Currently, by useing 1/HZ resolution, all timer "line up" on ticks and reduce 
> the interrupt overhead that would occure if we actually tried to give "exactly" 
> what was asked for.  This is a matter of math and can be handled (assuming we 
> resist the urge to go shopping :))

Well, basically we have the same resolution as we do today. Right now
you specify the time in jiffies which has different resolutions
depending on configuration, so I'm not sure I see the problem. I mean, I
guess someone could be off put when they ask for a 1 ns expiration time
and it returns in 1 ms, but I thought timers followed sleep() semantics,
so it shouldn't be that much of an issue.  

> b.) For those platforms with repeating timers that can not "hit" our desired 
> resolution (i.e. 1/HZ) there is an additional overhead to program the timer each 
> interrupt.   In principle we do this in the HRT patch, but there we only do it 
> for high resolution timers, which we assume are rather rare.  It is good to have 
> a low res timer that is also accurate.  Even better if we can keep the overhead 
> low by not having to reprogram a timer each tick.
> 
> In the ideal world we would have a hardware repeating timer that is reasonably 
> accurate (we might want to correct it every second or so) to generate the low 
> res timing interrupts and a high res timer that we can program quickly for high 
> resolution interrupts.

Even in this case, inaccurate hardware timers can only cause a single
tick jitter. The current problem we're seeing is that the timer
inaccuracy accumulates, so if we're 1% off, a sleep(600) returns 6
seconds late. Using the time source instead of jiffies for expiration
would avoid the accumulation, so we'd be at most a single tick late. 

Or is the problem more subtle?

thanks
-john



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  6:37                             ` Jesse Barnes
@ 2004-09-09  8:09                               ` George Anzinger
  0 siblings, 0 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-09  8:09 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: john stultz, Christoph Lameter, Albert Cahalan, lkml, tim,
	Ulrich.Windl, Len Brown, linux, David Mosberger, Andi Kleen,
	paulus, schwidefsky, jimix, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott

Jesse Barnes wrote:
> On Wednesday, September 8, 2004 9:31 pm, George Anzinger wrote:
> 
>>a.) resolution.  If you don't put a limit on this you will invite timer
>>storms. Currently, by useing 1/HZ resolution, all timer "line up" on ticks
>>and reduce the interrupt overhead that would occure if we actually tried to
>>give "exactly" what was asked for.  This is a matter of math and can be
>>handled (assuming we resist the urge to go shopping :))
> 
> 
> This can be bad though if lots of CPUs hit it at the same time or nearly so if 
> they're all trying to write the same cacheline or two.

I think that most of the SMP issues your talking about have gone away with 2.6 
where we have seperate timer lists for each cpu.


-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  4:31                           ` George Anzinger
@ 2004-09-09  6:37                             ` Jesse Barnes
  2004-09-09  8:09                               ` George Anzinger
  2004-09-09 19:07                             ` john stultz
  1 sibling, 1 reply; 60+ messages in thread
From: Jesse Barnes @ 2004-09-09  6:37 UTC (permalink / raw)
  To: george
  Cc: john stultz, Christoph Lameter, Albert Cahalan, lkml, tim,
	Ulrich.Windl, Len Brown, linux, David Mosberger, Andi Kleen,
	paulus, schwidefsky, jimix, keith maanthey, greg kh,
	Patricia Gaughen, Chris McDermott

On Wednesday, September 8, 2004 9:31 pm, George Anzinger wrote:
> a.) resolution.  If you don't put a limit on this you will invite timer
> storms. Currently, by useing 1/HZ resolution, all timer "line up" on ticks
> and reduce the interrupt overhead that would occure if we actually tried to
> give "exactly" what was asked for.  This is a matter of math and can be
> handled (assuming we resist the urge to go shopping :))

This can be bad though if lots of CPUs hit it at the same time or nearly so if 
they're all trying to write the same cacheline or two.

Jesse

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  3:32                         ` john stultz
@ 2004-09-09  4:31                           ` George Anzinger
  2004-09-09  6:37                             ` Jesse Barnes
  2004-09-09 19:07                             ` john stultz
  0 siblings, 2 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-09  4:31 UTC (permalink / raw)
  To: john stultz
  Cc: Christoph Lameter, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

john stultz wrote:
> On Wed, 2004-09-08 at 20:14, Christoph Lameter wrote:
> 
>>On Wed, 8 Sep 2004, john stultz wrote:
>>
>>
>>>Why must we use jiffies to tell when a timer expires? Honestly I'd like
>>>to see xtime and jiffies both disappear, but I'm not very familiar w/
>>>the soft-timer code, so forgive me if I'm misunderstanding.
>>>
>>>So instead of calculating delta_jiffies, just mark the timer to expire
>>>at B. Then each interrupt, you use get_fast_timestamp() to decide if now
>>>is greater then B. If so, expire it.
>>>
>>>Then we can look at being able to program timer interrupts to occur as
>>>close as possible to the next soft-timer's expiration time.
>>
>>Would it not be best to have some means to determine the time in
>>nanoseconds since the epoch and then use that for long waits? 
> 
> 
> The proposal has get_lowres_timeofday() which does just that, although
> for timer stuff, I would guess monotonic_clock() or
> get_lowres_timestamp(), which returns the number of (ntp adjusted)
> nanoseconds the system has been running, would be better.  
> 
> 
> 
>>One can then calculate the wait time in nanoseconds which may then be
>>passed to another timer routine which may take the appropriate action
>>depending on the time frame involved. I.e. for a few hundred nsecs do busy
>>wait. If longer reschedule and if even longer queue the task on some
>>event queue that is handled by the timer tick or something else.
> 
> 
> I'm not sure about the busy wait bit, but yes, at some point I'd like to
> see the timer subsystem use the timeofday subsystem instead of jiffies
> for its timekeeping. 
> 
Yes, I think this is the way we want to go.  Here are the "rubs" I see:

a.) resolution.  If you don't put a limit on this you will invite timer storms. 
  Currently, by useing 1/HZ resolution, all timer "line up" on ticks and reduce 
the interrupt overhead that would occure if we actually tried to give "exactly" 
what was asked for.  This is a matter of math and can be handled (assuming we 
resist the urge to go shopping :))

b.) For those platforms with repeating timers that can not "hit" our desired 
resolution (i.e. 1/HZ) there is an additional overhead to program the timer each 
interrupt.   In principle we do this in the HRT patch, but there we only do it 
for high resolution timers, which we assume are rather rare.  It is good to have 
a low res timer that is also accurate.  Even better if we can keep the overhead 
low by not having to reprogram a timer each tick.

In the ideal world we would have a hardware repeating timer that is reasonably 
accurate (we might want to correct it every second or so) to generate the low 
res timing interrupts and a high res timer that we can program quickly for high 
resolution interrupts.

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  3:14                       ` Christoph Lameter
@ 2004-09-09  3:32                         ` john stultz
  2004-09-09  4:31                           ` George Anzinger
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-09  3:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 2004-09-08 at 20:14, Christoph Lameter wrote:
> On Wed, 8 Sep 2004, john stultz wrote:
> 
> > Why must we use jiffies to tell when a timer expires? Honestly I'd like
> > to see xtime and jiffies both disappear, but I'm not very familiar w/
> > the soft-timer code, so forgive me if I'm misunderstanding.
> >
> > So instead of calculating delta_jiffies, just mark the timer to expire
> > at B. Then each interrupt, you use get_fast_timestamp() to decide if now
> > is greater then B. If so, expire it.
> >
> > Then we can look at being able to program timer interrupts to occur as
> > close as possible to the next soft-timer's expiration time.
> 
> Would it not be best to have some means to determine the time in
> nanoseconds since the epoch and then use that for long waits? 

The proposal has get_lowres_timeofday() which does just that, although
for timer stuff, I would guess monotonic_clock() or
get_lowres_timestamp(), which returns the number of (ntp adjusted)
nanoseconds the system has been running, would be better.  


> One can then calculate the wait time in nanoseconds which may then be
> passed to another timer routine which may take the appropriate action
> depending on the time frame involved. I.e. for a few hundred nsecs do busy
> wait. If longer reschedule and if even longer queue the task on some
> event queue that is handled by the timer tick or something else.

I'm not sure about the busy wait bit, but yes, at some point I'd like to
see the timer subsystem use the timeofday subsystem instead of jiffies
for its timekeeping. 

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  0:51                     ` john stultz
@ 2004-09-09  3:14                       ` Christoph Lameter
  2004-09-09  3:32                         ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-09  3:14 UTC (permalink / raw)
  To: john stultz
  Cc: george anzinger, Albert Cahalan, lkml, tim, Ulrich.Windl,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Wed, 8 Sep 2004, john stultz wrote:

> Why must we use jiffies to tell when a timer expires? Honestly I'd like
> to see xtime and jiffies both disappear, but I'm not very familiar w/
> the soft-timer code, so forgive me if I'm misunderstanding.
>
> So instead of calculating delta_jiffies, just mark the timer to expire
> at B. Then each interrupt, you use get_fast_timestamp() to decide if now
> is greater then B. If so, expire it.
>
> Then we can look at being able to program timer interrupts to occur as
> close as possible to the next soft-timer's expiration time.

Would it not be best to have some means to determine the time in
nanoseconds since the epoch and then use that for long waits? That cuts
out the timer dependencies. Could we have a generic interface for kernel
time that simply provides nanoseconds for everything and hides all the
details of time scaling?

Maybe as simple as

u64 now(void);

?

One can then calculate the wait time in nanoseconds which may then be
passed to another timer routine which may take the appropriate action
depending on the time frame involved. I.e. for a few hundred nsecs do busy
wait. If longer reschedule and if even longer queue the task on some
event queue that is handled by the timer tick or something else.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-09  0:08                   ` George Anzinger
@ 2004-09-09  0:51                     ` john stultz
  2004-09-09  3:14                       ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-09  0:51 UTC (permalink / raw)
  To: george anzinger
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Wed, 2004-09-08 at 17:08, George Anzinger wrote:
> john stultz wrote:
> > On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
> > 
> >>>Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> >>>by inaccurate calibration shouldn't be a problem the interval timer code
> >>>should need to worry about (outside of maybe adjusting its interval time
> >>>if its always arriving late/early). If possible the timesource
> >>>calibration code should be improved, but that's icing on the cake and
> >>>isn't critical.
> >>>
> >>
> >>Are you providing a way to predict what clock count provide a given time offset 
> >>INCLUDING ntp?  If so, cool.  If not we need to get this conversion right.  We 
> >>will go into this more on your return.
> > 
> > 
> > Sorry, I'm not sure what you mean. Mind expanding on the idea while my
> > brain warms back up?
> 
> The issue is this:  A user wants a timer to fire at exactly time B which is 
> several hours later than now (time A).  We need to know how to measure this time 
> with the timer resources (not the clock as you are talking about it).  Currently 
> we do something like delta_jiffies = timespec_to_jiffies(B - A) and set up a 
> jiffies timer to expire in delta_jiffies.  At this time we "assume" in 
> timespec_to_jiffies() that we _know_ how long a jiffie is in terms of wall clock 
> nanoseconds.  We also assume (possibly incorrectly) that this number is "good 
> enough" even with ntp messing things up.  I think this means that we assume 
> that, on the average, we have the right conversion and that any drift will be a) 
> small and b) on the average 0 (or real close to it).

Why must we use jiffies to tell when a timer expires? Honestly I'd like
to see xtime and jiffies both disappear, but I'm not very familiar w/
the soft-timer code, so forgive me if I'm misunderstanding. 

So instead of calculating delta_jiffies, just mark the timer to expire
at B. Then each interrupt, you use get_fast_timestamp() to decide if now
is greater then B. If so, expire it. 

Then we can look at being able to program timer interrupts to occur as
close as possible to the next soft-timer's expiration time.  

-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-08 18:07                 ` john stultz
@ 2004-09-09  0:08                   ` George Anzinger
  2004-09-09  0:51                     ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-09  0:08 UTC (permalink / raw)
  To: john stultz
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

john stultz wrote:
> On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
> 
>>>Again, monotonic_clock() and friends are NTP adjusted, so drift caused
>>>by inaccurate calibration shouldn't be a problem the interval timer code
>>>should need to worry about (outside of maybe adjusting its interval time
>>>if its always arriving late/early). If possible the timesource
>>>calibration code should be improved, but that's icing on the cake and
>>>isn't critical.
>>>
>>
>>Are you providing a way to predict what clock count provide a given time offset 
>>INCLUDING ntp?  If so, cool.  If not we need to get this conversion right.  We 
>>will go into this more on your return.
> 
> 
> Sorry, I'm not sure what you mean. Mind expanding on the idea while my
> brain warms back up?

The issue is this:  A user wants a timer to fire at exactly time B which is 
several hours later than now (time A).  We need to know how to measure this time 
with the timer resources (not the clock as you are talking about it).  Currently 
we do something like delta_jiffies = timespec_to_jiffies(B - A) and set up a 
jiffies timer to expire in delta_jiffies.  At this time we "assume" in 
timespec_to_jiffies() that we _know_ how long a jiffie is in terms of wall clock 
nanoseconds.  We also assume (possibly incorrectly) that this number is "good 
enough" even with ntp messing things up.  I think this means that we assume 
that, on the average, we have the right conversion and that any drift will be a) 
small and b) on the average 0 (or real close to it).

To my mind this is a few too many assumptions, but this is what we do today.  If 
you are proposing that ntp corrections take care of systemic constant drift 
(which by the way, linux use to do WRT HZ=1024 division errors), we will need a 
way of working this back into the timesepec_to_jiffies() code or we will not be 
able to do timers with any real accuracy.

An alternative, and one that I prefer, but, I think, runs counter to your 
proposal, is to derive the timer interrupts from the clock after the ntp 
correction.  Some hardware will have a problem with this (x86) but other 
hardware is rather constrained to do things this way.  The PPC counter interrupt 
generator, for example, counts the same pulses as the clock counter.
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-04  0:02               ` George Anzinger
@ 2004-09-08 18:07                 ` john stultz
  2004-09-09  0:08                   ` George Anzinger
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-08 18:07 UTC (permalink / raw)
  To: george anzinger
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 2004-09-03 at 17:02, George Anzinger wrote:
> > Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> > by inaccurate calibration shouldn't be a problem the interval timer code
> > should need to worry about (outside of maybe adjusting its interval time
> > if its always arriving late/early). If possible the timesource
> > calibration code should be improved, but that's icing on the cake and
> > isn't critical.
> > 
> Are you providing a way to predict what clock count provide a given time offset 
> INCLUDING ntp?  If so, cool.  If not we need to get this conversion right.  We 
> will go into this more on your return.

Sorry, I'm not sure what you mean. Mind expanding on the idea while my
brain warms back up?

thanks
-john



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 23:00                         ` john stultz
@ 2004-09-04  0:11                           ` Christoph Lameter
  0 siblings, 0 replies; 60+ messages in thread
From: Christoph Lameter @ 2004-09-04  0:11 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 3 Sep 2004, john stultz wrote:

> > True it is the applications problem if it breaks but we have applications
> > for SGI systems that demand access to a time source with the least
> > operating system interference possible.
>
> Yep, totally fine with that. If application developers will take the
> risk that's fine. Personally, I'd like the kernel's timeofday function
> to be fast enough that developers don't feel the need to attempt these
> special case hacks, but there will always be folks who need to live on
> the edge.

Yeah. I tuned the gettimeofday function on IA64 to be faster than the
glibc's use of the ITC but there are still some voices that demand a memory mapped
timer.

> > Are there any examples where the flexibility is needed?
>
> I'm a fan of Barbie's law: "Math is hard, let's go shopping!", so
> forgive my inability to immediately see through the details. In dealing
> with both high and low res timesources (some with microsecond
> frequencies, some run in picosecond frequencies), the math can be tuned
> for each case. Low res time sources are mainly a multiply, and possibly
> a divide only if it has a non-integer nanosecond per cycle value. For
> high-res timesources, we have to divide, so we would want to aproximate
> the frequency and use the multiply and shift as you suggested earlier.

A division can always be avoided. This is the equation

nanoseconds = x / y * timer_value

where x/y = scaling factor

and one can always set Y = 2^py to substitute a shift instead of
doing division. Just scale  x also appropriately by 2^px. This works to an
arbitrary accuracy if one has a long enough integer type. That is why I
suggested at least the intermediate use of 128bit. 64bit is fine for the
current cases but I would expect a need for 128bit to arise in the
future.

Math is sometimes simple and may simplify a lot of things.

> Maybe this is completely able to be generalized and I'm just not sharp
> enough to see it just yet, but it seems to me that having a timesource
> specific cyc2ns allows for further optimization then without.

I have not encountered a case that could not be handled by the above
mentioned transformation. Mathematically I would think I could say it is
impossible to find such a case.

> > Are you sure about that readability argument? A single
> > statement to calculate the delta is much more readable and verifyable
> > to be correct than a gazillion of small functions that (one  hopes and
> > prays) all do the same correctly.
>
> Your point is valid, and in moving the arch specific code into arch
> independent code, I'm largely trying to reduce the tangle of functions.
> Having 3 hardware specific functions isn't all that crazy.

I would rather have only one (the read timer function) and that function
would be optional for non-well-behaved timer sources.

> Worse case if we cannot come to some agreement on this, there's nothing
> stopping architectures from keeping their own timeofday subsystem. I'm
> just tired of implementing the same feature or fixing the same bug over
> and over and over for each arch.

These are all good reasons for centralizing the functionality instead of
dispersing it in many functions.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 23:32             ` john stultz
@ 2004-09-04  0:02               ` George Anzinger
  2004-09-08 18:07                 ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-04  0:02 UTC (permalink / raw)
  To: john stultz
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

john stultz wrote:
> Really quick: I'm off on vacation until Weds. Hopefully I've addressed
> in some way everyone's comments and my email box won't be stuffed when I
> return. I might peek in every once in awhile, though. 
> 
> Just one last response...
> 
> On Fri, 2004-09-03 at 15:10, George Anzinger wrote:
> 
>>john stultz wrote:
>>
>>>I feel trying to keep two notions of time, one in the timeofday code and
>>>one in the timer code is the real issue. Trying to better keep them
>>>synced will just lead to madness. Instead the timer subsystem needs to
>>>move to using monotonic_clock(), or get_lowres_timestamp() instead of
>>>using jiffies for its notion of time. Jiffies is just an interrupt
>>>counter. 
>>
>>Well, there may be a better way.  Suppose we change all the accounting code to 
>>work with nanoseconds.  That way when we do an accounting pass we would just add 
>>what has elapsed since the last pass.  
> 
> 
> Yep, that'd work too. 
> 
> 
>>Still need some way to do user timers that are tightly pegged to the clock.
> 
> 
> I see that as just a problem of programming the timer interrupt
> generator. If you have a nanosecond (or as high as the best timesource
> on your system provides) resolution notion of time, then there is
> nothing to peg or tie the timer with. You simply just program it to go
> off every X nanoseconds. Should it be so inaccurate that it does not go
> off right at X nanoseconds, then you've hit a limit of the hardware.
> Pick a more accurate interval length, dynamically change your interval,
> or live with it.  If the soft-timers use monotonic-clock() to determine
> when they expire, then you won't get accumulating drift (as
> monotonic-clock()is NTP frequency adjusted), only the minor jitter
> caused latency caused by the interrupt programming.
> 
> 
>>A thought I had along these lines was to program a timer for each tick.  The PIT 
>>is much too slow for this, but the APIC timers seem to be rather easy to 
>>program.  I am not sure how fast the access is but it can not be anything like 
>>the PIT.  Under this scheme we would use the monotonic clock to figure out just 
>>how far out the next tick should be and program that.
> 
> 
> Yep, tickless systems also start being possible. We just have interrupts
> for scheduled events. 
> 
> 
>>This could be modified to use repeating hardware if it is available.  (What does 
>>the HPET provide?  Does it interrupt?)  Since the APIC clock is not the 
>>reference clock (the PIT & pm timer clock is) we would have to correct these 
>>from time to time but that is rather easy (I do it today in HRT code).
> 
> 
> Don't know the details of interrupt generation, so I can't tell ya. I'm
> not sure I followed the correction bit?
> 
> 
>>This brings up another issue.  We know what the PIT clock frequency is but not 
>>what the TSC clock frequency.  Currently we do a calibration run at boot to 
>>figure this but I can easily show that this run consistently gives the wrong 
>>answer (I am sure this has to do with the I/O access delays).  If we are going 
>>to use the TSC (or any other "clock" that is not derived from the PITs 
>>14.3181818MHZ ital) we need both a way to get the correct value AND a way to 
>>adjust it for drift over time.  (Possibly this is the same thing.)  Of course 
>>this is all just x86 stuff.  Other archs will have their own issues.
> 
> 
> Again, monotonic_clock() and friends are NTP adjusted, so drift caused
> by inaccurate calibration shouldn't be a problem the interval timer code
> should need to worry about (outside of maybe adjusting its interval time
> if its always arriving late/early). If possible the timesource
> calibration code should be improved, but that's icing on the cake and
> isn't critical.
> 
Are you providing a way to predict what clock count provide a given time offset 
INCLUDING ntp?  If so, cool.  If not we need to get this conversion right.  We 
will go into this more on your return.

Have fun.

> Again, thanks to everyone for the great feedback and discussion!
> -john
> 

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 22:10           ` George Anzinger
@ 2004-09-03 23:32             ` john stultz
  2004-09-04  0:02               ` George Anzinger
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-03 23:32 UTC (permalink / raw)
  To: george anzinger
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Really quick: I'm off on vacation until Weds. Hopefully I've addressed
in some way everyone's comments and my email box won't be stuffed when I
return. I might peek in every once in awhile, though. 

Just one last response...

On Fri, 2004-09-03 at 15:10, George Anzinger wrote:
> john stultz wrote:
> > I feel trying to keep two notions of time, one in the timeofday code and
> > one in the timer code is the real issue. Trying to better keep them
> > synced will just lead to madness. Instead the timer subsystem needs to
> > move to using monotonic_clock(), or get_lowres_timestamp() instead of
> > using jiffies for its notion of time. Jiffies is just an interrupt
> > counter. 
> 
> Well, there may be a better way.  Suppose we change all the accounting code to 
> work with nanoseconds.  That way when we do an accounting pass we would just add 
> what has elapsed since the last pass.  

Yep, that'd work too. 

> Still need some way to do user timers that are tightly pegged to the clock.

I see that as just a problem of programming the timer interrupt
generator. If you have a nanosecond (or as high as the best timesource
on your system provides) resolution notion of time, then there is
nothing to peg or tie the timer with. You simply just program it to go
off every X nanoseconds. Should it be so inaccurate that it does not go
off right at X nanoseconds, then you've hit a limit of the hardware.
Pick a more accurate interval length, dynamically change your interval,
or live with it.  If the soft-timers use monotonic-clock() to determine
when they expire, then you won't get accumulating drift (as
monotonic-clock()is NTP frequency adjusted), only the minor jitter
caused latency caused by the interrupt programming.

> A thought I had along these lines was to program a timer for each tick.  The PIT 
> is much too slow for this, but the APIC timers seem to be rather easy to 
> program.  I am not sure how fast the access is but it can not be anything like 
> the PIT.  Under this scheme we would use the monotonic clock to figure out just 
> how far out the next tick should be and program that.

Yep, tickless systems also start being possible. We just have interrupts
for scheduled events. 

> This could be modified to use repeating hardware if it is available.  (What does 
> the HPET provide?  Does it interrupt?)  Since the APIC clock is not the 
> reference clock (the PIT & pm timer clock is) we would have to correct these 
> from time to time but that is rather easy (I do it today in HRT code).

Don't know the details of interrupt generation, so I can't tell ya. I'm
not sure I followed the correction bit?

> This brings up another issue.  We know what the PIT clock frequency is but not 
> what the TSC clock frequency.  Currently we do a calibration run at boot to 
> figure this but I can easily show that this run consistently gives the wrong 
> answer (I am sure this has to do with the I/O access delays).  If we are going 
> to use the TSC (or any other "clock" that is not derived from the PITs 
> 14.3181818MHZ ital) we need both a way to get the correct value AND a way to 
> adjust it for drift over time.  (Possibly this is the same thing.)  Of course 
> this is all just x86 stuff.  Other archs will have their own issues.

Again, monotonic_clock() and friends are NTP adjusted, so drift caused
by inaccurate calibration shouldn't be a problem the interval timer code
should need to worry about (outside of maybe adjusting its interval time
if its always arriving late/early). If possible the timesource
calibration code should be improved, but that's icing on the cake and
isn't critical.

Again, thanks to everyone for the great feedback and discussion!
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 22:04                       ` Christoph Lameter
@ 2004-09-03 23:00                         ` john stultz
  2004-09-04  0:11                           ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-03 23:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 2004-09-03 at 15:04, Christoph Lameter wrote:
> On Fri, 3 Sep 2004, john stultz wrote:
> > > The raw values in user space can be used for specialized purposes
> > > (somewhat like a generalized form of HPET) by applicationbs. This is not
> > > intended for real system time.
> >
> > I worry these specialized interfaces would become troubling to maintain.
> > The kernel (via regular syscall, fsyscall, or vsyscall) should provide
> > accurate time and should be the final authority for that.
> >
> > If specialized applications want to read the cycle counter (TSC, ITC),
> > /dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
> > its at their own risk. When those specialized assumptions break (TSC
> > being fixed freq, for example) its the applications' problem. The kernel
> > shouldn't codify or encourage those sorts of risky interfaces by
> > exporting frequency information in a standardized interface.
> 
> How can a timer have a variable frequency and being used for
> timing purposes. Maybe I just do not understand the point on TSC having a
> fixed frequency (I certainly hope so but is that true in the power save
> modes?).

TSC frequency changes are a reality i386 has been living with for awhile
now. In some cases, there isn't anything else that is performant enough
to use as a timesource, so people want to be able to generate time as
accurately as possible from a variable frequency counter. Me, I think
its somewhat crazy, but I've got to make it work.

> True it is the applications problem if it breaks but we have applications
> for SGI systems that demand access to a time source with the least
> operating system interference possible.

Yep, totally fine with that. If application developers will take the
risk that's fine. Personally, I'd like the kernel's timeofday function
to be fast enough that developers don't feel the need to attempt these
special case hacks, but there will always be folks who need to live on
the edge.


> > > > cyc2ns(): In this conversion we can optimize the math depending on the
> > > > timesource. If the timesource freq is a power of 2, we can just use
> > > > shift! However if its a weird value and we have to be very precise, we
> > > > do a full 64bit divide. We're not stuck with one equation given a freq
> > > > value.
> > >
> > > I have never seeen a timesource freq with the power of 2. Division is not
> > > necessary since one can realize this by multiplying with a certain factor
> > > and then shifting the result right instead of dividing.
> >
> > Well, it was a theoretical example, but I agree with your point. The
> > math tricks you suggest are usable depending on the frequency of the
> > timesource. Attaching that sort of logic to the timesource
> > implementation gives us more flexibility. It could be that the
> > flexibility is unnecessary, and if so I'm fine w/ changing the design,
> > but I'm not confident of that just yet.
> 
> s/depending on/independent of/ ???
> 
> Are there any examples where the flexibility is needed?

I'm a fan of Barbie's law: "Math is hard, let's go shopping!", so
forgive my inability to immediately see through the details. In dealing
with both high and low res timesources (some with microsecond
frequencies, some run in picosecond frequencies), the math can be tuned
for each case. Low res time sources are mainly a multiply, and possibly
a divide only if it has a non-integer nanosecond per cycle value. For
high-res timesources, we have to divide, so we would want to aproximate
the frequency and use the multiply and shift as you suggested earlier. 

Maybe this is completely able to be generalized and I'm just not sharp
enough to see it just yet, but it seems to me that having a timesource
specific cyc2ns allows for further optimization then without.

> > > > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > > > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > > > subtraction without overflow worries you need to mask the subtraction.
> > > > This can be done by exporting a mask value w/ the freq value, but was
> > > > cleaner when moved into the timesource.
> > >
> > > I would suggest it is better to add this mask and avoid another call to
> > > a tiny function that does simply mask and subtract. Note that compilers
> > > are more efficient if they get a sufficiently large chunk of code. This is
> > > in particular necessary on IA64 and other processors given the inherent
> > > parallelism in their internal CPU. A function call is typically much
> > > slower than a subtract and and operation.
> >
> > Yep. As I said before, you've got very good points. The delta() function
> > is the one I'm least attached to, but for now I'm going to hang on to it
> > for readability reasons. It will be simple to change to a mask value
> > later after everyone has reviewed the code and is confident of its
> > correctness.
> 
> Are you sure about that readability argument? A single
> statement to calculate the delta is much more readable and verifyable
> to be correct than a gazillion of small functions that (one  hopes and
> prays) all do the same correctly.

Your point is valid, and in moving the arch specific code into arch
independent code, I'm largely trying to reduce the tangle of functions.
Having 3 hardware specific functions isn't all that crazy.

> > > > read(): Rather then just giving the address of the register, the read
> > > > call allows for timesource specific logic. This lets us use jiffies as a
> > > > timesource, or in cases like the ACPI PM timesource, where the register
> > > > must be read 3 times in order to ensure a correct value is latched, we
> > > > can avoid having to include that logic into the generic code, so it does
> > > > not affect systems that do not use or have that timesource.
> > >
> > > I think it is essential to have the capability to use a function. But
> > > nevertheless it is quite inefficient to have yet another small function
> > > that simply reads a value from a memory location. What I proposed is to be
> > > able to specify a memory location in the time source structure. That way
> > > this function call can be avoided for most timer source. Any specialized
> > > time source will be able to use a function call but will then not be as
> > > fast as the time sources that can simply setup a memory address.
> >
> > I really do like the time_interpolator changes you've submitted in
> > earlier threads. The core generalizing of timesources changes are very
> > similar between our two implementations, so your comments are very
> > important to me. I am open to adding these direct memory accessors to
> > the time counters, but again its a performance optimization, and right
> > now clarity is king.
> 
> Clarity is gained by centralizing these things. The design results in
> numerous functions that are defined and which all mostly do the same. For
> clarities sake please change the spec....

I disagree, but that doesn't mean I'm right. If more people complain as
the patch develops, I'll rectify the code so it is less confusing.
There's no reason for such a hard line at this point. The code is in no
way close to finished, or entering the kernel, so give me a chance to
roll with it for a bit. I'm just wary of premature optimization traps. 

Worse case if we cannot come to some agreement on this, there's nothing
stopping architectures from keeping their own timeofday subsystem. I'm
just tired of implementing the same feature or fixing the same bug over
and over and over for each arch.

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 19:27         ` john stultz
@ 2004-09-03 22:10           ` George Anzinger
  2004-09-03 23:32             ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-03 22:10 UTC (permalink / raw)
  To: john stultz
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

john stultz wrote:
> On Fri, 2004-09-03 at 00:24, George Anzinger wrote:
> 
>>Albert Cahalan wrote:
>>
>>>On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
>>>
>>>
>>>>john stultz wrote:
>>>
>>>
>>>>>+
>>>>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
>>>>>+{
>>>>>+
>>>>>+	cyc *= NSEC_PER_SEC/HZ;
>>>>
>>>>Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
>>>>999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
>>>>like the relationship of jiffie to 1/HZ and to real time.
>>>
>>>
>>>HZ not being HZ is the source of many foul problems.
>>>
>>>NTP should be able to correct for the error. For systems
>>>not running NTP, provide a fake NTP to make corrections
>>>based on the expected frequency error.
>>>
>>>Based on that, skip or double-up on the ticks to make
>>>them be exactly HZ over long periods of time.
>>
>>There are several problems here.  First, to make this possible, you will have to 
>>outlaw several values for HZ (1024 comes to mind).  Second, like it or not, 1/HZ 
>>or something close to it, is the timer resolution.  I think we need to try and 
>>keep this a power of 10, mostly because there are a lot of folks who just don't 
>>understand it if it is not.  And third, you need to get real close to it with 
>>the hardware timer.  If you introduce NTP or some such to fix things, well, 
>>things just break another place.  For example, we started 2.6 with HZ=1000 and 
>>had problems like:
>> > time sleep 10
>>sleeping for 9.9 seconds.  This will just not do.  Any corrections made to the 
>>wall clock really need to be made to the timer system as well.  As it is we 
>>assume that, by correctly choosing the tick value, the wall clock will be 
>>correct on average, even under NTP.  I.e. that the NTP correction will be, over 
>>time, very small.  We really do want code that is much more accurate than "time 
>>sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will 
>>have changed by x nanoseconds while we did so.
> 
> 
> I feel trying to keep two notions of time, one in the timeofday code and
> one in the timer code is the real issue. Trying to better keep them
> synced will just lead to madness. Instead the timer subsystem needs to
> move to using monotonic_clock(), or get_lowres_timestamp() instead of
> using jiffies for its notion of time. Jiffies is just an interrupt
> counter. 

Well, there may be a better way.  Suppose we change all the accounting code to 
work with nanoseconds.  That way when we do an accounting pass we would just add 
what has elapsed since the last pass.  Still need some way to do user timers 
that are tightly pegged to the clock.

A thought I had along these lines was to program a timer for each tick.  The PIT 
is much too slow for this, but the APIC timers seem to be rather easy to 
program.  I am not sure how fast the access is but it can not be anything like 
the PIT.  Under this scheme we would use the monotonic clock to figure out just 
how far out the next tick should be and program that.

This could be modified to use repeating hardware if it is available.  (What does 
the HPET provide?  Does it interrupt?)  Since the APIC clock is not the 
reference clock (the PIT & pm timer clock is) we would have to correct these 
from time to time but that is rather easy (I do it today in HRT code).

This brings up another issue.  We know what the PIT clock frequency is but not 
what the TSC clock frequency.  Currently we do a calibration run at boot to 
figure this but I can easily show that this run consistently gives the wrong 
answer (I am sure this has to do with the I/O access delays).  If we are going 
to use the TSC (or any other "clock" that is not derived from the PITs 
14.3181818MHZ ital) we need both a way to get the correct value AND a way to 
adjust it for drift over time.  (Possibly this is the same thing.)  Of course 
this is all just x86 stuff.  Other archs will have their own issues.

For what its worth, the current 2.6 HRT patch uses the PIT as the 1/HZ time 
interrupt.  It uses the APIC, if present, for the high-res interrupt source.



-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 21:00                     ` john stultz
@ 2004-09-03 22:04                       ` Christoph Lameter
  2004-09-03 23:00                         ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-03 22:04 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 3 Sep 2004, john stultz wrote:

> > The raw values in user space can be used for specialized purposes
> > (somewhat like a generalized form of HPET) by applicationbs. This is not
> > intended for real system time.
>
> I worry these specialized interfaces would become troubling to maintain.
> The kernel (via regular syscall, fsyscall, or vsyscall) should provide
> accurate time and should be the final authority for that.
>
> If specialized applications want to read the cycle counter (TSC, ITC),
> /dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
> its at their own risk. When those specialized assumptions break (TSC
> being fixed freq, for example) its the applications' problem. The kernel
> shouldn't codify or encourage those sorts of risky interfaces by
> exporting frequency information in a standardized interface.

How can a timer have a variable frequency and being used for
timing purposes. Maybe I just do not understand the point on TSC having a
fixed frequency (I certainly hope so but is that true in the power save
modes?).

True it is the applications problem if it breaks but we have applications
for SGI systems that demand access to a time source with the least
operating system interference possible.

> > > cyc2ns(): In this conversion we can optimize the math depending on the
> > > timesource. If the timesource freq is a power of 2, we can just use
> > > shift! However if its a weird value and we have to be very precise, we
> > > do a full 64bit divide. We're not stuck with one equation given a freq
> > > value.
> >
> > I have never seeen a timesource freq with the power of 2. Division is not
> > necessary since one can realize this by multiplying with a certain factor
> > and then shifting the result right instead of dividing.
>
> Well, it was a theoretical example, but I agree with your point. The
> math tricks you suggest are usable depending on the frequency of the
> timesource. Attaching that sort of logic to the timesource
> implementation gives us more flexibility. It could be that the
> flexibility is unnecessary, and if so I'm fine w/ changing the design,
> but I'm not confident of that just yet.

s/depending on/independent of/ ???

Are there any examples where the flexibility is needed?

> > > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > > subtraction without overflow worries you need to mask the subtraction.
> > > This can be done by exporting a mask value w/ the freq value, but was
> > > cleaner when moved into the timesource.
> >
> > I would suggest it is better to add this mask and avoid another call to
> > a tiny function that does simply mask and subtract. Note that compilers
> > are more efficient if they get a sufficiently large chunk of code. This is
> > in particular necessary on IA64 and other processors given the inherent
> > parallelism in their internal CPU. A function call is typically much
> > slower than a subtract and and operation.
>
> Yep. As I said before, you've got very good points. The delta() function
> is the one I'm least attached to, but for now I'm going to hang on to it
> for readability reasons. It will be simple to change to a mask value
> later after everyone has reviewed the code and is confident of its
> correctness.

Are you sure about that readability argument? A single
statement to calculate the delta is much more readable and verifyable
to be correct than a gazillion of small functions that (one  hopes and
prays) all do the same correctly.

> > > read(): Rather then just giving the address of the register, the read
> > > call allows for timesource specific logic. This lets us use jiffies as a
> > > timesource, or in cases like the ACPI PM timesource, where the register
> > > must be read 3 times in order to ensure a correct value is latched, we
> > > can avoid having to include that logic into the generic code, so it does
> > > not affect systems that do not use or have that timesource.
> >
> > I think it is essential to have the capability to use a function. But
> > nevertheless it is quite inefficient to have yet another small function
> > that simply reads a value from a memory location. What I proposed is to be
> > able to specify a memory location in the time source structure. That way
> > this function call can be avoided for most timer source. Any specialized
> > time source will be able to use a function call but will then not be as
> > fast as the time sources that can simply setup a memory address.
>
> I really do like the time_interpolator changes you've submitted in
> earlier threads. The core generalizing of timesources changes are very
> similar between our two implementations, so your comments are very
> important to me. I am open to adding these direct memory accessors to
> the time counters, but again its a performance optimization, and right
> now clarity is king.

Clarity is gained by centralizing these things. The design results in
numerous functions that are defined and which all mostly do the same. For
clarities sake please change the spec....

> Yep, performance and scalability is *very* important. However
> correctness is even more important, and right now we have enough trouble
> with just that. All of the arch specific performance hacks have
> complicated the current code such that minor changes in other subsystems
> (sometimes not so) subtly break things. Untangling these quirks have
> been a large part of what I and many others have worked on for the last
> few years. So for correctness sake we do have to balance clarity and
> performance. Minimizing trade-offs is my goal right now.

Performance as well as clarify as well as arch independence really
demand that these small functions be given up.

> > Of course I am skimming by some additional detail like the mask. But this
> > would cut down significantly on the code piece to be maintained. From what
> > I can see my approach saves a lot of duplicated code. The duplication that
> > may exist comes about because of architecture specific time access
> > optimizations.
>
> Well, I realize this is just a simple sketch, but how do you then handle
> cpu freq changes? Suddenly generic code has to deal with that as well?
> And then there's those buggy time sources that have to be read multiple
> times, or drifty ITCs that (may or maynot) need cmpxchgs to ensure they
> do not go backwards. Making the generic code have to deal with all of
> these cases will be ugly (and may even hurt performance).

The reading from memory only works for well behaved time sources. If
something unusual comes along then a function needs to be called and thus
a performance hit is taken. If its important enough (like the drifty
logic on IA64) then one may consider putting that logic into the
generic code. I did not not say not to have functions at all. I think a
function call must be possible at least to retrieve the timer value for
special cases.

If the CPU freq cases is to be handled by the generic code or a special
function is a design decision that you probably have to make. Over time
other logic will develop that may first show up as new time retrieval
function but that is then recognized to be of universal importance so that
it will then be moved into the generic code.

> Clearly, you've said that function calls should be available in those
> cases, so I don't want to miss-characterize you comments. However I do
> want to point it out so that others keep in mind that you can't always
> just give a pointer and a freq value.

Exactly.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03 16:18                   ` Christoph Lameter
@ 2004-09-03 21:00                     ` john stultz
  2004-09-03 22:04                       ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-03 21:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 2004-09-03 at 09:18, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> > > The simplest thins is to provide a data structure without any functions
> > > attached that can simply be copied into userspace if wanted. If an arch
> > > needs special time access then that is depending on the arch specific
> > > methods available and such a data structure as I have proposed will
> > > include all the info necessary to implement such user mode time access.
> >
> > Ehhh.. I really don't like the idea of giving all the raw values to
> > userspace and letting user-code do the timeofday calculation. Fixing
> > bugs in each arches timeofday code is hard enough. Imagine if we have to
> > go through and fix userspace too! It would also make a user/kernel data
> > interface that we'd have to preserve. I'd like to avoid that and instead
> > use the vsyscall method to give us greater flexibility. Plus I doubt
> > anyone would want to implement the NTP adjustments in userspace? eek!
> 
> The raw values in user space can be used for specialized purposes
> (somewhat like a generalized form of HPET) by applicationbs. This is not
> intended for real system time.

I worry these specialized interfaces would become troubling to maintain.
The kernel (via regular syscall, fsyscall, or vsyscall) should provide
accurate time and should be the final authority for that. 

If specialized applications want to read the cycle counter (TSC, ITC),
/dev/HPET or /dev/mem or whatever, and try to infer time that's fine but
its at their own risk. When those specialized assumptions break (TSC
being fixed freq, for example) its the applications' problem. The kernel
shouldn't codify or encourage those sorts of risky interfaces by
exporting frequency information in a standardized interface.


> > cyc2ns(): In this conversion we can optimize the math depending on the
> > timesource. If the timesource freq is a power of 2, we can just use
> > shift! However if its a weird value and we have to be very precise, we
> > do a full 64bit divide. We're not stuck with one equation given a freq
> > value.
> 
> I have never seeen a timesource freq with the power of 2. Division is not
> necessary since one can realize this by multiplying with a certain factor
> and then shifting the result right instead of dividing.

Well, it was a theoretical example, but I agree with your point. The
math tricks you suggest are usable depending on the frequency of the
timesource. Attaching that sort of logic to the timesource
implementation gives us more flexibility. It could be that the
flexibility is unnecessary, and if so I'm fine w/ changing the design,
but I'm not confident of that just yet.

> > delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> > 24 bits, and the cyclone is 40. Thus to do proper twos complement
> > subtraction without overflow worries you need to mask the subtraction.
> > This can be done by exporting a mask value w/ the freq value, but was
> > cleaner when moved into the timesource.
> 
> I would suggest it is better to add this mask and avoid another call to
> a tiny function that does simply mask and subtract. Note that compilers
> are more efficient if they get a sufficiently large chunk of code. This is
> in particular necessary on IA64 and other processors given the inherent
> parallelism in their internal CPU. A function call is typically much
> slower than a subtract and and operation.

Yep. As I said before, you've got very good points. The delta() function
is the one I'm least attached to, but for now I'm going to hang on to it
for readability reasons. It will be simple to change to a mask value
later after everyone has reviewed the code and is confident of its
correctness. 

> > read(): Rather then just giving the address of the register, the read
> > call allows for timesource specific logic. This lets us use jiffies as a
> > timesource, or in cases like the ACPI PM timesource, where the register
> > must be read 3 times in order to ensure a correct value is latched, we
> > can avoid having to include that logic into the generic code, so it does
> > not affect systems that do not use or have that timesource.
> 
> I think it is essential to have the capability to use a function. But
> nevertheless it is quite inefficient to have yet another small function
> that simply reads a value from a memory location. What I proposed is to be
> able to specify a memory location in the time source structure. That way
> this function call can be avoided for most timer source. Any specialized
> time source will be able to use a function call but will then not be as
> fast as the time sources that can simply setup a memory address.

I really do like the time_interpolator changes you've submitted in
earlier threads. The core generalizing of timesources changes are very
similar between our two implementations, so your comments are very
important to me. I am open to adding these direct memory accessors to
the time counters, but again its a performance optimization, and right
now clarity is king.

> > But I doubt I'll convince you with words, so let me work on it a bit and
> > see if I can code around your concerns and put you at ease. You've
> > brought up some good issues, and I'll definitely work to resolve them!
> 
> Oh. I think this is good in terms of clarifying the issues. I did not
> realize that these capabilities that you proposed to use existed. My main
> concern here is efficient and scalable access to time. The ideal solution
> is one routine that can run straight through with minimal locking and
> simply do it all for most cases. The seqlock approach takes care of the
> locking.

Yep, performance and scalability is *very* important. However
correctness is even more important, and right now we have enough trouble
with just that. All of the arch specific performance hacks have
complicated the current code such that minor changes in other subsystems
(sometimes not so) subtly break things. Untangling these quirks have
been a large part of what I and many others have worked on for the last
few years. So for correctness sake we do have to balance clarity and
performance. Minimizing trade-offs is my goal right now. 

I'm starting to feel like I'm just being repetitious, so I'll try to
move on now. Sorry for the long-winded emails.

> Also with the ability to specify parameter instead of functions, one could
> easily setup a timer with a single function call like f.e.
> 
> 
> setup_timer(TIME_SOURCE_CPU,NULL, 1500000)
> 
> or
> 
> setup_timer(TIME_SOURCE_MMIO64, &timer, 4000000)
>
> Of course I am skimming by some additional detail like the mask. But this
> would cut down significantly on the code piece to be maintained. From what
> I can see my approach saves a lot of duplicated code. The duplication that
> may exist comes about because of architecture specific time access
> optimizations.

Well, I realize this is just a simple sketch, but how do you then handle
cpu freq changes? Suddenly generic code has to deal with that as well?
And then there's those buggy time sources that have to be read multiple
times, or drifty ITCs that (may or maynot) need cmpxchgs to ensure they
do not go backwards. Making the generic code have to deal with all of
these cases will be ugly (and may even hurt performance). 

Clearly, you've said that function calls should be available in those
cases, so I don't want to miss-characterize you comments. However I do
want to point it out so that others keep in mind that you can't always
just give a pointer and a freq value.

Once again, I thank you and everyone else for the great comments!
Hopefully I'll get more arch hooks in soon so you and others can
actually start playing with (and hopefully improving) the code. 

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  7:43                   ` George Anzinger
@ 2004-09-03 19:32                     ` john stultz
  0 siblings, 0 replies; 60+ messages in thread
From: john stultz @ 2004-09-03 19:32 UTC (permalink / raw)
  To: george anzinger
  Cc: Christoph Lameter, lkml, tim, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 2004-09-03 at 00:43, George Anzinger wrote:
> john stultz wrote:
> > On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
> > 
> >>On Thu, 2 Sep 2004, john stultz wrote:
> > read(): Rather then just giving the address of the register, the read
> > call allows for timesource specific logic. This lets us use jiffies as a
> > timesource, or in cases like the ACPI PM timesource, where the register
> > must be read 3 times in order to ensure a correct value is latched, we
> > can avoid having to include that logic into the generic code, so it does
> > not affect systems that do not use or have that timesource.
> 
> I am not convince that 3 reads are in fact needed.  In fact, I am amost certain 
> that two is more than enough.  In fact, it takes so long to read it that I just 
> use one read and a sanity check in the HRT code.  Here is the code I use:

[snip]

I'd have to defer to Dominik (acpi-pm author) for that. But the case
remains, to get a "good" value, you have to do more then just give
access to the raw register.

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  7:24       ` George Anzinger
@ 2004-09-03 19:27         ` john stultz
  2004-09-03 22:10           ` George Anzinger
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-03 19:27 UTC (permalink / raw)
  To: george anzinger
  Cc: Albert Cahalan, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Fri, 2004-09-03 at 00:24, George Anzinger wrote:
> Albert Cahalan wrote:
> > On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
> > 
> >>john stultz wrote:
> > 
> > 
> >>>+
> >>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> >>>+{
> >>>+
> >>>+	cyc *= NSEC_PER_SEC/HZ;
> >>
> >>Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
> >>999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
> >>like the relationship of jiffie to 1/HZ and to real time.
> > 
> > 
> > HZ not being HZ is the source of many foul problems.
> > 
> > NTP should be able to correct for the error. For systems
> > not running NTP, provide a fake NTP to make corrections
> > based on the expected frequency error.
> > 
> > Based on that, skip or double-up on the ticks to make
> > them be exactly HZ over long periods of time.
> 
> There are several problems here.  First, to make this possible, you will have to 
> outlaw several values for HZ (1024 comes to mind).  Second, like it or not, 1/HZ 
> or something close to it, is the timer resolution.  I think we need to try and 
> keep this a power of 10, mostly because there are a lot of folks who just don't 
> understand it if it is not.  And third, you need to get real close to it with 
> the hardware timer.  If you introduce NTP or some such to fix things, well, 
> things just break another place.  For example, we started 2.6 with HZ=1000 and 
> had problems like:
>  > time sleep 10
> sleeping for 9.9 seconds.  This will just not do.  Any corrections made to the 
> wall clock really need to be made to the timer system as well.  As it is we 
> assume that, by correctly choosing the tick value, the wall clock will be 
> correct on average, even under NTP.  I.e. that the NTP correction will be, over 
> time, very small.  We really do want code that is much more accurate than "time 
> sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will 
> have changed by x nanoseconds while we did so.

I feel trying to keep two notions of time, one in the timeofday code and
one in the timer code is the real issue. Trying to better keep them
synced will just lead to madness. Instead the timer subsystem needs to
move to using monotonic_clock(), or get_lowres_timestamp() instead of
using jiffies for its notion of time. Jiffies is just an interrupt
counter. 

This somewhat ignores the fact that we're discussing a "jiffies
timesource", which I only implemented as a lowest common denominator /
if you really don't have anything better / thanks the academy for its'
worst time source ever award/ simple example of the timesource code.

More and more I think I need a t-shirt that reads "jiffies ain't time".

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  6:42     ` Albert Cahalan
  2004-09-03  7:24       ` George Anzinger
@ 2004-09-03 19:18       ` john stultz
  1 sibling, 0 replies; 60+ messages in thread
From: john stultz @ 2004-09-03 19:18 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: george anzinger, lkml, tim, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 23:42, Albert Cahalan wrote:
> > > +int ntp_leapsecond(struct timespec now)
> > > +{
> > > +	/*
> > > +	 * Leap second processing. If in leap-insert state at
> > > +	 * the end of the day, the system clock is set back one
> > > +	 * second; if in leap-delete state, the system clock is
> > > +	 * set ahead one second. The microtime() routine or
> > > +	 * external clock driver will insure that reported time
> > > +	 * is always monotonic. The ugly divides should be
> > > +	 * replaced.
> 
> Don't optimize until the patch is in and stable.
> The divides can be removed much later. Wait months,
> if not forever, before making the code less readable.
> 
> The same goes for arch-specific non-syscall hacks.

Yep. Code readability is crucial, although performance is also a concern
that *has* to be addressed. 

As much as I'm probably digging myself a hole in doing all of this, I
really don't want to work on time for the rest of my life, so making the
code clear and readable is my only hope for passing this on.  :)

-john



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  1:30                 ` john stultz
  2004-09-03  7:43                   ` George Anzinger
@ 2004-09-03 16:18                   ` Christoph Lameter
  2004-09-03 21:00                     ` john stultz
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-03 16:18 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2 Sep 2004, john stultz wrote:

> > The only way curent way to enter the kernel from glibc with a fastcall is
> > the EPC.
>
> Hmm. I must be explaining myself poorly, or not understanding you. I
> apologize for not understanding this EPC/fastcall business well enough.
> I'd like to use EPC from a user-executable kernel page to escalate
> privileges to access the hardware counter. I don't care if I have to use
> the the current fastcall (fsys.S) interface or not. However you're
> making sounds like this isn't possible, so I'll need to do some
> research.

The mechanism you envision could probably be created on IA64. Its
certainly not available on all platforms and potentially cannot be available.

Also if you want to pursue the approach with functions you need to enter
the kernel at least 3 times(!) in order to call the 3 different functions
you defined in order to determine time. You need to retrieve time, then do
the time diff call then the nanosecond conversion.

> > The EPC call already does do a *secure* transfer like this on IA64 and
> > will execute kernel code without user space mapping. This idea raises all sorts
> > of concerns....
>
> Yes, but its not portable. Reducing duplicate code so timeofday
> maintenance isn't a nightmare is the first goal here. It may not be
> completely achievable, and when I hit that point I'll have to rework the
> design, but at this point I'm not convinced that it cannot be done.

It does not need to be portable since it is an architecture specific
optimization which could be accomplished in variety of ways on other
platforms. The time source information structure could be platform
independent. The architectures can optimize the way they interpret the
timer source information structure without additional calls to kernel
functions.

I would expect the logic for time retrieval not to be subject to much
change. The current ASM for IA64 f.e. would probalby work without change
for your new approach if the time source information would not require
function calls.

However, the proposal makes these optimizations impossible.

> > The simplest thins is to provide a data structure without any functions
> > attached that can simply be copied into userspace if wanted. If an arch
> > needs special time access then that is depending on the arch specific
> > methods available and such a data structure as I have proposed will
> > include all the info necessary to implement such user mode time access.
>
> Ehhh.. I really don't like the idea of giving all the raw values to
> userspace and letting user-code do the timeofday calculation. Fixing
> bugs in each arches timeofday code is hard enough. Imagine if we have to
> go through and fix userspace too! It would also make a user/kernel data
> interface that we'd have to preserve. I'd like to avoid that and instead
> use the vsyscall method to give us greater flexibility. Plus I doubt
> anyone would want to implement the NTP adjustments in userspace? eek!

The raw values in user space can be used for specialized purposes
(somewhat like a generalized form of HPET) by applicationbs. This is not
intended for real system time.

> cyc2ns(): In this conversion we can optimize the math depending on the
> timesource. If the timesource freq is a power of 2, we can just use
> shift! However if its a weird value and we have to be very precise, we
> do a full 64bit divide. We're not stuck with one equation given a freq
> value.

I have never seeen a timesource freq with the power of 2. Division is not
necessary since one can realize this by multiplying with a certain factor
and then shifting the result right instead of dividing.

> delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> 24 bits, and the cyclone is 40. Thus to do proper twos complement
> subtraction without overflow worries you need to mask the subtraction.
> This can be done by exporting a mask value w/ the freq value, but was
> cleaner when moved into the timesource.

I would suggest it is better to add this mask and avoid another call to
a tiny function that does simply mask and subtract. Note that compilers
are more efficient if they get a sufficiently large chunk of code. This is
in particular necessary on IA64 and other processors given the inherent
parallelism in their internal CPU. A function call is typically much
slower than a subtract and and operation.

> read(): Rather then just giving the address of the register, the read
> call allows for timesource specific logic. This lets us use jiffies as a
> timesource, or in cases like the ACPI PM timesource, where the register
> must be read 3 times in order to ensure a correct value is latched, we
> can avoid having to include that logic into the generic code, so it does
> not affect systems that do not use or have that timesource.

I think it is essential to have the capability to use a function. But
nevertheless it is quite inefficient to have yet another small function
that simply reads a value from a memory location. What I proposed is to be
able to specify a memory location in the time source structure. That way
this function call can be avoided for most timer source. Any specialized
time source will be able to use a function call but will then not be as
fast as the time sources that can simply setup a memory address.

> But I doubt I'll convince you with words, so let me work on it a bit and
> see if I can code around your concerns and put you at ease. You've
> brought up some good issues, and I'll definitely work to resolve them!

Oh. I think this is good in terms of clarifying the issues. I did not
realize that these capabilities that you proposed to use existed. My main
concern here is efficient and scalable access to time. The ideal solution
is one routine that can run straight through with minimal locking and
simply do it all for most cases. The seqlock approach takes care of the
locking.

Also with the ability to specify parameter instead of functions, one could
easily setup a timer with a single function call like f.e.


setup_timer(TIME_SOURCE_CPU,NULL, 1500000)

or

setup_timer(TIME_SOURCE_MMIO64, &timer, 4000000)

Of course I am skimming by some additional detail like the mask. But this
would cut down significantly on the code piece to be maintained. From what
I can see my approach saves a lot of duplicated code. The duplication that
may exist comes about because of architecture specific time access
optimizations.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  1:30                 ` john stultz
@ 2004-09-03  7:43                   ` George Anzinger
  2004-09-03 19:32                     ` john stultz
  2004-09-03 16:18                   ` Christoph Lameter
  1 sibling, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-03  7:43 UTC (permalink / raw)
  To: john stultz
  Cc: Christoph Lameter, lkml, tim, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

john stultz wrote:
> On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
> 
>>On Thu, 2 Sep 2004, john stultz wrote:
>>
>>
>>>>Of course but its not a generic way of timer acccess and
>>>>requires a fastcall for each timer. You still have the problem of
>>>>exporting the frequency and the time offsets to user space (those also
>>>>need to be kept current in order for one to be able to calculate a timer
>>>>value!). The syscall/fastcall API then needs to be extended to allow for a
>>>>system call to access each of the individual timers supported.
>>>
>>>Indeed, it would require a fastcall accessor for each timesource, but
>>>maybe fastcall is the wrong way to describe it. Could it just be a
>>>function that uses the epc to escalate its privileges? As for freq and
>>>offsets, those would already be user-visible (see below for more
>>>details)
>>
>>The only way curent way to enter the kernel from glibc with a fastcall is
>>the EPC.
> 
> 
> Hmm. I must be explaining myself poorly, or not understanding you. I
> apologize for not understanding this EPC/fastcall business well enough.
> I'd like to use EPC from a user-executable kernel page to escalate
> privileges to access the hardware counter. I don't care if I have to use
> the the current fastcall (fsys.S) interface or not. However you're
> making sounds like this isn't possible, so I'll need to do some
> research. 
> 
> 
>>>The plan at the moment is that the timeofday core gettimeofday code path
>>>as well as any timesource that supports it adds a _vsyscall linker
>>>attribute. Then the linker will place all the code on a special page or
>>>set of pages. Those pages would then be mapped in as user-readable. Then
>>>just like the x86-64's vsyscall gettimeofday, glibc would re-direct
>>>gettimeofday calls to the user-mapped kernel pages, where it would
>>>execute in user mode (with maybe the epc privilege escalation for ia64
>>>time sources that required it).
>>
>>The EPC call already does do a *secure* transfer like this on IA64 and
>>will execute kernel code without user space mapping. This idea raises all sorts
>>of concerns....
> 
> 
> Yes, but its not portable. Reducing duplicate code so timeofday
> maintenance isn't a nightmare is the first goal here. It may not be
> completely achievable, and when I hit that point I'll have to rework the
> design, but at this point I'm not convinced that it cannot be done.
> 
> As for security concerns, all your mapping out to userspace are the time
> variables and the functions needed to convert those to a accurate time
> of day. This is almost what you suggest below, but with the additional
> math from the kernel. The method I'm suggesting is very similar to
> x86-64's arch/x86-64/kernel/vsyscall.c.
> 
> 
> 
>>>I had to do most of this for the i386 vsyscall-gettimeofday port, but I
>>>was unhappy with the duplication of code (and bugs), as well as the fact
>>>that I was then being pushed to re-do it for each architecture. While
>>>its not currently implemented (I was hoping to keep the discussion
>>>focused on the correctness of what's been implemented), I feel the plan
>>>for user-mode access won't be too complex. I'm still open for further
>>>discussion if you'd like, obviously performance is quite important, and
>>>I want to calm any fears you have, but I'm sure the new ntp code plenty
>>>of performance issues to look at before we start digging into usermode
>>>access, so maybe we can come back to this discussion later?
>>
>>The functions in the timer source structure is a central problem for IA64. We
>>cannot take that out right now.
> 
> 
> Don't worry I'm not submitting this code just yet. :) I'll need all (or
> at least most of the important) architecture maintainers to sign on
> before I try to push this code in.
> 
> Right now I'm trying to shake out possible problems with the design and
> the first pass implementation of the code. You're helping me do that,
> and I thank you for it. Concessions will be made, but for now I'm going
> to try to preserve the current design and work around the problems as
> they arise. 
> 
> 
> 
>>The full parameterization of timer access as I have suggested also allows
>>user page access to timers with simply mapping a page to access the timer.
>>No other gimmicks are needed since the timer source structure contains all
>>information to calculate a timer offset.
>>
>>
>>>Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
>>>folks have talked about their own user mode time access. Chasing down a
>>>time bug across arches gets to be fairly hairy, so I'm trying to
>>>simplify that.
>>
>>The simplest thins is to provide a data structure without any functions
>>attached that can simply be copied into userspace if wanted. If an arch
>>needs special time access then that is depending on the arch specific
>>methods available and such a data structure as I have proposed will
>>include all the info necessary to implement such user mode time access.
> 
> 
> Ehhh.. I really don't like the idea of giving all the raw values to
> userspace and letting user-code do the timeofday calculation. Fixing
> bugs in each arches timeofday code is hard enough. Imagine if we have to
> go through and fix userspace too! It would also make a user/kernel data
> interface that we'd have to preserve. I'd like to avoid that and instead
> use the vsyscall method to give us greater flexibility. Plus I doubt
> anyone would want to implement the NTP adjustments in userspace? eek!
> 
> 
>>A requirement to call functions in the kernel to determine time makes
>>these solution impossible. And its getting extremely complex if one has to
>>invoke different functions for each timer supported.
> 
> 
> The struct timesource interface of read()/delta()/cyc2ns() was the best
> generalization I could come up with. I feel they're necessary for the
> following reasons:
> 	
> cyc2ns(): In this conversion we can optimize the math depending on the
> timesource. If the timesource freq is a power of 2, we can just use
> shift! However if its a weird value and we have to be very precise, we
> do a full 64bit divide. We're not stuck ith one equation given a freq
> value.
> 
> delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
> 24 bits, and the cyclone is 40. Thus to do proper twos complement
> subtraction without overflow worries you need to mask the subtraction.
> This can be done by exporting a mask value w/ the freq value, but was
> cleaner when moved into the timesource. 
> 
> read(): Rather then just giving the address of the register, the read
> call allows for timesource specific logic. This lets us use jiffies as a
> timesource, or in cases like the ACPI PM timesource, where the register
> must be read 3 times in order to ensure a correct value is latched, we
> can avoid having to include that logic into the generic code, so it does
> not affect systems that do not use or have that timesource.

I am not convince that 3 reads are in fact needed.  In fact, I am amost certain 
that two is more than enough.  In fact, it takes so long to read it that I just 
use one read and a sanity check in the HRT code.  Here is the code I use:

unsigned long
quick_get_cpuctr(void)
{
	static  unsigned long last_read = 0;
	static  int qgc_max = 0;
	int i;

	unsigned long rd_delta, rd_ans, rd = inl(acpi_pm_tmr_address);

	/*
	 * This will be REALLY big if ever we move backward in time...
	 */
	rd_delta = (rd - last_read) & SIZE_MASK;
	last_read = rd;

	rd_ans =  (rd - last_update) & SIZE_MASK;

	if (likely((rd_ans < (arch_cycles_per_jiffy << 1)) &&
		   (rd_delta < (arch_cycles_per_jiffy << 1))))
		return rd_ans;

	for (i = 0; i < 10; i++) {
		rd = inl(acpi_pm_tmr_address);
		rd_delta = (rd - last_read) & SIZE_MASK;
		last_read = rd;
		if (unlikely(i > qgc_max))
			qgc_max = i;
		/*
		 * On my test machine (800MHZ dual PIII) this is always
		 * seven.  Seems long, but we will give it some slack...
		 * We note that rd_delta (and all the vars) unsigned so
		 * a backward movement will show as a really big number.
		 */
		if (likely(rd_delta < 20))
			return (rd - last_update) & SIZE_MASK;
	}
	return (rd - last_update) & SIZE_MASK;
}


-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  6:42     ` Albert Cahalan
@ 2004-09-03  7:24       ` George Anzinger
  2004-09-03 19:27         ` john stultz
  2004-09-03 19:18       ` john stultz
  1 sibling, 1 reply; 60+ messages in thread
From: George Anzinger @ 2004-09-03  7:24 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: john stultz, lkml, tim, Ulrich.Windl, clameter, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

Albert Cahalan wrote:
> On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
> 
>>john stultz wrote:
> 
> 
>>>+
>>>+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
>>>+{
>>>+
>>>+	cyc *= NSEC_PER_SEC/HZ;
>>
>>Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
>>999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
>>like the relationship of jiffie to 1/HZ and to real time.
> 
> 
> HZ not being HZ is the source of many foul problems.
> 
> NTP should be able to correct for the error. For systems
> not running NTP, provide a fake NTP to make corrections
> based on the expected frequency error.
> 
> Based on that, skip or double-up on the ticks to make
> them be exactly HZ over long periods of time.

There are several problems here.  First, to make this possible, you will have to 
outlaw several values for HZ (1024 comes to mind).  Second, like it or not, 1/HZ 
or something close to it, is the timer resolution.  I think we need to try and 
keep this a power of 10, mostly because there are a lot of folks who just don't 
understand it if it is not.  And third, you need to get real close to it with 
the hardware timer.  If you introduce NTP or some such to fix things, well, 
things just break another place.  For example, we started 2.6 with HZ=1000 and 
had problems like:
 > time sleep 10
sleeping for 9.9 seconds.  This will just not do.  Any corrections made to the 
wall clock really need to be made to the timer system as well.  As it is we 
assume that, by correctly choosing the tick value, the wall clock will be 
correct on average, even under NTP.  I.e. that the NTP correction will be, over 
time, very small.  We really do want code that is much more accurate than "time 
sleep 10" to be right, i.e. if we sleep for x nanoseconds, the wall clock will 
have changed by x nanoseconds while we did so.
> 
>

~

-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  1:39   ` George Anzinger
  2004-09-03  1:58     ` john stultz
@ 2004-09-03  6:42     ` Albert Cahalan
  2004-09-03  7:24       ` George Anzinger
  2004-09-03 19:18       ` john stultz
  1 sibling, 2 replies; 60+ messages in thread
From: Albert Cahalan @ 2004-09-03  6:42 UTC (permalink / raw)
  To: george
  Cc: john stultz, lkml, tim, albert, Ulrich.Windl, clameter,
	Len Brown, linux, David Mosberger, Andi Kleen, paulus,
	schwidefsky, jimix, keith maanthey, greg kh, Patricia Gaughen,
	Chris McDermott

On Thu, 2004-09-02 at 21:39, George Anzinger wrote:
> john stultz wrote:

> > +
> > +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> > +{
> > +
> > +	cyc *= NSEC_PER_SEC/HZ;
> 
> Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
> 999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
> like the relationship of jiffie to 1/HZ and to real time.

HZ not being HZ is the source of many foul problems.

NTP should be able to correct for the error. For systems
not running NTP, provide a fake NTP to make corrections
based on the expected frequency error.

Based on that, skip or double-up on the ticks to make
them be exactly HZ over long periods of time.

> > +int ntp_leapsecond(struct timespec now)
> > +{
> > +	/*
> > +	 * Leap second processing. If in leap-insert state at
> > +	 * the end of the day, the system clock is set back one
> > +	 * second; if in leap-delete state, the system clock is
> > +	 * set ahead one second. The microtime() routine or
> > +	 * external clock driver will insure that reported time
> > +	 * is always monotonic. The ugly divides should be
> > +	 * replaced.

Don't optimize until the patch is in and stable.
The divides can be removed much later. Wait months,
if not forever, before making the code less readable.

The same goes for arch-specific non-syscall hacks.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  1:39   ` George Anzinger
@ 2004-09-03  1:58     ` john stultz
  2004-09-03  6:42     ` Albert Cahalan
  1 sibling, 0 replies; 60+ messages in thread
From: john stultz @ 2004-09-03  1:58 UTC (permalink / raw)
  To: george anzinger
  Cc: lkml, tim, albert, Ulrich.Windl, clameter, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 18:39, George Anzinger wrote:
> john stultz wrote:
> > +static cycle_t jiffies_read(void)
> > +{
> > +	cycle_t ret = get_jiffies_64();
> > +	return ret;
> > +}
> > +
> > +static cycle_t jiffies_delta(cycle_t now, cycle_t then)
> > +{
> > +	/* simple subtraction, no need to mask */
> > +	return now - then;
> > +}
> 
> This should be inline...

Unfortunately they cannot be, as they are used as function pointers
through the struct timesource interface. 

> > +
> > +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> > +{
> > +
> > +	cyc *= NSEC_PER_SEC/HZ;
> 
> Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
> 999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
> like the relationship of jiffie to 1/HZ and to real time.

You're right. You'd think with the recent discussions I would have
already fixed that, but it slipped by. I'll fix it to use ACTHZ in the
next release. Good catch. 


> > +int ntp_leapsecond(struct timespec now)
> > +{
> > +	/*
> > +	 * Leap second processing. If in leap-insert state at
> > +	 * the end of the day, the system clock is set back one
> > +	 * second; if in leap-delete state, the system clock is
> > +	 * set ahead one second. The microtime() routine or
> > +	 * external clock driver will insure that reported time
> > +	 * is always monotonic. The ugly divides should be
> > +	 * replaced.
> 
> ??  Is this really to be hidden?  I rather though this was just the same as a 
> user driven clock setting.  In any case, it is a slew of the wall clock and 
> should be presented to abs timer code so that the abs timers can be corrected. 
> This means a call to "clock_was_set()" AND an update of the mono_to_wall value.

Hmm. Looking at that comment, it seems a bit outdated. I'll see if I can
update it to be more clear. 

And yes, clock_was_set() is on my list. I realized I dropped it, but I
wanted to make sure I was using it properly before I just threw it in.


> > +void do_gettimeofday(struct timeval *tv)
> > +{
> > +	nsec_t wall, sys;
> > +	unsigned long seq;
> > +
> > +	/* atomically read wall and sys time */
> > +	do {
> > +		seq = read_seqbegin(&system_time_lock);
> > +
> > +		wall = wall_time_offset;
> > +		sys = __monotonic_clock();
> > +
> > +	} while (read_seqretry(&system_time_lock, seq));
> > +
> > +	/* add them and convert to timeval */
> > +	*tv = ns2timeval(wall+sys);
> > +}
> I am not sure you don't want to seperate the locking from the clock read.   This 
> so one lock can be put around larger bits of code.  For example, in 
> posix-times.c we need to get all three clocks under the same lock (that being 
> monotonic, wall_time_offset, and jiffies (and possibly as sub jiffie value)).

Well, all the values in timeofday.c are protected by the
system_time_lock, so I'm trying to minimize the locks. I do want to kill
off xtime and thus the xtime lock, so jiffies will be by itself and will
need a jiffies_lock of some sort. However part of this effort is to ween
folks off of using jiffies as a time value, thus in the future it should
only be used inside the timer subsystem as a interrupt counter. 

Thinking about it more, the NTP code could probably drop the ntp_lock
and just use the system_time_lock, but I think I'll push that
optimization off for a bit. 

> Something like:
> void get_tod_parts(nsec_t *wall, nsec_t *mon)
> {
> 	*wall = wall_time_offset;
> 	*mon  = __monotonic_clock();
> }
> 
> This could then be used in a larger function with out the double locking.

I understood your point above, but I'm not sure I see double locking.
__monotonic_clock() is lock free for exactly this reason.


> > +int do_settimeofday(struct timespec *tv)
> > +{
> > +	/* convert timespec to ns */
> > +	nsec_t newtime = timespec2ns(tv);
> > +
> > +	/* atomically adjust wall_time_offset to the desired value */
> > +	write_seqlock_irq(&system_time_lock);
> > +
> > +	wall_time_offset = newtime - __monotonic_clock();
> > +
> > +	/* clear NTP settings */
> > +	ntp_clear();
> > +
> > +	write_sequnlock_irq(&system_time_lock);
> 
> Also need a clock_was_set() call here.

Yep! 

Thanks for the feedback, George!
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 21:09 ` [RFC][PATCH] new timeofday core subsystem (v.A0) john stultz
  2004-09-02 22:19   ` Christoph Lameter
@ 2004-09-03  1:39   ` George Anzinger
  2004-09-03  1:58     ` john stultz
  2004-09-03  6:42     ` Albert Cahalan
  1 sibling, 2 replies; 60+ messages in thread
From: George Anzinger @ 2004-09-03  1:39 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, albert, Ulrich.Windl, clameter, Len Brown, linux,
	David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

john stultz wrote:
> All,
> 	This patch implements the architecture independent portion of the time
> of day subsystem. Included is timeofday.c (which includes all the time
> of day management and accessor functions), ntp.c (which includes the ntp
> scaling code, leap second processing, and ntp kernel state machine
> code), interface definition .h files, the example jiffies timesource
> (lowest common denominator time source, mainly for use as example code)
> and minimal hooks into arch independent code.
> 
> The patch does not function without minimal architecture specific hooks
> (i386 example to follow), and it can be applied to a tree without
> affecting the code.
> 
> I look forward to your comments and feedback.
> 
> thanks
> -john
~

> +static cycle_t jiffies_read(void)
> +{
> +	cycle_t ret = get_jiffies_64();
> +	return ret;
> +}
> +
> +static cycle_t jiffies_delta(cycle_t now, cycle_t then)
> +{
> +	/* simple subtraction, no need to mask */
> +	return now - then;
> +}

This should be inline...

> +
> +static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
> +{
> +
> +	cyc *= NSEC_PER_SEC/HZ;

Hm... This assumes that 1/HZ is what is needed here.  Today this value is 
999898.  Not exactly reachable by NSEC_PER_SEC/HZ.  Or did I miss something, 
like the relationship of jiffie to 1/HZ and to real time.

> +
~

> +int ntp_leapsecond(struct timespec now)
> +{
> +	/*
> +	 * Leap second processing. If in leap-insert state at
> +	 * the end of the day, the system clock is set back one
> +	 * second; if in leap-delete state, the system clock is
> +	 * set ahead one second. The microtime() routine or
> +	 * external clock driver will insure that reported time
> +	 * is always monotonic. The ugly divides should be
> +	 * replaced.

??  Is this really to be hidden?  I rather though this was just the same as a 
user driven clock setting.  In any case, it is a slew of the wall clock and 
should be presented to abs timer code so that the abs timers can be corrected. 
This means a call to "clock_was_set()" AND an update of the mono_to_wall value.
> +	 */
> +	static time_t leaptime = 0;
> +
> +	switch (ntp_state) {
> +	case TIME_OK:
> +		if (ntp_status & STA_INS) {
> +			ntp_state = TIME_INS;
> +			/* calculate end of today (23:59:59)*/
> +			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
> +		}
> +		else if (ntp_status & STA_DEL) {
> +			ntp_state = TIME_DEL;
> +			/* calculate end of today (23:59:59)*/
> +			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
> +		}
> +		break;
> +
> +	case TIME_INS:
> +		/* Once we are at (or past) leaptime, insert the second */
> +		if (now.tv_sec > leaptime) {
> +			ntp_state = TIME_OOP;
> +			printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
> +
> +			return -1;
> +		}
> +		break;
> +
> +	case TIME_DEL:
> +		/* Once we are at (or past) leaptime, delete the second */
> +		if (now.tv_sec >= leaptime) {
> +			ntp_state = TIME_WAIT;
> +			printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
> +
> +			return 1;
> +		}
~

> +/* do_gettimeofday():
> + *		Returns the time of day
> + */
> +void do_gettimeofday(struct timeval *tv)
> +{
> +	nsec_t wall, sys;
> +	unsigned long seq;
> +
> +	/* atomically read wall and sys time */
> +	do {
> +		seq = read_seqbegin(&system_time_lock);
> +
> +		wall = wall_time_offset;
> +		sys = __monotonic_clock();
> +
> +	} while (read_seqretry(&system_time_lock, seq));
> +
> +	/* add them and convert to timeval */
> +	*tv = ns2timeval(wall+sys);
> +}
I am not sure you don't want to seperate the locking from the clock read.   This 
so one lock can be put around larger bits of code.  For example, in 
posix-times.c we need to get all three clocks under the same lock (that being 
monotonic, wall_time_offset, and jiffies (and possibly as sub jiffie value)).

Something like:
void get_tod_parts(nsec_t *wall, nsec_t *mon)
{
	*wall = wall_time_offset;
	*mon  = __monotonic_clock();
}

This could then be used in a larger function with out the double locking.

> +
> +
> +/* do_settimeofday():
> + *		Sets the time of day
> + */
> +int do_settimeofday(struct timespec *tv)
> +{
> +	/* convert timespec to ns */
> +	nsec_t newtime = timespec2ns(tv);
> +
> +	/* atomically adjust wall_time_offset to the desired value */
> +	write_seqlock_irq(&system_time_lock);
> +
> +	wall_time_offset = newtime - __monotonic_clock();
> +
> +	/* clear NTP settings */
> +	ntp_clear();
> +
> +	write_sequnlock_irq(&system_time_lock);

Also need a clock_was_set() call here.
> +
~
-- 
George Anzinger   george@mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  0:47               ` Christoph Lameter
@ 2004-09-03  1:30                 ` john stultz
  2004-09-03  7:43                   ` George Anzinger
  2004-09-03 16:18                   ` Christoph Lameter
  0 siblings, 2 replies; 60+ messages in thread
From: john stultz @ 2004-09-03  1:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 17:47, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> 
> > > Of course but its not a generic way of timer acccess and
> > > requires a fastcall for each timer. You still have the problem of
> > > exporting the frequency and the time offsets to user space (those also
> > > need to be kept current in order for one to be able to calculate a timer
> > > value!). The syscall/fastcall API then needs to be extended to allow for a
> > > system call to access each of the individual timers supported.
> >
> > Indeed, it would require a fastcall accessor for each timesource, but
> > maybe fastcall is the wrong way to describe it. Could it just be a
> > function that uses the epc to escalate its privileges? As for freq and
> > offsets, those would already be user-visible (see below for more
> > details)
> 
> The only way curent way to enter the kernel from glibc with a fastcall is
> the EPC.

Hmm. I must be explaining myself poorly, or not understanding you. I
apologize for not understanding this EPC/fastcall business well enough.
I'd like to use EPC from a user-executable kernel page to escalate
privileges to access the hardware counter. I don't care if I have to use
the the current fastcall (fsys.S) interface or not. However you're
making sounds like this isn't possible, so I'll need to do some
research. 

> > The plan at the moment is that the timeofday core gettimeofday code path
> > as well as any timesource that supports it adds a _vsyscall linker
> > attribute. Then the linker will place all the code on a special page or
> > set of pages. Those pages would then be mapped in as user-readable. Then
> > just like the x86-64's vsyscall gettimeofday, glibc would re-direct
> > gettimeofday calls to the user-mapped kernel pages, where it would
> > execute in user mode (with maybe the epc privilege escalation for ia64
> > time sources that required it).
> 
> The EPC call already does do a *secure* transfer like this on IA64 and
> will execute kernel code without user space mapping. This idea raises all sorts
> of concerns....

Yes, but its not portable. Reducing duplicate code so timeofday
maintenance isn't a nightmare is the first goal here. It may not be
completely achievable, and when I hit that point I'll have to rework the
design, but at this point I'm not convinced that it cannot be done.

As for security concerns, all your mapping out to userspace are the time
variables and the functions needed to convert those to a accurate time
of day. This is almost what you suggest below, but with the additional
math from the kernel. The method I'm suggesting is very similar to
x86-64's arch/x86-64/kernel/vsyscall.c.


> > I had to do most of this for the i386 vsyscall-gettimeofday port, but I
> > was unhappy with the duplication of code (and bugs), as well as the fact
> > that I was then being pushed to re-do it for each architecture. While
> > its not currently implemented (I was hoping to keep the discussion
> > focused on the correctness of what's been implemented), I feel the plan
> > for user-mode access won't be too complex. I'm still open for further
> > discussion if you'd like, obviously performance is quite important, and
> > I want to calm any fears you have, but I'm sure the new ntp code plenty
> > of performance issues to look at before we start digging into usermode
> > access, so maybe we can come back to this discussion later?
> 
> The functions in the timer source structure is a central problem for IA64. We
> cannot take that out right now.

Don't worry I'm not submitting this code just yet. :) I'll need all (or
at least most of the important) architecture maintainers to sign on
before I try to push this code in.

Right now I'm trying to shake out possible problems with the design and
the first pass implementation of the code. You're helping me do that,
and I thank you for it. Concessions will be made, but for now I'm going
to try to preserve the current design and work around the problems as
they arise. 


> The full parameterization of timer access as I have suggested also allows
> user page access to timers with simply mapping a page to access the timer.
> No other gimmicks are needed since the timer source structure contains all
> information to calculate a timer offset.
> 
> > Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
> > folks have talked about their own user mode time access. Chasing down a
> > time bug across arches gets to be fairly hairy, so I'm trying to
> > simplify that.
> 
> The simplest thins is to provide a data structure without any functions
> attached that can simply be copied into userspace if wanted. If an arch
> needs special time access then that is depending on the arch specific
> methods available and such a data structure as I have proposed will
> include all the info necessary to implement such user mode time access.

Ehhh.. I really don't like the idea of giving all the raw values to
userspace and letting user-code do the timeofday calculation. Fixing
bugs in each arches timeofday code is hard enough. Imagine if we have to
go through and fix userspace too! It would also make a user/kernel data
interface that we'd have to preserve. I'd like to avoid that and instead
use the vsyscall method to give us greater flexibility. Plus I doubt
anyone would want to implement the NTP adjustments in userspace? eek!

> A requirement to call functions in the kernel to determine time makes
> these solution impossible. And its getting extremely complex if one has to
> invoke different functions for each timer supported.

The struct timesource interface of read()/delta()/cyc2ns() was the best
generalization I could come up with. I feel they're necessary for the
following reasons:
	
cyc2ns(): In this conversion we can optimize the math depending on the
timesource. If the timesource freq is a power of 2, we can just use
shift! However if its a weird value and we have to be very precise, we
do a full 64bit divide. We're not stuck with one equation given a freq
value.

delta(): Some counters don't fill 32 or 64 bits. ACPI PM time source is
24 bits, and the cyclone is 40. Thus to do proper twos complement
subtraction without overflow worries you need to mask the subtraction.
This can be done by exporting a mask value w/ the freq value, but was
cleaner when moved into the timesource. 

read(): Rather then just giving the address of the register, the read
call allows for timesource specific logic. This lets us use jiffies as a
timesource, or in cases like the ACPI PM timesource, where the register
must be read 3 times in order to ensure a correct value is latched, we
can avoid having to include that logic into the generic code, so it does
not affect systems that do not use or have that timesource.

But I doubt I'll convince you with words, so let me work on it a bit and
see if I can code around your concerns and put you at ease. You've
brought up some good issues, and I'll definitely work to resolve them! 

Thanks again!
-john



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-03  0:07             ` john stultz
@ 2004-09-03  0:47               ` Christoph Lameter
  2004-09-03  1:30                 ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-03  0:47 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2 Sep 2004, john stultz wrote:

> > Of course but its not a generic way of timer acccess and
> > requires a fastcall for each timer. You still have the problem of
> > exporting the frequency and the time offsets to user space (those also
> > need to be kept current in order for one to be able to calculate a timer
> > value!). The syscall/fastcall API then needs to be extended to allow for a
> > system call to access each of the individual timers supported.
>
> Indeed, it would require a fastcall accessor for each timesource, but
> maybe fastcall is the wrong way to describe it. Could it just be a
> function that uses the epc to escalate its privileges? As for freq and
> offsets, those would already be user-visible (see below for more
> details)

The only way curent way to enter the kernel from glibc with a fastcall is
the EPC.

> The plan at the moment is that the timeofday core gettimeofday code path
> as well as any timesource that supports it adds a _vsyscall linker
> attribute. Then the linker will place all the code on a special page or
> set of pages. Those pages would then be mapped in as user-readable. Then
> just like the x86-64's vsyscall gettimeofday, glibc would re-direct
> gettimeofday calls to the user-mapped kernel pages, where it would
> execute in user mode (with maybe the epc privilege escalation for ia64
> time sources that required it).

The EPC call already does do a *secure* transfer like this on IA64 and
will execute kernel code without user space mapping. This idea raises all sorts
of concerns....

> I had to do most of this for the i386 vsyscall-gettimeofday port, but I
> was unhappy with the duplication of code (and bugs), as well as the fact
> that I was then being pushed to re-do it for each architecture. While
> its not currently implemented (I was hoping to keep the discussion
> focused on the correctness of what's been implemented), I feel the plan
> for user-mode access won't be too complex. I'm still open for further
> discussion if you'd like, obviously performance is quite important, and
> I want to calm any fears you have, but I'm sure the new ntp code plenty
> of performance issues to look at before we start digging into usermode
> access, so maybe we can come back to this discussion later?

The functions in the timer source structure is a central problem for IA64. We
cannot take that out right now.

The full parameterization of timer access as I have suggested also allows
user page access to timers with simply mapping a page to access the timer.
No other gimmicks are needed since the timer source structure contains all
information to calculate a timer offset.

> Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
> folks have talked about their own user mode time access. Chasing down a
> time bug across arches gets to be fairly hairy, so I'm trying to
> simplify that.

The simplest thins is to provide a data structure without any functions
attached that can simply be copied into userspace if wanted. If an arch
needs special time access then that is depending on the arch specific
methods available and such a data structure as I have proposed will
include all the info necessary to implement such user mode time access.

A requirement to call functions in the kernel to determine time makes
these solution impossible. And its getting extremely complex if one has to
invoke different functions for each timer supported.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 23:39           ` Christoph Lameter
@ 2004-09-03  0:07             ` john stultz
  2004-09-03  0:47               ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-03  0:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 16:39, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> 
> > Then you implement a fastcall for fastcall_readcyclonecounter(), which
> > in crazy ia64 asm would do something like:
> >
> > ENTRY(fastcall_readcyclonecounter)
> > blip	// magic privledge escalation
> > blop	// load cyclone counter into memory
> > bloop	// copy cyclone counter back into return register
> > ;;
> > END(fastcall_readcyclonecounter)
> >
> >
> > This avoids 150+ lines of asm needed to re-implement the gettimeofday
> > math.
> >
> > However, I could be mistaken. Is something like this possible?
> 
> Of course but its not a generic way of timer acccess and
> requires a fastcall for each timer. You still have the problem of
> exporting the frequency and the time offsets to user space (those also
> need to be kept current in order for one to be able to calculate a timer
> value!). The syscall/fastcall API then needs to be extended to allow for a
> system call to access each of the individual timers supported.

Indeed, it would require a fastcall accessor for each timesource, but
maybe fastcall is the wrong way to describe it. Could it just be a
function that uses the epc to escalate its privileges? As for freq and
offsets, those would already be user-visible (see below for more
details)

> And how would this be supported from user space? We link in special
> libraries to support for cyclone and any other supported timers into
> each program? I thought a kernel would provide hardware abstraction?

The plan at the moment is that the timeofday core gettimeofday code path
as well as any timesource that supports it adds a _vsyscall linker
attribute. Then the linker will place all the code on a special page or
set of pages. Those pages would then be mapped in as user-readable. Then
just like the x86-64's vsyscall gettimeofday, glibc would re-direct
gettimeofday calls to the user-mapped kernel pages, where it would
execute in user mode (with maybe the epc privilege escalation for ia64
time sources that required it).

I had to do most of this for the i386 vsyscall-gettimeofday port, but I
was unhappy with the duplication of code (and bugs), as well as the fact
that I was then being pushed to re-do it for each architecture. While
its not currently implemented (I was hoping to keep the discussion
focused on the correctness of what's been implemented), I feel the plan
for user-mode access won't be too complex. I'm still open for further
discussion if you'd like, obviously performance is quite important, and
I want to calm any fears you have, but I'm sure the new ntp code plenty
of performance issues to look at before we start digging into usermode
access, so maybe we can come back to this discussion later?

> The current IA64 fastcall timer interface is generic and is bound into the
> clock_gettime interface of glibc.

Yes, but x86-64 has one way, and ia64 does it another, and i know ppc
folks have talked about their own user mode time access. Chasing down a
time bug across arches gets to be fairly hairy, so I'm trying to
simplify that.

Again, I really appreciate your thoughts and feedback. Let me see if in
the next release I can have more code to illustrate the user-mode access
plan.

thanks!
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 23:14         ` john stultz
@ 2004-09-02 23:39           ` Christoph Lameter
  2004-09-03  0:07             ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-02 23:39 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2 Sep 2004, john stultz wrote:

> Then you implement a fastcall for fastcall_readcyclonecounter(), which
> in crazy ia64 asm would do something like:
>
> ENTRY(fastcall_readcyclonecounter)
> blip	// magic privledge escalation
> blop	// load cyclone counter into memory
> bloop	// copy cyclone counter back into return register
> ;;
> END(fastcall_readcyclonecounter)
>
>
> This avoids 150+ lines of asm needed to re-implement the gettimeofday
> math.
>
> However, I could be mistaken. Is something like this possible?

Of course but its not a generic way of timer acccess and
requires a fastcall for each timer. You still have the problem of
exporting the frequency and the time offsets to user space (those also
need to be kept current in order for one to be able to calculate a timer
value!). The syscall/fastcall API then needs to be extended to allow for a
system call to access each of the individual timers supported.

And how would this be supported from user space? We link in special
libraries to support for cyclone and any other supported timers into
each program? I thought a kernel would provide hardware abstraction?

The current IA64 fastcall timer interface is generic and is bound into the
clock_gettime interface of glibc.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 22:42       ` Christoph Lameter
@ 2004-09-02 23:14         ` john stultz
  2004-09-02 23:39           ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-02 23:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 15:42, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> 
> > What about my idea from yesterday of inverting the fastcall
> > relationship? Instead of creating a structure that exports values and
> > pointers the fastcall can use to create a time of day, why not use the
> > fast call to read the raw time and return it back to the time of day
> > code (which may be running in user context)? This avoids the duplication
> > of having to re-implement the timeofday/clock_gettime functions in
> > fastcall asm code.
> 
> "Read raw time"? How can you read the raw time in a fast call if the
> fast call needs to do additional function calls (as defined in the
> proposed time structure) in the kernel context in order to retrieve time?
> 
> A fast call cannot do any function calls in the kernel context or
> otherwise.
> 
> The overhead of the function calls will reduce the performance of time
> access significantly.

Forgive me if I mis-understand the fastcall method, but this is the
concept. Instead of having a fastcall function that implmements the
gettimofday/clock_gettime + ntp scaling + etc all in asm we do the
following:

When creating a ia64 timesource (say the cyclone just for a specific
example) we do something like the following.

/* trivial delta (nothing new here) */
static cycle_t cyclone_fastcall_delta(cycle_t now, cycle_t then)
{
	return (now - then);
}
/* trivial cyc2ns (nothing new here) */
static nsec_t cyclone_fastcall_cyc2ns(cycle_t cyc, cycle_t* remainder)
{
	u64 rem;
	cyc *= freq_multiplier;
	if (remainder)
		*remainder = 0;
	return (nsec_t)cyc;
}

/* fastcall read, this is where it gets interesting */
static cycle_t cyclone_fasatcall_read(void)
{
	u64 ret;

	ret = fastcall_readcyclonecounter();

	return (cycle_t)ret;
}

struct timesource_t timesource_cyclone_fastcall = {
	.name = "cyclone_fastcall",
	.priority = 100,
	.read = cyclone_fastcall_read,
	.delta = cyclone_fastcall_delta,
	.cyc2ns = cyclone_fastcall_cyc2ns,
};

Then you implement a fastcall for fastcall_readcyclonecounter(), which
in crazy ia64 asm would do something like:

ENTRY(fastcall_readcyclonecounter)
blip	// magic privledge escalation
blop	// load cyclone counter into memory
bloop	// copy cyclone counter back into return register
;;
END(fastcall_readcyclonecounter)


This avoids 150+ lines of asm needed to re-implement the gettimeofday
math.

However, I could be mistaken. Is something like this possible?

thanks
-john





^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 22:28     ` john stultz
@ 2004-09-02 22:42       ` Christoph Lameter
  2004-09-02 23:14         ` john stultz
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-02 22:42 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2 Sep 2004, john stultz wrote:

> What about my idea from yesterday of inverting the fastcall
> relationship? Instead of creating a structure that exports values and
> pointers the fastcall can use to create a time of day, why not use the
> fast call to read the raw time and return it back to the time of day
> code (which may be running in user context)? This avoids the duplication
> of having to re-implement the timeofday/clock_gettime functions in
> fastcall asm code.

"Read raw time"? How can you read the raw time in a fast call if the
fast call needs to do additional function calls (as defined in the
proposed time structure) in the kernel context in order to retrieve time?

A fast call cannot do any function calls in the kernel context or
otherwise.

The overhead of the function calls will reduce the performance of time
access significantly.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 22:19   ` Christoph Lameter
@ 2004-09-02 22:28     ` john stultz
  2004-09-02 22:42       ` Christoph Lameter
  0 siblings, 1 reply; 60+ messages in thread
From: john stultz @ 2004-09-02 22:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2004-09-02 at 15:19, Christoph Lameter wrote:
> On Thu, 2 Sep 2004, john stultz wrote:
> 
> > +struct timesource_t {
> > +	char* name;
> > +	int priority;
> > +	cycle_t (*read)(void);
> > +	cycle_t (*delta)(cycle_t now, cycle_t then);
> > +	nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
> > +};
> 
> This is defining functions to be called for a timesource. Which means that
> the functionality for clock_gettime cannot be implemented as a fastcall
> anymore on IA64. The delta and the cyc2ns function are always the same.
> The compensation for jitter etc is always the same and can also be
> implemented generically.
> 
> Could we change this to avoid the function calls.
> 
> For example.
> 
> struct timesource_t {
> 	char *name;
> 	int priority;
> 	int type;
> 	void *address;
> 	long long int frequency;
> }
> 
> The functions are always
> 
> delta = (now-then)
> cycle2ns = cycles * (NSEC_PER_SEC / frequency) + remainder
> 
> types of time sources
> 
> #define TIMESOURCE_CPU 0
> #define TIMESOURCE_MMIO32 1
> #define TIMESOURCE_MMIO64 2
> #define TIMESOURCE_FUNCTION 3 	/* To catch odd cases */


What about my idea from yesterday of inverting the fastcall
relationship? Instead of creating a structure that exports values and
pointers the fastcall can use to create a time of day, why not use the
fast call to read the raw time and return it back to the time of day
code (which may be running in user context)? This avoids the duplication
of having to re-implement the timeofday/clock_gettime functions in
fastcall asm code.

thanks
-john


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 21:09 ` [RFC][PATCH] new timeofday core subsystem (v.A0) john stultz
@ 2004-09-02 22:19   ` Christoph Lameter
  2004-09-02 22:28     ` john stultz
  2004-09-03  1:39   ` George Anzinger
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Lameter @ 2004-09-02 22:19 UTC (permalink / raw)
  To: john stultz
  Cc: lkml, tim, george anzinger, albert, Ulrich.Windl, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

On Thu, 2 Sep 2004, john stultz wrote:

> +struct timesource_t {
> +	char* name;
> +	int priority;
> +	cycle_t (*read)(void);
> +	cycle_t (*delta)(cycle_t now, cycle_t then);
> +	nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
> +};

This is defining functions to be called for a timesource. Which means that
the functionality for clock_gettime cannot be implemented as a fastcall
anymore on IA64. The delta and the cyc2ns function are always the same.
The compensation for jitter etc is always the same and can also be
implemented generically.

Could we change this to avoid the function calls.

For example.

struct timesource_t {
	char *name;
	int priority;
	int type;
	void *address;
	long long int frequency;
}

The functions are always

delta = (now-then)
cycle2ns = cycles * (NSEC_PER_SEC / frequency) + remainder

types of time sources

#define TIMESOURCE_CPU 0
#define TIMESOURCE_MMIO32 1
#define TIMESOURCE_MMIO64 2
#define TIMESOURCE_FUNCTION 3 	/* To catch odd cases */

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC][PATCH] new timeofday core subsystem (v.A0)
  2004-09-02 21:07 [RFC] New Time of day proposal (updated 9/2/04) john stultz
@ 2004-09-02 21:09 ` john stultz
  2004-09-02 22:19   ` Christoph Lameter
  2004-09-03  1:39   ` George Anzinger
  0 siblings, 2 replies; 60+ messages in thread
From: john stultz @ 2004-09-02 21:09 UTC (permalink / raw)
  To: lkml
  Cc: tim, george anzinger, albert, Ulrich.Windl, clameter, Len Brown,
	linux, David Mosberger, Andi Kleen, paulus, schwidefsky, jimix,
	keith maanthey, greg kh, Patricia Gaughen, Chris McDermott

All,
	This patch implements the architecture independent portion of the time
of day subsystem. Included is timeofday.c (which includes all the time
of day management and accessor functions), ntp.c (which includes the ntp
scaling code, leap second processing, and ntp kernel state machine
code), interface definition .h files, the example jiffies timesource
(lowest common denominator time source, mainly for use as example code)
and minimal hooks into arch independent code.

The patch does not function without minimal architecture specific hooks
(i386 example to follow), and it can be applied to a tree without
affecting the code.

I look forward to your comments and feedback.

thanks
-john

linux-2.6.9-rc1_timeofday-core_A0.patch
=======================================
diff -Nru a/drivers/Makefile b/drivers/Makefile
--- a/drivers/Makefile	2004-09-02 13:27:00 -07:00
+++ b/drivers/Makefile	2004-09-02 13:27:00 -07:00
@@ -52,3 +52,4 @@
 obj-$(CONFIG_CPU_FREQ)		+= cpufreq/
 obj-$(CONFIG_MMC)		+= mmc/
 obj-y				+= firmware/
+obj-$(CONFIG_NEWTOD)		+= timesource/
diff -Nru a/drivers/timesource/Makefile b/drivers/timesource/Makefile
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/Makefile	2004-09-02 13:27:00 -07:00
@@ -0,0 +1 @@
+obj-y += jiffies.o
diff -Nru a/drivers/timesource/jiffies.c b/drivers/timesource/jiffies.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/drivers/timesource/jiffies.c	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,61 @@
+/*
+ * linux/drivers/timesource/jiffies.c
+ *
+ * Copyright (C) 2004 IBM
+ *
+ * This file contains the jiffies based time source.
+ *
+ */
+#include <linux/timesource.h>
+#include <linux/jiffies.h>
+#include <linux/init.h>
+
+/* The Jiffies based timesource is the lowest common
+ * denominator time source which should function on
+ * all systems. It has the same course resolution as
+ * the timer interrupt frequency HZ and it suffers
+ * inaccuracies caused by missed or lost timer
+ * interrupts and the inability for the timer
+ * interrupt hardware to accuratly tick at the
+ * requested HZ value. It is also not reccomended
+ * for "tick-less" systems.
+ */
+
+static cycle_t jiffies_read(void)
+{
+	cycle_t ret = get_jiffies_64();
+	return ret;
+}
+
+static cycle_t jiffies_delta(cycle_t now, cycle_t then)
+{
+	/* simple subtraction, no need to mask */
+	return now - then;
+}
+
+static nsec_t jiffies_cyc2ns(cycle_t cyc, cycle_t* remainder)
+{
+
+	cyc *= NSEC_PER_SEC/HZ;
+
+	/* no remainder, so set it to zero */
+	if (remainder)
+		*remainder = 0;
+
+	return (nsec_t)cyc;
+}
+
+struct timesource_t timesource_jiffies = {
+	.name = "jiffies",
+	.priority = 0, /* lowest priority*/
+	.read = jiffies_read,
+	.delta = jiffies_delta,
+	.cyc2ns = jiffies_cyc2ns,
+};
+
+static int init_jiffies_timesource(void)
+{
+	register_timesource(&timesource_jiffies);
+	return 0;
+}
+module_init(init_jiffies_timesource);
diff -Nru a/include/linux/ntp.h b/include/linux/ntp.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/ntp.h	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,22 @@
+/*	linux/include/linux/ntp.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains time of day helper functions
+ */
+
+#ifndef _LINUX_NTP_H
+#define _LINUX_NTP_H
+#include <linux/types.h>
+#include <linux/time.h>
+#include <linux/timex.h>
+
+/* timeofday interfaces */
+nsec_t ntp_scale(nsec_t value);
+void ntp_advance(nsec_t value);
+int ntp_adjtimex(struct timex*);
+int ntp_leapsecond(struct timespec now);
+void ntp_clear(void);
+int get_ntp_status(void);
+
+#endif
diff -Nru a/include/linux/time.h b/include/linux/time.h
--- a/include/linux/time.h	2004-09-02 13:27:00 -07:00
+++ b/include/linux/time.h	2004-09-02 13:27:00 -07:00
@@ -24,6 +24,10 @@
 
 #ifdef __KERNEL__
 
+/* timeofday base types */
+typedef u64 nsec_t;
+typedef u64 cycle_t;
+
 #include <linux/spinlock.h>
 #include <linux/seqlock.h>
 #include <linux/timex.h>
diff -Nru a/include/linux/timeofday.h b/include/linux/timeofday.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/timeofday.h	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,61 @@
+/*	linux/include/linux/timeofday.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains the interface to the time of day subsystem
+ */
+#ifndef _LINUX_TIMEOFDAY_H
+#define _LINUX_TIMEOFDAY_H
+#include <linux/types.h>
+#include <linux/time.h>
+
+#ifdef CONFIG_NEWTOD
+nsec_t get_lowres_timestamp(void);
+nsec_t get_lowres_timeofday(void);
+nsec_t do_monotonic_clock(void);
+
+
+void do_gettimeofday(struct timeval *tv);
+int do_settimeofday(struct timespec *tv);
+int do_adjtimex(struct timex *tx);
+
+void timeofday_interrupt_hook(void);
+void timeofday_init(void);
+
+
+/* Helper functions */
+static inline struct timeval ns2timeval(nsec_t ns)
+{
+	struct timeval tv;
+	tv.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &tv.tv_usec);
+	tv.tv_usec /= NSEC_PER_USEC;
+	return tv;
+}
+
+static inline struct timespec ns2timespec(nsec_t ns)
+{
+	struct timespec ts;
+	ts.tv_sec = div_long_long_rem(ns, NSEC_PER_SEC, &ts.tv_nsec);
+	return ts;
+}
+
+static inline u64 timespec2ns(struct timespec* ts)
+{
+	nsec_t ret;
+	ret = ((nsec_t)ts->tv_sec) * NSEC_PER_SEC;
+	ret += ts->tv_nsec;
+	return ret;
+}
+
+static inline nsec_t timeval2ns(struct timeval* tv)
+{
+	nsec_t ret;
+	ret = ((nsec_t)tv->tv_sec) * NSEC_PER_SEC;
+	ret += tv->tv_usec*NSEC_PER_USEC;
+	return ret;
+}
+#else /* CONFIG_NEWTOD */
+#define timeofday_interrupt_hook()
+#define timeofday_init()
+#endif /* CONFIG_NEWTOD */
+#endif /* _LINUX_TIMEOFDAY_H */
diff -Nru a/include/linux/timesource.h b/include/linux/timesource.h
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/include/linux/timesource.h	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,36 @@
+/*	linux/include/linux/timesource.h
+ *
+ *	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+ *
+ *	This file contains the structure definitions for timesources.
+ *
+ *	If you are not a timesource, or the time of day code, you should
+ *	not be including this file.
+ */
+#ifndef _LINUX_TIMESORUCE_H
+#define _LINUX_TIMESORUCE_H
+
+#include <linux/types.h>
+#include <linux/time.h>
+
+/* struct timesource_t:
+ *		Provides mostly state-free accessors to the underlying
+ *		hardware.
+ * name:	ptr to timesource name
+ * priority:priority value (higher is better)
+ * @read:	returns a cycle value
+ * @delta:	calculates the difference between two cycle values
+ * @cyc2ns:	converts a cycle value to ns (expected to be expensive)
+ */
+struct timesource_t {
+	char* name;
+	int priority;
+	cycle_t (*read)(void);
+	cycle_t (*delta)(cycle_t now, cycle_t then);
+	nsec_t (*cyc2ns)(cycle_t cycles, cycle_t* remainder);
+};
+
+/* used to install a new time source */
+void register_timesource(struct timesource_t*);
+
+#endif
diff -Nru a/init/main.c b/init/main.c
--- a/init/main.c	2004-09-02 13:27:00 -07:00
+++ b/init/main.c	2004-09-02 13:27:00 -07:00
@@ -45,6 +45,7 @@
 #include <linux/unistd.h>
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
+#include <linux/timeofday.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -514,6 +515,7 @@
 	pidhash_init();
 	init_timers();
 	softirq_init();
+	timeofday_init();
 	time_init();
 
 	/*
diff -Nru a/kernel/Makefile b/kernel/Makefile
--- a/kernel/Makefile	2004-09-02 13:27:00 -07:00
+++ b/kernel/Makefile	2004-09-02 13:27:00 -07:00
@@ -9,6 +9,7 @@
 	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
 	    kthread.o
 
+obj-$(CONFIG_NEWTOD) += timeofday.o ntp.o
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o
diff -Nru a/kernel/ntp.c b/kernel/ntp.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/kernel/ntp.c	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,554 @@
+/********************************************************************
+* linux/kernel/ntp.c
+*
+* NTP state machine and time scaling code.
+*
+* Copyright (C) 2004 IBM, John Stultz (johnstul@us.ibm.com)
+*
+* Portions rewritten from kernel/time.c and kernel/timer.c
+* Please see those files for original copyrights.
+*
+* Hopefully you should never have to understand or touch
+* any of the code below. but don't let that keep you from trying!
+*
+* This code is loosely based on David Mills' RFC 1589 and its
+* updates. Please see the following for more details:
+*  http://www.eecis.udel.edu/~mills/database/rfc/rfc1589.txt
+*  http://www.eecis.udel.edu/~mills/database/reports/kern/kernb.pdf
+*
+* NOTE:	To simplify the code, we do not implement any of
+* the PPS code, as the code that uses it never was merged.
+*					-johnstul@us.ibm.com
+*
+* 	Revision History:
+*		2004-09-02:	A0
+*			o First pass sent to lkml for review.
+*
+*
+* TODO List:
+*	o More documentation
+*	o More testing
+*	o More optimization
+*********************************************************************/
+
+#include <linux/ntp.h>
+#include <linux/errno.h>
+#include <linux/sched.h> /* Needed for capable() */
+
+/* NTP scaling code
+ * Functions:
+ * ----------
+ * nsec_t ntp_scale(nsec_t value):
+ *		Scales the nsec_t vale using ntp kernel state
+ * void ntp_advance(nsec_t interval):
+ *		Increments the NTP state machine by interval time
+ * static int ntp_hardupdate(long offset, struct timeval tv)
+ *		ntp_adjtimex helper function
+ * int ntp_adjtimex(struct timex* tx):
+ *		Interface to adjust NTP state machine
+ * int ntp_leapsecond(struct timespec now)
+ *		Does NTP leapsecond processing. Returns number of
+ *		seconds current time should be adjusted by.
+ * void ntp_clear(void):
+ *		Clears the ntp kernel state
+ * int get_ntp_status(void):
+ *		returns ntp_status value
+ *
+ * Variables:
+ * ----------
+ * ntp kernel state variables:
+ *		See below for full list.
+ * ntp_lock:
+ *		Protects ntp kernel state variables
+ */
+
+
+
+/* Chapter 5: Kernel Variables [RFC 1589 pg. 28] */
+/* 5.1 Interface Variables */
+static int ntp_status		= STA_UNSYNC;		/* status */
+static long ntp_offset;					/* usec */
+static long ntp_constant	= 2;			/* ntp magic? */
+static long ntp_maxerror	= NTP_PHASE_LIMIT;	/* usec */
+static long ntp_esterror	= NTP_PHASE_LIMIT;	/* usec */
+static const long ntp_tolerance	= MAXFREQ;		/* shifted ppm */
+static const long ntp_precision	= 1;			/* constant */
+
+/* 5.2 Phase-Lock Loop Variables */
+static long ntp_freq;					/* shifted ppm */
+static long ntp_reftime;				/* sec */
+
+/* Extra values */
+static int ntp_state	= TIME_OK;		/* leapsecond state */
+static long ntp_tick	= USEC_PER_SEC/USER_HZ;	/* tick length */
+
+static s64 ss_offset_len;	/* SINGLESHOT offset adj interval (nsec)*/
+static long singleshot_adj;	/* +/- MAX_SINGLESHOT_ADJ (ppm)*/
+static long tick_adj; 		/* tx->tick adjustment (ppm) */
+static long offset_adj;		/* offset adjustment (ppm) */
+
+
+/* lock for the above variables */
+static seqlock_t ntp_lock = SEQLOCK_UNLOCKED;
+
+#define MILLION 1000000
+#define MAX_SINGLESHOT_ADJ 500 /* (ppm) */
+#define SEC_PER_DAY 86400
+
+/* Required to safely shift negative values */
+#define shiftR(x,s) (x < 0) ? (-((-x) >> (s))) : ((x) >> (s))
+
+/* nsec_t ntp_scale(nsec_t):
+ *	Scale the raw time interval to an NTP scaled value
+ *
+ *	This is done in three steps:
+ *	1. Tick adjustment.
+ *	2. Frequency adjustment.
+ *	3. Singleshot offset adjustment
+ *
+ *	The first two are done over the entire raw interval. While
+ *	the third is only done as long as the raw interval is less
+ *	then the singleshot offset length.
+ *
+ *	The basic equasion is:
+ *	raw = raw + ppm scaling
+ *
+ *	So while raw < ss_offset_len:
+ *	raw = raw + (raw*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ *
+ *	But when raw is > ss_offset_len:
+ *	raw = raw
+ *		+ (ss_offset_len*(tick_adj+freq_adj+ss_offset_adj))/MILLION
+ *		+ ((raw - ss_offset_len)*(tick_adj+freq_adj))/MILLION
+ *
+ * XXX - TODO:
+ *	o Get rid of nasty 64bit divides (try shifts)
+ *	o more comments!
+ */
+nsec_t ntp_scale(nsec_t raw)
+{
+	u64 ret = (u64)raw;
+	unsigned long seq;
+	u64 offset_len;
+	long freq_ppm, tick_ppm, ss_ppm, offset_ppm;
+
+	/* save off current kernel state */
+	do {
+		seq = read_seqbegin(&ntp_lock);
+
+		tick_ppm = tick_adj;
+		offset_ppm = offset_adj;
+		freq_ppm = shiftR(ntp_freq,SHIFT_USEC);
+		ss_ppm = singleshot_adj;
+		offset_len = (u64)ss_offset_len;
+
+	} while (read_seqretry(&ntp_lock, seq));
+
+
+	/* Due to sign issues, this is a bit ugly */
+	if (raw <= offset_len) {
+		/* calcluate total ppm */
+		long ppm = tick_ppm + freq_ppm + offset_ppm + ss_ppm;
+
+		/* use abs(ppm) to avoid sign issues w/ do_div */
+		u64 tmp = raw * abs(ppm);
+
+		/* XXX - try to replace this w/ a 64bit shift */
+		do_div(tmp, MILLION);
+
+		/* reapply the sign lost by using abs(ppm) */
+		if(ppm < 0)
+			ret -= tmp;
+		else
+			ret += tmp;
+
+	} else {
+		/* Only apply MAX_SINGLESHOT_ADJ for the ss_offset_len */
+		s64 v1 = offset_len*(tick_ppm + freq_ppm + offset_ppm + ss_ppm);
+		s64 v2 = (s64)(raw - offset_len)*(tick_ppm + freq_ppm+ offset_ppm);
+
+		s64 tmp = v1+v2;
+
+		/* use abs(tmp) to avoid sign issues w/ do_div */
+		u64 tmp2 = abs(tmp);
+
+		/* XXX - try to replace this w/ a 64bit shift */
+		do_div(tmp2,MILLION);
+
+		/* reapply the sign lost by using abs(tmp2) */
+		if(tmp < 0)
+			ret -= tmp2;
+		else
+			ret += tmp2;
+	}
+
+	return (nsec_t)ret;
+}
+
+/* void ntp_advance(nsec_t interval):
+ *	Periodic hook which increments NTP state machine by interval
+ *	This is ntp_hardclock in the RFC.
+ */
+void ntp_advance(nsec_t interval)
+{
+	static u64 interval_sum=0;
+
+	/* inc interval sum */
+	interval_sum += interval;
+
+	write_seqlock_irq(&ntp_lock);
+
+	/* decrement singleshot offset interval */
+	ss_offset_len =- interval;
+	if(ss_offset_len < 0) /* make sure it doesn't go negative */
+		ss_offset_len=0;
+
+	/* Do second overflow code */
+	while (interval_sum > NSEC_PER_SEC) {
+		/* XXX - I'd prefer to smoothly apply this math
+		 * at each call to ntp_advance() rather then each
+		 * second.
+		 */
+		long tmp;
+
+		/* Bump maxerror by ntp_tolerance */
+		ntp_maxerror += shiftR(ntp_tolerance, SHIFT_USEC);
+		if (ntp_maxerror > NTP_PHASE_LIMIT) {
+			ntp_maxerror = NTP_PHASE_LIMIT;
+			ntp_status |= STA_UNSYNC;
+		}
+
+		/* Calculate offset_adj for the next second */
+		tmp = ntp_offset;
+		if (!(ntp_status & STA_FLL))
+		    tmp = shiftR(tmp, SHIFT_KG + ntp_constant);
+
+		/* bound the adjustment to MAXPHASE/MINSEC */
+		if (tmp > (MAXPHASE / MINSEC) << SHIFT_UPDATE)
+		    tmp = (MAXPHASE / MINSEC) << SHIFT_UPDATE;
+		if (tmp < -(MAXPHASE / MINSEC) << SHIFT_UPDATE)
+		    tmp = -(MAXPHASE / MINSEC) << SHIFT_UPDATE;
+
+		offset_adj = shiftR(tmp, SHIFT_UPDATE); /* (usec/sec) = ppm */
+		ntp_offset -= tmp;
+
+		interval_sum -= NSEC_PER_SEC;
+	}
+
+/* XXX - if still needed do equiv code to switching time_next_adjust */
+
+	write_sequnlock_irq(&ntp_lock);
+
+}
+
+
+/* called only by ntp_adjtimex while holding ntp_lock */
+static int ntp_hardupdate(long offset, struct timeval tv)
+{
+	int ret;
+	long tmp, interval;
+
+	ret = 0;
+	if (!(ntp_status & STA_PLL))
+		return ret;
+
+	tmp = offset;
+	/* Make sure offset is bounded by MAXPHASE */
+	if (tmp > MAXPHASE)
+		tmp = MAXPHASE;
+	if (tmp < -MAXPHASE)
+		tmp = -MAXPHASE;
+
+	ntp_offset = tmp << SHIFT_UPDATE;
+
+    if ((ntp_status & STA_FREQHOLD) || (ntp_reftime == 0))
+		ntp_reftime = tv.tv_sec;
+
+	/* calculate seconds since last call to hardupdate */
+	interval = tv.tv_sec - ntp_reftime;
+	ntp_reftime = tv.tv_sec;
+
+	if ((ntp_status & STA_FLL) && (interval >= MINSEC)) {
+		long damping;
+		tmp = (offset / interval); /* ppm (usec/sec)*/
+
+		/* convert to shifted ppm, then apply damping factor */
+
+		/* calculate damping factor - XXX bigger comment!*/
+		damping = SHIFT_KH - SHIFT_USEC;
+
+		/* apply damping factor */
+		ntp_freq += shiftR(tmp,damping);
+
+		printk("ntp->freq change: %ld\n",shiftR(tmp,damping));
+
+	} else if ((ntp_status & STA_PLL) && (interval < MAXSEC)) {
+		long damping;
+		tmp = offset * interval; /* ppm XXX - not quite*/
+
+		/* calculate damping factor - XXX bigger comment!*/
+		damping = (2 * ntp_constant) + SHIFT_KF - SHIFT_USEC;
+
+		/* apply damping factor */
+		ntp_freq += shiftR(tmp,damping);
+
+		printk("ntp->freq change: %ld\n", shiftR(tmp,damping));
+
+	} else { /* interval out of bounds */
+		printk("interval out of bounds: %ld\n", interval);
+		ret = -1; /* TIME_ERROR */
+	}
+
+	/* bound ntp_freq */
+	if (ntp_freq > ntp_tolerance)
+		ntp_freq = ntp_tolerance;
+	if (ntp_freq < -ntp_tolerance)
+		ntp_freq = -ntp_tolerance;
+
+	return ret;
+}
+
+/* int ntp_adjtimex(struct timex* tx)
+ *	Interface to change NTP state machine
+ */
+int ntp_adjtimex(struct timex* tx)
+{
+	long save_offset;
+	int result;
+
+/*=[Sanity checking]===============================*/
+	/* Check capabilities if we're trying to modify something */
+	if (tx->modes && !capable(CAP_SYS_TIME))
+		return -EPERM;
+
+	/* frequency adjustment limited to +/- MAXFREQ */
+	if ((tx->modes & ADJ_FREQUENCY)
+			&& (abs(tx->freq) > MAXFREQ))
+		return -EINVAL;
+
+	/* maxerror adjustment limited to NTP_PHASE_LIMIT */
+	if ((tx->modes & ADJ_MAXERROR)
+			&& (tx->maxerror < 0
+				|| tx->maxerror >= NTP_PHASE_LIMIT))
+		return -EINVAL;
+
+	/* esterror adjustment limited to NTP_PHASE_LIMIT */
+	if ((tx->modes & ADJ_ESTERROR)
+			&& (tx->esterror < 0
+				|| tx->esterror >= NTP_PHASE_LIMIT))
+		return -EINVAL;
+
+	/* constant adjustment must be positive */
+	if ((tx->modes & ADJ_TIMECONST)
+			&& (tx->constant < 0))
+		return -EINVAL;
+
+	/* Single shot mode can only be used by itself */
+	if (((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+			&& (tx->modes != ADJ_OFFSET_SINGLESHOT))
+		return -EINVAL;
+
+	/* offset adjustment limited to +/- MAXPHASE */
+	if ((tx->modes != ADJ_OFFSET_SINGLESHOT)
+			&& (tx->modes & ADJ_OFFSET)
+			&& (abs(tx->offset)>= MAXPHASE))
+		return -EINVAL;
+
+	/* tick adjustment limited to 10% */
+	if ((tx->modes & ADJ_TICK)
+			&& ((tx->tick < 900000/USER_HZ)
+				||(tx->tick > 11000000/USER_HZ)))
+		return -EINVAL;
+
+	/* dbg output XXX - yank me! */
+	if(tx->modes) {
+		printk("adjtimex: tx->offset: %ld    tx->freq: %ld\n",
+				tx->offset, tx->freq);
+	}
+
+/*=[Kernel input bits]==========================*/
+	write_seqlock_irq(&ntp_lock);
+
+	result = ntp_state;
+
+	/* For ADJ_OFFSET_SINGLESHOT we must return the old offset */
+	save_offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+	/* Process input parameters */
+	if (tx->modes & ADJ_STATUS) {
+		ntp_status &=  STA_RONLY;
+		ntp_status |= tx->status & ~STA_RONLY;
+	}
+
+	if (tx->modes & ADJ_FREQUENCY)
+		ntp_freq = tx->freq;
+
+	if (tx->modes & ADJ_MAXERROR)
+		ntp_maxerror = tx->maxerror;
+
+	if (tx->modes & ADJ_ESTERROR)
+		ntp_esterror = tx->esterror;
+
+	if (tx->modes & ADJ_TIMECONST)
+		ntp_constant = tx->constant;
+
+	if (tx->modes & ADJ_OFFSET) {
+		if (tx->modes == ADJ_OFFSET_SINGLESHOT) {
+			if (tx->offset < 0)
+				singleshot_adj = -MAX_SINGLESHOT_ADJ;
+			else
+				singleshot_adj = MAX_SINGLESHOT_ADJ;
+			/* Calculate single shot offset interval:
+			 *
+			 *	len = offset/(+/-)MAX_SINGESHOT_ADJ
+			 *	len = abs(offset)/MAX_SINGESHOT_ADJ
+			 *	len = (abs(offset)*NSEC_PER_USEC)
+			 *			/(MAX_SINGESHOT_ADJ /NSEC_PER_SEC)
+			 *	len = (abs(offset)*NSEC_PER_USEC*NSEC_PER_SEC)/MAX_SINGESHOT_ADJ
+			 *	len = abs(offset)*NSEC_PER_SEC*(NSEC_PER_USEC/MAX_SINGESHOT_ADJ)
+			 */
+			ss_offset_len = abs(tx->offset);
+			ss_offset_len *= NSEC_PER_SEC * (NSEC_PER_USEC/MAX_SINGLESHOT_ADJ);
+		}
+		/* call hardupdate() */
+		if (ntp_hardupdate(tx->offset, tx->time))
+			result = TIME_ERROR;
+	}
+
+	if (tx->modes & ADJ_TICK) {
+		/* first calculate usec/user_tick offset */
+		tick_adj = (USEC_PER_SEC/USER_HZ) - tx->tick;
+		/* multiply by user_hz to get usec/sec => ppm */
+		tick_adj *= USER_HZ;
+		/* save tx->tick for future calls to adjtimex */
+		ntp_tick = tx->tick;
+	}
+
+	if ((ntp_status & (STA_UNSYNC|STA_CLOCKERR)) != 0 )
+		result = TIME_ERROR;
+
+/*=[Kernel output bits]================================*/
+	/* write kernel state to user timex values*/
+	if ((tx->modes & ADJ_OFFSET_SINGLESHOT) == ADJ_OFFSET_SINGLESHOT)
+		tx->offset = save_offset;
+	else
+		tx->offset = shiftR(ntp_offset, SHIFT_UPDATE);
+
+	tx->freq = ntp_freq;
+	tx->maxerror = ntp_maxerror;
+	tx->esterror = ntp_esterror;
+	tx->status = ntp_status;
+	tx->constant = ntp_constant;
+	tx->precision = ntp_precision;
+	tx->tolerance = ntp_tolerance;
+
+	/* PPS is not implemented, so these are zero */
+	tx->ppsfreq	= /*XXX - Not Implemented!*/ 0;
+	tx->jitter	= /*XXX - Not Implemented!*/ 0;
+	tx->shift	= /*XXX - Not Implemented!*/ 0;
+	tx->stabil	= /*XXX - Not Implemented!*/ 0;
+	tx->jitcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->calcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->errcnt	= /*XXX - Not Implemented!*/ 0;
+	tx->stbcnt	= /*XXX - Not Implemented!*/ 0;
+
+	write_sequnlock_irq(&ntp_lock);
+
+	return result;
+}
+
+
+/* void ntp_leapsecond(struct timespec now):
+ *	NTP Leapsecnod processing code. Returns the number of
+ *	seconds (-1, 0, or 1) that should be added to the current
+ *	time to properly adjust for leapseconds.
+ */
+int ntp_leapsecond(struct timespec now)
+{
+	/*
+	 * Leap second processing. If in leap-insert state at
+	 * the end of the day, the system clock is set back one
+	 * second; if in leap-delete state, the system clock is
+	 * set ahead one second. The microtime() routine or
+	 * external clock driver will insure that reported time
+	 * is always monotonic. The ugly divides should be
+	 * replaced.
+	 */
+	static time_t leaptime = 0;
+
+	switch (ntp_state) {
+	case TIME_OK:
+		if (ntp_status & STA_INS) {
+			ntp_state = TIME_INS;
+			/* calculate end of today (23:59:59)*/
+			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+		}
+		else if (ntp_status & STA_DEL) {
+			ntp_state = TIME_DEL;
+			/* calculate end of today (23:59:59)*/
+			leaptime = now.tv_sec + SEC_PER_DAY - (now.tv_sec % SEC_PER_DAY) - 1;
+		}
+		break;
+
+	case TIME_INS:
+		/* Once we are at (or past) leaptime, insert the second */
+		if (now.tv_sec > leaptime) {
+			ntp_state = TIME_OOP;
+			printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
+
+			return -1;
+		}
+		break;
+
+	case TIME_DEL:
+		/* Once we are at (or past) leaptime, delete the second */
+		if (now.tv_sec >= leaptime) {
+			ntp_state = TIME_WAIT;
+			printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
+
+			return 1;
+		}
+		break;
+
+	case TIME_OOP:
+		/*  Wait for the end of the leap second*/
+		if (now.tv_sec > (leaptime + 1))
+			ntp_state = TIME_WAIT;
+		break;
+
+	case TIME_WAIT:
+		if (!(ntp_status & (STA_INS | STA_DEL)))
+			ntp_state = TIME_OK;
+	}
+
+	return 0;
+}
+
+/* void ntp_clear(void):
+ *	Clears the NTP state machine.
+ */
+void ntp_clear(void)
+{
+	write_seqlock_irq(&ntp_lock);
+
+	/* clear everything */
+	ntp_status |= STA_UNSYNC;
+	ntp_maxerror = NTP_PHASE_LIMIT;
+	ntp_esterror = NTP_PHASE_LIMIT;
+	ss_offset_len=0;
+	singleshot_adj=0;
+	tick_adj=0;
+	offset_adj =0;
+
+	write_sequnlock_irq(&ntp_lock);
+
+}
+
+/* int get_ntp_status(void):
+ *  Returns the NTP status.
+ */
+int get_ntp_status(void)
+{
+	return ntp_status;
+}
+
diff -Nru a/kernel/time.c b/kernel/time.c
--- a/kernel/time.c	2004-09-02 13:27:00 -07:00
+++ b/kernel/time.c	2004-09-02 13:27:00 -07:00
@@ -33,6 +33,7 @@
 #include <linux/smp_lock.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
+#include <linux/timeofday.h>
 
 /* 
  * The timezone where the local system is located.  Used as a default by some
@@ -210,6 +211,7 @@
 /* adjtimex mainly allows reading (and writing, if superuser) of
  * kernel time-keeping variables. used by xntpd.
  */
+#ifndef CONFIG_NEWTOD
 int do_adjtimex(struct timex *txc)
 {
         long ltemp, mtemp, save_adjust;
@@ -392,6 +394,7 @@
 	do_gettimeofday(&txc->time);
 	return(result);
 }
+#endif
 
 asmlinkage long sys_adjtimex(struct timex __user *txc_p)
 {
diff -Nru a/kernel/timeofday.c b/kernel/timeofday.c
--- /dev/null	Wed Dec 31 16:00:00 196900
+++ b/kernel/timeofday.c	2004-09-02 13:27:00 -07:00
@@ -0,0 +1,330 @@
+/*********************************************************************
+*	linux/kernel/timeofday.c
+*
+*	Copyright (C) 2003, 2004 IBM, John Stultz (johnstul@us.ibm.com)
+*
+*  This file contains the functions which access and manage
+*	the system's time of day functionality.
+*
+* 	Revision History:
+*		2004-09-02:	A0
+*			o First pass sent to lkml for review.
+*
+* TODO List:
+*		o More testing
+**********************************************************************/
+
+#include <linux/timeofday.h>
+#include <linux/timesource.h>
+#include <linux/ntp.h>
+#include <linux/timex.h>
+
+/*XXX - remove later */
+#define TIME_DBG 1
+#define TIME_DBG_FREQ 120000
+
+/*[Nanosecond based variables]----------------
+ * system_time:
+ *		Monotonically increasing counter of the number of nanoseconds
+ *		since boot.
+ * wall_time_offset:
+ *		Offset added to system_time to provide accurate time-of-day
+ */
+static nsec_t system_time;
+static nsec_t wall_time_offset;
+
+
+/*[Cycle based variables]----------------
+ * offset_base:
+ *		Value of the timesource at the last clock_interrupt_hook()
+ *		(adjusted only minorly to account for rounded off cycles)
+ */
+static cycle_t offset_base;
+
+/*[Time source data]-------------------
+ * timesource
+ *		current timesource pointer (initialized to timesource_jiffies)
+ * next_timesource:
+ *		pointer to the timesource that will be installed at the next hook
+ */
+extern struct timesource_t timesource_jiffies;
+static struct timesource_t *timesource = &timesource_jiffies;
+static struct timesource_t *next_timesource;
+
+/*[Locks]----------------------------
+ * system_time_lock:
+ *		generic lock for all locally scoped time values
+ */
+static seqlock_t system_time_lock = SEQLOCK_UNLOCKED;
+
+
+/* [XXX - Hacks]--------------------
+ *			Makes stuff compile
+ */
+extern unsigned long get_cmos_time(void);
+extern void sync_persistant_clock(struct timespec ts);
+
+/* get_lowres_timestamp():
+ *		Returns a low res timestamp.
+ *		(ie: the value of system_time as  calculated at
+ *			the last invocation of clock_interrupt_hook() )
+ */
+nsec_t get_lowres_timestamp(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		/* quickly grab system_time*/
+		ret = system_time;
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+/* get_lowres_timeofday():
+ *		Returns a low res time of day, as calculated at the
+ *		last invocation of clock_interrupt_hook()
+ */
+nsec_t get_lowres_timeofday(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		/* quickly calculate low-res time of day */
+		ret = system_time + wall_time_offset;
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+
+/* __monotonic_clock():
+ *		private function, must hold system_time_lock lock when being
+ *		called. Returns the monotonically increasing number of
+ *		nanoseconds	since the system booted (adjusted by NTP scaling)
+ */
+static nsec_t __monotonic_clock(void)
+{
+	nsec_t ret, ns_offset;
+	cycle_t now, delta;
+
+	/* read timesource */
+	now = timesource->read();
+
+	/* calculate the delta since the last clock_interrupt */
+	delta = timesource->delta(now, offset_base);
+
+	/* convert to nanoseconds */
+	ns_offset = timesource->cyc2ns(delta, 0);
+
+	/* apply the NTP scaling */
+	ns_offset = ntp_scale(ns_offset);
+
+	/* add result to system time */
+	ret = system_time + ns_offset;
+
+	return ret;
+}
+
+
+/* do_monotonic_clock():
+ *		Returns the monotonically increasing number of nanoseconds
+ *		since the system booted via __monotonic_clock()
+ */
+nsec_t do_monotonic_clock(void)
+{
+	nsec_t ret;
+	unsigned long seq;
+
+	/* atomically read __monotonic_clock() */
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		ret = __monotonic_clock();
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	return ret;
+}
+
+
+/* do_gettimeofday():
+ *		Returns the time of day
+ */
+void do_gettimeofday(struct timeval *tv)
+{
+	nsec_t wall, sys;
+	unsigned long seq;
+
+	/* atomically read wall and sys time */
+	do {
+		seq = read_seqbegin(&system_time_lock);
+
+		wall = wall_time_offset;
+		sys = __monotonic_clock();
+
+	} while (read_seqretry(&system_time_lock, seq));
+
+	/* add them and convert to timeval */
+	*tv = ns2timeval(wall+sys);
+}
+
+
+/* do_settimeofday():
+ *		Sets the time of day
+ */
+int do_settimeofday(struct timespec *tv)
+{
+	/* convert timespec to ns */
+	nsec_t newtime = timespec2ns(tv);
+
+	/* atomically adjust wall_time_offset to the desired value */
+	write_seqlock_irq(&system_time_lock);
+
+	wall_time_offset = newtime - __monotonic_clock();
+
+	/* clear NTP settings */
+	ntp_clear();
+
+	write_sequnlock_irq(&system_time_lock);
+
+	return 0;
+}
+
+/* do_adjtimex:
+ *		Userspace NTP daemon's interface to the kernel NTP variables
+ */
+int do_adjtimex(struct timex *tx)
+{
+	do_gettimeofday(&tx->time); /* set timex->time*/
+								/* Note: We set tx->time first, */
+								/* because ntp_adjtimex uses it */
+
+	return ntp_adjtimex(tx);			/* call out to NTP code */
+}
+
+
+/* timeofday_interrupt_hook:
+ *		calculates the delta since the last interrupt,
+ *		updates system time and clears the offset.
+ *		likely called by timer_interrupt()
+ */
+void timeofday_interrupt_hook(void)
+{
+	cycle_t now, delta, remainder;
+	nsec_t ns, ntp_ns;
+	long leapsecond;
+
+	write_seqlock(&system_time_lock);
+
+	/* read time source */
+	now = timesource->read();
+
+	/* calculate cycle delta */
+	delta = timesource->delta(now, offset_base);
+
+	/* convert cycles to ns  and save remainder */
+	ns = timesource->cyc2ns(delta, &remainder);
+
+	/* apply NTP scaling factor for this tick */
+	ntp_ns = ntp_scale(ns);
+
+#if TIME_DBG /* XXX - remove later*/
+{
+	static int dbg=0;
+	if(!(dbg++%TIME_DBG_FREQ)){
+		printk("now: %lluc - then: %lluc = delta: %lluc -> %llu ns + %llu cyc (ntp: %lluc)\n",
+			now, offset_base, delta, ns, remainder, ntp_ns);
+	}
+}
+#endif
+	/* update system_time */
+	system_time += ntp_ns;
+
+	/* reset the offset_base */
+	offset_base = now;
+
+	/* subtract remainder to account for rounded off cycles */
+	offset_base = timesource->delta(offset_base,remainder);
+
+	/* advance the ntp state machine by ns*/
+	ntp_advance(ns);
+
+	/* do ntp leap second processing*/
+	leapsecond = ntp_leapsecond(ns2timespec(system_time+wall_time_offset));
+	if (leapsecond == 1) /* XXX could this be cleaner? */
+		wall_time_offset += 1;
+	if (leapsecond == -1)
+		wall_time_offset -= 1;
+
+	/* sync the persistant clock */
+	if (!(get_ntp_status() & STA_UNSYNC))
+		sync_persistant_clock(ns2timespec(system_time + wall_time_offset));
+
+	/* if necessary, switch timesources */
+	if (next_timesource) {
+		/* immediately set new offset_base */
+		offset_base = next_timesource->read();
+		/* swap timesources */
+		timesource = next_timesource;
+		next_timesource = 0;
+
+		printk(KERN_INFO "Time: %s timesource has been installed\n",
+					timesource->name);
+	}
+
+
+	/* update legacy time values */
+	write_seqlock(&xtime_lock);
+	xtime = ns2timespec(system_time + wall_time_offset);
+	wall_to_monotonic = ns2timespec(wall_time_offset);
+	wall_to_monotonic.tv_sec = -wall_to_monotonic.tv_sec;
+	wall_to_monotonic.tv_nsec = -wall_to_monotonic.tv_nsec;
+	write_sequnlock(&xtime_lock);
+
+	write_sequnlock(&system_time_lock);
+}
+
+/* register_timesource():
+ *		Used to install a new timesource
+ */
+void register_timesource(struct timesource_t* t)
+{
+	write_seqlock(&system_time_lock);
+
+	/* XXX - check override */
+
+	/* if next_timesource has been set, make sure we beat that one too */
+	if (next_timesource && (t->priority > next_timesource->priority))
+		next_timesource = t;
+	else if(t->priority > timesource->priority)
+		next_timesource = t;
+
+	write_sequnlock(&system_time_lock);
+}
+
+
+/* timeofday_init():
+ *		Initializes time variables
+ */
+void timeofday_init(void)
+{
+	write_seqlock(&system_time_lock);
+
+	/* clear and initialize offsets*/
+	offset_base = timesource->read();
+	wall_time_offset = ((u64)get_cmos_time()) * NSEC_PER_SEC;
+
+	/* clear NTP scaling factor*/
+	ntp_clear();
+
+	write_sequnlock(&system_time_lock);
+
+	return;
+}
diff -Nru a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c	2004-09-02 13:27:00 -07:00
+++ b/kernel/timer.c	2004-09-02 13:27:00 -07:00
@@ -559,6 +559,7 @@
 int tickadj = 500/HZ ? : 1;		/* microsecs */
 
 
+#ifndef CONFIG_NEWTOD
 /*
  * phase-lock loop variables
  */
@@ -786,6 +787,9 @@
 	    second_overflow();
 	}
 }
+#else /* CONFIG_NEWTOD */
+#define update_wall_time(x)
+#endif /* CONFIG_NEWTOD */
 
 static inline void do_process_times(struct task_struct *p,
 	unsigned long user, unsigned long system)



^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2004-09-16  7:14 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <413850B9.15119.BA95FD@rkdvmks1.ngate.uni-regensburg.de>
     [not found] ` <1094224071.431.7758.camel@cube>
2004-09-06  6:08   ` [RFC][PATCH] new timeofday core subsystem (v.A0) Ulrich Windl
2004-09-12 17:11     ` Albert Cahalan
2004-09-02 21:07 [RFC] New Time of day proposal (updated 9/2/04) john stultz
2004-09-02 21:09 ` [RFC][PATCH] new timeofday core subsystem (v.A0) john stultz
2004-09-02 22:19   ` Christoph Lameter
2004-09-02 22:28     ` john stultz
2004-09-02 22:42       ` Christoph Lameter
2004-09-02 23:14         ` john stultz
2004-09-02 23:39           ` Christoph Lameter
2004-09-03  0:07             ` john stultz
2004-09-03  0:47               ` Christoph Lameter
2004-09-03  1:30                 ` john stultz
2004-09-03  7:43                   ` George Anzinger
2004-09-03 19:32                     ` john stultz
2004-09-03 16:18                   ` Christoph Lameter
2004-09-03 21:00                     ` john stultz
2004-09-03 22:04                       ` Christoph Lameter
2004-09-03 23:00                         ` john stultz
2004-09-04  0:11                           ` Christoph Lameter
2004-09-03  1:39   ` George Anzinger
2004-09-03  1:58     ` john stultz
2004-09-03  6:42     ` Albert Cahalan
2004-09-03  7:24       ` George Anzinger
2004-09-03 19:27         ` john stultz
2004-09-03 22:10           ` George Anzinger
2004-09-03 23:32             ` john stultz
2004-09-04  0:02               ` George Anzinger
2004-09-08 18:07                 ` john stultz
2004-09-09  0:08                   ` George Anzinger
2004-09-09  0:51                     ` john stultz
2004-09-09  3:14                       ` Christoph Lameter
2004-09-09  3:32                         ` john stultz
2004-09-09  4:31                           ` George Anzinger
2004-09-09  6:37                             ` Jesse Barnes
2004-09-09  8:09                               ` George Anzinger
2004-09-09 19:07                             ` john stultz
2004-09-09 20:49                               ` George Anzinger
2004-09-13 21:29                                 ` Christoph Lameter
2004-09-13 22:25                                   ` john stultz
2004-09-13 22:45                                     ` Christoph Lameter
2004-09-14  6:53                                       ` Ulrich Windl
2004-09-14 17:49                                     ` Christoph Lameter
2004-09-15  0:57                                       ` George Anzinger
2004-09-15  3:32                                         ` Christoph Lameter
2004-09-15  8:04                                           ` George Anzinger
2004-09-15  8:54                                             ` Dominik Brodowski
2004-09-15 17:54                                               ` George Anzinger
2004-09-15  9:12                                             ` Andi Kleen
2004-09-15 15:46                                             ` Christoph Lameter
2004-09-15 18:00                                               ` George Anzinger
2004-09-15 18:28                                                 ` Christoph Lameter
2004-09-15  6:46                                         ` Christoph Lameter
2004-09-15 16:32                                           ` john stultz
2004-09-15 16:46                                             ` Christoph Lameter
2004-09-15 17:13                                               ` john stultz
2004-09-15 17:30                                                 ` Christoph Lameter
2004-09-15 18:48                                                   ` john stultz
2004-09-15 19:58                                                     ` George Anzinger
2004-09-15 20:20                                                     ` Christoph Lameter
2004-09-16  7:02                                                     ` Ulrich Windl
2004-09-03 19:18       ` john stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).