All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] RFC: slow tsc optimization
@ 2012-09-06  7:24 Gilles Chanteperdrix
  2012-09-07 11:29 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 3+ messages in thread
From: Gilles Chanteperdrix @ 2012-09-06  7:24 UTC (permalink / raw)
  To: Xenomai


Hi,

The last few days, I have been working on getting the "rdtsc"
instruction replaced with a call to a tsc emulation function dynamically
at run time. It turned out to be easy with the Linux "alternative"
mechanism, since it implements replacements based on CPU capabilities,
and the TSC is such a capability. This modification allows to compile a
kernel with Xenomai that will run on any x86_32 platform.

Now, when running kernels without tsc using the PIT based tsc emulation,
I found out something pretty obvious, this PIT based tsc emulation is
slow, it takes 4us every time we call it. And the nucleus reads the tsc
a number of times when a timer interrupt happens up to the wake up of
the latency user-space task:
- at the very beginning of the timer interrupt
- after the execution of the latency thread timer
- in the timer programming function, to compute the timer delay
- in the middle of the context switch, if the "statistics collection"
feature is enabled
- in the xnpod_wait_thread_periodic function, after the context switch,
in order to compute the number of timer overruns.

That is 20us, and the thread is not yet running in user-space.

So, I have been thinking about reducing the number of calls to the PIT,
unfortunately keeping the last tsc value around and reusing it is a bit
heavy, and implies modifications which are completely useless for the
non PIT case (which should be the vast majority), and in fact, the tsc
emulation code has to keep the last read value, since it is required to
convert clocksources with less than 64 bits to a 64 bits value. So, I
propose the following approach:

The I-pipe core will provide two tsc reading functions:
ipipe_read_tsc which reads the counter
ipipe_read_tsc_fast which will read the tsc if the cpu has a tsc, or
return the last value read if the tsc is emulated.

This would result in lighter modifications of the nucleus.

Or is treating this problem in fact useless because nobody uses xenomai
without a tsc?

Regards.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Xenomai] RFC: slow tsc optimization
  2012-09-06  7:24 [Xenomai] RFC: slow tsc optimization Gilles Chanteperdrix
@ 2012-09-07 11:29 ` Gilles Chanteperdrix
  2012-09-08 19:30   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 3+ messages in thread
From: Gilles Chanteperdrix @ 2012-09-07 11:29 UTC (permalink / raw)
  To: Xenomai

On 09/06/2012 09:24 AM, Gilles Chanteperdrix wrote:

> 
> Hi,
> 
> The last few days, I have been working on getting the "rdtsc"
> instruction replaced with a call to a tsc emulation function dynamically
> at run time. It turned out to be easy with the Linux "alternative"
> mechanism, since it implements replacements based on CPU capabilities,
> and the TSC is such a capability. This modification allows to compile a
> kernel with Xenomai that will run on any x86_32 platform.
> 
> Now, when running kernels without tsc using the PIT based tsc emulation,
> I found out something pretty obvious, this PIT based tsc emulation is
> slow, it takes 4us every time we call it. And the nucleus reads the tsc
> a number of times when a timer interrupt happens up to the wake up of
> the latency user-space task:
> - at the very beginning of the timer interrupt
> - after the execution of the latency thread timer
> - in the timer programming function, to compute the timer delay
> - in the middle of the context switch, if the "statistics collection"
> feature is enabled
> - in the xnpod_wait_thread_periodic function, after the context switch,
> in order to compute the number of timer overruns.
> 
> That is 20us, and the thread is not yet running in user-space.
> 
> So, I have been thinking about reducing the number of calls to the PIT,
> unfortunately keeping the last tsc value around and reusing it is a bit
> heavy, and implies modifications which are completely useless for the
> non PIT case (which should be the vast majority), and in fact, the tsc
> emulation code has to keep the last read value, since it is required to
> convert clocksources with less than 64 bits to a 64 bits value. So, I
> propose the following approach:
> 
> The I-pipe core will provide two tsc reading functions:
> ipipe_read_tsc which reads the counter
> ipipe_read_tsc_fast which will read the tsc if the cpu has a tsc, or
> return the last value read if the tsc is emulated.


An update on this work. In fact the "read_tsc_fast" should be the most
common operation, and really reading the PIT counter is not. And reading
the slow tsc should be made at some critical points so that the fast_tsc
is reasonably accurate. So, in fact I implemented:
ipipe_read_tsc, which returns the last tsc value, and is replaced with
rdtsc when available
ipipe_read_slow_tsc, which reads the emulated tsc, and is also replaced
with rdtsc when available,
ipipe_touch_tsc, which reads the emulated tsc, but is replaced with a
nop when rdtsc is available.

Now, the real remaining problem is where to use
ipipe_touch_tsc/ipipe_read_slow_tsc, to have a rasonable accuracy, but
not read the hardware tsc too often.

I implemented the following approach: the tsc is read at every entry
point of the nucleus, that is: interrupts, xenomai syscalls, events for
xenomai tasks. We also need to reread the tsc before programming the
next shot, in order to avoid programming too long delays (with a restart
of xntimer_tick_aperiodic if we find out that the delay is too short,
instead of going through another irq). All in all, these are fairly
lightweight modifications, and the latency test seems reasonable. Even
on a kernel with statistics collection enabled. I suspect the statistics
are a bit off, but at least they are there.

Since we read the tsc twice per interrupt, and reading it takes 4us, the
minimum latency is around 8us, I thought about including the tsc latency
(twice) into the nktimerlat latency, but this results in negative
latencies, and anyway, we should leave the choice to the user to do that
with /proc/xenomai/latency if he wants.

Now the remaining issues are:
- kernel-space code. We can trap insmod/rmmod in losyscall, but if an
RTDM driver ioctl method takes a long time to execute, or when a
kernel-space thread runs long tasks before calling xenomai services, it
may use old clock data
- the time of a syscall is always at least 4us. That is a bit stupid
when, say, for instance you want to lock a mutex, to read the tsc, lock
the mutex, then return to user space. Working this around seems
complicated. We could for instance add a "NOTSC" syscall flag to
indicate that the tsc should not be read before a syscall callback, but
modifying correctly the syscall tables to add this flag to the proper
syscalls is probably not so easy. For instance, when statistics
collection is enabled, we want to read the tsc before locking the mutex,
since if there is a context switch, we will need the value for updating
the statistics.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Xenomai] RFC: slow tsc optimization
  2012-09-07 11:29 ` Gilles Chanteperdrix
@ 2012-09-08 19:30   ` Gilles Chanteperdrix
  0 siblings, 0 replies; 3+ messages in thread
From: Gilles Chanteperdrix @ 2012-09-08 19:30 UTC (permalink / raw)
  To: Xenomai

On 09/07/2012 01:29 PM, Gilles Chanteperdrix wrote:

> On 09/06/2012 09:24 AM, Gilles Chanteperdrix wrote:
> 
>>
>> Hi,
>>
>> The last few days, I have been working on getting the "rdtsc"
>> instruction replaced with a call to a tsc emulation function dynamically
>> at run time. It turned out to be easy with the Linux "alternative"
>> mechanism, since it implements replacements based on CPU capabilities,
>> and the TSC is such a capability. This modification allows to compile a
>> kernel with Xenomai that will run on any x86_32 platform.
>>
>> Now, when running kernels without tsc using the PIT based tsc emulation,
>> I found out something pretty obvious, this PIT based tsc emulation is
>> slow, it takes 4us every time we call it. And the nucleus reads the tsc
>> a number of times when a timer interrupt happens up to the wake up of
>> the latency user-space task:
>> - at the very beginning of the timer interrupt
>> - after the execution of the latency thread timer
>> - in the timer programming function, to compute the timer delay
>> - in the middle of the context switch, if the "statistics collection"
>> feature is enabled
>> - in the xnpod_wait_thread_periodic function, after the context switch,
>> in order to compute the number of timer overruns.
>>
>> That is 20us, and the thread is not yet running in user-space.
>>
>> So, I have been thinking about reducing the number of calls to the PIT,
>> unfortunately keeping the last tsc value around and reusing it is a bit
>> heavy, and implies modifications which are completely useless for the
>> non PIT case (which should be the vast majority), and in fact, the tsc
>> emulation code has to keep the last read value, since it is required to
>> convert clocksources with less than 64 bits to a 64 bits value. So, I
>> propose the following approach:
>>
>> The I-pipe core will provide two tsc reading functions:
>> ipipe_read_tsc which reads the counter
>> ipipe_read_tsc_fast which will read the tsc if the cpu has a tsc, or
>> return the last value read if the tsc is emulated.
> 
> 
> An update on this work. In fact the "read_tsc_fast" should be the most
> common operation, and really reading the PIT counter is not. And reading
> the slow tsc should be made at some critical points so that the fast_tsc
> is reasonably accurate. So, in fact I implemented:
> ipipe_read_tsc, which returns the last tsc value, and is replaced with
> rdtsc when available
> ipipe_read_slow_tsc, which reads the emulated tsc, and is also replaced
> with rdtsc when available,
> ipipe_touch_tsc, which reads the emulated tsc, but is replaced with a
> nop when rdtsc is available.
> 
> Now, the real remaining problem is where to use
> ipipe_touch_tsc/ipipe_read_slow_tsc, to have a rasonable accuracy, but
> not read the hardware tsc too often.
> 
> I implemented the following approach: the tsc is read at every entry
> point of the nucleus, that is: interrupts, xenomai syscalls, events for
> xenomai tasks. We also need to reread the tsc before programming the
> next shot, in order to avoid programming too long delays (with a restart
> of xntimer_tick_aperiodic if we find out that the delay is too short,
> instead of going through another irq). All in all, these are fairly
> lightweight modifications, and the latency test seems reasonable. Even
> on a kernel with statistics collection enabled. I suspect the statistics
> are a bit off, but at least they are there.
> 
> Since we read the tsc twice per interrupt, and reading it takes 4us, the
> minimum latency is around 8us, I thought about including the tsc latency
> (twice) into the nktimerlat latency, but this results in negative
> latencies, and anyway, we should leave the choice to the user to do that
> with /proc/xenomai/latency if he wants.
> 
> Now the remaining issues are:
> - kernel-space code. We can trap insmod/rmmod in losyscall, but if an
> RTDM driver ioctl method takes a long time to execute, or when a
> kernel-space thread runs long tasks before calling xenomai services, it
> may use old clock data
> - the time of a syscall is always at least 4us. That is a bit stupid
> when, say, for instance you want to lock a mutex, to read the tsc, lock
> the mutex, then return to user space. Working this around seems
> complicated. We could for instance add a "NOTSC" syscall flag to
> indicate that the tsc should not be read before a syscall callback, but
> modifying correctly the syscall tables to add this flag to the proper
> syscalls is probably not so easy. For instance, when statistics
> collection is enabled, we want to read the tsc before locking the mutex,
> since if there is a context switch, we will need the value for updating
> the statistics.
> 


Some benchmarks on atom. In the second try "pit, one read", we do not
re-read the emulated tsc before programming the timer, we avoid loosing
4us, at the expense of the precision of the timer tick.

http://sisyphus.hd.free.fr/~gilles/core-3.4-latencies/atom2.png

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-09-08 19:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-06  7:24 [Xenomai] RFC: slow tsc optimization Gilles Chanteperdrix
2012-09-07 11:29 ` Gilles Chanteperdrix
2012-09-08 19:30   ` Gilles Chanteperdrix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.