Re: [Xenomai-core] co-kernel benchmarking on arm926

From: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
To: Nero Fernandez <grimlynch@domain.hid>
Cc: xenomai@xenomai.org
Subject: Re: [Xenomai-core] co-kernel benchmarking on arm926
Date: Mon, 28 Jun 2010 23:50:41 +0200	[thread overview]
Message-ID: <4C291931.7010402@domain.hid> (raw)
In-Reply-To: <AANLkTilOt6PsWdCc7faKOBDy1MdCZp8Lgib1d5Dzy3cz@domain.hid>

Nero Fernandez wrote:
> On Fri, Jun 25, 2010 at 8:30 PM, Philippe Gerum <rpm@xenomai.org> wrote:
> 
>> On Thu, 2010-06-24 at 17:05 +0530, Nero Fernandez wrote:
>>> Thanks for your response, Philippe.
>>>
>>> The concerns while the carrying out my experiments were to:
>>>
>>>  - compare xenomai co-kernel overheads (timer and context switch
>>> latencies)
>>>    in xenomai-space vs similar native-linux overheads. These are
>>> presented in
>>>    the first two sheets.
>>>
>>>  - find out, how addition of xenomai, xenomai+adeos effects the native
>>> kernel's
>>>    performance. Here, lmbench was used on the native linux side to
>>> estimate
>>>    the changes to standard linux services.
>> How can your reasonably estimate the overhead of co-kernel services
>> without running any co-kernel services? Interrupt pipelining is not a
>> co-kernel service. You do nothing with interrupt pipelining except
>> enabling co-kernel services to be implemented with real-time response
>> guarantee.
>>
> 
> Repeating myself, sheet 1 and 2 contain the results of running
> co-kernel services(real-time pthread, message-queues, semaphores
> and clock-nansleep) and making measurment regarding scheduling
> and timer-base functionality provided by co-kernel via posix skin.
> 
> Same code was then built native posix, instead of  xenomai-posix skin
> and similar measurements were taken for linux-scheduler and timerbase.
> This is something that i cant do with xenomai's native test (use it for
> native linux benchmarking).
> The point here is to demostrate what kind of benefits may be drawn using
>  xenomai-space without any code change.
> 
> 
> 
>>> Regarding the additions of latency measurements in sys-timer handler,
>>> i performed
>>> a similar measurement from xnintr_clock_handler(), and the results
>>> were similar
>>> to ones reported from sys-timer handler in xenomai-enabled linux.
>> If your benchmark is about Xenomai, then at least make sure to provide
>> results for Xenomai services, used in a relevant application and
>> platform context. Pretending that you instrumented
>> xnintr_clock_handler() at some point and got some results, but
>> eventually decided to illustrate your benchmark with other "similar"
>> results obtained from a totally unrelated instrumentation code, does not
>> help considering the figures as relevant.
>>
>> Btw, hooking xnintr_clock_handler() is not correct. Again, benchmarking
>> interrupt latency with Xenomai has to measure the entire code path, from
>> the moment the interrupt is taken by the CPU, until it is delivered to
>> the Xenomai service user. By instrumenting directly in
>> xnintr_clock_handler(), your test bypasses the Xenomai timer handling
>> code which delivers the timer tick to the user code, and the
>> rescheduling procedure as well, so your figures are optimistically wrong
>> for any normal use case based on real-time tasks.
>>
> 
> Regarding hooking up a measurement-device in sys-timer itself, it serves
> the benefit of observing the changes that xenomai's aperiodic handling
> of system-timer brings. This measurement does not attempt to measure
> the co-kernel services in any manner.
> 
> 
> 
>>  While trying to
>>> make both these measurements, i tried to take care that delay-value
>>> logging is
>>> done at the end the handler routines,but the __ipipe_mach_tsc value is
>>> recorded
>>> at the beginning of the routine (a patch for this is included in the
>>> worksheet itself)
>> This patch is hopelessly useless and misleading. Unless your intent is
>> to have your application directly embodied into low-level interrupt
>> handlers, you are not measuring the actual overhead.
>>
>> Latency is not solely a matter of interrupt masking, but also a matter
>> of I/D cache misses, particularly on ARM - you have to traverse the
>> actual code until delivery to exhibit the latter.
>>
>> This is exactly what the latency tests shipped with Xenomai are for:
>> - /usr/xenomai/bin/latency -t0/1/2
>> - /usr/xenomai/bin/klatency
>> - /usr/xenomai/bin/irqbench
>>
>> If your system involves user-space tasks, then you should benchmark
>> user-space response time using latency [-t0]. If you plan to use
>> kernel-based tasks such as RTDM tasks, then latency -t1 and klatency
>> tests will provide correct results for your benchmark.
>> If you are interested only in interrupt latency, then latency -t2 will
>> help.
>>
>> If you do think that those tests do not measure what you seem to be
>> interested in, then you may want to explain why on this list, so that we
>> eventually understand what you are after.
>>
>>> Regarding the system, changing the kernel version would invalidate my
>>> results
>>> as the system is a released CE device and has no plans to upgrade the
>>> kernel.
>> Ok. But that makes your benchmark 100% irrelevant with respect to
>> assessing the real performances of a decent co-kernel on your setup.
>>
>>> AFAIK, enabling FCSE would limit the number of concurrent processes,
>>> hence
>>> becoming inviable in my scenario.
>> Ditto. Besides, FCSE as implemented in recent I-pipe patches has a
>> best-effort mode which lifts those limitations, at the expense of
>> voiding the latency guarantee, but on the average, that would still be
>> much better than always suffering the VIVT cache insanity without FCSE
>>
> 
> Thanks for mentioning this. I will try to enable this option for
> re-measurements.
> 
> 
>> Quoting a previous mail of yours, regarding your target:
>>> Processor       : ARM926EJ-S rev 5 (v5l)
>> The latency hit induced by VIVT caching on arm926 is typically in the
>> 180-200 us range under load in user-space, and 100-120 us in kernel
>> space. So, without FCSE, this would bite at each Xenomai __and__ linux
>> process context switch. Since your application requires that more than
>> 95 processes be available in the system, you will likely get quite a few
>> switches in any given period of time, unless most of them always sleep,
>> of course.
>>
>> Ok, so let me do some wild guesses here: you told us this is a CE-based
>> application; maybe it exists already? maybe it has to be put on steroïds
>> for gaining decent real-time guarantees it doesn't have yet? and perhaps
>> the design of that application involves many processes undergoing
>> periodic activities, so lots of context switches with address space
>> changes during normal operations?
>>
>> And, you want that to run on arm926, with no FCSE, and likely not a huge
>> amount of RAM either, with more than 95 different address spaces? Don't
>> you think there might be a problem? If so, don't you think implementing
>> a benchmark based on those assumptions might be irrelevant at some
>> point?
>>
>>> As far as the adeos patch is concerned, i took a recent one (2.6.32)
>> I guess you meant 2.6.33?
>>
> 
> Correction, 2.6.30.

Ok. If you are interested in the FCSE code, you may want to use FCSE v4.
See the comparison on the hackbench test here:
http://sisyphus.hd.free.fr/~gilles/pub/fcse/hackbench-fcse-v4.png

I did not rebase the I-pipe patch for 2.6.30 on this new fcse, but you
can find it in the patches for 2.6.31 and 2.6.33. Or as standalone trees
in my adeos git tree:
http://git.xenomai.org/?p=ipipe-gch.git;a=summary

Also note that since we are in the re-hashing tonight, as Philippe told
you, 95 processes is actually a lot on a low-end ARM platform, so you
would better be sure that you really need more than 95 processes
(beware, we are talking processes here, memory spaces, not threads, a
process may have has many threads as it wants) before deciding not to
use the FCSE guaranteed mode. Thinking that the number of processes is
unlimited on a low-end/embedded ARM system is an error: it is limited by
the available ressources (RAM, CPU) on your system. The lower the
ressources, the lower the practical limit is, and I bet this practical
limit is much lower than you would like.

-- 
					    Gilles.