From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Fri, 17 May 2019 02:13:41 -0500 (CDT) From: Per Oberg Message-ID: <1874557544.12596976.1558077221209.JavaMail.zimbra@wolfram.com> In-Reply-To: References: <1823257552.1557117297954.94.KebiMail.daeyoungsong@cnu.ac.kr> <4c39d8ca-3c98-4b35-6b06-2013dc80134b@xenomai.org> <1907767262.9872848.1557212015078.JavaMail.zimbra@wolfram.com> Subject: Re: Xenomai Mercury and PREEMPT_RT MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: =?utf-8?B?7Iah64yA7JiB?= , xenomai ----- Den 16 maj 2019, p=C3=A5 kl 17:51, Philippe Gerum rpm@xenomai.org skr= ev: > On 5/7/19 8:53 AM, Per Oberg wrote: > > ----- Den 6 maj 2019, p=C3=A5 kl 10:07, xenomai xenomai@xenomai.org skr= ev: > >> On 5/6/19 6:34 AM, =EC=86=A1=EB=8C=80=EC=98=81 via Xenomai wrote: > >>> [webmailreconf.public.do?24538165] > >>> Hello, expert. > >>> I have a question about Xenomai Mercury and PREEMPT_RT. > >>> Following "Xenomai 3 =E2=80=93 An Overview of the Real-Time Framework= for > >>> Linux", Mercury is based on PREEMPT_RT and only offers API emulation. > >>> Is Mercury's latency performance better than PREEMPT_RT? > >> No, it merely uses what native preemption provides for. >> So, just to be clear. Are you saying that it should be "Just as good" as > > PREEMPT.RT when the kernel is fully patched? (Whatever that means...) >> Now, I don't want to start a flame-war here, that would be stupid. But I= really >> want to know your opinion on this (see for eaxmple claims made in [1]). = May we >> assume that the Mercury performance would be quite close to Cobolt perfo= rmance > > ? > The so-called "Mercury" layer allows to run the Xenomai APIs (e.g. > alchemy, vxworks, pSOS) on a single kernel system, i.e. without > assistance of any co-kernel like Cobalt. This is a mediating interface > library getting the real-time POSIX services it needs to run those APIs > from the plain glibc, instead of libcobalt. Therefore, this purely > user-space layer cannot bring more real-time guarantees than the > underlying native kernel is able to deliver. > Regarding this presentation at ELCE 2016, one of the many claims was > that native preemption delivers about the same if not better worst case > latency figures than a dual kernel configuration like Xenomai/cobalt > does on Altera's SoCFPGA Cyclone V. This does not involve Mercury. > As mentioned in slide #25 of the presentation [1], the benchmark leading > to this conclusion ran on an Altera SocFPGA Cyclone V, with two Cortex > A9 cores. However, there is no mention of the kernel, I-pipe or Xenomai > releases being tested, which is unfortunate. The test was about > measuring the response time to an external device interrupt from a > user-space task, but we don't know what the Xenomai test application > looked like, neither do we know which driver code was involved in > performing real-time I/O on the input and output pins operated by > Xenomai, which is again unfortunate since this is key to timely > behavior. From an engineering standpoint, when 431999999 samples are > below 65=C2=B5 and only 1 sample is above 95=C2=B5 over a 12 hours long t= est, it > seems legitimate to wonder whether some bug might be hiding under the sof= a. > Also, I'm unsure why CPU isolation (isolcpus=3D) was omitted from the > Xenomai test although it is present for the best-performing native > preemption test. On this particular hardware which exhibits a > not-that-snappy outer L2 cache with write-allocate policy enabled, this > is an > unfortunate choice too. This means that for such test, the native > preemption test has benefited from hot cache conditions most of the time > since the stress load was always running on a separate CPU. At the > opposite, the Xenomai test application was continuously battered by > costly cache misses as the stress threads kept causing I/D cache line > evictions, pushing away the real-time code and data each time the > sampling thread slept waiting for the next measurement period on the > shared CPU. > In addition, the 10Khz sampling loop which is presented may be too fast > to uncover the actual cost of cache misses. The faster the real-time > loop, the fewer the opportunities for the non real-time work to disturb > the shared environment both run in. Running a slower, 1Khz loop seems > more appropriate to lower the odds of cache evictions, at least if one > is looking for real-world runtime conditions where a system may have to > execute a significant amount of code concurrently, some of which > requiring low jitter in wake up times, but not necessarily for pacing > high frequency loops. > Although there is not enough information to exactly reproduce this > benchmark configuration, we can easily run a simple timer-based test > scenario with what is at hand, which is the same SoCFPGA hardware, and > the latency test developed by the native preemption team, which Xenomai > can run too. To confirm whether CPU isolation may have played a role in > these results, the Xenomai test should run twice, once on an isolated > CPU, next without this optimization. I'll try to give an exhaustive > description of the test recipe I used, so that it be can reproduced > easily in your kitchen: > * native preemption setup > - download kernel 5.0.14 in source form, apply -rt8 patch from [2]. > - enable maximum preemption (CONFIG_PREEMPT_RT_FULL). > - boot the kernel with "isolcpus=3D1 threadirqs". > - check that no threaded IRQ can compete with SCHED_FIFO,98. Normally > there should be none of them, but you may want to double-check if you > have custom IRQ settings at boot time. > - switch to a non-serial terminal (ssh, telnet); significant output to > a serial device might affect the worst case latency on some > platforms with native preemption because of implementation issues in > console drivers, so the console should be kept quiet. You could also > add the "quiet" option to the boot args as an additional precaution. > - for good measure, turn off SCHED_FIFO throttling by setting > /proc/sys/kernel/sched_rt_runtime_us to -1. This should definitely > not be needed for the kind of test we are about to run, but let's > move this away for peace of mind. > - wakeup events are produced by a timer, so there is no IRQ threading > for these ones which are fully handled from a so-called hard IRQ > context, so no IRQ (software) priority issue. Since the TWD > timer we use from the Cortex A9 is a per-core beast, there won't be > any CPU affinity issue either: a tick on CPU1 will wake up the > sampling thread on CPU1. > * Xenomai setup > - git clone code from [3], which is kernel 4.19.33 including the > I-pipe patch. > - git clone the Xenomai code from [4], which is the base of the > upcoming 3.1 release. > - boot the kernel with "isolcpus=3D1" > - run the "autotune" utility to best calibrate the core timer gravity > values (more info at [5]). This is not required here since the default > values are close enough for this SoC though. > Although the base kernel releases are not identical, they are still > close enough, and experience shows that the figures are fairly stable > regardless of the kernel release under test, at least with a working > Xenomai port. > * for both kernel setups > - turn off all debug features and tracers in the kernel configuration. > - ensure that all CPUs keep running at maximum frequency by enabling > the "performance" CPU_FREQ governor, or disabling CPU_FREQ entirely. > - turn CPU_IDLE on, disabling the ARM_CPUIDLE driver for entering the > idle state via basic WFI. This said, you could alternatively enable > ARM_CPUIDLE on this particular platform, this should not increase > the worst case latency. > - disable graphic support to rule out any weird GPU driver issue. > * stress load > For both tests, part of the stress load is generated by the 'hackbench' > program mentioned in the presentation, which is available from [6], > linked against the plain glibc. Running it with 40 groups is enough to > bring the system down to a crawl. The command started at the beginning > of both runs is: > # while :; do hackbench -g 40 -l 30000 >/dev/null 2>&1; sleep 1; done& > However, running 'hackbench' is not enough to observe the worst latency > peaks on most platforms. Since we want to estimate such worst case, > let's tickle the dragon's tail by adding a plain simple 'dd' loop in the > background, reading a large enough bulk of memory from /dev/zero > repeatedly. In effect, this pounds the memory subsystem badly by > continuously clearing RAM, putting more pressure on data caches. The > command used is: > # dd if=3D/dev/zero of=3D/dev/null bs=3D32M & > * test application > We can use the 'cyclictest' latency measurement code also available from > [6], linking it against the plain glibc for the native preemption test, > or against Xenomai's POSIX libcobalt interface for the dual kernel > setup. The source code of the pre-compiled version of 'cyclictest' for > Xenomai is available at [7]. The command starting the measurement for > 12hrs is: > # cyclictest -l43200000 -a 1 -m -n -p98 -i1000 -h200 -q > results.lat& > This test application is pinned to the isolated CPU #1 which has been > excluded from the load balancing scheme, process memory is locked. The > latency sampling thread is set to SCHED_FIFO,98 in both tests, since > this is the highest priority level we may use for a user-space > application without interfering with critical kernel activities in the > native preemption case. Actually we could have used priority 1 for > Xenomai for the same latency results, since the Xenomai scheduler always > has precedence over the regular kernel scheduler. The sampling frequency > is set to 1Khz. The so-called "clock_nanosleep mode" is used, which is > definitely the best case for native preemption. > -- > The results obtained running this test can be downloaded from [8]. They > don't match the figures presented at ELCE, on the very same hardware, > although they should have followed the same trend but did not. In > isolated mode, Xenomai achieved 50=C2=B5 worst case latency, bumping to 8= 3=C2=B5 > without CPU isolation, while native preemption went to 117=C2=B5 with CPU > isolation. > I gave as many details as possible regarding the settings I used for > native preemption on this SoC, so that anything I might have overlooked > could be quickly spotted by an expert in this field. Just let me know. > The Xenomai part should be ok though. >> And how about communication with hardware communication in a PREEMT_RT s= etup? >> Without special RT-drivers what can we expect? For me it's not about whe= n > > something can be computed but rather when it can be commuicated. > I don't think there is a definitive answer to this. It would depend on > several aspects, such as a particular implementation of the driver and > the locking constructs used there for instance, whether there is some > kernel layer above the VFS between your application and the driver > handling the actual I/O requests to the hardware, what would be the > worst runtime conditions for latency, what would be the runtime settings > like irq thread priorities and so on. > > [1] http://linuxgizmos.com/real-time-linux-explained/ >> Qute: "While Xenomai performed better on most tests, and offered far les= s >> jitter, the differences were not as great as the 300 to 400 percent late= ncy >> superiority claimed by some Xenomai boosters, said Altenberg. When tests= were >> performed on userspace tasks =E2=80=94 which Altenberg says is the most = real-world, and >> therefore the most important, test =E2=80=94 the worst-case reaction was= about 90 to 95 > > microseconds for both Xenomai and RTL/PREEMPT.RT, he claimed." > I must admit that I don't have any formal explanation about such a > difference between those results and the ones I just obtained, not > having access to the test material presented at ELCE. Maybe we managed > to lower the Xenomai worst case latency by 49% on this hardware since > 2016 which would be great news, and native preemption got worse by about > 32% during the same period of time, which would be sad. > I'm reassured by the fact that the most recent results are consistent > with those I have been seeing over the years on all architectures I came > across running a variety of tests, which is a relief. > This said, I would refrain from generalizing the results of either my > benchmark or any benchmark, especially those supporting a PR stunt: > there is a multitude of combinations of real-time use cases, platforms, > runtime conditions and requirements. How meaningful are those results > depends on such combo, not to speak of the implementation of the > application itself. The devil is in the detail. > [1] > https://events.static.linuxfound.org/sites/events/files/slides/praesentat= ion_1.pdf > [2] > https://mirrors.edge.kernel.org/pub/linux/kernel/projects/rt/5.0/older/pa= tches-5.0.14-rt8.tar.xz > [3] https://gitlab.denx.de/Xenomai/ipipe-arm (branch master) > [4] https://gitlab.denx.de/Xenomai/xenomai (branch next) > [5] > https://xenomai.org/documentation/xenomai-3/html/man1/autotune/index.html > [6] git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git > [7] https://gitlab.denx.de/Xenomai/xenomai/tree/next/demo/posix/cyclictes= t > [8] https://xenomai.org/downloads/xenomai/benchmarks/cyclone-v/socfpga/20= 19/ > -- > Philippe. Thanks! I really appreciate your detailed and serious answer. In fact this was prob= ably the most well written mailing list answer I have ever gotten.=20 Per =C3=96berg=20