All of lore.kernel.org
 help / color / mirror / Atom feed
* Debugging system freeze, SIGXCPU
@ 2019-01-24 16:35 Ari Mozes
  2019-02-25 13:32 ` Fwd: " Ari Mozes
  0 siblings, 1 reply; 10+ messages in thread
From: Ari Mozes @ 2019-01-24 16:35 UTC (permalink / raw)
  To: xenomai

I am new to Xenomai and trying to troubleshoot a repeatable Linux
freeze I was experiencing (nothing in the system logs I could find,
unresponsive to any keystrokes like REISUB, etc.)  I then switched
from cobalt 3.0.7 running on Linux 4.4.71 to cobalt 3.0.8 on 4.14.89,
but the problem remained.  I then turned on all Xenomai debugging
options when building the kernel, and I am now seeing SIGXCPU.  I
assume this is the same issue that was causing the freeze before I
enabled Xenomai debugging options, but I am not certain.  Below is the
info from my system as well as a reduced testcase which can reproduce
the SIGXCPU on my system.  Most important (to me) is gaining a better
understanding of what to expect from Xenomai and how to identify
problematic code as I try to transition a much larger code base to an
RTOS (specifically Xenomai).  Any help and insight would be
appreciated.  Happy to provide any missing information and/or run
experiments to help further troubleshoot.

- Ari

Result from running /usr/xenomai/bin/xeno-config --info:
Xenomai version: Xenomai/cobalt v3.0.7
Linux fire 4.14.89 #3 SMP PREEMPT Sat Jan 19 15:34:03 EST 2019 x86_64
x86_64 x86_64 GNU/Linux
Kernel parameters: BOOT_IMAGE=/boot/vmlinuz-4.14.89
root=UUID=2ffdfd76-c81d-4252-88df-b51b6b0fcc9b ro quiet splash
i915.enable_rc6=0 i915.enable_dc=0 xeno_nucleus.xenomai_gid=129 nosmap
i915.modset=0 noapic intremap=off xenomai.allowed_group=129
crashkernel=64M@16M vt.handoff=7
I-pipe release #2 detected
Cobalt core 3.0.8 detected
Compiler: gcc version 6.5.0 20181026 (Ubuntu 6.5.0-2ubuntu1~16.04)
Build args: --with-pic --with-core=cobalt --enable-smp --disable-tls
--enable-dlopen-libs --disable-clock-monotonic-raw

Test case file xenapp.cpp:

#include <stdio.h>
#include <sys/mman.h>
#include <alchemy/task.h>
#include <alchemy/timer.h>
#include <chrono>
#include <unistd.h>

// All times are in nanoseconds
#define TIME_10ms  10000000
#define TIME_100us   100000

#define TASK_PERIOD TIME_10ms
#define TASK_TICKS (rt_timer_ns2ticks(SRTIME(TASK_PERIOD)))

#define SLEEP_TIME TIME_100us
#define SLEEP_TICKS (rt_timer_ns2ticks(SRTIME(SLEEP_TIME)))

#define TASK_BUSY (TASK_PERIOD / 4) // "work" for about 1/4 of the period
#define LOOP_COUNT ((int)(TASK_BUSY / SLEEP_TIME))

void xenTaskFunc(void *cookie)
{
    int errno;
    std::chrono::high_resolution_clock::time_point curTime;
    unsigned long overruns = 0;

    errno = rt_task_set_periodic(NULL, TM_NOW, TASK_TICKS);
    if (errno) {
        printf("rt_task_set_periodic error: %d %s\n", errno, strerror(-errno));
    }

    while(true) {
        // If I alternate between the std::chrono call and rt_task_sleep for
        // part of the task period, SIGXCPU is thrown.
        for (int i = 0; i < LOOP_COUNT; i++) {
            curTime = std::chrono::high_resolution_clock::now();
            errno = rt_task_sleep(SLEEP_TICKS);
            if (errno) {
                printf("rt_task_sleep error: %d %s\n", errno, strerror(-errno));
            }
        }
        errno = rt_task_wait_period(&overruns);
        if (errno) {
            if (errno == -ETIMEDOUT) {
                printf("rt_task_wait_period overruns: %ld\n", overruns);
            }
            else {
                printf("rt_task_wait_period error: %d %s\n", errno,
strerror(-errno));
                break;
            }
        }
  }
}

int main(int argc, char **argv)
{
    mlockall(MCL_CURRENT | MCL_FUTURE);

    int errno;
    RT_TASK xenTask;
    errno = rt_task_create(&xenTask, "xenTestTask", 0, 50, 0);
    if (errno) {
        printf("rt_task_create error: %d %s\n", errno, strerror(-errno));
    }
    errno = rt_task_start(&xenTask, &xenTaskFunc, NULL);
    if (errno) {
        printf("rt_task_start error: %d %s\n", errno, strerror(-errno));
    }

    printf("TASK_TICKS %lld, SLEEP_TICKS %lld, LOOP_COUNT %d\n",
        TASK_TICKS, SLEEP_TICKS, LOOP_COUNT);

    // wait for signal
    pause();
}


Test case Makefile:

XENO_CONFIG := /usr/xenomai/bin/xeno-config
CFLAGS := $(shell $(XENO_CONFIG) --alchemy --cflags)
LDFLAGS := $(shell $(XENO_CONFIG) --alchemy --ldflags)
CC := $(shell $(XENO_CONFIG) --cc)

all: xenapp

xenapp: xenapp.cpp
    $(CC) -o $@ $< $(CFLAGS) $(LDFLAGS) -lstdc++


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Fwd: Debugging system freeze, SIGXCPU
  2019-01-24 16:35 Debugging system freeze, SIGXCPU Ari Mozes
@ 2019-02-25 13:32 ` Ari Mozes
  2019-02-25 16:08   ` Philippe Gerum
  0 siblings, 1 reply; 10+ messages in thread
From: Ari Mozes @ 2019-02-25 13:32 UTC (permalink / raw)
  To: xenomai

Resending this question with testcase.
Can someone give the testcase a try to see if it reproduces the problem I
am seeing?  Is more information needed?
It takes a couple of minutes before I see the issue occur.

Thanks,
Ari

---------- Forwarded message ---------
From: Ari Mozes <arimozes@neocisinc.com>
Date: Thu, Jan 24, 2019 at 11:35 AM
Subject: Debugging system freeze, SIGXCPU
To: <xenomai@xenomai.org>


I am new to Xenomai and trying to troubleshoot a repeatable Linux
freeze I was experiencing (nothing in the system logs I could find,
unresponsive to any keystrokes like REISUB, etc.)  I then switched
from cobalt 3.0.7 running on Linux 4.4.71 to cobalt 3.0.8 on 4.14.89,
but the problem remained.  I then turned on all Xenomai debugging
options when building the kernel, and I am now seeing SIGXCPU.  I
assume this is the same issue that was causing the freeze before I
enabled Xenomai debugging options, but I am not certain.  Below is the
info from my system as well as a reduced testcase which can reproduce
the SIGXCPU on my system.  Most important (to me) is gaining a better
understanding of what to expect from Xenomai and how to identify
problematic code as I try to transition a much larger code base to an
RTOS (specifically Xenomai).  Any help and insight would be
appreciated.  Happy to provide any missing information and/or run
experiments to help further troubleshoot.

- Ari

Result from running /usr/xenomai/bin/xeno-config --info:
Xenomai version: Xenomai/cobalt v3.0.7
Linux fire 4.14.89 #3 SMP PREEMPT Sat Jan 19 15:34:03 EST 2019 x86_64
x86_64 x86_64 GNU/Linux
Kernel parameters: BOOT_IMAGE=/boot/vmlinuz-4.14.89
root=UUID=2ffdfd76-c81d-4252-88df-b51b6b0fcc9b ro quiet splash
i915.enable_rc6=0 i915.enable_dc=0 xeno_nucleus.xenomai_gid=129 nosmap
i915.modset=0 noapic intremap=off xenomai.allowed_group=129
crashkernel=64M@16M vt.handoff=7
I-pipe release #2 detected
Cobalt core 3.0.8 detected
Compiler: gcc version 6.5.0 20181026 (Ubuntu 6.5.0-2ubuntu1~16.04)
Build args: --with-pic --with-core=cobalt --enable-smp --disable-tls
--enable-dlopen-libs --disable-clock-monotonic-raw

Test case file xenapp.cpp:

#include <stdio.h>
#include <sys/mman.h>
#include <alchemy/task.h>
#include <alchemy/timer.h>
#include <chrono>
#include <unistd.h>

// All times are in nanoseconds
#define TIME_10ms  10000000
#define TIME_100us   100000

#define TASK_PERIOD TIME_10ms
#define TASK_TICKS (rt_timer_ns2ticks(SRTIME(TASK_PERIOD)))

#define SLEEP_TIME TIME_100us
#define SLEEP_TICKS (rt_timer_ns2ticks(SRTIME(SLEEP_TIME)))

#define TASK_BUSY (TASK_PERIOD / 4) // "work" for about 1/4 of the period
#define LOOP_COUNT ((int)(TASK_BUSY / SLEEP_TIME))

void xenTaskFunc(void *cookie)
{
    int errno;
    std::chrono::high_resolution_clock::time_point curTime;
    unsigned long overruns = 0;

    errno = rt_task_set_periodic(NULL, TM_NOW, TASK_TICKS);
    if (errno) {
        printf("rt_task_set_periodic error: %d %s\n", errno,
strerror(-errno));
    }

    while(true) {
        // If I alternate between the std::chrono call and rt_task_sleep for
        // part of the task period, SIGXCPU is thrown.
        for (int i = 0; i < LOOP_COUNT; i++) {
            curTime = std::chrono::high_resolution_clock::now();
            errno = rt_task_sleep(SLEEP_TICKS);
            if (errno) {
                printf("rt_task_sleep error: %d %s\n", errno,
strerror(-errno));
            }
        }
        errno = rt_task_wait_period(&overruns);
        if (errno) {
            if (errno == -ETIMEDOUT) {
                printf("rt_task_wait_period overruns: %ld\n", overruns);
            }
            else {
                printf("rt_task_wait_period error: %d %s\n", errno,
strerror(-errno));
                break;
            }
        }
  }
}

int main(int argc, char **argv)
{
    mlockall(MCL_CURRENT | MCL_FUTURE);

    int errno;
    RT_TASK xenTask;
    errno = rt_task_create(&xenTask, "xenTestTask", 0, 50, 0);
    if (errno) {
        printf("rt_task_create error: %d %s\n", errno, strerror(-errno));
    }
    errno = rt_task_start(&xenTask, &xenTaskFunc, NULL);
    if (errno) {
        printf("rt_task_start error: %d %s\n", errno, strerror(-errno));
    }

    printf("TASK_TICKS %lld, SLEEP_TICKS %lld, LOOP_COUNT %d\n",
        TASK_TICKS, SLEEP_TICKS, LOOP_COUNT);

    // wait for signal
    pause();
}


Test case Makefile:

XENO_CONFIG := /usr/xenomai/bin/xeno-config
CFLAGS := $(shell $(XENO_CONFIG) --alchemy --cflags)
LDFLAGS := $(shell $(XENO_CONFIG) --alchemy --ldflags)
CC := $(shell $(XENO_CONFIG) --cc)

all: xenapp

xenapp: xenapp.cpp
    $(CC) -o $@ $< $(CFLAGS) $(LDFLAGS) -lstdc++


-- 

*Ari Mozes*
*Staff Software Engineer*

*Neocis, Inc. <http://www.neocis.com/>*

*Mobile: 781.266.6553 <(858)%20692-1927>*

* <https://www.linkedin.com/company/neocis-inc-/>
<https://twitter.com/yomirobot>
<https://www.facebook.com/YomiRobot/>
<https://www.instagram.com/yomirobot/>
<https://www.youtube.com/channel/UC0TgygGbBNYX267-o6bnWyg>*
*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*

<https://www.neocis.com/>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 13:32 ` Fwd: " Ari Mozes
@ 2019-02-25 16:08   ` Philippe Gerum
  2019-02-25 16:57     ` Ari Mozes
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2019-02-25 16:08 UTC (permalink / raw)
  To: Ari Mozes, xenomai

On 2/25/19 2:32 PM, Ari Mozes via Xenomai wrote:
> Resending this question with testcase.
> Can someone give the testcase a try to see if it reproduces the problem I
> am seeing?  Is more information needed?
> It takes a couple of minutes before I see the issue occur.

The random lockup is due to std::chrono::high_resolution_clock::now()
invoking the vDSO form of clock_gettime().

SIGXCPU aka Xenomai's SIGDEBUG may be sent by the core in various
situations, but since the code does not set the T_WARNSW for any task,
the only explanation is receiving a Xenomai watchdog notification. See
the help information about CONFIG_XENO_OPT_WATCHDOG in your kernel
configuration.

After a few secs spinning in the vDSO code which may not be called from
real-time context, the Xenomai core pulls the break and sends SIGXCPU to
the offending process, unless the system locks up before the watchdog
could even trigger.

Solution: use clock_gettime(CLOCK_HOST_REALTIME) instead of
std::chrono::high_resolution_clock::now() for getting timestamps.

A related discussion is available at this URL:
https://www.xenomai.org/pipermail/xenomai/2018-December/040133.html

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 16:08   ` Philippe Gerum
@ 2019-02-25 16:57     ` Ari Mozes
  2019-02-25 17:28       ` Jan Kiszka
  2019-02-26  8:40       ` Philippe Gerum
  0 siblings, 2 replies; 10+ messages in thread
From: Ari Mozes @ 2019-02-25 16:57 UTC (permalink / raw)
  To: xenomai

Philippe,
Thank you for the information and the URL.
I read through the thread, and I agree with comments that it would be
helpful to be able to identify/blacklist/etc problematic calls when
porting over existing code to a true RT scenario.  In our case the
original code was written with "RT-like" behavior in mind, but as
there is a lot of code already in place, approaches to identify
existing problematic calls would be helpful.
I will continue to familiarize myself with the nitty-gritty details,
but anything that makes the process easier is always welcome :-)

Ari


On Mon, Feb 25, 2019 at 11:08 AM Philippe Gerum <rpm@xenomai.org> wrote:
>
> On 2/25/19 2:32 PM, Ari Mozes via Xenomai wrote:
> > Resending this question with testcase.
> > Can someone give the testcase a try to see if it reproduces the problem I
> > am seeing?  Is more information needed?
> > It takes a couple of minutes before I see the issue occur.
>
> The random lockup is due to std::chrono::high_resolution_clock::now()
> invoking the vDSO form of clock_gettime().
>
> SIGXCPU aka Xenomai's SIGDEBUG may be sent by the core in various
> situations, but since the code does not set the T_WARNSW for any task,
> the only explanation is receiving a Xenomai watchdog notification. See
> the help information about CONFIG_XENO_OPT_WATCHDOG in your kernel
> configuration.
>
> After a few secs spinning in the vDSO code which may not be called from
> real-time context, the Xenomai core pulls the break and sends SIGXCPU to
> the offending process, unless the system locks up before the watchdog
> could even trigger.
>
> Solution: use clock_gettime(CLOCK_HOST_REALTIME) instead of
> std::chrono::high_resolution_clock::now() for getting timestamps.
>
> A related discussion is available at this URL:
> https://www.xenomai.org/pipermail/xenomai/2018-December/040133.html
>
> --
> Philippe.



-- 

Ari Mozes
Staff Software Engineer

Neocis, Inc.

Mobile: 781.266.6553



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 16:57     ` Ari Mozes
@ 2019-02-25 17:28       ` Jan Kiszka
  2019-02-25 17:59         ` Ari Mozes
  2019-02-26  8:40       ` Philippe Gerum
  1 sibling, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2019-02-25 17:28 UTC (permalink / raw)
  To: Ari Mozes, xenomai

On 25.02.19 17:57, Ari Mozes via Xenomai wrote:
> Philippe,
> Thank you for the information and the URL.
> I read through the thread, and I agree with comments that it would be
> helpful to be able to identify/blacklist/etc problematic calls when
> porting over existing code to a true RT scenario.  In our case the
> original code was written with "RT-like" behavior in mind, but as
> there is a lot of code already in place, approaches to identify
> existing problematic calls would be helpful.

You could wrap such calls like we do for malloc/free in libcobalt. But wrapping 
only works if the direct caller is processed that way - and is not some 
pre-built external library.

Therefore: Do not use libraries that you didn't validate from within 
time-sensitive code paths. Also libstdc++ may contain more surprises.

Jan

> I will continue to familiarize myself with the nitty-gritty details,
> but anything that makes the process easier is always welcome :-)
> 
> Ari
> 
> 
> On Mon, Feb 25, 2019 at 11:08 AM Philippe Gerum <rpm@xenomai.org> wrote:
>>
>> On 2/25/19 2:32 PM, Ari Mozes via Xenomai wrote:
>>> Resending this question with testcase.
>>> Can someone give the testcase a try to see if it reproduces the problem I
>>> am seeing?  Is more information needed?
>>> It takes a couple of minutes before I see the issue occur.
>>
>> The random lockup is due to std::chrono::high_resolution_clock::now()
>> invoking the vDSO form of clock_gettime().
>>
>> SIGXCPU aka Xenomai's SIGDEBUG may be sent by the core in various
>> situations, but since the code does not set the T_WARNSW for any task,
>> the only explanation is receiving a Xenomai watchdog notification. See
>> the help information about CONFIG_XENO_OPT_WATCHDOG in your kernel
>> configuration.
>>
>> After a few secs spinning in the vDSO code which may not be called from
>> real-time context, the Xenomai core pulls the break and sends SIGXCPU to
>> the offending process, unless the system locks up before the watchdog
>> could even trigger.
>>
>> Solution: use clock_gettime(CLOCK_HOST_REALTIME) instead of
>> std::chrono::high_resolution_clock::now() for getting timestamps.
>>
>> A related discussion is available at this URL:
>> https://www.xenomai.org/pipermail/xenomai/2018-December/040133.html
>>
>> --
>> Philippe.
> 
> 
> 

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 17:28       ` Jan Kiszka
@ 2019-02-25 17:59         ` Ari Mozes
  0 siblings, 0 replies; 10+ messages in thread
From: Ari Mozes @ 2019-02-25 17:59 UTC (permalink / raw)
  To: xenomai

On Mon, Feb 25, 2019 at 12:28 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > On Mon, Feb 25, 2019 at 11:08 AM Philippe Gerum <rpm@xenomai.org> wrote:
> >>
> >> On 2/25/19 2:32 PM, Ari Mozes via Xenomai wrote:
> >>> Resending this question with testcase.
> >>> Can someone give the testcase a try to see if it reproduces the problem I
> >>> am seeing?  Is more information needed?
> >>> It takes a couple of minutes before I see the issue occur.
> >>
> >> The random lockup is due to std::chrono::high_resolution_clock::now()
> >> invoking the vDSO form of clock_gettime().
> >>
> >> SIGXCPU aka Xenomai's SIGDEBUG may be sent by the core in various
> >> situations, but since the code does not set the T_WARNSW for any task,
> >> the only explanation is receiving a Xenomai watchdog notification. See
> >> the help information about CONFIG_XENO_OPT_WATCHDOG in your kernel
> >> configuration.
> >>
> >> After a few secs spinning in the vDSO code which may not be called from
> >> real-time context, the Xenomai core pulls the break and sends SIGXCPU to
> >> the offending process, unless the system locks up before the watchdog
> >> could even trigger.
> >>
> >> Solution: use clock_gettime(CLOCK_HOST_REALTIME) instead of
> >> std::chrono::high_resolution_clock::now() for getting timestamps.
> >>
> >> A related discussion is available at this URL:
> >> https://www.xenomai.org/pipermail/xenomai/2018-December/040133.html
> >>
> >> --
> >> Philippe.
> >
> >
> >
>
> --
> Siemens AG, Corporate Technology, CT RDA IOT SES-DE
> Corporate Competence Center Embedded Linux
>
> On 25.02.19 17:57, Ari Mozes via Xenomai wrote:
> > Philippe,
> > Thank you for the information and the URL.
> > I read through the thread, and I agree with comments that it would be
> > helpful to be able to identify/blacklist/etc problematic calls when
> > porting over existing code to a true RT scenario.  In our case the
> > original code was written with "RT-like" behavior in mind, but as
> > there is a lot of code already in place, approaches to identify
> > existing problematic calls would be helpful.
>
> You could wrap such calls like we do for malloc/free in libcobalt. But wrapping
> only works if the direct caller is processed that way - and is not some
> pre-built external library.

Sure - make sense - IMO just knowing which calls are potentially problematic is
the difficult part here.  I expect I will just continue to stumble through them
and learn more as I go.

>
> Therefore: Do not use libraries that you didn't validate from within
> time-sensitive code paths. Also libstdc++ may contain more surprises.
>
> Jan
>
> > I will continue to familiarize myself with the nitty-gritty details,
> > but anything that makes the process easier is always welcome :-)
> >
> > Ari
> >
> >


-- 

Ari Mozes
Staff Software Engineer

Neocis, Inc.

Mobile: 781.266.6553



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 16:57     ` Ari Mozes
  2019-02-25 17:28       ` Jan Kiszka
@ 2019-02-26  8:40       ` Philippe Gerum
  2019-02-26 13:52         ` Ari Mozes
  1 sibling, 1 reply; 10+ messages in thread
From: Philippe Gerum @ 2019-02-26  8:40 UTC (permalink / raw)
  To: Ari Mozes, xenomai

On 2/25/19 5:57 PM, Ari Mozes via Xenomai wrote:
> Philippe,
> Thank you for the information and the URL.
> I read through the thread, and I agree with comments that it would be
> helpful to be able to identify/blacklist/etc problematic calls when
> porting over existing code to a true RT scenario.  In our case the
> original code was written with "RT-like" behavior in mind, but as
> there is a lot of code already in place, approaches to identify
> existing problematic calls would be helpful.
> I will continue to familiarize myself with the nitty-gritty details,
> but anything that makes the process easier is always welcome :-)
> 

User-oriented documentation is lacking for Xenomai, that is a fact.
Until somebody tackles the task of contributing it gradually, the
situation won't change. This being said, the following may help as a
survival kit for programming with Xenomai.

This is a dual kernel system, so we have two competing cores: the
regular kernel and cobalt. The latter can preempt the former for running
its own tasks at almost any point in time, including within its critical
sections.

With that in mind, it becomes clear that calling regular kernel routines
from the runtime context of the cobalt core may cause severe re-entry bugs.

To mitigate this issue, cobalt detects when one of its tasks issues a
regular kernel system call from a real-time context, transferring
control over it to the regular kernel when this happens. The cobalt task
is demoted to non real-time mode during this process, which incurs
unbounded latency down the road, but that is still better than breaking
the whole kernel system.

Because such detection happens when a task transitions between user and
kernel space due to a syscall, vDSO-based services and intra-kernel
function calls escape it, since there is no intervening syscall. In
these particular cases, the real-time core most often breaks basic
assumptions of the non real-time linux kernel with respect to locking
rules and interrupt-free sections by running code it should not, and
things start to fall apart.

C++ libraries may call into standard glibc services such as malloc/free,
POSIX mutex support, which in turn may issue regular linux syscalls in
some cases (e.g. access to a non-contended mutex won't, the contended
case will definitely ask the kernel for putting the caller to sleep
until the lock is available). This is going to be the major issue to
solve when porting a large C++ code base to a dual kernel system such as
Xenomai: figuring out which C++ abstraction is real-time safe in such
environment, which is not.

Typical solutions may involve overloading the new/delete operators so
that an allocator which does not rely on regular system calls is picked
instead of malloc/free, possibly staying away from C++ exception
handling too if it implicitly allocates memory the same way.

To help you in detecting the situations where your application is being
demoted to non real-time mode (aka "secondary" mode) by cobalt in order
to process a regular syscall, you can trap the SIGDEBUG signal. This is
a regular linux signal (SIGXCPU in disguise) which is sent to the thread
crossing the domain boundaries from rt to non-rt. For this to happen,
the thread should arm the "warn on mode switch" flag using a Xenomai
system call. The application should catch the SIGDEBUG signal, which
comes with some bits of information detailing which action specifically
triggered the mode switch.

With the "alchemy" API, rt_task_set_mode(0, T_WARNSW, NULL) can be used,
or the task can be created with such init flag as illustrated in
demo/alchemy/altency.c. With the POSIX API, one can use
pthread_setmode_np(0, PTHREAD_WARNSW, NULL) as illustrated in
testsuite/latency/latency.c.

These particular services are described there:

https://xenomai.org/documentation/xenomai-3/html/xeno3prm/group__alchemy__task.html#ga915e7edfb0aaddb643794d7abc7093bf
https://xenomai.org/documentation/xenomai-3/html/xeno3prm/group__cobalt__api__thread.html#gae3b7df7f77c04253ed19fb6346f0f9b2

In the Xenomai documentation, the "api-tags" information mentions
"switch-primary" for any call that forces the caller to switch to
real-time mode. Conversely, "switch-secondary" tags services which
demote the caller to non-rt mode.

As a rule of thumb, most calls from the glibc should be considered as
potentially rt-unsafe in a dual kernel environment, because they may
rely on regular system calls for performing their work. Specifically,
any service which in essence allocates memory, synchronizes threads,
does messaging, or affects the scheduling state of POSIX threads may
have to call into the regular kernel for doing so.

This is fine to use them during the initialization/cleanup stages of any
Xenomai application, but you certainly want to avoid them from the
time-critical work loop.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-26  8:40       ` Philippe Gerum
@ 2019-02-26 13:52         ` Ari Mozes
  0 siblings, 0 replies; 10+ messages in thread
From: Ari Mozes @ 2019-02-26 13:52 UTC (permalink / raw)
  To: xenomai

On Tue, Feb 26, 2019 at 3:40 AM Philippe Gerum <rpm@xenomai.org> wrote:
>
> On 2/25/19 5:57 PM, Ari Mozes via Xenomai wrote:
> > Philippe,
> > Thank you for the information and the URL.
> > I read through the thread, and I agree with comments that it would be
> > helpful to be able to identify/blacklist/etc problematic calls when
> > porting over existing code to a true RT scenario.  In our case the
> > original code was written with "RT-like" behavior in mind, but as
> > there is a lot of code already in place, approaches to identify
> > existing problematic calls would be helpful.
> > I will continue to familiarize myself with the nitty-gritty details,
> > but anything that makes the process easier is always welcome :-)
> >
>
> User-oriented documentation is lacking for Xenomai, that is a fact.
> Until somebody tackles the task of contributing it gradually, the
> situation won't change. This being said, the following may help as a
> survival kit for programming with Xenomai.
>
> This is a dual kernel system, so we have two competing cores: the
> regular kernel and cobalt. The latter can preempt the former for running
> its own tasks at almost any point in time, including within its critical
> sections.
>
> With that in mind, it becomes clear that calling regular kernel routines
> from the runtime context of the cobalt core may cause severe re-entry bugs.
>
> To mitigate this issue, cobalt detects when one of its tasks issues a
> regular kernel system call from a real-time context, transferring
> control over it to the regular kernel when this happens. The cobalt task
> is demoted to non real-time mode during this process, which incurs
> unbounded latency down the road, but that is still better than breaking
> the whole kernel system.
>
> Because such detection happens when a task transitions between user and
> kernel space due to a syscall, vDSO-based services and intra-kernel
> function calls escape it, since there is no intervening syscall. In
> these particular cases, the real-time core most often breaks basic
> assumptions of the non real-time linux kernel with respect to locking
> rules and interrupt-free sections by running code it should not, and
> things start to fall apart.
>
> C++ libraries may call into standard glibc services such as malloc/free,
> POSIX mutex support, which in turn may issue regular linux syscalls in
> some cases (e.g. access to a non-contended mutex won't, the contended
> case will definitely ask the kernel for putting the caller to sleep
> until the lock is available). This is going to be the major issue to
> solve when porting a large C++ code base to a dual kernel system such as
> Xenomai: figuring out which C++ abstraction is real-time safe in such
> environment, which is not.
>
> Typical solutions may involve overloading the new/delete operators so
> that an allocator which does not rely on regular system calls is picked
> instead of malloc/free, possibly staying away from C++ exception
> handling too if it implicitly allocates memory the same way.
>
> To help you in detecting the situations where your application is being
> demoted to non real-time mode (aka "secondary" mode) by cobalt in order
> to process a regular syscall, you can trap the SIGDEBUG signal. This is
> a regular linux signal (SIGXCPU in disguise) which is sent to the thread
> crossing the domain boundaries from rt to non-rt. For this to happen,
> the thread should arm the "warn on mode switch" flag using a Xenomai
> system call. The application should catch the SIGDEBUG signal, which
> comes with some bits of information detailing which action specifically
> triggered the mode switch.
>
> With the "alchemy" API, rt_task_set_mode(0, T_WARNSW, NULL) can be used,
> or the task can be created with such init flag as illustrated in
> demo/alchemy/altency.c. With the POSIX API, one can use
> pthread_setmode_np(0, PTHREAD_WARNSW, NULL) as illustrated in
> testsuite/latency/latency.c.
>
> These particular services are described there:
>
> https://xenomai.org/documentation/xenomai-3/html/xeno3prm/group__alchemy__task.html#ga915e7edfb0aaddb643794d7abc7093bf
> https://xenomai.org/documentation/xenomai-3/html/xeno3prm/group__cobalt__api__thread.html#gae3b7df7f77c04253ed19fb6346f0f9b2
>
> In the Xenomai documentation, the "api-tags" information mentions
> "switch-primary" for any call that forces the caller to switch to
> real-time mode. Conversely, "switch-secondary" tags services which
> demote the caller to non-rt mode.
>
> As a rule of thumb, most calls from the glibc should be considered as
> potentially rt-unsafe in a dual kernel environment, because they may
> rely on regular system calls for performing their work. Specifically,
> any service which in essence allocates memory, synchronizes threads,
> does messaging, or affects the scheduling state of POSIX threads may
> have to call into the regular kernel for doing so.
>
> This is fine to use them during the initialization/cleanup stages of any
> Xenomai application, but you certainly want to avoid them from the
> time-critical work loop.
>
> --
> Philippe.

Thank you Philippe.
Much appreciated, and it will help as I re-read the existing doc/examples/etc.
I had previously looked at Mercury, but comments such as
https://www.xenomai.org/pipermail/xenomai/2018-October/039733.html
made the choice a bit murky.
In any case there is clearly a lot of existing information to absorb, but
thanks again for this aptly named cut at a "survival kit."

Ari


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
  2019-02-25 23:10 Norbert Lange
@ 2019-02-26 13:34 ` Ari Mozes
  0 siblings, 0 replies; 10+ messages in thread
From: Ari Mozes @ 2019-02-26 13:34 UTC (permalink / raw)
  To: xenomai

On Mon, Feb 25, 2019 at 6:10 PM Norbert Lange <nolange79@gmail.com> wrote:
>
> > Sure - make sense - IMO just knowing which calls are potentially problematic is
> > the difficult part here.  I expect I will just continue to stumble through them
> > and learn more as I go.
>
> I wrote some checkers that should be able to catch those calls
> (had pretty much the same issue, legacy code...).
>
> https://github.com/nolange/preload_checkers
>
> Guinea pigs welcome.
>
> Norbert

I will definitely give this a try - thanks much!

Ari
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Fwd: Debugging system freeze, SIGXCPU
@ 2019-02-25 23:10 Norbert Lange
  2019-02-26 13:34 ` Ari Mozes
  0 siblings, 1 reply; 10+ messages in thread
From: Norbert Lange @ 2019-02-25 23:10 UTC (permalink / raw)
  To: xenomai, arimozes

> Sure - make sense - IMO just knowing which calls are potentially
problematic is
> the difficult part here.  I expect I will just continue to stumble
through them
> and learn more as I go.

I wrote some checkers that should be able to catch those calls
(had pretty much the same issue, legacy code...).

https://github.com/nolange/preload_checkers

Guinea pigs welcome.

Norbert

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-02-26 13:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 16:35 Debugging system freeze, SIGXCPU Ari Mozes
2019-02-25 13:32 ` Fwd: " Ari Mozes
2019-02-25 16:08   ` Philippe Gerum
2019-02-25 16:57     ` Ari Mozes
2019-02-25 17:28       ` Jan Kiszka
2019-02-25 17:59         ` Ari Mozes
2019-02-26  8:40       ` Philippe Gerum
2019-02-26 13:52         ` Ari Mozes
2019-02-25 23:10 Norbert Lange
2019-02-26 13:34 ` Ari Mozes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.