Re: [Xenomai] non-blocking rt_task

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
@ 2014-04-16 14:20 Petr Cervenka
  2014-04-16 14:28 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-16 14:20 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 04/16/2014 02:22 PM, Petr Cervenka wrote:
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM, Petr
>>> Cervenka wrote:
>>>> Hello I have a problem with the rt_task_suspend(NULL) call. I'm
>>>> using it for synchronization of two (producer / consumer like)
>>>> tasks. 1) When the consumer task has no work to do, it stops
>>>> itself by calling of the rt_task_suspend(NULL). 2) When the
>>>> producer creates new work for consumer, it wakes it up by calling
>>>> of rt_task_resume(&consumerTask). The problem is, that consumer
>>>> seldom switches to a state, that it sleeps by rt_task_suspend no
>>>> more. And the task then takes all the CPU time. The return code
>>>> is 0. But I already have seen couple of -4 (-EINTR) values in the
>>>> past also. Consumer task status was 00300380 before and 00300184
>>>> (if there is small safety sleep present). I can use for example
>>>> RT_EVENT variable instead, but I'm curious if you by chance don't
>>>> know, what is happening? Xenomai 2.6.3, Linux 3.5.7
>>>
>>> Could you post the example of code you are using to get this
>>> issue?
>>>
>>
>> It's and application with many threads, mutexes and others. It's also
>> special measuring HW dependent. I can post here some simplified
>> example. But I don't think it would be possible to reproduce the same
>> behavior easily. It happens in my configuration only probably once
>> per day and very unpredictably. But I have more details. I replaced
>> rt_task_suspend / rt_task_resume by rt_event_wait / rt_event_signal.
>> It failed similar way, but this time the result of wait was -4
>> (-EINTR). And (after several millions of invocations) it recovered
>> itself.
>
>-EINTR is a valid return value for both rt_event_wait and 
>rt_task_suspend. In case you get this error, you should loop to call 
>rt_event_wait again, and not call rt_event_clear, as you risk clearing 
>an event which has been signaled afterwards.
>
You are right. It was just very quick replace of waiting and waking-up functions. But I'm checking the "work queue" anyway and it also doesn't need exact timing here. My problem it that the slow consumer task seems to be "interrupted by signal" (or whatever) for several minutes. I mean, that it doesn't wait for the event anymore and it always returns immediately (with -EINTR return code). I also already got one such situation half an hour ago. But the return code was 0 that time. Could you give me some advice what to check when such situation happens again?

Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-16 14:20 [Xenomai] non-blocking rt_task_suspend(NULL) Petr Cervenka
@ 2014-04-16 14:28 ` Gilles Chanteperdrix
  0 siblings, 0 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-16 14:28 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/16/2014 04:20 PM, Petr Cervenka wrote:
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 02:22 PM, Petr
>> Cervenka wrote:
>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM,
>>>> Petr Cervenka wrote:
>>>>> Hello I have a problem with the rt_task_suspend(NULL) call.
>>>>> I'm using it for synchronization of two (producer / consumer
>>>>> like) tasks. 1) When the consumer task has no work to do, it
>>>>> stops itself by calling of the rt_task_suspend(NULL). 2) When
>>>>> the producer creates new work for consumer, it wakes it up by
>>>>> calling of rt_task_resume(&consumerTask). The problem is,
>>>>> that consumer seldom switches to a state, that it sleeps by
>>>>> rt_task_suspend no more. And the task then takes all the CPU
>>>>> time. The return code is 0. But I already have seen couple of
>>>>> -4 (-EINTR) values in the past also. Consumer task status was
>>>>> 00300380 before and 00300184 (if there is small safety sleep
>>>>> present). I can use for example RT_EVENT variable instead,
>>>>> but I'm curious if you by chance don't know, what is
>>>>> happening? Xenomai 2.6.3, Linux 3.5.7
>>>>
>>>> Could you post the example of code you are using to get this
>>>> issue?
>>>>
>>>
>>> It's and application with many threads, mutexes and others. It's
>>> also special measuring HW dependent. I can post here some
>>> simplified example. But I don't think it would be possible to
>>> reproduce the same behavior easily. It happens in my
>>> configuration only probably once per day and very unpredictably.
>>> But I have more details. I replaced rt_task_suspend /
>>> rt_task_resume by rt_event_wait / rt_event_signal. It failed
>>> similar way, but this time the result of wait was -4 (-EINTR).
>>> And (after several millions of invocations) it recovered itself.
>>
>> -EINTR is a valid return value for both rt_event_wait and
>> rt_task_suspend. In case you get this error, you should loop to
>> call rt_event_wait again, and not call rt_event_clear, as you risk
>> clearing an event which has been signaled afterwards.
>>
> You are right. It was just very quick replace of waiting and
> waking-up functions. But I'm checking the "work queue" anyway and it
> also doesn't need exact timing here. My problem it that the slow
> consumer task seems to be "interrupted by signal" (or whatever) for
> several minutes. I mean, that it doesn't wait for the event anymore
> and it always returns immediately (with -EINTR return code).

Are you running inside gdb? Does the task receive the SIGDEBUG signal? 
Do you have the XNWARNSW bit armed?

> I also
> already got one such situation half an hour ago. But the return code
> was 0 that time. Could you give me some advice what to check when
> such situation happens again?

Well the task status should help.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-20 12:27                   ` Petr Cervenka
@ 2014-05-20 12:54                     ` Philippe Gerum
  0 siblings, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2014-05-20 12:54 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 05/20/2014 02:27 PM, Petr Cervenka wrote:
>> Od: Philippe Gerum
>>
>>>>
>>>> Thanks. Could you drop the previous instrumentation patches, and give a try at this one? It fixes a flaw in the logic for maintaining the thread information bits, which may have caused the issue you observed:
>>>>
>>>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>>>> index 0a2ee19..22fa91d 100644
>>>> --- a/ksrc/nucleus/pod.c
>>>> +++ b/ksrc/nucleus/pod.c
>>>> @@ -1391,7 +1391,8 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
>>>> 		}
>>>> #endif /* CONFIG_XENO_OPT_PERVASIVE */
>>>>
>>>> -		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | XNWAKEN | XNROBBED);
>>>> +		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | \
>>>> +				    XNWAKEN | XNROBBED | XNKICKED);
>>>> 	}
>>>>
>>>> 	/* Don't start the timer for a thread indefinitely delayed by
>>>>
>>>
>>> I applied the new patch and I was not able to reproduce the -EINTR problem to this moment. But one of the machines I tried during long term test over the weekend got frozen. But it could be also from different reason. What do you think?
>>
>> Could you enable all debug features (CONFIG_XENO_OPT_DEBUG), and
>> particularly CONFIG_XENO_OPT_WATCHDOG? The system will be significantly
>> slowed down due to the consistency checks on all internal lists (i.e.
>> DEBUG_QUEUES), but this may give us some hints about the freeze.
>>
>
> I have enabled almost all debug features and tortured several PCs for a week without a single problem. Thank you all for your help.
>

Ok, thanks for testing this extensively. The suggested patch has been 
pushed to the maintenance tree.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-12 13:09                 ` Philippe Gerum
@ 2014-05-20 12:27                   ` Petr Cervenka
  2014-05-20 12:54                     ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-05-20 12:27 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai

> Od: Philippe Gerum
>
>>>
>>> Thanks. Could you drop the previous instrumentation patches, and give a try at this one? It fixes a flaw in the logic for maintaining the thread information bits, which may have caused the issue you observed:
>>>
>>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>>> index 0a2ee19..22fa91d 100644
>>> --- a/ksrc/nucleus/pod.c
>>> +++ b/ksrc/nucleus/pod.c
>>> @@ -1391,7 +1391,8 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
>>> 		}
>>> #endif /* CONFIG_XENO_OPT_PERVASIVE */
>>>
>>> -		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | XNWAKEN | XNROBBED);
>>> +		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | \
>>> +				    XNWAKEN | XNROBBED | XNKICKED);
>>> 	}
>>>
>>> 	/* Don't start the timer for a thread indefinitely delayed by
>>>
>>
>> I applied the new patch and I was not able to reproduce the -EINTR problem to this moment. But one of the machines I tried during long term test over the weekend got frozen. But it could be also from different reason. What do you think?
>
>Could you enable all debug features (CONFIG_XENO_OPT_DEBUG), and 
>particularly CONFIG_XENO_OPT_WATCHDOG? The system will be significantly 
>slowed down due to the consistency checks on all internal lists (i.e. 
>DEBUG_QUEUES), but this may give us some hints about the freeze.
>

I have enabled almost all debug features and tortured several PCs for a week without a single problem. Thank you all for your help.

Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-12 12:37               ` Petr Cervenka
@ 2014-05-12 13:09                 ` Philippe Gerum
  2014-05-20 12:27                   ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2014-05-12 13:09 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 05/12/2014 02:37 PM, Petr Cervenka wrote:
>> Od: Philippe Gerum <rpm@xenomai.org>
>>
>>
>> Thanks. Could you drop the previous instrumentation patches, and give a try at this one? It fixes a flaw in the logic for maintaining the thread information bits, which may have caused the issue you observed:
>>
>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>> index 0a2ee19..22fa91d 100644
>> --- a/ksrc/nucleus/pod.c
>> +++ b/ksrc/nucleus/pod.c
>> @@ -1391,7 +1391,8 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
>> 		}
>> #endif /* CONFIG_XENO_OPT_PERVASIVE */
>>
>> -		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | XNWAKEN | XNROBBED);
>> +		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | \
>> +				    XNWAKEN | XNROBBED | XNKICKED);
>> 	}
>>
>> 	/* Don't start the timer for a thread indefinitely delayed by
>>
>
> I applied the new patch and I was not able to reproduce the -EINTR problem to this moment. But one of the machines I tried during long term test over the weekend got frozen. But it could be also from different reason. What do you think?

Could you enable all debug features (CONFIG_XENO_OPT_DEBUG), and 
particularly CONFIG_XENO_OPT_WATCHDOG? The system will be significantly 
slowed down due to the consistency checks on all internal lists (i.e. 
DEBUG_QUEUES), but this may give us some hints about the freeze.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-08 15:53             ` Philippe Gerum
@ 2014-05-12 12:37               ` Petr Cervenka
  2014-05-12 13:09                 ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-05-12 12:37 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Xenomai

> Od: Philippe Gerum <rpm@xenomai.org>
>
>
>Thanks. Could you drop the previous instrumentation patches, and give a try at this one? It fixes a flaw in the logic for maintaining the thread information bits, which may have caused the issue you observed: 
>
>diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>index 0a2ee19..22fa91d 100644
>--- a/ksrc/nucleus/pod.c
>+++ b/ksrc/nucleus/pod.c
>@@ -1391,7 +1391,8 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
> 		}
> #endif /* CONFIG_XENO_OPT_PERVASIVE */
> 
>-		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | XNWAKEN | XNROBBED);
>+		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | \
>+				    XNWAKEN | XNROBBED | XNKICKED);
> 	}
> 
> 	/* Don't start the timer for a thread indefinitely delayed by
>

I applied the new patch and I was not able to reproduce the -EINTR problem to this moment. But one of the machines I tried during long term test over the weekend got frozen. But it could be also from different reason. What do you think?

Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-07 13:13           ` Petr Cervenka
@ 2014-05-08 15:53             ` Philippe Gerum
  2014-05-12 12:37               ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2014-05-08 15:53 UTC (permalink / raw)
  To: Petr Cervenka, Gilles Chanteperdrix; +Cc: Xenomai

On 05/07/2014 03:13 PM, Petr Cervenka wrote:
>> Od: Philippe Gerum <rpm@xenomai.org>
>>>
>>> Here it is. It's full of modules and other perhaps not so real-time 
>>> settings, because it was derived from Kubuntu kernel config file.
>>>
>>
>> Thanks. The issue seems to happen as a result of a relax -> harden 
>> transition racing with a signal receipt. In order to help me ruling 
>> out some assumptions, could you please apply the patch below, and 
>> confirm that no job control is involved (SIGCONT/SIGSTOP) in your 
>> application?
>>
>> Knowing whether any of the two warnings added by this patch is issued 
>> when the bug happens would also help solving the issue, TIA.
>>
>> diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>> index 0a2ee19..9fb797f 100644
>> --- a/ksrc/nucleus/pod.c
>> +++ b/ksrc/nucleus/pod.c
>> @@ -1379,15 +1379,17 @@ void xnpod_suspend_thread(xnthread_t *thread, 
>> xnflags_t mask,
>>          * context, to collect and act upon the pending Linux
>>          * signal.
>>          */
>> -        if ((mask & XNRELAX) == 0 &&
>> -            xnthread_test_info(thread, XNKICKED)) {
>> -            if (wchan) {
>> -                thread->wchan = wchan;
>> -                xnsynch_forget_sleeper(thread);
>> +        if (xnthread_test_info(thread, XNKICKED)) {
>> +            if ((mask & XNRELAX) == 0) {
>> +                if (wchan) {
>> +                    thread->wchan = wchan;
>> +                    xnsynch_forget_sleeper(thread);
>> +                }
>> +                xnthread_clear_info(thread, XNRMID | XNTIMEO);
>> +                xnthread_set_info(thread, XNBREAK);
>> +                goto unlock_and_exit;
>>             }
>> -            xnthread_clear_info(thread, XNRMID | XNTIMEO);
>> -            xnthread_set_info(thread, XNBREAK);
>> -            goto unlock_and_exit;
>> +            WARN_ON(1);
>>         }
>> #endif /* CONFIG_XENO_OPT_PERVASIVE */
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 38c1423..fc592a2 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2696,6 +2696,8 @@ static inline void do_sigwake_event(struct 
>> task_struct *p)
>>         }
>>     }
>>
>> +    WARN_ON(!signal_pending(p));
>> +
>>     /*
>>      * If a relaxed thread is getting a signal while running, we
>>      * force it out of RPI, so that it won't keep a boosted
>> -- 
> 
> Finally, I was able to catch the warning(s). I got 38 of these before 
> task ended:
> 
> [ 1109.336726] ------------[ cut here ]------------
> [ 1109.336737] WARNING: at kernel/xenomai/nucleus/pod.c:1392 
> xnpod_suspend_thread+0x17d/0x5a0()
> [ 1109.336746] Hardware name: X7SBA
> [ 1109.336755] Modules linked in: fr01_rtdm(O) netconsole configfs igb 
> dca e1000e e1000 r8169 rt_e1000(O) rt_r8169(O) rtpacket(O) rtnet(O) 
> coretemp psmouse microcode serio_raw shpchp lpc_ich i3200_edac video 
> edac_core floppy
> [ 1109.336908] Pid: 786, comm: ASYNC_TASK_1869 Tainted: G           O 
> 3.5.7-debug #38
> [ 1109.336917] Call Trace:
> [ 1109.336933]  [<ffffffff8103cdcf>] warn_slowpath_common+0x7f/0xc0
> [ 1109.336949]  [<ffffffff8103ce2a>] warn_slowpath_null+0x1a/0x20
> [ 1109.336965]  [<ffffffff8110b08d>] xnpod_suspend_thread+0x17d/0x5a0
> [ 1109.336981]  [<ffffffff81117054>] xnshadow_relax+0xf4/0x250
> [ 1109.336997]  [<ffffffff810cc73c>] ? __ipipe_restore_head+0x7c/0x100
> [ 1109.337013]  [<ffffffff811181db>] xnshadow_harden+0x30b/0x340
> [ 1109.337029]  [<ffffffff811183e0>] losyscall_event+0xb0/0x2f0
> [ 1109.337045]  [<ffffffff810d0829>] ipipe_syscall_hook+0x89/0xd0
> [ 1109.337061]  [<ffffffff810cbc68>] __ipipe_notify_syscall+0x158/0x340
> [ 1109.337076]  [<ffffffff8102167a>] __ipipe_syscall_root+0x4a/0x1f0
> [ 1109.337092]  [<ffffffff81367baa>] __ipipe_syscall_root_thunk+0x35/0x67
> [ 1109.337108]  [<ffffffff81682217>] ? system_call_after_swapgs+0x54/0x6d
> [ 1109.337117] ---[ end trace 9fc5fa66a7479311 ]---
> 
> The trace log is in the attachment.
> 

Thanks. Could you drop the previous instrumentation patches, and give a try at this one? It fixes a flaw in the logic for maintaining the thread information bits, which may have caused the issue you observed: 

diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 0a2ee19..22fa91d 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -1391,7 +1391,8 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
 		}
 #endif /* CONFIG_XENO_OPT_PERVASIVE */
 
-		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | XNWAKEN | XNROBBED);
+		xnthread_clear_info(thread, XNRMID | XNTIMEO | XNBREAK | \
+				    XNWAKEN | XNROBBED | XNKICKED);
 	}
 
 	/* Don't start the timer for a thread indefinitely delayed by

TIA,

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-06 12:57         ` Philippe Gerum
@ 2014-05-07 13:13           ` Petr Cervenka
  2014-05-08 15:53             ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-05-07 13:13 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: Xenomai

> Od: Philippe Gerum <rpm@xenomai.org>
>> 
>> Here it is. It's full of modules and other perhaps not so real-time 
>> settings, because it was derived from Kubuntu kernel config file.
>> 
>
>Thanks. The issue seems to happen as a result of a relax -> harden transition racing with a signal receipt. In order to help me ruling out some assumptions, could you please apply the patch below, and confirm that no job control is involved (SIGCONT/SIGSTOP) in your application?
>
>Knowing whether any of the two warnings added by this patch is issued when the bug happens would also help solving the issue, TIA.
>
>diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
>index 0a2ee19..9fb797f 100644
>--- a/ksrc/nucleus/pod.c
>+++ b/ksrc/nucleus/pod.c
>@@ -1379,15 +1379,17 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
> 		 * context, to collect and act upon the pending Linux
> 		 * signal.
> 		 */
>-		if ((mask & XNRELAX) == 0 &&
>-		    xnthread_test_info(thread, XNKICKED)) {
>-			if (wchan) {
>-				thread->wchan = wchan;
>-				xnsynch_forget_sleeper(thread);
>+		if (xnthread_test_info(thread, XNKICKED)) {
>+			if ((mask & XNRELAX) == 0) {
>+				if (wchan) {
>+					thread->wchan = wchan;
>+					xnsynch_forget_sleeper(thread);
>+				}
>+				xnthread_clear_info(thread, XNRMID | XNTIMEO);
>+				xnthread_set_info(thread, XNBREAK);
>+				goto unlock_and_exit;
> 			}
>-			xnthread_clear_info(thread, XNRMID | XNTIMEO);
>-			xnthread_set_info(thread, XNBREAK);
>-			goto unlock_and_exit;
>+			WARN_ON(1);
> 		}
> #endif /* CONFIG_XENO_OPT_PERVASIVE */
> 
>diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>index 38c1423..fc592a2 100644
>--- a/ksrc/nucleus/shadow.c
>+++ b/ksrc/nucleus/shadow.c
>@@ -2696,6 +2696,8 @@ static inline void do_sigwake_event(struct task_struct *p)
> 		}
> 	}
> 
>+	WARN_ON(!signal_pending(p));
>+
> 	/*
> 	 * If a relaxed thread is getting a signal while running, we
> 	 * force it out of RPI, so that it won't keep a boosted
>-- 

Finally, I was able to catch the warning(s). I got 38 of these before task ended:

[ 1109.336726] ------------[ cut here ]------------
[ 1109.336737] WARNING: at kernel/xenomai/nucleus/pod.c:1392 xnpod_suspend_thread+0x17d/0x5a0()
[ 1109.336746] Hardware name: X7SBA
[ 1109.336755] Modules linked in: fr01_rtdm(O) netconsole configfs igb dca e1000e e1000 r8169 rt_e1000(O) rt_r8169(O) rtpacket(O) rtnet(O) coretemp psmouse microcode serio_raw shpchp lpc_ich i3200_edac video edac_core floppy
[ 1109.336908] Pid: 786, comm: ASYNC_TASK_1869 Tainted: G           O 3.5.7-debug #38
[ 1109.336917] Call Trace:
[ 1109.336933]  [<ffffffff8103cdcf>] warn_slowpath_common+0x7f/0xc0
[ 1109.336949]  [<ffffffff8103ce2a>] warn_slowpath_null+0x1a/0x20
[ 1109.336965]  [<ffffffff8110b08d>] xnpod_suspend_thread+0x17d/0x5a0
[ 1109.336981]  [<ffffffff81117054>] xnshadow_relax+0xf4/0x250
[ 1109.336997]  [<ffffffff810cc73c>] ? __ipipe_restore_head+0x7c/0x100
[ 1109.337013]  [<ffffffff811181db>] xnshadow_harden+0x30b/0x340
[ 1109.337029]  [<ffffffff811183e0>] losyscall_event+0xb0/0x2f0
[ 1109.337045]  [<ffffffff810d0829>] ipipe_syscall_hook+0x89/0xd0
[ 1109.337061]  [<ffffffff810cbc68>] __ipipe_notify_syscall+0x158/0x340
[ 1109.337076]  [<ffffffff8102167a>] __ipipe_syscall_root+0x4a/0x1f0
[ 1109.337092]  [<ffffffff81367baa>] __ipipe_syscall_root_thunk+0x35/0x67
[ 1109.337108]  [<ffffffff81682217>] ? system_call_after_swapgs+0x54/0x6d
[ 1109.337117] ---[ end trace 9fc5fa66a7479311 ]---

The trace log is in the attachment.

Petr

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: freeze14.txt
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140507/37a07957/attachment.txt>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-06  9:29       ` Petr Cervenka
@ 2014-05-06 12:57         ` Philippe Gerum
  2014-05-07 13:13           ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2014-05-06 12:57 UTC (permalink / raw)
  To: Petr Cervenka, Gilles Chanteperdrix; +Cc: Xenomai

On 05/06/2014 11:29 AM, Petr Cervenka wrote:

> 
> Here it is. It's full of modules and other perhaps not so real-time 
> settings, because it was derived from Kubuntu kernel config file.
> 

Thanks. The issue seems to happen as a result of a relax -> harden transition racing with a signal receipt. In order to help me ruling out some assumptions, could you please apply the patch below, and confirm that no job control is involved (SIGCONT/SIGSTOP) in your application?

Knowing whether any of the two warnings added by this patch is issued when the bug happens would also help solving the issue, TIA.

diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c
index 0a2ee19..9fb797f 100644
--- a/ksrc/nucleus/pod.c
+++ b/ksrc/nucleus/pod.c
@@ -1379,15 +1379,17 @@ void xnpod_suspend_thread(xnthread_t *thread, xnflags_t mask,
 		 * context, to collect and act upon the pending Linux
 		 * signal.
 		 */
-		if ((mask & XNRELAX) == 0 &&
-		    xnthread_test_info(thread, XNKICKED)) {
-			if (wchan) {
-				thread->wchan = wchan;
-				xnsynch_forget_sleeper(thread);
+		if (xnthread_test_info(thread, XNKICKED)) {
+			if ((mask & XNRELAX) == 0) {
+				if (wchan) {
+					thread->wchan = wchan;
+					xnsynch_forget_sleeper(thread);
+				}
+				xnthread_clear_info(thread, XNRMID | XNTIMEO);
+				xnthread_set_info(thread, XNBREAK);
+				goto unlock_and_exit;
 			}
-			xnthread_clear_info(thread, XNRMID | XNTIMEO);
-			xnthread_set_info(thread, XNBREAK);
-			goto unlock_and_exit;
+			WARN_ON(1);
 		}
 #endif /* CONFIG_XENO_OPT_PERVASIVE */
 
diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 38c1423..fc592a2 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -2696,6 +2696,8 @@ static inline void do_sigwake_event(struct task_struct *p)
 		}
 	}
 
+	WARN_ON(!signal_pending(p));
+
 	/*
 	 * If a relaxed thread is getting a signal while running, we
 	 * force it out of RPI, so that it won't keep a boosted
-- 
Philippe.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-06  8:56     ` Philippe Gerum
@ 2014-05-06  9:29       ` Petr Cervenka
  2014-05-06 12:57         ` Philippe Gerum
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-05-06  9:29 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: Xenomai

> Od: Philippe Gerum <rpm@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 05/06/2014 10:17 AM, Petr Cervenka wrote:
>>> Od: Philippe Gerum <rpm@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org>
>>> On 05/02/2014 02:13 PM, Petr Cervenka wrote:
>>>>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>>>>
>>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>
>>>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>>>
>>>>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>>>>
>>>>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>>>>> enough trace points, it should be possible to understand what
>>>>>>>> happens.
>>>>>>>>
>>>>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>>>>> but I simply don't understand what is happening there. The trace
>>>>>>> output is in the attachment. Could you please help me to understand
>>>>>>> it?
>>>>>>>
>>>>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>>>>> (under different name). Version of xenomai is 2.6.3.
>>>>>>
>>>>>> We see that xnpod_suspend_thread returns immediately, likely
>>>>>> because it
>>>>>> has the XNKICKED bit set. Could you add more back trace points? So
>>>>>> that
>>>>>> we see what is setting the XNKICKED bit?
>>>>>>
>>>>>
>>>>> I have added (maybe too much) back trace points. But as last time,
>>>>> I'm not able to see (almost) anything in it ;-)
>>>>> Previous xnpod_suspend_thread (on line 3713, probably caused by
>>>>> rt_mutex_aquire) seems to be fine (for me;-) ).
>>>>>
>>>>
>>>> Could you please help me with analysis, what is happening in the
>>>> trace log?
>>>> I only see at the end only peaces of log, which are already contained
>>>> somewhere before.
>>>> For example lines 3832-3993 are the same as 2901-3062.
>>>> Also 3697-3861 and 3065-3229.
>>>> Also 3596-3692 and 2935-3058
>>>> Also 3065-3550 and 2402-2887
>>>>
>>>> There are also suspicious lines with "device_not_available" and
>>>> "ipipe_handle_exception", but they seem to be regularly appearing.
>>>>
>>>
>>> Please add these bits to your trace setup:
>>>
>>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>>> index 13865a3..ccc54a8 100644
>>> --- a/ksrc/nucleus/shadow.c
>>> +++ b/ksrc/nucleus/shadow.c
>>> @@ -2516,6 +2516,10 @@ int do_losyscall_event(unsigned event,
>>> rthal_pipeline_stage_t *stage,
>>>           * We may have gained a shadow TCB from the syscall we
>>>           * just invoked, so make sure to fetch it.
>>>           */
>>> +        if (__xn_interrupted_p(regs)) {
>>> +            xntrace_special(0x44, thread->info);
>>> +            xntrace_special(0x55, signal_pending(current));
>>> +        }
>>>          thread = xnshadow_thread(current);
>>>          if (signal_pending(current)) {
>>>              sigs = 1;
>>>
>>>
>>> TIA,
>>
>> I have applied the patch and here is the trace log. It took a while
>> because some change I made caused lesser probability of the problem.
>> Short version:
>> :   + (0x44)    0x0000000c    -3      0.097  losyscall_event+0x270
>> (ipipe_syscall_hook+0x89)
>> :   + (0x55)    0x00000000    -3      0.119  losyscall_event+0x292
>> (ipipe_syscall_hook+0x89)
>> Long version is in the attachment.
>
>Could you post your kernel config?
>

Here it is. It's full of modules and other perhaps not so real-time settings, because it was derived from Kubuntu kernel config file.

Petr
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: kernel_config_debug.txt
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140506/38f80d30/attachment.txt>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-06  8:17   ` Petr Cervenka
  2014-05-06  8:39     ` Philippe Gerum
@ 2014-05-06  8:56     ` Philippe Gerum
  2014-05-06  9:29       ` Petr Cervenka
  1 sibling, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2014-05-06  8:56 UTC (permalink / raw)
  To: Petr Cervenka, Gilles Chanteperdrix; +Cc: Xenomai

On 05/06/2014 10:17 AM, Petr Cervenka wrote:
>> Od: Philippe Gerum <rpm@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>> On 05/02/2014 02:13 PM, Petr Cervenka wrote:
>>>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>>
>>>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>>>
>>>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>>>> enough trace points, it should be possible to understand what
>>>>>>> happens.
>>>>>>>
>>>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>>>> but I simply don't understand what is happening there. The trace
>>>>>> output is in the attachment. Could you please help me to understand
>>>>>> it?
>>>>>>
>>>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>>>> (under different name). Version of xenomai is 2.6.3.
>>>>>
>>>>> We see that xnpod_suspend_thread returns immediately, likely
>>>>> because it
>>>>> has the XNKICKED bit set. Could you add more back trace points? So
>>>>> that
>>>>> we see what is setting the XNKICKED bit?
>>>>>
>>>>
>>>> I have added (maybe too much) back trace points. But as last time,
>>>> I'm not able to see (almost) anything in it ;-)
>>>> Previous xnpod_suspend_thread (on line 3713, probably caused by
>>>> rt_mutex_aquire) seems to be fine (for me;-) ).
>>>>
>>>
>>> Could you please help me with analysis, what is happening in the
>>> trace log?
>>> I only see at the end only peaces of log, which are already contained
>>> somewhere before.
>>> For example lines 3832-3993 are the same as 2901-3062.
>>> Also 3697-3861 and 3065-3229.
>>> Also 3596-3692 and 2935-3058
>>> Also 3065-3550 and 2402-2887
>>>
>>> There are also suspicious lines with "device_not_available" and
>>> "ipipe_handle_exception", but they seem to be regularly appearing.
>>>
>>
>> Please add these bits to your trace setup:
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 13865a3..ccc54a8 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2516,6 +2516,10 @@ int do_losyscall_event(unsigned event,
>> rthal_pipeline_stage_t *stage,
>>           * We may have gained a shadow TCB from the syscall we
>>           * just invoked, so make sure to fetch it.
>>           */
>> +        if (__xn_interrupted_p(regs)) {
>> +            xntrace_special(0x44, thread->info);
>> +            xntrace_special(0x55, signal_pending(current));
>> +        }
>>          thread = xnshadow_thread(current);
>>          if (signal_pending(current)) {
>>              sigs = 1;
>>
>>
>> TIA,
>
> I have applied the patch and here is the trace log. It took a while
> because some change I made caused lesser probability of the problem.
> Short version:
> :   + (0x44)    0x0000000c    -3      0.097  losyscall_event+0x270
> (ipipe_syscall_hook+0x89)
> :   + (0x55)    0x00000000    -3      0.119  losyscall_event+0x292
> (ipipe_syscall_hook+0x89)
> Long version is in the attachment.

Could you post your kernel config?

TIA,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-06  8:17   ` Petr Cervenka
@ 2014-05-06  8:39     ` Philippe Gerum
  2014-05-06  8:56     ` Philippe Gerum
  1 sibling, 0 replies; 27+ messages in thread
From: Philippe Gerum @ 2014-05-06  8:39 UTC (permalink / raw)
  To: Petr Cervenka, Gilles Chanteperdrix; +Cc: Xenomai

On 05/06/2014 10:17 AM, Petr Cervenka wrote:
>> Od: Philippe Gerum <rpm@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>> On 05/02/2014 02:13 PM, Petr Cervenka wrote:
>>>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>>
>>>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>>>
>>>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>>>> enough trace points, it should be possible to understand what
>>>>>>> happens.
>>>>>>>
>>>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>>>> but I simply don't understand what is happening there. The trace
>>>>>> output is in the attachment. Could you please help me to understand
>>>>>> it?
>>>>>>
>>>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>>>> (under different name). Version of xenomai is 2.6.3.
>>>>>
>>>>> We see that xnpod_suspend_thread returns immediately, likely
>>>>> because it
>>>>> has the XNKICKED bit set. Could you add more back trace points? So
>>>>> that
>>>>> we see what is setting the XNKICKED bit?
>>>>>
>>>>
>>>> I have added (maybe too much) back trace points. But as last time,
>>>> I'm not able to see (almost) anything in it ;-)
>>>> Previous xnpod_suspend_thread (on line 3713, probably caused by
>>>> rt_mutex_aquire) seems to be fine (for me;-) ).
>>>>
>>>
>>> Could you please help me with analysis, what is happening in the
>>> trace log?
>>> I only see at the end only peaces of log, which are already contained
>>> somewhere before.
>>> For example lines 3832-3993 are the same as 2901-3062.
>>> Also 3697-3861 and 3065-3229.
>>> Also 3596-3692 and 2935-3058
>>> Also 3065-3550 and 2402-2887
>>>
>>> There are also suspicious lines with "device_not_available" and
>>> "ipipe_handle_exception", but they seem to be regularly appearing.
>>>
>>
>> Please add these bits to your trace setup:
>>
>> diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>> index 13865a3..ccc54a8 100644
>> --- a/ksrc/nucleus/shadow.c
>> +++ b/ksrc/nucleus/shadow.c
>> @@ -2516,6 +2516,10 @@ int do_losyscall_event(unsigned event,
>> rthal_pipeline_stage_t *stage,
>>           * We may have gained a shadow TCB from the syscall we
>>           * just invoked, so make sure to fetch it.
>>           */
>> +        if (__xn_interrupted_p(regs)) {
>> +            xntrace_special(0x44, thread->info);
>> +            xntrace_special(0x55, signal_pending(current));
>> +        }
>>          thread = xnshadow_thread(current);
>>          if (signal_pending(current)) {
>>              sigs = 1;
>>
>>
>> TIA,
>
> I have applied the patch and here is the trace log. It took a while
> because some change I made caused lesser probability of the problem.
> Short version:
> :   + (0x44)    0x0000000c    -3      0.097  losyscall_event+0x270
> (ipipe_syscall_hook+0x89)
> :   + (0x55)    0x00000000    -3      0.119  losyscall_event+0x292
> (ipipe_syscall_hook+0x89)
> Long version is in the attachment.
>

Ok, thanks. Things are getting clearer: the real-time core believes it 
has kicked a thread out of sleep for receiving a linux signal, but linux 
does not have any signal to process for this thread. So the real-time 
core ends up propagating EINTR to userland as nobody clears the kick 
notification Xenomai-wise. Ok, something is definitely wrong in the 
Xenomai core logic here.

And this bug is likely present in Xenomai 3/Cobalt as well.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-02 13:16 ` Philippe Gerum
@ 2014-05-06  8:17   ` Petr Cervenka
  2014-05-06  8:39     ` Philippe Gerum
  2014-05-06  8:56     ` Philippe Gerum
  0 siblings, 2 replies; 27+ messages in thread
From: Petr Cervenka @ 2014-05-06  8:17 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: Xenomai

> Od: Philippe Gerum <rpm@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 05/02/2014 02:13 PM, Petr Cervenka wrote:
>>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org>
>>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>
>>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>>
>>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>>> enough trace points, it should be possible to understand what
>>>>>> happens.
>>>>>>
>>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>>> but I simply don't understand what is happening there. The trace
>>>>> output is in the attachment. Could you please help me to understand
>>>>> it?
>>>>>
>>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>>> (under different name). Version of xenomai is 2.6.3.
>>>>
>>>> We see that xnpod_suspend_thread returns immediately, likely because it
>>>> has the XNKICKED bit set. Could you add more back trace points? So that
>>>> we see what is setting the XNKICKED bit?
>>>>
>>>
>>> I have added (maybe too much) back trace points. But as last time, I'm not able to see (almost) anything in it ;-)
>>> Previous xnpod_suspend_thread (on line 3713, probably caused by rt_mutex_aquire) seems to be fine (for me;-) ).
>>>
>>
>> Could you please help me with analysis, what is happening in the trace log?
>> I only see at the end only peaces of log, which are already contained somewhere before.
>> For example lines 3832-3993 are the same as 2901-3062.
>> Also 3697-3861 and 3065-3229.
>> Also 3596-3692 and 2935-3058
>> Also 3065-3550 and 2402-2887
>>
>> There are also suspicious lines with "device_not_available" and "ipipe_handle_exception", but they seem to be regularly appearing.
>>
>
>Please add these bits to your trace setup:
>
>diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
>index 13865a3..ccc54a8 100644
>--- a/ksrc/nucleus/shadow.c
>+++ b/ksrc/nucleus/shadow.c
>@@ -2516,6 +2516,10 @@ int do_losyscall_event(unsigned event, 
>rthal_pipeline_stage_t *stage,
>  		 * We may have gained a shadow TCB from the syscall we
>  		 * just invoked, so make sure to fetch it.
>  		 */
>+		if (__xn_interrupted_p(regs)) {
>+			xntrace_special(0x44, thread->info);
>+			xntrace_special(0x55, signal_pending(current));
>+		}
>  		thread = xnshadow_thread(current);
>  		if (signal_pending(current)) {
>  			sigs = 1;
>
>
>TIA,

I have applied the patch and here is the trace log. It took a while because some change I made caused lesser probability of the problem.
Short version:
:   + (0x44)	0x0000000c    -3	  0.097  losyscall_event+0x270 (ipipe_syscall_hook+0x89)
:   + (0x55)	0x00000000    -3	  0.119  losyscall_event+0x292 (ipipe_syscall_hook+0x89)
Long version is in the attachment.

Petr
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: freeze13.txt
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140506/aad6f1af/attachment.txt>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-02 12:13 Petr Cervenka
  2014-05-02 12:30 ` Gilles Chanteperdrix
@ 2014-05-02 13:16 ` Philippe Gerum
  2014-05-06  8:17   ` Petr Cervenka
  1 sibling, 1 reply; 27+ messages in thread
From: Philippe Gerum @ 2014-05-02 13:16 UTC (permalink / raw)
  To: Petr Cervenka, Gilles Chanteperdrix; +Cc: Xenomai

On 05/02/2014 02:13 PM, Petr Cervenka wrote:
>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org>
>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>
>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>> enough trace points, it should be possible to understand what
>>>>> happens.
>>>>>
>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>> but I simply don't understand what is happening there. The trace
>>>> output is in the attachment. Could you please help me to understand
>>>> it?
>>>>
>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>> (under different name). Version of xenomai is 2.6.3.
>>>
>>> We see that xnpod_suspend_thread returns immediately, likely because it
>>> has the XNKICKED bit set. Could you add more back trace points? So that
>>> we see what is setting the XNKICKED bit?
>>>
>>
>> I have added (maybe too much) back trace points. But as last time, I'm not able to see (almost) anything in it ;-)
>> Previous xnpod_suspend_thread (on line 3713, probably caused by rt_mutex_aquire) seems to be fine (for me;-) ).
>>
>
> Could you please help me with analysis, what is happening in the trace log?
> I only see at the end only peaces of log, which are already contained somewhere before.
> For example lines 3832-3993 are the same as 2901-3062.
> Also 3697-3861 and 3065-3229.
> Also 3596-3692 and 2935-3058
> Also 3065-3550 and 2402-2887
>
> There are also suspicious lines with "device_not_available" and "ipipe_handle_exception", but they seem to be regularly appearing.
>

Please add these bits to your trace setup:

diff --git a/ksrc/nucleus/shadow.c b/ksrc/nucleus/shadow.c
index 13865a3..ccc54a8 100644
--- a/ksrc/nucleus/shadow.c
+++ b/ksrc/nucleus/shadow.c
@@ -2516,6 +2516,10 @@ int do_losyscall_event(unsigned event, 
rthal_pipeline_stage_t *stage,
  		 * We may have gained a shadow TCB from the syscall we
  		 * just invoked, so make sure to fetch it.
  		 */
+		if (__xn_interrupted_p(regs)) {
+			xntrace_special(0x44, thread->info);
+			xntrace_special(0x55, signal_pending(current));
+		}
  		thread = xnshadow_thread(current);
  		if (signal_pending(current)) {
  			sigs = 1;


TIA,

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-05-02 12:13 Petr Cervenka
@ 2014-05-02 12:30 ` Gilles Chanteperdrix
  2014-05-02 13:16 ` Philippe Gerum
  1 sibling, 0 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-05-02 12:30 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

Le 02/05/2014 14:13, Petr Cervenka a écrit :
>> Od: "Petr Cervenka" <grugh@centrum.cz>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org>
>>> On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>>> SIGDEBUG signal was not received. Task status from
>>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>>> you have any other idea what to check or what can cause perhaps
>>>>>> every xenomai call fail with -EINTR in one task?
>>>>>
>>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>>> trigger a trace freeze when the -EINTR code is received. With
>>>>> enough trace points, it should be possible to understand what
>>>>> happens.
>>>>>
>>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>>> but I simply don't understand what is happening there. The trace
>>>> output is in the attachment. Could you please help me to understand
>>>> it?
>>>>
>>>> I also got some minor problem with xntrace_user_freeze, because the
>>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>>> is defined in src/skins/common/trace.c and (should be) contained in
>>>> libxenomai.so. But I was not successful and I had to define it myself
>>>> (under different name). Version of xenomai is 2.6.3.
>>>
>>> We see that xnpod_suspend_thread returns immediately, likely because it 
>>> has the XNKICKED bit set. Could you add more back trace points? So that 
>>> we see what is setting the XNKICKED bit?
>>>
>>
>> I have added (maybe too much) back trace points. But as last time, I'm not able to see (almost) anything in it ;-)
>> Previous xnpod_suspend_thread (on line 3713, probably caused by rt_mutex_aquire) seems to be fine (for me;-) ).
>>
> 
> Could you please help me with analysis, what is happening in the trace log?
> I only see at the end only peaces of log, which are already contained somewhere before.
> For example lines 3832-3993 are the same as 2901-3062.
> Also 3697-3861 and 3065-3229.
> Also 3596-3692 and 2935-3058
> Also 3065-3550 and 2402-2887
> 
> There are also suspicious lines with "device_not_available" and "ipipe_handle_exception", but they seem to be regularly appearing.

I will probably not have time to have a look at it before the end of the
week-end. I had a quick glance and did not find what I expected. We will
have to add some calls to ipipe_trace_special to see get the values of
the task state or info flags in the trace.

-- 
Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
@ 2014-05-02 12:13 Petr Cervenka
  2014-05-02 12:30 ` Gilles Chanteperdrix
  2014-05-02 13:16 ` Philippe Gerum
  0 siblings, 2 replies; 27+ messages in thread
From: Petr Cervenka @ 2014-05-02 12:13 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: "Petr Cervenka" <grugh@centrum.cz>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>>On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>
>>>>> SIGDEBUG signal was not received. Task status from
>>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>>> placed) When the task is in the "wrong" state, also the call of
>>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>>> you have any other idea what to check or what can cause perhaps
>>>>> every xenomai call fail with -EINTR in one task?
>>>>
>>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>>> trigger a trace freeze when the -EINTR code is received. With
>>>> enough trace points, it should be possible to understand what
>>>> happens.
>>>>
>>> I called a xntrace_user_freeze() immediately when the issue occurs,
>>> but I simply don't understand what is happening there. The trace
>>> output is in the attachment. Could you please help me to understand
>>> it?
>>>
>>> I also got some minor problem with xntrace_user_freeze, because the
>>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>>> is defined in src/skins/common/trace.c and (should be) contained in
>>> libxenomai.so. But I was not successful and I had to define it myself
>>> (under different name). Version of xenomai is 2.6.3.
>>
>>We see that xnpod_suspend_thread returns immediately, likely because it 
>>has the XNKICKED bit set. Could you add more back trace points? So that 
>>we see what is setting the XNKICKED bit?
>>
>
>I have added (maybe too much) back trace points. But as last time, I'm not able to see (almost) anything in it ;-)
>Previous xnpod_suspend_thread (on line 3713, probably caused by rt_mutex_aquire) seems to be fine (for me;-) ).
>

Could you please help me with analysis, what is happening in the trace log?
I only see at the end only peaces of log, which are already contained somewhere before.
For example lines 3832-3993 are the same as 2901-3062.
Also 3697-3861 and 3065-3229.
Also 3596-3692 and 2935-3058
Also 3065-3550 and 2402-2887

There are also suspicious lines with "device_not_available" and "ipipe_handle_exception", but they seem to be regularly appearing.

Thank you in  advance.
Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-24 17:53         ` Gilles Chanteperdrix
@ 2014-04-25  8:38           ` Petr Cervenka
  0 siblings, 0 replies; 27+ messages in thread
From: Petr Cervenka @ 2014-04-25  8:38 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai


> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>>> SIGDEBUG signal was not received. Task status from
>>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>>> placed) When the task is in the "wrong" state, also the call of
>>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>>> you have any other idea what to check or what can cause perhaps
>>>> every xenomai call fail with -EINTR in one task?
>>>
>>> If I had to debug this issue, I would enable the I-pipe tracer and
>>> trigger a trace freeze when the -EINTR code is received. With
>>> enough trace points, it should be possible to understand what
>>> happens.
>>>
>> I called a xntrace_user_freeze() immediately when the issue occurs,
>> but I simply don't understand what is happening there. The trace
>> output is in the attachment. Could you please help me to understand
>> it?
>>
>> I also got some minor problem with xntrace_user_freeze, because the
>> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
>> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
>> is defined in src/skins/common/trace.c and (should be) contained in
>> libxenomai.so. But I was not successful and I had to define it myself
>> (under different name). Version of xenomai is 2.6.3.
>
>We see that xnpod_suspend_thread returns immediately, likely because it 
>has the XNKICKED bit set. Could you add more back trace points? So that 
>we see what is setting the XNKICKED bit?
>

I have added (maybe too much) back trace points. But as last time, I'm not able to see (almost) anything in it ;-)
Previous xnpod_suspend_thread (on line 3713, probably caused by rt_mutex_aquire) seems to be fine (for me;-) ).

Petr
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: freeze7.txt
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140425/4288c679/attachment.txt>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-24 15:06       ` Petr Cervenka
@ 2014-04-24 17:53         ` Gilles Chanteperdrix
  2014-04-25  8:38           ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-24 17:53 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/24/2014 05:06 PM, Petr Cervenka wrote:
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>>> SIGDEBUG signal was not received. Task status from
>>> rt_task_inquire() was 0x300180 or 0x300380 (depends where it is
>>> placed) When the task is in the "wrong" state, also the call of
>>> rt_task_sleep(100000) is returning permanently -EINTR code. Do
>>> you have any other idea what to check or what can cause perhaps
>>> every xenomai call fail with -EINTR in one task?
>>
>> If I had to debug this issue, I would enable the I-pipe tracer and
>> trigger a trace freeze when the -EINTR code is received. With
>> enough trace points, it should be possible to understand what
>> happens.
>>
> I called a xntrace_user_freeze() immediately when the issue occurs,
> but I simply don't understand what is happening there. The trace
> output is in the attachment. Could you please help me to understand
> it?
>
> I also got some minor problem with xntrace_user_freeze, because the
> linker was not able to find it: asyncwriter.cpp:(.text+0x843):
> undefined reference to `xntrace_user_freeze(unsigned long, int)' It
> is defined in src/skins/common/trace.c and (should be) contained in
> libxenomai.so. But I was not successful and I had to define it myself
> (under different name). Version of xenomai is 2.6.3.

We see that xnpod_suspend_thread returns immediately, likely because it 
has the XNKICKED bit set. Could you add more back trace points? So that 
we see what is setting the XNKICKED bit?


-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-22 17:20     ` Gilles Chanteperdrix
@ 2014-04-24 15:06       ` Petr Cervenka
  2014-04-24 17:53         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-24 15:06 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
>> SIGDEBUG signal was not received. Task status from rt_task_inquire()
>> was 0x300180 or 0x300380 (depends where it is placed) When the task
>> is in the "wrong" state, also the call of rt_task_sleep(100000) is
>> returning permanently -EINTR code. Do you have any other idea what to
>> check or what can cause perhaps every xenomai call fail with -EINTR
>> in one task?
>
>If I had to debug this issue, I would enable the I-pipe tracer and 
>trigger a trace freeze when the -EINTR code is received. With enough 
>trace points, it should be possible to understand what happens.
>
I called a xntrace_user_freeze() immediately when the issue occurs, but I simply don't understand what is happening there. The trace output is in the attachment. Could you please help me to understand it?

I also got some minor problem with xntrace_user_freeze, because the linker was not able to find it:
asyncwriter.cpp:(.text+0x843): undefined reference to `xntrace_user_freeze(unsigned long, int)'
It is defined in src/skins/common/trace.c and (should be) contained in libxenomai.so. But I was not successful and I had to define it myself (under different name). Version of xenomai is 2.6.3.

Petr

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: freeze1.txt
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140424/09c18c23/attachment.txt>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-18  8:51   ` Petr Cervenka
@ 2014-04-22 17:20     ` Gilles Chanteperdrix
  2014-04-24 15:06       ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-22 17:20 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/18/2014 10:51 AM, Petr Cervenka wrote:
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 06:02 PM, Petr
>> Cervenka wrote:
>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 04:20 PM,
>>>> Petr Cervenka wrote:
>>>>>> Od: Gilles Chanteperdrix
>>>>>> <gilles.chanteperdrix@xenomai.org>
>>>>>>
>>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 02:22 PM,
>>>>>> Petr Cervenka wrote:
>>>>>>>> Od: Gilles Chanteperdrix
>>>>>>>> <gilles.chanteperdrix@xenomai.org>
>>>>>>>>
>>>>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42
>>>>>>>> PM, Petr Cervenka wrote:
>>>>>>>>> Hello I have a problem with the rt_task_suspend(NULL)
>>>>>>>>> call. I'm using it for synchronization of two
>>>>>>>>> (producer / consumer like) tasks. 1) When the
>>>>>>>>> consumer task has no work to do, it stops itself by
>>>>>>>>> calling of the rt_task_suspend(NULL). 2) When the
>>>>>>>>> producer creates new work for consumer, it wakes it
>>>>>>>>> up by calling of rt_task_resume(&consumerTask). The
>>>>>>>>> problem is, that consumer seldom switches to a state,
>>>>>>>>> that it sleeps by rt_task_suspend no more. And the
>>>>>>>>> task then takes all the CPU time. The return code is
>>>>>>>>> 0. But I already have seen couple of -4 (-EINTR)
>>>>>>>>> values in the past also. Consumer task status was
>>>>>>>>> 00300380 before and 00300184 (if there is small
>>>>>>>>> safety sleep present). I can use for example RT_EVENT
>>>>>>>>> variable instead, but I'm curious if you by chance
>>>>>>>>> don't know, what is happening? Xenomai 2.6.3, Linux
>>>>>>>>> 3.5.7
>>>>>>>>
>>>>>>>> Could you post the example of code you are using to get
>>>>>>>> this issue?
>>>>>>>>
>>>>>>>
>>>>>>> It's and application with many threads, mutexes and
>>>>>>> others. It's also special measuring HW dependent. I can
>>>>>>> post here some simplified example. But I don't think it
>>>>>>> would be possible to reproduce the same behavior easily.
>>>>>>> It happens in my configuration only probably once per day
>>>>>>> and very unpredictably. But I have more details. I
>>>>>>> replaced rt_task_suspend / rt_task_resume by
>>>>>>> rt_event_wait / rt_event_signal. It failed similar way,
>>>>>>> but this time the result of wait was -4 (-EINTR). And
>>>>>>> (after several millions of invocations) it recovered
>>>>>>> itself.
>>>>>>
>>>>>> -EINTR is a valid return value for both rt_event_wait and
>>>>>> rt_task_suspend. In case you get this error, you should
>>>>>> loop to call rt_event_wait again, and not call
>>>>>> rt_event_clear, as you risk clearing an event which has
>>>>>> been signaled afterwards.
>>>>>>
>>>>> You are right. It was just very quick replace of waiting and
>>>>> waking-up functions. But I'm checking the "work queue" anyway
>>>>> and it also doesn't need exact timing here. My problem it
>>>>> that the slow consumer task seems to be "interrupted by
>>>>> signal" (or whatever) for several minutes. I mean, that it
>>>>> doesn't wait for the event anymore and it always returns
>>>>> immediately (with -EINTR return code).
>>>>
>>>> Are you running inside gdb? Does the task receive the SIGDEBUG
>>>> signal? Do you have the XNWARNSW bit armed?
>>>>
>>>
>>> gdb: No. SIGDEBUG, XNWARNSW: I don't even know what it is ;-).
>>
>> See examples/native/sigdebug.c in xenomai sources. Try installing
>> the same signal handler in your application to be notified upon
>> reception of this signal.
>>
> SIGDEBUG signal was not received. Task status from rt_task_inquire()
> was 0x300180 or 0x300380 (depends where it is placed) When the task
> is in the "wrong" state, also the call of rt_task_sleep(100000) is
> returning permanently -EINTR code. Do you have any other idea what to
> check or what can cause perhaps every xenomai call fail with -EINTR
> in one task?

If I had to debug this issue, I would enable the I-pipe tracer and 
trigger a trace freeze when the -EINTR code is received. With enough 
trace points, it should be possible to understand what happens.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-16 16:17 ` Gilles Chanteperdrix
@ 2014-04-18  8:51   ` Petr Cervenka
  2014-04-22 17:20     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-18  8:51 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 04/16/2014 06:02 PM, Petr Cervenka wrote:
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org>
>>> On 04/16/2014 04:20 PM, Petr Cervenka wrote:
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 02:22 PM, Petr
>>>>> Cervenka wrote:
>>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>>
>>>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM,
>>>>>>> Petr Cervenka wrote:
>>>>>>>> Hello I have a problem with the rt_task_suspend(NULL) call.
>>>>>>>> I'm using it for synchronization of two (producer / consumer
>>>>>>>> like) tasks. 1) When the consumer task has no work to do, it
>>>>>>>> stops itself by calling of the rt_task_suspend(NULL). 2) When
>>>>>>>> the producer creates new work for consumer, it wakes it up by
>>>>>>>> calling of rt_task_resume(&consumerTask). The problem is,
>>>>>>>> that consumer seldom switches to a state, that it sleeps by
>>>>>>>> rt_task_suspend no more. And the task then takes all the CPU
>>>>>>>> time. The return code is 0. But I already have seen couple of
>>>>>>>> -4 (-EINTR) values in the past also. Consumer task status was
>>>>>>>> 00300380 before and 00300184 (if there is small safety sleep
>>>>>>>> present). I can use for example RT_EVENT variable instead,
>>>>>>>> but I'm curious if you by chance don't know, what is
>>>>>>>> happening? Xenomai 2.6.3, Linux 3.5.7
>>>>>>>
>>>>>>> Could you post the example of code you are using to get this
>>>>>>> issue?
>>>>>>>
>>>>>>
>>>>>> It's and application with many threads, mutexes and others. It's
>>>>>> also special measuring HW dependent. I can post here some
>>>>>> simplified example. But I don't think it would be possible to
>>>>>> reproduce the same behavior easily. It happens in my
>>>>>> configuration only probably once per day and very unpredictably.
>>>>>> But I have more details. I replaced rt_task_suspend /
>>>>>> rt_task_resume by rt_event_wait / rt_event_signal. It failed
>>>>>> similar way, but this time the result of wait was -4 (-EINTR).
>>>>>> And (after several millions of invocations) it recovered itself.
>>>>>
>>>>> -EINTR is a valid return value for both rt_event_wait and
>>>>> rt_task_suspend. In case you get this error, you should loop to
>>>>> call rt_event_wait again, and not call rt_event_clear, as you risk
>>>>> clearing an event which has been signaled afterwards.
>>>>>
>>>> You are right. It was just very quick replace of waiting and
>>>> waking-up functions. But I'm checking the "work queue" anyway and it
>>>> also doesn't need exact timing here. My problem it that the slow
>>>> consumer task seems to be "interrupted by signal" (or whatever) for
>>>> several minutes. I mean, that it doesn't wait for the event anymore
>>>> and it always returns immediately (with -EINTR return code).
>>>
>>> Are you running inside gdb? Does the task receive the SIGDEBUG signal?
>>> Do you have the XNWARNSW bit armed?
>>>
>>
>> gdb: No.
>> SIGDEBUG, XNWARNSW: I don't even know what it is ;-).
>
>See examples/native/sigdebug.c in xenomai sources. Try installing the 
>same signal handler in your application to be notified upon reception of 
>this signal.
>
SIGDEBUG signal was not received.
Task status from rt_task_inquire() was 0x300180 or 0x300380 (depends where it is placed)
When the task is in the "wrong" state, also the call of rt_task_sleep(100000) is returning permanently -EINTR code.
Do you have any other idea what to check or what can cause perhaps every xenomai call fail with -EINTR in one task?

Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-16 16:02 Petr Cervenka
@ 2014-04-16 16:17 ` Gilles Chanteperdrix
  2014-04-18  8:51   ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-16 16:17 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/16/2014 06:02 PM, Petr Cervenka wrote:
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org>
>> On 04/16/2014 04:20 PM, Petr Cervenka wrote:
>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>
>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 02:22 PM, Petr
>>>> Cervenka wrote:
>>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>>
>>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM,
>>>>>> Petr Cervenka wrote:
>>>>>>> Hello I have a problem with the rt_task_suspend(NULL) call.
>>>>>>> I'm using it for synchronization of two (producer / consumer
>>>>>>> like) tasks. 1) When the consumer task has no work to do, it
>>>>>>> stops itself by calling of the rt_task_suspend(NULL). 2) When
>>>>>>> the producer creates new work for consumer, it wakes it up by
>>>>>>> calling of rt_task_resume(&consumerTask). The problem is,
>>>>>>> that consumer seldom switches to a state, that it sleeps by
>>>>>>> rt_task_suspend no more. And the task then takes all the CPU
>>>>>>> time. The return code is 0. But I already have seen couple of
>>>>>>> -4 (-EINTR) values in the past also. Consumer task status was
>>>>>>> 00300380 before and 00300184 (if there is small safety sleep
>>>>>>> present). I can use for example RT_EVENT variable instead,
>>>>>>> but I'm curious if you by chance don't know, what is
>>>>>>> happening? Xenomai 2.6.3, Linux 3.5.7
>>>>>>
>>>>>> Could you post the example of code you are using to get this
>>>>>> issue?
>>>>>>
>>>>>
>>>>> It's and application with many threads, mutexes and others. It's
>>>>> also special measuring HW dependent. I can post here some
>>>>> simplified example. But I don't think it would be possible to
>>>>> reproduce the same behavior easily. It happens in my
>>>>> configuration only probably once per day and very unpredictably.
>>>>> But I have more details. I replaced rt_task_suspend /
>>>>> rt_task_resume by rt_event_wait / rt_event_signal. It failed
>>>>> similar way, but this time the result of wait was -4 (-EINTR).
>>>>> And (after several millions of invocations) it recovered itself.
>>>>
>>>> -EINTR is a valid return value for both rt_event_wait and
>>>> rt_task_suspend. In case you get this error, you should loop to
>>>> call rt_event_wait again, and not call rt_event_clear, as you risk
>>>> clearing an event which has been signaled afterwards.
>>>>
>>> You are right. It was just very quick replace of waiting and
>>> waking-up functions. But I'm checking the "work queue" anyway and it
>>> also doesn't need exact timing here. My problem it that the slow
>>> consumer task seems to be "interrupted by signal" (or whatever) for
>>> several minutes. I mean, that it doesn't wait for the event anymore
>>> and it always returns immediately (with -EINTR return code).
>>
>> Are you running inside gdb? Does the task receive the SIGDEBUG signal?
>> Do you have the XNWARNSW bit armed?
>>
>
> gdb: No.
> SIGDEBUG, XNWARNSW: I don't even know what it is ;-).

See examples/native/sigdebug.c in xenomai sources. Try installing the 
same signal handler in your application to be notified upon reception of 
this signal.


-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
@ 2014-04-16 16:02 Petr Cervenka
  2014-04-16 16:17 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-16 16:02 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 04/16/2014 04:20 PM, Petr Cervenka wrote:
>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>
>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/16/2014 02:22 PM, Petr
>>> Cervenka wrote:
>>>>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>>>>
>>>>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM,
>>>>> Petr Cervenka wrote:
>>>>>> Hello I have a problem with the rt_task_suspend(NULL) call.
>>>>>> I'm using it for synchronization of two (producer / consumer
>>>>>> like) tasks. 1) When the consumer task has no work to do, it
>>>>>> stops itself by calling of the rt_task_suspend(NULL). 2) When
>>>>>> the producer creates new work for consumer, it wakes it up by
>>>>>> calling of rt_task_resume(&consumerTask). The problem is,
>>>>>> that consumer seldom switches to a state, that it sleeps by
>>>>>> rt_task_suspend no more. And the task then takes all the CPU
>>>>>> time. The return code is 0. But I already have seen couple of
>>>>>> -4 (-EINTR) values in the past also. Consumer task status was
>>>>>> 00300380 before and 00300184 (if there is small safety sleep
>>>>>> present). I can use for example RT_EVENT variable instead,
>>>>>> but I'm curious if you by chance don't know, what is
>>>>>> happening? Xenomai 2.6.3, Linux 3.5.7
>>>>>
>>>>> Could you post the example of code you are using to get this
>>>>> issue?
>>>>>
>>>>
>>>> It's and application with many threads, mutexes and others. It's
>>>> also special measuring HW dependent. I can post here some
>>>> simplified example. But I don't think it would be possible to
>>>> reproduce the same behavior easily. It happens in my
>>>> configuration only probably once per day and very unpredictably.
>>>> But I have more details. I replaced rt_task_suspend /
>>>> rt_task_resume by rt_event_wait / rt_event_signal. It failed
>>>> similar way, but this time the result of wait was -4 (-EINTR).
>>>> And (after several millions of invocations) it recovered itself.
>>>
>>> -EINTR is a valid return value for both rt_event_wait and
>>> rt_task_suspend. In case you get this error, you should loop to
>>> call rt_event_wait again, and not call rt_event_clear, as you risk
>>> clearing an event which has been signaled afterwards.
>>>
>> You are right. It was just very quick replace of waiting and
>> waking-up functions. But I'm checking the "work queue" anyway and it
>> also doesn't need exact timing here. My problem it that the slow
>> consumer task seems to be "interrupted by signal" (or whatever) for
>> several minutes. I mean, that it doesn't wait for the event anymore
>> and it always returns immediately (with -EINTR return code).
>
>Are you running inside gdb? Does the task receive the SIGDEBUG signal? 
>Do you have the XNWARNSW bit armed?
>

gdb: No.
SIGDEBUG, XNWARNSW: I don't even know what it is ;-).

>> I also
>> already got one such situation half an hour ago. But the return code
>> was 0 that time. Could you give me some advice what to check when
>> such situation happens again?
>
>Well the task status should help.
>
Normaly the task status is (from /proc/xenomai/stat): 00300182.
(XNFPU | XNSHADOW| XNMAPPED | XNSTARTED | XNPEND - waiting for an event)
Task status from the last issue (from /proc/xenomai/stat) was 00300380.
(XNFPU | XNSHADOW| XNRELAX | XNMAPPED | XNSTARTED)
CPU load of the task was 23% (and more than 4mil. MSW/CSW). Perhaps sending of UDP packets (used for debugging) caused some sleep and prevented the computer from total freeze.

After next issue I will have more precise information from rt_task_inquire.

Petr


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-16 12:22   ` Petr Cervenka
@ 2014-04-16 12:26     ` Gilles Chanteperdrix
  0 siblings, 0 replies; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-16 12:26 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/16/2014 02:22 PM, Petr Cervenka wrote:
>> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>>
>> CC: "Xenomai" <xenomai@xenomai.org> On 04/15/2014 02:42 PM, Petr
>> Cervenka wrote:
>>> Hello I have a problem with the rt_task_suspend(NULL) call. I'm
>>> using it for synchronization of two (producer / consumer like)
>>> tasks. 1) When the consumer task has no work to do, it stops
>>> itself by calling of the rt_task_suspend(NULL). 2) When the
>>> producer creates new work for consumer, it wakes it up by calling
>>> of rt_task_resume(&consumerTask). The problem is, that consumer
>>> seldom switches to a state, that it sleeps by rt_task_suspend no
>>> more. And the task then takes all the CPU time. The return code
>>> is 0. But I already have seen couple of -4 (-EINTR) values in the
>>> past also. Consumer task status was 00300380 before and 00300184
>>> (if there is small safety sleep present). I can use for example
>>> RT_EVENT variable instead, but I'm curious if you by chance don't
>>> know, what is happening? Xenomai 2.6.3, Linux 3.5.7
>>
>> Could you post the example of code you are using to get this
>> issue?
>>
>
> It's and application with many threads, mutexes and others. It's also
> special measuring HW dependent. I can post here some simplified
> example. But I don't think it would be possible to reproduce the same
> behavior easily. It happens in my configuration only probably once
> per day and very unpredictably. But I have more details. I replaced
> rt_task_suspend / rt_task_resume by rt_event_wait / rt_event_signal.
> It failed similar way, but this time the result of wait was -4
> (-EINTR). And (after several millions of invocations) it recovered
> itself.

-EINTR is a valid return value for both rt_event_wait and 
rt_task_suspend. In case you get this error, you should loop to call 
rt_event_wait again, and not call rt_event_clear, as you risk clearing 
an event which has been signaled afterwards.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-16  9:08 ` Gilles Chanteperdrix
@ 2014-04-16 12:22   ` Petr Cervenka
  2014-04-16 12:26     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-16 12:22 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

> Od: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
>
> CC: "Xenomai" <xenomai@xenomai.org>
>On 04/15/2014 02:42 PM, Petr Cervenka wrote:
>> Hello I have a problem with the rt_task_suspend(NULL) call. I'm using
>> it for synchronization of two (producer / consumer like) tasks. 1)
>> When the consumer task has no work to do, it stops itself by calling
>> of the rt_task_suspend(NULL). 2) When the producer creates new work
>> for consumer, it wakes it up by calling of
>> rt_task_resume(&consumerTask). The problem is, that consumer seldom
>> switches to a state, that it sleeps by rt_task_suspend no more. And
>> the task then takes all the CPU time. The return code is 0. But I
>> already have seen couple of -4 (-EINTR) values in the past also.
>> Consumer task status was 00300380 before and 00300184 (if there is
>> small safety sleep present). I can use for example RT_EVENT variable
>> instead, but I'm curious if you by chance don't know, what is
>> happening? Xenomai 2.6.3, Linux 3.5.7
>
>Could you post the example of code you are using to get this issue?
>

It's and application with many threads, mutexes and others. It's also special measuring HW dependent. I can post here some simplified example. But I don't think it would be possible to reproduce the same behavior easily. It happens in my configuration only probably once per day and very unpredictably.
But I have more details. I replaced rt_task_suspend / rt_task_resume by rt_event_wait / rt_event_signal. It failed similar way, but this time the result of wait was -4 (-EINTR). And (after several millions of invocations) it recovered itself.

// consumer task -----------------------------------------------------------------------------
void CAsyncReaderWriter::taskCycle() {
    unsigned long eventMask;
    while (!terminated) {
        // wait for new data !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        //int res = rt_task_suspend(NULL);
        int res = rt_event_wait(&event, EVENT_READ | EVENT_WRITE | EVENT_END, &eventMask, EV_ANY, TM_INFINITE);

        if (res < 0) {
            LOG_PRINT("wait result: %d (%s)\n", res, strerror(-res));
        }

        rt_event_clear(&event, EVENT_READ | EVENT_WRITE | EVENT_END, NULL);

        // synchronized data processing with checking of the "terminated" flag
        rt_mutex_acquire(&mutex, TM_INFINITE);
        while (!terminated && !dataQueue.empty()) {
            TDataElement *dataElement = dataQueue.pop_front();
            rt_mutex_release(&mutex);

            // slow work including hard disk usage
            ...

            rt_mutex_acquire(&mutex, TM_INFINITE);
            // return free data element
            freeData.push_back(dataElement);
        }
        rt_mutex_release(&mutex);
        
        // safety sleep added after some freezes caused by 100% CPU load
        rt_task_sleep(rt_timer_ns2ticks(100000));
    }
}

// called from producer -------------------------------------------------------------------------
void CAsyncReaderWriter::write(const TParams &params) {
    TDataElement *dataElement = NULL;

    // get free element
    rt_mutex_acquire(&mutex, TM_INFINITE);
    if (freeData.empty()) {
        rt_mutex_release(&mutex);
        throw std::runtime_error(ERRORMSG("no free data elements"));
    }
    dataElement = freeData.back();
    freeData.pop_back();
    rt_mutex_release(&mutex);

    // copy params
    dataElement->params = params;

    // add to circle buffer
    rt_mutex_acquire(&mutex, TM_INFINITE);
    if (dataQueue.full()) {
        freeData.push_back(dataElement);
        rt_mutex_release(&mutex);
        throw std::runtime_error(ERRORMSG("data queue is full"));
    }
    dataQueue.push_back(dataElement);
    rt_mutex_release(&mutex);

    // wake-up task !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    //rt_task_resume(&task);
    rt_event_signal(&event, EVENT_WRITE);
}


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Xenomai] non-blocking rt_task_suspend(NULL)
  2014-04-15 12:42 Petr Cervenka
@ 2014-04-16  9:08 ` Gilles Chanteperdrix
  2014-04-16 12:22   ` Petr Cervenka
  0 siblings, 1 reply; 27+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-16  9:08 UTC (permalink / raw)
  To: Petr Cervenka; +Cc: Xenomai

On 04/15/2014 02:42 PM, Petr Cervenka wrote:
> Hello I have a problem with the rt_task_suspend(NULL) call. I'm using
> it for synchronization of two (producer / consumer like) tasks. 1)
> When the consumer task has no work to do, it stops itself by calling
> of the rt_task_suspend(NULL). 2) When the producer creates new work
> for consumer, it wakes it up by calling of
> rt_task_resume(&consumerTask). The problem is, that consumer seldom
> switches to a state, that it sleeps by rt_task_suspend no more. And
> the task then takes all the CPU time. The return code is 0. But I
> already have seen couple of -4 (-EINTR) values in the past also.
> Consumer task status was 00300380 before and 00300184 (if there is
> small safety sleep present). I can use for example RT_EVENT variable
> instead, but I'm curious if you by chance don't know, what is
> happening? Xenomai 2.6.3, Linux 3.5.7

Could you post the example of code you are using to get this issue?

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Xenomai] non-blocking rt_task_suspend(NULL)
@ 2014-04-15 12:42 Petr Cervenka
  2014-04-16  9:08 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 27+ messages in thread
From: Petr Cervenka @ 2014-04-15 12:42 UTC (permalink / raw)
  To: Xenomai

Hello I have a problem with the rt_task_suspend(NULL) call.
I'm using it for synchronization of two (producer / consumer like) tasks.
1) When the consumer task has no work to do, it stops itself by calling of the rt_task_suspend(NULL).
2) When the producer creates new work for consumer, it wakes it up by calling of rt_task_resume(&consumerTask).
The problem is, that consumer seldom switches to a state, that it sleeps by rt_task_suspend no more. And the task then takes all the CPU time.
The return code is 0. But I already have seen couple of -4 (-EINTR) values in the past also.
Consumer task status was 00300380 before and 00300184 (if there is small safety sleep present).
I can use for example RT_EVENT variable instead, but I'm curious if you by chance don't know, what is happening?
Xenomai 2.6.3, Linux 3.5.7

Petr Cervenka

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-05-20 12:54 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-16 14:20 [Xenomai] non-blocking rt_task_suspend(NULL) Petr Cervenka
2014-04-16 14:28 ` Gilles Chanteperdrix
  -- strict thread matches above, loose matches on Subject: below --
2014-05-02 12:13 Petr Cervenka
2014-05-02 12:30 ` Gilles Chanteperdrix
2014-05-02 13:16 ` Philippe Gerum
2014-05-06  8:17   ` Petr Cervenka
2014-05-06  8:39     ` Philippe Gerum
2014-05-06  8:56     ` Philippe Gerum
2014-05-06  9:29       ` Petr Cervenka
2014-05-06 12:57         ` Philippe Gerum
2014-05-07 13:13           ` Petr Cervenka
2014-05-08 15:53             ` Philippe Gerum
2014-05-12 12:37               ` Petr Cervenka
2014-05-12 13:09                 ` Philippe Gerum
2014-05-20 12:27                   ` Petr Cervenka
2014-05-20 12:54                     ` Philippe Gerum
2014-04-16 16:02 Petr Cervenka
2014-04-16 16:17 ` Gilles Chanteperdrix
2014-04-18  8:51   ` Petr Cervenka
2014-04-22 17:20     ` Gilles Chanteperdrix
2014-04-24 15:06       ` Petr Cervenka
2014-04-24 17:53         ` Gilles Chanteperdrix
2014-04-25  8:38           ` Petr Cervenka
2014-04-15 12:42 Petr Cervenka
2014-04-16  9:08 ` Gilles Chanteperdrix
2014-04-16 12:22   ` Petr Cervenka
2014-04-16 12:26     ` Gilles Chanteperdrix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.