All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
@ 2016-02-05  8:27 Ronny Meeus
  2016-02-09 11:23 ` Philippe Gerum
  0 siblings, 1 reply; 7+ messages in thread
From: Ronny Meeus @ 2016-02-05  8:27 UTC (permalink / raw)
  To: xenomai

Hello

we are using the pSOS interface on top of the Mercury core.
Under heavy stress conditions we see sporadically that messages are
getting lost.

Attached you find a test application (multi_queue.c)  that helped me
to find the issue and verify that the attached patch actually solves
the issue as well.
To reproduce the issue I start the test application several times in
parallel. The cpuload of the board will reach a value of 100%.
Maybe it is not needed to start several application in parallel, but
if I do so, I can reproduce the issue in a matter of seconds.
As soon as one message is lost, the test stops itself (exit). After
some time I observe that all applications have stopped ...

What the test basically does is sending messages between a manager and
2 worker threads and verify, by using sequence numbers in the
messages, that no message is lost.
The test might look too complex but it is basically a simulation of a
scenario that we perform in our real application.

The patch is ported from an older version of Xenomai.
I did not actually run the test with the latest version but by looking
to the code I can see that the issue is still present in the latest
version as well.

Best regards,
Ronny
-------------- next part --------------
A non-text attachment was scrubbed...
Name: multi_queue.c
Type: text/x-csrc
Size: 4747 bytes
Desc: not available
URL: <http://xenomai.org/pipermail/xenomai/attachments/20160205/5a4434b5/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: psos_queue_correction.patch
Type: application/octet-stream
Size: 2106 bytes
Desc: not available
URL: <http://xenomai.org/pipermail/xenomai/attachments/20160205/5a4434b5/attachment.obj>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-05  8:27 [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive Ronny Meeus
@ 2016-02-09 11:23 ` Philippe Gerum
  2016-02-09 20:23   ` Ronny Meeus
  0 siblings, 1 reply; 7+ messages in thread
From: Philippe Gerum @ 2016-02-09 11:23 UTC (permalink / raw)
  To: Ronny Meeus, xenomai

On 02/05/2016 09:27 AM, Ronny Meeus wrote:
> Hello
> 
> we are using the pSOS interface on top of the Mercury core.
> Under heavy stress conditions we see sporadically that messages are
> getting lost.
> 
> Attached you find a test application (multi_queue.c)  that helped me
> to find the issue and verify that the attached patch actually solves
> the issue as well.
> To reproduce the issue I start the test application several times in
> parallel. The cpuload of the board will reach a value of 100%.
> Maybe it is not needed to start several application in parallel, but
> if I do so, I can reproduce the issue in a matter of seconds.
> As soon as one message is lost, the test stops itself (exit). After
> some time I observe that all applications have stopped ...
> 
> What the test basically does is sending messages between a manager and
> 2 worker threads and verify, by using sequence numbers in the
> messages, that no message is lost.
> The test might look too complex but it is basically a simulation of a
> scenario that we perform in our real application.
> 
> The patch is ported from an older version of Xenomai.
> I did not actually run the test with the latest version but by looking
> to the code I can see that the issue is still present in the latest
> version as well.
> 

Your analysis is right, but the fix only addresses the symptom, not the root cause of this issue. The fact that a message might be squeezed in while the receiver resumes from a timeout condition reveals a more general issue with the syncobj_wait_grant/drain() logic in this area.

This is what has to be fixed, since any code checking for the -ETIMEDOUT/-EINTR status from those copperplate calls to determine whether the condition was eventually satisfied is virtually affected by the very same problem.

Could you test the following patch and let me know if that fixes the issue for you as well?

diff --git a/lib/copperplate/syncobj.c b/lib/copperplate/syncobj.c
index 1b3d545..bf04b29 100644
--- a/lib/copperplate/syncobj.c
+++ b/lib/copperplate/syncobj.c
@@ -453,14 +453,55 @@ struct threadobj *syncobj_peek_drain(struct syncobj *sobj)
 
 static int wait_epilogue(struct syncobj *sobj,
 			 struct syncstate *syns,
-			 struct threadobj *current)
+			 struct threadobj *current,
+			 int ret)
 {
 	current->run_state = __THREAD_S_RUNNING;
 
+	/*
+	 * Fixup a potential race upon return from grant/drain_wait
+	 * operations, e.g. given two threads A and B:
+	 *
+	 * A:enqueue_waiter(self)
+	 * A:monitor_wait
+	 *    A:monitor_unlock
+	 *    A:[timed] sleep
+	 *    A:wakeup on timeout/interrupt
+	 *       B:monitor_lock
+	 *       B:look_for_queued_waiter
+	 *          (found A, update A's state)
+	 *       B:monitor_unlock
+	 *    A:dequeue_waiter(self)
+	 *    A:return -ETIMEDOUT/-EINTR
+	 *
+	 * The race may happen anytime between the timeout/interrupt
+	 * event is received by A, and the moment it grabs back the
+	 * monitor lock before unqueuing. When the race happens, B can
+	 * squeeze in a signal before A unqueues after resumption on
+	 * error.
+	 *
+	 * Problem: A's internal state has been updated (e.g. some
+	 * data transferred to it), but it will receive
+	 * -ETIMEDOUT/-EINTR, causing it to miss the update
+	 * eventually.
+	 *
+	 * Solution: fixup the status code upon return from
+	 * wait_grant/drain operations, so that -ETIMEDOUT/-EINTR is
+	 * never returned to the caller if the syncobj was actually
+	 * signaled. We still allow the SYNCOBJ_FLUSHED condition to
+	 * override that success code though.
+	 *
+	 * Whether a condition should be deemed satisfied if it is
+	 * signaled during the race window described above is
+	 * debatable, but this is a simple and straightforward way to
+	 * handle such grey area.
+	 */
+
 	if (current->wait_sobj) {
 		dequeue_waiter(sobj, current);
 		current->wait_sobj = NULL;
-	}
+	} else if (ret == -ETIMEDOUT || ret == -EINTR)
+		ret = 0;
 
 	sobj->wait_count--;
 	assert(sobj->wait_count >= 0);
@@ -477,7 +518,7 @@ static int wait_epilogue(struct syncobj *sobj,
 	if (current->wait_status & SYNCOBJ_FLUSHED)
 		return -EINTR;
 
-	return 0;
+	return ret;
 }
 
 int syncobj_wait_grant(struct syncobj *sobj, const struct timespec *timeout,
@@ -515,7 +556,7 @@ int syncobj_wait_grant(struct syncobj *sobj, const struct timespec *timeout,
 
 	pthread_setcancelstate(state, NULL);
 
-	return wait_epilogue(sobj, syns, current) ?: ret;
+	return wait_epilogue(sobj, syns, current, ret);
 }
 
 int syncobj_wait_drain(struct syncobj *sobj, const struct timespec *timeout,
@@ -553,7 +594,7 @@ int syncobj_wait_drain(struct syncobj *sobj, const struct timespec *timeout,
 
 	pthread_setcancelstate(state, NULL);
 
-	return wait_epilogue(sobj, syns, current) ?: ret;
+	return wait_epilogue(sobj, syns, current, ret);
 }
 
 int syncobj_destroy(struct syncobj *sobj, struct syncstate *syns)

-- 
Philippe.


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-09 11:23 ` Philippe Gerum
@ 2016-02-09 20:23   ` Ronny Meeus
  2016-02-10  6:06     ` Ronny Meeus
  0 siblings, 1 reply; 7+ messages in thread
From: Ronny Meeus @ 2016-02-09 20:23 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Tue, Feb 9, 2016 at 12:23 PM, Philippe Gerum <rpm@xenomai.org> wrote:
> On 02/05/2016 09:27 AM, Ronny Meeus wrote:
>> Hello
>>
>> we are using the pSOS interface on top of the Mercury core.
>> Under heavy stress conditions we see sporadically that messages are
>> getting lost.
>>
>> Attached you find a test application (multi_queue.c)  that helped me
>> to find the issue and verify that the attached patch actually solves
>> the issue as well.
>> To reproduce the issue I start the test application several times in
>> parallel. The cpuload of the board will reach a value of 100%.
>> Maybe it is not needed to start several application in parallel, but
>> if I do so, I can reproduce the issue in a matter of seconds.
>> As soon as one message is lost, the test stops itself (exit). After
>> some time I observe that all applications have stopped ...
>>
>> What the test basically does is sending messages between a manager and
>> 2 worker threads and verify, by using sequence numbers in the
>> messages, that no message is lost.
>> The test might look too complex but it is basically a simulation of a
>> scenario that we perform in our real application.
>>
>> The patch is ported from an older version of Xenomai.
>> I did not actually run the test with the latest version but by looking
>> to the code I can see that the issue is still present in the latest
>> version as well.
>>
>
> Your analysis is right, but the fix only addresses the symptom, not the root cause of this issue. The fact that a message might be squeezed in while the receiver resumes from a timeout condition reveals a more general issue with the syncobj_wait_grant/drain() logic in this area.
>
> This is what has to be fixed, since any code checking for the -ETIMEDOUT/-EINTR status from those copperplate calls to determine whether the condition was eventually satisfied is virtually affected by the very same problem.
>
> Could you test the following patch and let me know if that fixes the issue for you as well?

Philippe

I have ported the patch to our version of Xenomai and started the test.
It runs already for several minutes without issues so it looks like
the problem is solved also with the patch below.
I will let it run overnight and update you tomorrow morning.

Thanks.

Best regards,
Ronny

>
> diff --git a/lib/copperplate/syncobj.c b/lib/copperplate/syncobj.c
> index 1b3d545..bf04b29 100644
> --- a/lib/copperplate/syncobj.c
> +++ b/lib/copperplate/syncobj.c
> @@ -453,14 +453,55 @@ struct threadobj *syncobj_peek_drain(struct syncobj *sobj)
>
>  static int wait_epilogue(struct syncobj *sobj,
>                          struct syncstate *syns,
> -                        struct threadobj *current)
> +                        struct threadobj *current,
> +                        int ret)
>  {
>         current->run_state = __THREAD_S_RUNNING;
>
> +       /*
> +        * Fixup a potential race upon return from grant/drain_wait
> +        * operations, e.g. given two threads A and B:
> +        *
> +        * A:enqueue_waiter(self)
> +        * A:monitor_wait
> +        *    A:monitor_unlock
> +        *    A:[timed] sleep
> +        *    A:wakeup on timeout/interrupt
> +        *       B:monitor_lock
> +        *       B:look_for_queued_waiter
> +        *          (found A, update A's state)
> +        *       B:monitor_unlock
> +        *    A:dequeue_waiter(self)
> +        *    A:return -ETIMEDOUT/-EINTR
> +        *
> +        * The race may happen anytime between the timeout/interrupt
> +        * event is received by A, and the moment it grabs back the
> +        * monitor lock before unqueuing. When the race happens, B can
> +        * squeeze in a signal before A unqueues after resumption on
> +        * error.
> +        *
> +        * Problem: A's internal state has been updated (e.g. some
> +        * data transferred to it), but it will receive
> +        * -ETIMEDOUT/-EINTR, causing it to miss the update
> +        * eventually.
> +        *
> +        * Solution: fixup the status code upon return from
> +        * wait_grant/drain operations, so that -ETIMEDOUT/-EINTR is
> +        * never returned to the caller if the syncobj was actually
> +        * signaled. We still allow the SYNCOBJ_FLUSHED condition to
> +        * override that success code though.
> +        *
> +        * Whether a condition should be deemed satisfied if it is
> +        * signaled during the race window described above is
> +        * debatable, but this is a simple and straightforward way to
> +        * handle such grey area.
> +        */
> +
>         if (current->wait_sobj) {
>                 dequeue_waiter(sobj, current);
>                 current->wait_sobj = NULL;
> -       }
> +       } else if (ret == -ETIMEDOUT || ret == -EINTR)
> +               ret = 0;
>
>         sobj->wait_count--;
>         assert(sobj->wait_count >= 0);
> @@ -477,7 +518,7 @@ static int wait_epilogue(struct syncobj *sobj,
>         if (current->wait_status & SYNCOBJ_FLUSHED)
>                 return -EINTR;
>
> -       return 0;
> +       return ret;
>  }
>
>  int syncobj_wait_grant(struct syncobj *sobj, const struct timespec *timeout,
> @@ -515,7 +556,7 @@ int syncobj_wait_grant(struct syncobj *sobj, const struct timespec *timeout,
>
>         pthread_setcancelstate(state, NULL);
>
> -       return wait_epilogue(sobj, syns, current) ?: ret;
> +       return wait_epilogue(sobj, syns, current, ret);
>  }
>
>  int syncobj_wait_drain(struct syncobj *sobj, const struct timespec *timeout,
> @@ -553,7 +594,7 @@ int syncobj_wait_drain(struct syncobj *sobj, const struct timespec *timeout,
>
>         pthread_setcancelstate(state, NULL);
>
> -       return wait_epilogue(sobj, syns, current) ?: ret;
> +       return wait_epilogue(sobj, syns, current, ret);
>  }
>
>  int syncobj_destroy(struct syncobj *sobj, struct syncstate *syns)
>
> --
> Philippe.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-09 20:23   ` Ronny Meeus
@ 2016-02-10  6:06     ` Ronny Meeus
  2016-02-10  8:13       ` Ronny Meeus
  2016-02-10  8:49       ` Philippe Gerum
  0 siblings, 2 replies; 7+ messages in thread
From: Ronny Meeus @ 2016-02-10  6:06 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Tue, Feb 9, 2016 at 9:23 PM, Ronny Meeus <ronny.meeus@gmail.com> wrote:

> Philippe
>
> I have ported the patch to our version of Xenomai and started the test.
> It runs already for several minutes without issues so it looks like
> the problem is solved also with the patch below.
> I will let it run overnight and update you tomorrow morning.
>

Test is still running, no issues seen.
So the conclusion is clear: your patch solves the issue.
Thanks

Ronny


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-10  6:06     ` Ronny Meeus
@ 2016-02-10  8:13       ` Ronny Meeus
  2016-02-10  8:44         ` Philippe Gerum
  2016-02-10  8:49       ` Philippe Gerum
  1 sibling, 1 reply; 7+ messages in thread
From: Ronny Meeus @ 2016-02-10  8:13 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai

On Wed, Feb 10, 2016 at 7:06 AM, Ronny Meeus <ronny.meeus@gmail.com> wrote:
> On Tue, Feb 9, 2016 at 9:23 PM, Ronny Meeus <ronny.meeus@gmail.com> wrote:
>
>> Philippe
>>
>> I have ported the patch to our version of Xenomai and started the test.
>> It runs already for several minutes without issues so it looks like
>> the problem is solved also with the patch below.
>> I will let it run overnight and update you tomorrow morning.
>>
>
> Test is still running, no issues seen.
> So the conclusion is clear: your patch solves the issue.
> Thanks
>
> Ronny

Philippe,

you state in on of the comments in the patch:
"
 * Whether a condition should be deemed satisfied if it is
 * signaled during the race window described above is
 * debatable, but this is a simple and straightforward way to
 * handle such grey area.
"

I understand this, it could also be possible to return a timeout in
this case to the
caller, but it is not acceptable that the message is actually lost.
If you implement it in such a way that a timeout error is returned, the message
that has been copied into the receiver's buffer should be put back in the queue
so that it is received on the next call to q_receive.

Ronny


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-10  8:13       ` Ronny Meeus
@ 2016-02-10  8:44         ` Philippe Gerum
  0 siblings, 0 replies; 7+ messages in thread
From: Philippe Gerum @ 2016-02-10  8:44 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 02/10/2016 09:13 AM, Ronny Meeus wrote:
> On Wed, Feb 10, 2016 at 7:06 AM, Ronny Meeus <ronny.meeus@gmail.com> wrote:
>> On Tue, Feb 9, 2016 at 9:23 PM, Ronny Meeus <ronny.meeus@gmail.com> wrote:
>>
>>> Philippe
>>>
>>> I have ported the patch to our version of Xenomai and started the test.
>>> It runs already for several minutes without issues so it looks like
>>> the problem is solved also with the patch below.
>>> I will let it run overnight and update you tomorrow morning.
>>>
>>
>> Test is still running, no issues seen.
>> So the conclusion is clear: your patch solves the issue.
>> Thanks
>>
>> Ronny
> 
> Philippe,
> 
> you state in on of the comments in the patch:
> "
>  * Whether a condition should be deemed satisfied if it is
>  * signaled during the race window described above is
>  * debatable, but this is a simple and straightforward way to
>  * handle such grey area.
> "
> 
> I understand this, it could also be possible to return a timeout in
> this case to the
> caller, but it is not acceptable that the message is actually lost.
> If you implement it in such a way that a timeout error is returned, the message
> that has been copied into the receiver's buffer should be put back in the queue
> so that it is received on the next call to q_receive.
> 

Clearly, yes. But that approach would invalidate the current assumption
that receiving any error status from the wait_grant/drain services
(which includes -ETIMEDOUT/-EINTR) does mean that such operation has
entirely failed, and the object state was left unchanged. On the
contrary, that would mean that such operation might have failed or
succeeded, we would not know until some extra state is considered (e.g.
the wait_struct relaying the data).

Not only this would require all current users to be fixed up for looking
at that extra piece of information in their own state, but this is not
acceptable logically speaking. Also, that would make the implementation
uselessly complex in the upper layers.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive
  2016-02-10  6:06     ` Ronny Meeus
  2016-02-10  8:13       ` Ronny Meeus
@ 2016-02-10  8:49       ` Philippe Gerum
  1 sibling, 0 replies; 7+ messages in thread
From: Philippe Gerum @ 2016-02-10  8:49 UTC (permalink / raw)
  To: Ronny Meeus; +Cc: xenomai

On 02/10/2016 07:06 AM, Ronny Meeus wrote:
> On Tue, Feb 9, 2016 at 9:23 PM, Ronny Meeus <ronny.meeus@gmail.com> wrote:
> 
>> Philippe
>>
>> I have ported the patch to our version of Xenomai and started the test.
>> It runs already for several minutes without issues so it looks like
>> the problem is solved also with the patch below.
>> I will let it run overnight and update you tomorrow morning.
>>
> 
> Test is still running, no issues seen.
> So the conclusion is clear: your patch solves the issue.
> Thanks
> 

Ok, this patch made it to the -stable and -next branches. Thanks for the
update.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-10  8:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-05  8:27 [Xenomai] Mercury / pSOS: messages lost due to bug in q_receive Ronny Meeus
2016-02-09 11:23 ` Philippe Gerum
2016-02-09 20:23   ` Ronny Meeus
2016-02-10  6:06     ` Ronny Meeus
2016-02-10  8:13       ` Ronny Meeus
2016-02-10  8:44         ` Philippe Gerum
2016-02-10  8:49       ` Philippe Gerum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.