* Cobalt deadlock for no apparent reason
@ 2020-01-20 18:03 Lange Norbert
2020-01-21 17:45 ` Jan Kiszka
0 siblings, 1 reply; 4+ messages in thread
From: Lange Norbert @ 2020-01-20 18:03 UTC (permalink / raw)
To: Xenomai (xenomai@xenomai.org)
Hello,
I got a deadlock while running through gdbserver, this is an implementation of a synchronized queue,
Fup side waits via condition variable, main wants to push data, but main fails to acquire the mutex.
The mutex is an errorchecking type, without priority inheritance, and not used elsewhere.
The task are as following:
CPU PID CLASS TYPE PRI TIMEOUT STAT NAME
1 1686 rt cobalt 4 - Wt main
3 1690 rt cobalt 2 - Wt fup.medium
main is stuck in this function
int mutex_lock(struct mutex_data *pData)
{
pthread_t threadId = pthread_self();
// assert(pthread_equal(threadId, pData->m_LockId) == 0);
-> int r = pthread_mutex_lock(&pData->m_Mutex);
assert(r == 0);
pData->m_LockId = threadId;
return r;
}
In libcobalt:
do
ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
while (ret == -EINTR);
fup.medium is stuck in:
int conditionvar_wait(struct conditionvar_data *pData, struct mutex_data *pMutex)
{
pthread_t sid = pthread_self();
assert(pthread_equal(sid, pMutex->m_LockId) != 0);
pMutex->m_LockId = 0;
-> int r = pthread_cond_wait(&pData->m_CondVar, &pMutex->m_Mutex);
assert(r == 0);
pMutex->m_LockId = sid;
return r;
}
In libcobalt:
while (err == -EINTR)
err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);
Mit besten Grüßen / Kind regards
NORBERT LANGE
AT-RD3
ANDRITZ HYDRO GmbH
Eibesbrunnergasse 20
1120 Vienna / AUSTRIA
p: +43 50805 56684
norbert.lange@andritz.com<mailto:norbert.lange@andritz.com>
andritz.com<http://www.andritz.com/>
________________________________
This message and any attachments are solely for the use of the intended recipients. They may contain privileged and/or confidential information or other information protected from disclosure. If you are not an intended recipient, you are hereby notified that you received this email in error and that any review, dissemination, distribution or copying of this email and any attachment is strictly prohibited. If you have received this email in error, please contact the sender and delete the message and any attachment from your system.
ANDRITZ HYDRO GmbH
Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation
Firmensitz/ Registered seat: Wien
Firmenbuchgericht/ Court of registry: Handelsgericht Wien
Firmenbuchnummer/ Company registration: FN 61833 g
DVR: 0605077
UID-Nr.: ATU14756806
Thank You
________________________________
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Cobalt deadlock for no apparent reason
2020-01-20 18:03 Cobalt deadlock for no apparent reason Lange Norbert
@ 2020-01-21 17:45 ` Jan Kiszka
2020-01-22 10:11 ` Lange Norbert
0 siblings, 1 reply; 4+ messages in thread
From: Jan Kiszka @ 2020-01-21 17:45 UTC (permalink / raw)
To: Lange Norbert, Xenomai (xenomai@xenomai.org)
On 20.01.20 19:03, Lange Norbert via Xenomai wrote:
> Hello,
>
> I got a deadlock while running through gdbserver, this is an implementation of a synchronized queue,
> Fup side waits via condition variable, main wants to push data, but main fails to acquire the mutex.
> The mutex is an errorchecking type, without priority inheritance, and not used elsewhere.
>
>
> The task are as following:
>
> CPU PID CLASS TYPE PRI TIMEOUT STAT NAME
> 1 1686 rt cobalt 4 - Wt main
> 3 1690 rt cobalt 2 - Wt fup.medium
>
> main is stuck in this function
>
> int mutex_lock(struct mutex_data *pData)
> {
> pthread_t threadId = pthread_self();
> // assert(pthread_equal(threadId, pData->m_LockId) == 0);
> -> int r = pthread_mutex_lock(&pData->m_Mutex);
> assert(r == 0);
> pData->m_LockId = threadId;
> return r;
> }
>
> In libcobalt:
> do
> ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
> while (ret == -EINTR);
>
> fup.medium is stuck in:
>
> int conditionvar_wait(struct conditionvar_data *pData, struct mutex_data *pMutex)
> {
> pthread_t sid = pthread_self();
> assert(pthread_equal(sid, pMutex->m_LockId) != 0);
> pMutex->m_LockId = 0;
> -> int r = pthread_cond_wait(&pData->m_CondVar, &pMutex->m_Mutex);
> assert(r == 0);
> pMutex->m_LockId = sid;
> return r;
> }
>
> In libcobalt:
> while (err == -EINTR)
> err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);
>
This is likely tricky to debug by just looking at things. Can you factor
out a reproducer?
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: Cobalt deadlock for no apparent reason
2020-01-21 17:45 ` Jan Kiszka
@ 2020-01-22 10:11 ` Lange Norbert
2020-01-22 17:19 ` Jan Kiszka
0 siblings, 1 reply; 4+ messages in thread
From: Lange Norbert @ 2020-01-22 10:11 UTC (permalink / raw)
To: Jan Kiszka, Xenomai (xenomai@xenomai.org)
> -----Original Message-----
> From: Jan Kiszka <jan.kiszka@siemens.com>
> Sent: Dienstag, 21. Jänner 2020 18:46
> To: Lange Norbert <norbert.lange@andritz.com>; Xenomai
> (xenomai@xenomai.org) <xenomai@xenomai.org>
> Subject: Re: Cobalt deadlock for no apparent reason
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 20.01.20 19:03, Lange Norbert via Xenomai wrote:
> > Hello,
> >
> > I got a deadlock while running through gdbserver, this is an
> > implementation of a synchronized queue, Fup side waits via condition
> variable, main wants to push data, but main fails to acquire the mutex.
> > The mutex is an errorchecking type, without priority inheritance, and not
> used elsewhere.
> >
> >
> > The task are as following:
> >
> > CPU PID CLASS TYPE PRI TIMEOUT STAT NAME
> > 1 1686 rt cobalt 4 - Wt main
> > 3 1690 rt cobalt 2 - Wt fup.medium
> >
> > main is stuck in this function
> >
> > int mutex_lock(struct mutex_data *pData) {
> > pthread_t threadId = pthread_self();
> > // assert(pthread_equal(threadId, pData->m_LockId) == 0);
> > -> int r = pthread_mutex_lock(&pData->m_Mutex);
> > assert(r == 0);
> > pData->m_LockId = threadId;
> > return r;
> > }
> >
> > In libcobalt:
> > do
> > ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
> > while (ret == -EINTR);
> >
> > fup.medium is stuck in:
> >
> > int conditionvar_wait(struct conditionvar_data *pData, struct
> > mutex_data *pMutex) {
> > pthread_t sid = pthread_self();
> > assert(pthread_equal(sid, pMutex->m_LockId) != 0);
> > pMutex->m_LockId = 0;
> > -> int r = pthread_cond_wait(&pData->m_CondVar, &pMutex-
> >m_Mutex);
> > assert(r == 0);
> > pMutex->m_LockId = sid;
> > return r;
> > }
> >
> > In libcobalt:
> > while (err == -EINTR)
> > err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);
> >
>
> This is likely tricky to debug by just looking at things. Can you factor out a
> reproducer?
Well, the "no apparent reason" is key here, it's not easily reproducible either.
Might help if you tell me how I could end up in this situation, AFAIK the pthread_cond_wait function got an interrupt,
when can this occur for example.
It seems limited to running under a debugger (or just happens a lot less without), and when the process dlopen's libraries this pauses
the process for example. At that time the fup.medium is supposed to stick in pthread_cond_wait (means the dlopens might cause spurious wakeups)
and notified after everything is ready to run.
So, if you have any idea how I could narrow it down, I migh be able to build a reproducer. Right now I am going to use a timed mutex to atleast detect the issue.
Norbert
________________________________
This message and any attachments are solely for the use of the intended recipients. They may contain privileged and/or confidential information or other information protected from disclosure. If you are not an intended recipient, you are hereby notified that you received this email in error and that any review, dissemination, distribution or copying of this email and any attachment is strictly prohibited. If you have received this email in error, please contact the sender and delete the message and any attachment from your system.
ANDRITZ HYDRO GmbH
Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation
Firmensitz/ Registered seat: Wien
Firmenbuchgericht/ Court of registry: Handelsgericht Wien
Firmenbuchnummer/ Company registration: FN 61833 g
DVR: 0605077
UID-Nr.: ATU14756806
Thank You
________________________________
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Cobalt deadlock for no apparent reason
2020-01-22 10:11 ` Lange Norbert
@ 2020-01-22 17:19 ` Jan Kiszka
0 siblings, 0 replies; 4+ messages in thread
From: Jan Kiszka @ 2020-01-22 17:19 UTC (permalink / raw)
To: Lange Norbert, Xenomai (xenomai@xenomai.org)
On 22.01.20 11:11, Lange Norbert wrote:
>
>
>> -----Original Message-----
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>> Sent: Dienstag, 21. Jänner 2020 18:46
>> To: Lange Norbert <norbert.lange@andritz.com>; Xenomai
>> (xenomai@xenomai.org) <xenomai@xenomai.org>
>> Subject: Re: Cobalt deadlock for no apparent reason
>>
>> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
>> ATTACHMENTS.
>>
>>
>> On 20.01.20 19:03, Lange Norbert via Xenomai wrote:
>>> Hello,
>>>
>>> I got a deadlock while running through gdbserver, this is an
>>> implementation of a synchronized queue, Fup side waits via condition
>> variable, main wants to push data, but main fails to acquire the mutex.
>>> The mutex is an errorchecking type, without priority inheritance, and not
>> used elsewhere.
>>>
>>>
>>> The task are as following:
>>>
>>> CPU PID CLASS TYPE PRI TIMEOUT STAT NAME
>>> 1 1686 rt cobalt 4 - Wt main
>>> 3 1690 rt cobalt 2 - Wt fup.medium
>>>
>>> main is stuck in this function
>>>
>>> int mutex_lock(struct mutex_data *pData) {
>>> pthread_t threadId = pthread_self();
>>> // assert(pthread_equal(threadId, pData->m_LockId) == 0);
>>> -> int r = pthread_mutex_lock(&pData->m_Mutex);
>>> assert(r == 0);
>>> pData->m_LockId = threadId;
>>> return r;
>>> }
>>>
>>> In libcobalt:
>>> do
>>> ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
>>> while (ret == -EINTR);
>>>
>>> fup.medium is stuck in:
>>>
>>> int conditionvar_wait(struct conditionvar_data *pData, struct
>>> mutex_data *pMutex) {
>>> pthread_t sid = pthread_self();
>>> assert(pthread_equal(sid, pMutex->m_LockId) != 0);
>>> pMutex->m_LockId = 0;
>>> -> int r = pthread_cond_wait(&pData->m_CondVar, &pMutex-
>>> m_Mutex);
>>> assert(r == 0);
>>> pMutex->m_LockId = sid;
>>> return r;
>>> }
>>>
>>> In libcobalt:
>>> while (err == -EINTR)
>>> err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);
>>>
>>
>> This is likely tricky to debug by just looking at things. Can you factor out a
>> reproducer?
>
> Well, the "no apparent reason" is key here, it's not easily reproducible either.
> Might help if you tell me how I could end up in this situation, AFAIK the pthread_cond_wait function got an interrupt,
> when can this occur for example.
Well, we have the cond_wait apparently being signaled (condition met)
and on its way back, "just" trying to reacquire the mutex it dropped
while waiting. On the other hand, that mutex is also not available for
the other context trying to call mutex_lock. Now, we either have one of
those two instances actually holding the lock while not noticing it - or
there is a third instance in possession of the lock. From the code and
information you sent, this is impossible to guess.
> It seems limited to running under a debugger (or just happens a lot less without), and when the process dlopen's libraries this pauses
> the process for example. At that time the fup.medium is supposed to stick in pthread_cond_wait (means the dlopens might cause spurious wakeups)
> and notified after everything is ready to run.
The debugger might be the catalyst for the issue, or it is actually
causing it. Again, impossible to guess from the given information:
>
> So, if you have any idea how I could narrow it down, I migh be able to build a reproducer. Right now I am going to use a timed mutex to atleast detect the issue.
- identify the owner of the lock at the time of the deadlock, based on
the data structures - maybe that is already telling the story
- take an ftrace of the situation so that the flow of context switches
and debugger interceptions can be reconstructed
Jan
--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-01-22 17:19 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-20 18:03 Cobalt deadlock for no apparent reason Lange Norbert
2020-01-21 17:45 ` Jan Kiszka
2020-01-22 10:11 ` Lange Norbert
2020-01-22 17:19 ` Jan Kiszka
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.