Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
       [not found] <CAO6Ho0ea+Mhdoea4b3EL-J4z2wifHPWohq1ps74LQBU+b0-OOQ@mail.gmail.com>
@ 2016-10-19 10:03 ` Evgeniy Ivanov
       [not found] ` <CAO6Ho0e27-JTmUPrTRZfr95Lm=ri+mOSRNr7Av5Ma0t1ucuH6g@mail.gmail.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Evgeniy Ivanov @ 2016-10-19 10:03 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 881 bytes --]

Sorry, found partial answer in docs which state that cds_lfht_destroy
should not be called from a call_rcu thread context. Why does this
limitation exists?

On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov <i@eivanov.com> wrote:

> Hi,
>
> Each node of top level rculfhash has nested rculfhash. Some thread clears
> the top level map and then uses rcu_barrier() to wait until everything is
> destroyed (it is done to check leaks). Recently it started to dead lock
> sometimes with following stacks:
>
> Thread1:
>
> __poll
> cds_lfht_destroy    <---- nested map
> ...
> free_Node(rcu_head*)  <----- node of top level map
> call_rcu_thread
>
> Thread2:
>
> syscall
> rcu_barrier_qsbr
> destroy_all
> main
>
>
> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
> internal deadlock because of nested maps?
>
>
> --
> Cheers,
> Evgeniy
>



-- 
Cheers,
Evgeniy

[-- Attachment #1.2: Type: text/html, Size: 1529 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
       [not found] ` <CAO6Ho0e27-JTmUPrTRZfr95Lm=ri+mOSRNr7Av5Ma0t1ucuH6g@mail.gmail.com>
@ 2016-10-19 15:03   ` Mathieu Desnoyers
       [not found]   ` <85898265.59079.1476889412542.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2016-10-19 15:03 UTC (permalink / raw)
  To: Evgeniy Ivanov; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 1888 bytes --]

This is because we use call_rcu internally to trigger the hash table 
resize. 

In cds_lfht_destroy, we start by waiting for "in-flight" resize to complete. 
Unfortunately, this requires that call_rcu worker thread progresses. If 
cds_lfht_destroy is called from the call_rcu worker thread, it will wait 
forever. 

One alternative would be to implement our own worker thread scheme 
for the rcu HT resize rather than use the call_rcu worker thread. This 
would simplify cds_lfht_destroy requirements a lot. 

Ideally I'd like to re-use all the call_rcu work dispatch/worker handling 
scheme, just as a separate work queue. 

Thoughts ? 

Thanks, 

Mathieu 

----- On Oct 19, 2016, at 6:03 AM, Evgeniy Ivanov <i@eivanov.com> wrote: 

> Sorry, found partial answer in docs which state that cds_lfht_destroy should not
> be called from a call_rcu thread context. Why does this limitation exists?

> On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov < [ mailto:i@eivanov.com |
> i@eivanov.com ] > wrote:

>> Hi,

>> Each node of top level rculfhash has nested rculfhash. Some thread clears the
>> top level map and then uses rcu_barrier() to wait until everything is destroyed
>> (it is done to check leaks). Recently it started to dead lock sometimes with
>> following stacks:

>> Thread1:

>> __poll
>> cds_lfht_destroy <---- nested map
>> ...
>> free_Node(rcu_head*) <----- node of top level map
>> call_rcu_thread

>> Thread2:

>> syscall
>> rcu_barrier_qsbr
>> destroy_all
>> main

>> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
>> internal deadlock because of nested maps?

>> --
>> Cheers,
>> Evgeniy

> --
> Cheers,
> Evgeniy

> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 3909 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
       [not found]   ` <85898265.59079.1476889412542.JavaMail.zimbra@efficios.com>
@ 2016-10-21  8:19     ` Evgeniy Ivanov
       [not found]     ` <CAO6Ho0d+LkJi_2ebomx13D42CApEG-bakyTbtzvz+Jvxc4wy9A@mail.gmail.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Evgeniy Ivanov @ 2016-10-21  8:19 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 2375 bytes --]

On Wed, Oct 19, 2016 at 6:03 PM, Mathieu Desnoyers <
mathieu.desnoyers@efficios.com> wrote:

> This is because we use call_rcu internally to trigger the hash table
> resize.
>
> In cds_lfht_destroy, we start by waiting for "in-flight" resize to
> complete.
> Unfortunately, this requires that call_rcu worker thread progresses. If
> cds_lfht_destroy is called from the call_rcu worker thread, it will wait
> forever.
>
> One alternative would be to implement our own worker thread scheme
> for the rcu HT resize rather than use the call_rcu worker thread. This
> would simplify cds_lfht_destroy requirements a lot.
>
> Ideally I'd like to re-use all the call_rcu work dispatch/worker handling
> scheme, just as a separate work queue.
>
> Thoughts ?
>

Thank you for explaining. Sounds like a plan: in our prod there is no issue
with having extra thread for table resizes. And nested tables is important
feature.



>
Thanks,
>
> Mathieu
>
> ----- On Oct 19, 2016, at 6:03 AM, Evgeniy Ivanov <i@eivanov.com> wrote:
>
> Sorry, found partial answer in docs which state that cds_lfht_destroy
> should not be called from a call_rcu thread context. Why does this
> limitation exists?
>
> On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov <i@eivanov.com> wrote:
>
>> Hi,
>>
>> Each node of top level rculfhash has nested rculfhash. Some thread clears
>> the top level map and then uses rcu_barrier() to wait until everything is
>> destroyed (it is done to check leaks). Recently it started to dead lock
>> sometimes with following stacks:
>>
>> Thread1:
>>
>> __poll
>> cds_lfht_destroy    <---- nested map
>> ...
>> free_Node(rcu_head*)  <----- node of top level map
>> call_rcu_thread
>>
>> Thread2:
>>
>> syscall
>> rcu_barrier_qsbr
>> destroy_all
>> main
>>
>>
>> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
>> internal deadlock because of nested maps?
>>
>>
>> --
>> Cheers,
>> Evgeniy
>>
>
>
>
> --
> Cheers,
> Evgeniy
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>


-- 
Cheers,
Evgeniy

[-- Attachment #1.2: Type: text/html, Size: 5611 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
       [not found]     ` <CAO6Ho0d+LkJi_2ebomx13D42CApEG-bakyTbtzvz+Jvxc4wy9A@mail.gmail.com>
@ 2017-06-01 13:01       ` Mathieu Desnoyers
       [not found]       ` <5908376.2670.1496322081960.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2017-06-01 13:01 UTC (permalink / raw)
  To: Evgeniy Ivanov; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 3348 bytes --]

----- On Oct 21, 2016, at 4:19 AM, Evgeniy Ivanov <i@eivanov.com> wrote: 

> On Wed, Oct 19, 2016 at 6:03 PM, Mathieu Desnoyers < [
> mailto:mathieu.desnoyers@efficios.com | mathieu.desnoyers@efficios.com ] >
> wrote:

>> This is because we use call_rcu internally to trigger the hash table
>> resize.

>> In cds_lfht_destroy, we start by waiting for "in-flight" resize to complete.
>> Unfortunately, this requires that call_rcu worker thread progresses. If
>> cds_lfht_destroy is called from the call_rcu worker thread, it will wait
>> forever.

>> One alternative would be to implement our own worker thread scheme
>> for the rcu HT resize rather than use the call_rcu worker thread. This
>> would simplify cds_lfht_destroy requirements a lot.

>> Ideally I'd like to re-use all the call_rcu work dispatch/worker handling
>> scheme, just as a separate work queue.

>> Thoughts ?

> Thank you for explaining. Sounds like a plan: in our prod there is no issue with
> having extra thread for table resizes. And nested tables is important feature.

I finally managed to find some time to implement a solution, feedback 
would be welcome! 

Here are the RFC patches: 

https://lists.lttng.org/pipermail/lttng-dev/2017-May/027183.html 
https://lists.lttng.org/pipermail/lttng-dev/2017-May/027184.html 

Thanks, 

Mathieu 

>> Thanks,

>> Mathieu

>> ----- On Oct 19, 2016, at 6:03 AM, Evgeniy Ivanov < [ mailto:i@eivanov.com |
>> i@eivanov.com ] > wrote:

>>> Sorry, found partial answer in docs which state that cds_lfht_destroy should not
>>> be called from a call_rcu thread context. Why does this limitation exists?

>>> On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov < [ mailto:i@eivanov.com |
>>> i@eivanov.com ] > wrote:

>>>> Hi,

>>>> Each node of top level rculfhash has nested rculfhash. Some thread clears the
>>>> top level map and then uses rcu_barrier() to wait until everything is destroyed
>>>> (it is done to check leaks). Recently it started to dead lock sometimes with
>>>> following stacks:

>>>> Thread1:

>>>> __poll
>>>> cds_lfht_destroy <---- nested map
>>>> ...
>>>> free_Node(rcu_head*) <----- node of top level map
>>>> call_rcu_thread

>>>> Thread2:

>>>> syscall
>>>> rcu_barrier_qsbr
>>>> destroy_all
>>>> main

>>>> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
>>>> internal deadlock because of nested maps?

>>>> --
>>>> Cheers,
>>>> Evgeniy

>>> --
>>> Cheers,
>>> Evgeniy

>>> _______________________________________________
>>> lttng-dev mailing list
>>> [ mailto:lttng-dev@lists.lttng.org | lttng-dev@lists.lttng.org ]
>>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> [ http://www.efficios.com/ | http://www.efficios.com ]

>> _______________________________________________
>> lttng-dev mailing list
>> [ mailto:lttng-dev@lists.lttng.org | lttng-dev@lists.lttng.org ]
>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

> --
> Cheers,
> Evgeniy

> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 7261 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
       [not found]       ` <5908376.2670.1496322081960.JavaMail.zimbra@efficios.com>
@ 2017-06-07 22:05         ` Mathieu Desnoyers
  0 siblings, 0 replies; 6+ messages in thread
From: Mathieu Desnoyers @ 2017-06-07 22:05 UTC (permalink / raw)
  To: Evgeniy Ivanov; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 3850 bytes --]

----- On Jun 1, 2017, at 9:01 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: 

> ----- On Oct 21, 2016, at 4:19 AM, Evgeniy Ivanov <i@eivanov.com> wrote:

>> On Wed, Oct 19, 2016 at 6:03 PM, Mathieu Desnoyers < [
>> mailto:mathieu.desnoyers@efficios.com | mathieu.desnoyers@efficios.com ] >
>> wrote:

>>> This is because we use call_rcu internally to trigger the hash table
>>> resize.

>>> In cds_lfht_destroy, we start by waiting for "in-flight" resize to complete.
>>> Unfortunately, this requires that call_rcu worker thread progresses. If
>>> cds_lfht_destroy is called from the call_rcu worker thread, it will wait
>>> forever.

>>> One alternative would be to implement our own worker thread scheme
>>> for the rcu HT resize rather than use the call_rcu worker thread. This
>>> would simplify cds_lfht_destroy requirements a lot.

>>> Ideally I'd like to re-use all the call_rcu work dispatch/worker handling
>>> scheme, just as a separate work queue.

>>> Thoughts ?

>> Thank you for explaining. Sounds like a plan: in our prod there is no issue with
>> having extra thread for table resizes. And nested tables is important feature.

> I finally managed to find some time to implement a solution, feedback
> would be welcome!

> Here are the RFC patches:

> https://lists.lttng.org/pipermail/lttng-dev/2017-May/027183.html
> https://lists.lttng.org/pipermail/lttng-dev/2017-May/027184.html

Just merged commits derived from those patches into liburcu master branch. 

Thanks, 

Mathieu 

> Thanks,

> Mathieu

>>> Thanks,

>>> Mathieu

>>> ----- On Oct 19, 2016, at 6:03 AM, Evgeniy Ivanov < [ mailto:i@eivanov.com |
>>> i@eivanov.com ] > wrote:

>>>> Sorry, found partial answer in docs which state that cds_lfht_destroy should not
>>>> be called from a call_rcu thread context. Why does this limitation exists?

>>>> On Wed, Oct 19, 2016 at 12:56 PM, Evgeniy Ivanov < [ mailto:i@eivanov.com |
>>>> i@eivanov.com ] > wrote:

>>>>> Hi,

>>>>> Each node of top level rculfhash has nested rculfhash. Some thread clears the
>>>>> top level map and then uses rcu_barrier() to wait until everything is destroyed
>>>>> (it is done to check leaks). Recently it started to dead lock sometimes with
>>>>> following stacks:

>>>>> Thread1:

>>>>> __poll
>>>>> cds_lfht_destroy <---- nested map
>>>>> ...
>>>>> free_Node(rcu_head*) <----- node of top level map
>>>>> call_rcu_thread

>>>>> Thread2:

>>>>> syscall
>>>>> rcu_barrier_qsbr
>>>>> destroy_all
>>>>> main

>>>>> Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
>>>>> internal deadlock because of nested maps?

>>>>> --
>>>>> Cheers,
>>>>> Evgeniy

>>>> --
>>>> Cheers,
>>>> Evgeniy

>>>> _______________________________________________
>>>> lttng-dev mailing list
>>>> [ mailto:lttng-dev@lists.lttng.org | lttng-dev@lists.lttng.org ]
>>>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>>>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> [ http://www.efficios.com/ | http://www.efficios.com ]

>>> _______________________________________________
>>> lttng-dev mailing list
>>> [ mailto:lttng-dev@lists.lttng.org | lttng-dev@lists.lttng.org ]
>>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

>> --
>> Cheers,
>> Evgeniy

>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev@lists.lttng.org
>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 8198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash
@ 2016-10-19  9:56 Evgeniy Ivanov
  0 siblings, 0 replies; 6+ messages in thread
From: Evgeniy Ivanov @ 2016-10-19  9:56 UTC (permalink / raw)
  To: lttng-dev

[-- Attachment #1.1: Type: text/plain, Size: 576 bytes --]

Hi,

Each node of top level rculfhash has nested rculfhash. Some thread clears
the top level map and then uses rcu_barrier() to wait until everything is
destroyed (it is done to check leaks). Recently it started to dead lock
sometimes with following stacks:

Thread1:

__poll
cds_lfht_destroy    <---- nested map
...
free_Node(rcu_head*)  <----- node of top level map
call_rcu_thread

Thread2:

syscall
rcu_barrier_qsbr
destroy_all
main

Did call_rcu_thread dead lock with barrier thread? Or is it some kind of
internal deadlock because of nested maps?

-- 
Cheers,
Evgeniy

[-- Attachment #1.2: Type: text/html, Size: 826 bytes --]

[-- Attachment #2: Type: text/plain, Size: 156 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-07 22:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAO6Ho0ea+Mhdoea4b3EL-J4z2wifHPWohq1ps74LQBU+b0-OOQ@mail.gmail.com>
2016-10-19 10:03 ` Deadlock in call_rcu_thread when destroy rculfhash node with nested rculfhash Evgeniy Ivanov
     [not found] ` <CAO6Ho0e27-JTmUPrTRZfr95Lm=ri+mOSRNr7Av5Ma0t1ucuH6g@mail.gmail.com>
2016-10-19 15:03   ` Mathieu Desnoyers
     [not found]   ` <85898265.59079.1476889412542.JavaMail.zimbra@efficios.com>
2016-10-21  8:19     ` Evgeniy Ivanov
     [not found]     ` <CAO6Ho0d+LkJi_2ebomx13D42CApEG-bakyTbtzvz+Jvxc4wy9A@mail.gmail.com>
2017-06-01 13:01       ` Mathieu Desnoyers
     [not found]       ` <5908376.2670.1496322081960.JavaMail.zimbra@efficios.com>
2017-06-07 22:05         ` Mathieu Desnoyers
2016-10-19  9:56 Evgeniy Ivanov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.