All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Lttng Soft Hang issue
       [not found] <CACd+_b7pakrUPzfzj=kXYYtkezpwxOLUgGs_Hy7yMYOh-f-PqQ@mail.gmail.com>
@ 2015-05-27 10:19 ` Aravind HT
       [not found] ` <CACd+_b4MtdbAVjHCeQ6EnK5J3KRYzMK7Niq_c18pcKcdpnq07w@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-05-27 10:19 UTC (permalink / raw)
  To: lttng-dev, jdesfossez


[-- Attachment #1.1: Type: text/plain, Size: 2603 bytes --]

Request someone to kindly help with this. Blocked at this point, unable to
continue as any application crash leads to lttng not working.

Thanks,
Aravind.

On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> Hi,
>
>
>
> I have recently started trying lttng 2.6 for a few applications on my
> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
> failed soon after starting the session.
>
>
> On investigating, I found that thread_manage_consumer() had exited causing
> an overall health check failure.
>
>
> Here are the sequence of steps that I found contributed to
> *thread_manage_consumer()* exiting.
>
>
> 1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
> being called. This happens when there is an error detected by *revents =
> LTTNG_POLL_GETEV(&events, i)*
>
> My initial guess here is that as one of my apps has crashed, producing a *LPOLLERR
> | LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()* causing
> *ust_app_unregister()* to be called for that app.
>
>
>
> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*.*
>
>
>
> (2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
> sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the consumerd
> to stop it from further dealing with the concerned app. Somehow this
> doesn’t seem to help*.*
>
>
>
> 3.       Next, I see that the *thread_manage_consumer():1353* for some
> reason has ignored the above 2.a and gets to do request/reply for that app
> by calling*ust_consumer_metadata_request():491 ->
> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
> *registry->metadata_closed* and returns an *–EPIPE*
>
>
>
> 4.       This *–EPIPE* error cascades all the way back up to
> *thread_manage_consumer():1353* at which point *thread_manage_consumer()* decides
> to exit causing health_check() to fail.
>
>
>
>
>
> So it looks like under some scenario, an application crash could cause the
> lttng some problems.
>
>
>
> I think a possible fix for this scenario, is to instead of 4, send an
> *ERROR* message back to *consumerd()* . This could be done from
> *ust_consumer_metadata_request()* call. Can someone please let me know if
> this is correct and shed more light on the issue ?
>
>
>
>
>
> Please forgive if there are any guideline omissions for posting here from
> my part. This is my first post.
>
>
> Regards,
>
> Aravind.
>

[-- Attachment #1.2: Type: text/html, Size: 5038 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found] ` <CACd+_b4MtdbAVjHCeQ6EnK5J3KRYzMK7Niq_c18pcKcdpnq07w@mail.gmail.com>
@ 2015-07-06 15:23   ` Jérémie Galarneau
       [not found]   ` <CA+jJMxuHWhy1n9Ty1Q6=eGd7bRe-brh3Mqog1uzw5EXbKDPJ0w@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Jérémie Galarneau @ 2015-07-06 15:23 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 3127 bytes --]

Hi,

What do you observe when this happens? Does the session daemon become
unresponsive?

Jérémie

On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> Request someone to kindly help with this. Blocked at this point, unable to
> continue as any application crash leads to lttng not working.
>
> Thanks,
> Aravind.
>
> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> Hi,
>>
>>
>>
>> I have recently started trying lttng 2.6 for a few applications on my
>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>> failed soon after starting the session.
>>
>>
>> On investigating, I found that thread_manage_consumer() had exited
>> causing an overall health check failure.
>>
>>
>> Here are the sequence of steps that I found contributed to
>> *thread_manage_consumer()* exiting.
>>
>>
>> 1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
>> being called. This happens when there is an error detected by *revents =
>> LTTNG_POLL_GETEV(&events, i)*
>>
>> My initial guess here is that as one of my apps has crashed, producing a *LPOLLERR
>> | LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()* causing
>> *ust_app_unregister()* to be called for that app.
>>
>>
>>
>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*.*
>>
>>
>>
>> (2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
>> sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>> consumerd to stop it from further dealing with the concerned app. Somehow
>> this doesn’t seem to help*.*
>>
>>
>>
>> 3.       Next, I see that the *thread_manage_consumer():1353* for some
>> reason has ignored the above 2.a and gets to do request/reply for that app
>> by calling*ust_consumer_metadata_request():491 ->
>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
>> *registry->metadata_closed* and returns an *–EPIPE*
>>
>>
>>
>> 4.       This *–EPIPE* error cascades all the way back up to
>> *thread_manage_consumer():1353* at which point *thread_manage_consumer()* decides
>> to exit causing health_check() to fail.
>>
>>
>>
>>
>>
>> So it looks like under some scenario, an application crash could cause
>> the lttng some problems.
>>
>>
>>
>> I think a possible fix for this scenario, is to instead of 4, send an
>> *ERROR* message back to *consumerd()* . This could be done from
>> *ust_consumer_metadata_request()* call. Can someone please let me know
>> if this is correct and shed more light on the issue ?
>>
>>
>>
>>
>>
>> Please forgive if there are any guideline omissions for posting here from
>> my part. This is my first post.
>>
>>
>> Regards,
>>
>> Aravind.
>>
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>


-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com

[-- Attachment #1.2: Type: text/html, Size: 6132 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]   ` <CA+jJMxuHWhy1n9Ty1Q6=eGd7bRe-brh3Mqog1uzw5EXbKDPJ0w@mail.gmail.com>
@ 2015-07-06 16:42     ` Jérémie Galarneau
       [not found]     ` <CA+jJMxt2Dph0yhuE3tQ3aYKyAmX8t1DnmT6kDbnWVSgNN=rmUA@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Jérémie Galarneau @ 2015-07-06 16:42 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 3596 bytes --]

Would you mind testing Mathieu's patch?

I have rebased it on stable-2.6:
git clone https://github.com/jgalar/lttng-tools.git -b hang-fix

Thanks,
Jérémie


On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
jeremie.galarneau@efficios.com> wrote:

> Hi,
>
> What do you observe when this happens? Does the session daemon become
> unresponsive?
>
> Jérémie
>
> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> Request someone to kindly help with this. Blocked at this point, unable
>> to continue as any application crash leads to lttng not working.
>>
>> Thanks,
>> Aravind.
>>
>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I have recently started trying lttng 2.6 for a few applications on my
>>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>>> failed soon after starting the session.
>>>
>>>
>>> On investigating, I found that thread_manage_consumer() had exited
>>> causing an overall health check failure.
>>>
>>>
>>> Here are the sequence of steps that I found contributed to
>>> *thread_manage_consumer()* exiting.
>>>
>>>
>>> 1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
>>> being called. This happens when there is an error detected by *revents
>>> = LTTNG_POLL_GETEV(&events, i)*
>>>
>>> My initial guess here is that as one of my apps has crashed, producing a *LPOLLERR
>>> | LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()* causing
>>> *ust_app_unregister()* to be called for that app.
>>>
>>>
>>>
>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*.*
>>>
>>>
>>>
>>> (2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
>>> sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>> this doesn’t seem to help*.*
>>>
>>>
>>>
>>> 3.       Next, I see that the *thread_manage_consumer():1353* for some
>>> reason has ignored the above 2.a and gets to do request/reply for that app
>>> by calling*ust_consumer_metadata_request():491 ->
>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
>>> *registry->metadata_closed* and returns an *–EPIPE*
>>>
>>>
>>>
>>> 4.       This *–EPIPE* error cascades all the way back up to
>>> *thread_manage_consumer():1353* at which point
>>> *thread_manage_consumer()* decides to exit causing health_check() to
>>> fail.
>>>
>>>
>>>
>>>
>>>
>>> So it looks like under some scenario, an application crash could cause
>>> the lttng some problems.
>>>
>>>
>>>
>>> I think a possible fix for this scenario, is to instead of 4, send an
>>> *ERROR* message back to *consumerd()* . This could be done from
>>> *ust_consumer_metadata_request()* call. Can someone please let me know
>>> if this is correct and shed more light on the issue ?
>>>
>>>
>>>
>>>
>>>
>>> Please forgive if there are any guideline omissions for posting here
>>> from my part. This is my first post.
>>>
>>>
>>> Regards,
>>>
>>> Aravind.
>>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev@lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>



-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com

[-- Attachment #1.2: Type: text/html, Size: 7142 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]     ` <CA+jJMxt2Dph0yhuE3tQ3aYKyAmX8t1DnmT6kDbnWVSgNN=rmUA@mail.gmail.com>
@ 2015-07-08  4:19       ` Aravind HT
       [not found]       ` <CACd+_b5v5mpfBqNRQNHiJnA51niK0BchdAgf90VRpSQQpNn_7w@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-07-08  4:19 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 3878 bytes --]

Sure, I will try the fix and update.

On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
jeremie.galarneau@efficios.com> wrote:

> Would you mind testing Mathieu's patch?
>
> I have rebased it on stable-2.6:
> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>
> Thanks,
> Jérémie
>
>
> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
> jeremie.galarneau@efficios.com> wrote:
>
>> Hi,
>>
>> What do you observe when this happens? Does the session daemon become
>> unresponsive?
>>
>> Jérémie
>>
>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>>
>>> Request someone to kindly help with this. Blocked at this point, unable
>>> to continue as any application crash leads to lttng not working.
>>>
>>> Thanks,
>>> Aravind.
>>>
>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I have recently started trying lttng 2.6 for a few applications on my
>>>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>>>> failed soon after starting the session.
>>>>
>>>>
>>>> On investigating, I found that thread_manage_consumer() had exited
>>>> causing an overall health check failure.
>>>>
>>>>
>>>> Here are the sequence of steps that I found contributed to
>>>> *thread_manage_consumer()* exiting.
>>>>
>>>>
>>>> 1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
>>>> being called. This happens when there is an error detected by *revents
>>>> = LTTNG_POLL_GETEV(&events, i)*
>>>>
>>>> My initial guess here is that as one of my apps has crashed, producing
>>>> a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()*
>>>>  causing *ust_app_unregister()* to be called for that app.
>>>>
>>>>
>>>>
>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*.*
>>>>
>>>>
>>>>
>>>> (2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
>>>> sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>> this doesn’t seem to help*.*
>>>>
>>>>
>>>>
>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for some
>>>> reason has ignored the above 2.a and gets to do request/reply for that app
>>>> by calling*ust_consumer_metadata_request():491 ->
>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
>>>> *registry->metadata_closed* and returns an *–EPIPE*
>>>>
>>>>
>>>>
>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>> *thread_manage_consumer():1353* at which point
>>>> *thread_manage_consumer()* decides to exit causing health_check() to
>>>> fail.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So it looks like under some scenario, an application crash could cause
>>>> the lttng some problems.
>>>>
>>>>
>>>>
>>>> I think a possible fix for this scenario, is to instead of 4, send an
>>>> *ERROR* message back to *consumerd()* . This could be done from
>>>> *ust_consumer_metadata_request()* call. Can someone please let me know
>>>> if this is correct and shed more light on the issue ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please forgive if there are any guideline omissions for posting here
>>>> from my part. This is my first post.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Aravind.
>>>>
>>>
>>>
>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 7582 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]       ` <CACd+_b5v5mpfBqNRQNHiJnA51niK0BchdAgf90VRpSQQpNn_7w@mail.gmail.com>
@ 2015-07-11  7:26         ` Aravind HT
       [not found]         ` <CACd+_b7Gz6be-+0cc4gnQi92hSoPDDuu7fWw6t=Qn19WdNc2jA@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-07-11  7:26 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 4388 bytes --]

To try this fix, I also needed the changes talked about in
http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
otherwise, I would see relayd coring.
Once I took them I could see that the soft hang issue was no longer
reproducible.

Thanks for helping.

On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> Sure, I will try the fix and update.
>
> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
> jeremie.galarneau@efficios.com> wrote:
>
>> Would you mind testing Mathieu's patch?
>>
>> I have rebased it on stable-2.6:
>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>
>> Thanks,
>> Jérémie
>>
>>
>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>> jeremie.galarneau@efficios.com> wrote:
>>
>>> Hi,
>>>
>>> What do you observe when this happens? Does the session daemon become
>>> unresponsive?
>>>
>>> Jérémie
>>>
>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>>
>>>> Request someone to kindly help with this. Blocked at this point, unable
>>>> to continue as any application crash leads to lttng not working.
>>>>
>>>> Thanks,
>>>> Aravind.
>>>>
>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I have recently started trying lttng 2.6 for a few applications on my
>>>>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>>>>> failed soon after starting the session.
>>>>>
>>>>>
>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>> causing an overall health check failure.
>>>>>
>>>>>
>>>>> Here are the sequence of steps that I found contributed to
>>>>> *thread_manage_consumer()* exiting.
>>>>>
>>>>>
>>>>> 1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
>>>>> being called. This happens when there is an error detected by *revents
>>>>> = LTTNG_POLL_GETEV(&events, i)*
>>>>>
>>>>> My initial guess here is that as one of my apps has crashed, producing
>>>>> a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()*
>>>>>  causing *ust_app_unregister()* to be called for that app.
>>>>>
>>>>>
>>>>>
>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set
>>>>> *.*
>>>>>
>>>>>
>>>>>
>>>>> (2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
>>>>> sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>> this doesn’t seem to help*.*
>>>>>
>>>>>
>>>>>
>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
>>>>> *registry->metadata_closed* and returns an *–EPIPE*
>>>>>
>>>>>
>>>>>
>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>> *thread_manage_consumer():1353* at which point
>>>>> *thread_manage_consumer()* decides to exit causing health_check() to
>>>>> fail.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So it looks like under some scenario, an application crash could cause
>>>>> the lttng some problems.
>>>>>
>>>>>
>>>>>
>>>>> I think a possible fix for this scenario, is to instead of 4, send an
>>>>> *ERROR* message back to *consumerd()* . This could be done from
>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>> know if this is correct and shed more light on the issue ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Please forgive if there are any guideline omissions for posting here
>>>>> from my part. This is my first post.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Aravind.
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> lttng-dev mailing list
>>>> lttng-dev@lists.lttng.org
>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>
>>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 8357 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]         ` <CACd+_b7Gz6be-+0cc4gnQi92hSoPDDuu7fWw6t=Qn19WdNc2jA@mail.gmail.com>
@ 2015-07-13 15:34           ` Jérémie Galarneau
       [not found]           ` <CA+jJMxtyOqTDSh3iphDAE4qPx0+C8jJ3Txi1j+ouxHx0guuXng@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Jérémie Galarneau @ 2015-07-13 15:34 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 4966 bytes --]

Do you have a specific changeset in mind when you say you applied the
changes mentioned in that thread?

Do you simply revert the following commit?
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f

Thanks,
Jérémie

On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> To try this fix, I also needed the changes talked about in
> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
> otherwise, I would see relayd coring.
> Once I took them I could see that the soft hang issue was no longer
> reproducible.
>
> Thanks for helping.
>
> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> Sure, I will try the fix and update.
>>
>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>> jeremie.galarneau@efficios.com> wrote:
>>
>>> Would you mind testing Mathieu's patch?
>>>
>>> I have rebased it on stable-2.6:
>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>
>>> Thanks,
>>> Jérémie
>>>
>>>
>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>> jeremie.galarneau@efficios.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> What do you observe when this happens? Does the session daemon become
>>>> unresponsive?
>>>>
>>>> Jérémie
>>>>
>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com>
>>>> wrote:
>>>>
>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>
>>>>> Thanks,
>>>>> Aravind.
>>>>>
>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have recently started trying lttng 2.6 for a few applications on my
>>>>>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>>>>>> failed soon after starting the session.
>>>>>>
>>>>>>
>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>> causing an overall health check failure.
>>>>>>
>>>>>>
>>>>>> Here are the sequence of steps that I found contributed to
>>>>>> *thread_manage_consumer()* exiting.
>>>>>>
>>>>>>
>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>> ust_app_unregister(pollfd)* is being called. This happens when there
>>>>>> is an error detected by *revents = LTTNG_POLL_GETEV(&events, i)*
>>>>>>
>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set
>>>>>> *.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>> * consumer_close_metadata() *which sends
>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the consumerd
>>>>>> to stop it from further dealing with the concerned app. Somehow this
>>>>>> doesn’t seem to help*.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>> *thread_manage_consumer():1353* at which point
>>>>>> *thread_manage_consumer()* decides to exit causing health_check() to
>>>>>> fail.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> So it looks like under some scenario, an application crash could
>>>>>> cause the lttng some problems.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think a possible fix for this scenario, is to instead of 4, send an
>>>>>> *ERROR* message back to *consumerd()* . This could be done from
>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please forgive if there are any guideline omissions for posting here
>>>>>> from my part. This is my first post.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Aravind.
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> lttng-dev mailing list
>>>>> lttng-dev@lists.lttng.org
>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jérémie Galarneau
>>>> EfficiOS Inc.
>>>> http://www.efficios.com
>>>>
>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>


-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com

[-- Attachment #1.2: Type: text/html, Size: 9697 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]           ` <CA+jJMxtyOqTDSh3iphDAE4qPx0+C8jJ3Txi1j+ouxHx0guuXng@mail.gmail.com>
@ 2015-07-13 18:47             ` Aravind HT
       [not found]             ` <CACd+_b5dwMRdR8wjzG0S5wpwtuB+qx_Q4Y0RHA2GDw7ZZgny+A@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-07-13 18:47 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 7742 bytes --]

The specific change in
http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
that I need is for cds_list_del(&stream->recv_list); to be called.
Below is the patch extract.

diff --git a/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
b/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>

index a93151a
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
..d82a341
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
100644 (file)
--- a/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
+++ b/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
@@ -1340,6
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
+1340,18
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
@@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
        session->stream_count--;
        assert(session->stream_count >= 0);

+       /*
+        * Remove the stream from the connection recv list since we are about to
+        * flag it invalid and thus might be freed. This has to be
done here since
+        * only the control thread can do actions on that list.
+        *
+        * Note that this stream might NOT be in the list but we have to try to
+        * remove it here else this can race with the stream destruction freeing
+        * the object and the connection destroy doing a use after free when
+        * deleting the remaining nodes in this list.
+        */
+       cds_list_del(&stream->recv_list);
+
        /* Check if we can close it or else the data will do it. */
        try_close_stream(session, stream);






On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
jeremie.galarneau@efficios.com> wrote:

> Do you have a specific changeset in mind when you say you applied the
> changes mentioned in that thread?
>
> Do you simply revert the following commit?
>
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>
> Thanks,
> Jérémie
>
>
> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> To try this fix, I also needed the changes talked about in
>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>> otherwise, I would see relayd coring.
>> Once I took them I could see that the soft hang issue was no longer
>> reproducible.
>>
>> Thanks for helping.
>>
>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>>
>>> Sure, I will try the fix and update.
>>>
>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>> jeremie.galarneau@efficios.com> wrote:
>>>
>>>> Would you mind testing Mathieu's patch?
>>>>
>>>> I have rebased it on stable-2.6:
>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>
>>>> Thanks,
>>>> Jérémie
>>>>
>>>>
>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>> jeremie.galarneau@efficios.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> What do you observe when this happens? Does the session daemon become
>>>>> unresponsive?
>>>>>
>>>>> Jérémie
>>>>>
>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>
>>>>>> Thanks,
>>>>>> Aravind.
>>>>>>
>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>> consumerd failed soon after starting the session.
>>>>>>>
>>>>>>>
>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>> causing an overall health check failure.
>>>>>>>
>>>>>>>
>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>
>>>>>>>
>>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>> there is an error detected by *revents = LTTNG_POLL_GETEV(&events,
>>>>>>> i)*
>>>>>>>
>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set
>>>>>>> *.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>> * consumer_close_metadata() *which sends
>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>> this doesn’t seem to help*.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>> to fail.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>> cause the lttng some problems.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>> an *ERROR* message back to *consumerd()* . This could be done from
>>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please forgive if there are any guideline omissions for posting here
>>>>>>> from my part. This is my first post.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Aravind.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lttng-dev mailing list
>>>>>> lttng-dev@lists.lttng.org
>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jérémie Galarneau
>>>>> EfficiOS Inc.
>>>>> http://www.efficios.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jérémie Galarneau
>>>> EfficiOS Inc.
>>>> http://www.efficios.com
>>>>
>>>
>>>
>>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 23195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]             ` <CACd+_b5dwMRdR8wjzG0S5wpwtuB+qx_Q4Y0RHA2GDw7ZZgny+A@mail.gmail.com>
@ 2015-07-13 19:29               ` Aravind HT
       [not found]               ` <CACd+_b6Dz8eriOnPm9n5Ht8f1DJy-Cy3DE80NBCd27WxtDZdUQ@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-07-13 19:29 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 9352 bytes --]

cds_list_del(&stream->recv_list) definitely needs to be called from
relay_close_stream() else conn->recv_head will continue to have a stale
pointer(for streams that got closed) until
the entire connection is destroyed, and conversely,
connection_destruction() means destroying the conn->recv_head list which
means iterating/removing entries of streams ( some of them could be stale,
if relay_close_stream() does not do cds_list_del() ) and in which case,
cds_list_del(&stream->recv_list) needs to be called also.

So removing either of them, I think, is not the good and in this case,
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
could be reverted


 *** However *** ,
there could also be a synchronization issue if relay_close_stream() and
connection_destroy() can happen concurrently if these calls are not
serialized ( stream->recv_list being accessed from two contexts ), need to
verify this before reverting
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f


I also do not know the kind of issues being seen/details to prompt the
reversal of
http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80



On Tue, Jul 14, 2015 at 12:17 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> The specific change in
> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
> that I need is for cds_list_del(&stream->recv_list); to be called.
> Below is the patch extract.
>
> diff --git a/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> b/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>
> index a93151a
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> ..d82a341
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
> 100644 (file)
> --- a/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> +++ b/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
> @@ -1340,6
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
> +1340,18
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
> @@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
>         session->stream_count--;
>         assert(session->stream_count >= 0);
>
> +       /*
>
> +        * Remove the stream from the connection recv list since we are about to
>
> +        * flag it invalid and thus might be freed. This has to be done here since
> +        * only the control thread can do actions on that list.
> +        *
>
> +        * Note that this stream might NOT be in the list but we have to try to
>
> +        * remove it here else this can race with the stream destruction freeing
>
> +        * the object and the connection destroy doing a use after free when
> +        * deleting the remaining nodes in this list.
> +        */
> +       cds_list_del(&stream->recv_list);
> +
>         /* Check if we can close it or else the data will do it. */
>         try_close_stream(session, stream);
>
>
>
>
>
>
> On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
> jeremie.galarneau@efficios.com> wrote:
>
>> Do you have a specific changeset in mind when you say you applied the
>> changes mentioned in that thread?
>>
>> Do you simply revert the following commit?
>>
>> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>>
>> Thanks,
>> Jérémie
>>
>>
>> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>>
>>> To try this fix, I also needed the changes talked about in
>>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>>> otherwise, I would see relayd coring.
>>> Once I took them I could see that the soft hang issue was no longer
>>> reproducible.
>>>
>>> Thanks for helping.
>>>
>>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>>>
>>>> Sure, I will try the fix and update.
>>>>
>>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>>> jeremie.galarneau@efficios.com> wrote:
>>>>
>>>>> Would you mind testing Mathieu's patch?
>>>>>
>>>>> I have rebased it on stable-2.6:
>>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>>
>>>>> Thanks,
>>>>> Jérémie
>>>>>
>>>>>
>>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>>> jeremie.galarneau@efficios.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> What do you observe when this happens? Does the session daemon become
>>>>>> unresponsive?
>>>>>>
>>>>>> Jérémie
>>>>>>
>>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aravind.
>>>>>>>
>>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>>> consumerd failed soon after starting the session.
>>>>>>>>
>>>>>>>>
>>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>>> causing an overall health check failure.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>>> there is an error detected by *revents = LTTNG_POLL_GETEV(&events,
>>>>>>>> i)*
>>>>>>>>
>>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is
>>>>>>>> set*.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>>> * consumer_close_metadata() *which sends
>>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>>> this doesn’t seem to help*.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>>> to fail.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>>> cause the lttng some problems.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>>> an *ERROR* message back to *consumerd()* . This could be done from
>>>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please forgive if there are any guideline omissions for posting
>>>>>>>> here from my part. This is my first post.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Aravind.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> lttng-dev mailing list
>>>>>>> lttng-dev@lists.lttng.org
>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jérémie Galarneau
>>>>>> EfficiOS Inc.
>>>>>> http://www.efficios.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jérémie Galarneau
>>>>> EfficiOS Inc.
>>>>> http://www.efficios.com
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 27020 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lttng Soft Hang issue
       [not found]               ` <CACd+_b6Dz8eriOnPm9n5Ht8f1DJy-Cy3DE80NBCd27WxtDZdUQ@mail.gmail.com>
@ 2015-07-15 14:00                 ` Aravind HT
  0 siblings, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-07-15 14:00 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev, Julien Desfossez


[-- Attachment #1.1: Type: text/plain, Size: 16034 bytes --]

I have gone through a few scenarios and this is what I think of the problem
and my solution is at the bottom.

Please let me know if this is correct.


The brief analysis is:

1.       Whenever a new stream is added

a.      a stream object having path_name, channel_name, recv_list and other
things are allocated and populated.

b.      The address of recv_list, which is a member of the stream object is
added to conn->recv_head list. conn represents a connection that is made up
of multiple streams.



2.       One of the scenarios is that when a stream is closed
RELAYD_CLOSE_STREAM, the memory allocated in 1.a is freed through a call to
ctf_trace_destroy() -> stream_delete() and stream_destroy() . However, the
entry in the conn->recv_head is not removed. This now is a dangling
pointer. The closing of stream object can happen in multiple cases and each
will cause the same dangling pointer problem. The attached .xps document
shows quite a few code paths in which this can happen, raising the
complexity.

3.       Among the use cases that access this conn->recv_head list are
set_view_ready_flag() and addition of a new stream through
relay_add_stream(). Both iterate and try to modify the list. There may be
other cases as well.



In short, there is a linked list which is made of nodes allocated in memory
regions that get freed, but the nodes in this list are not removed at that
point. Any access of this list causes the problems.




Fixes attempted so far:

https://github.com/lttng/lttng-tools/commit/cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80

https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f


The above fixes of removing the node from conn->recv_head using
cds_list_del() attempts to do so from multiple places


1. relay_close_stream()


Nothing wrong in removing it here using cds_list_del()





2. connection_destroy()


This is where the problem of corruption is not getting solved. This is
because it gets called from destroy_connection()


static void destroy_connection(struct lttng_ht *relay_connections_ht,
struct relay_connection *conn)


{


                assert(relay_connections_ht);


                assert(conn);





                connection_delete(relay_connections_ht, conn);





                /* For the control socket, we try to destroy the session. */


                if (conn->type == RELAY_CONTROL && conn->session) {


                                destroy_session(conn->session,
conn->sessions_ht);


                }





                connection_destroy(conn);


}


But, destroy_session() above calls ctf_trace_destroy() which frees up the
memory allocated for the streams, thus a later attempt by
connection_destroy() to iterate over the conn->recv_head list will be a
problem as it now contains dangling pointers.



3. From my side, I thought that just calling cds_list_del() from
relay_close_stream() should fix it, however, when set_viewer_ready_flag()
is called


static void set_viewer_ready_flag(struct relay_connection *conn)


{


                struct relay_stream *stream, *tmp_stream;


                pthread_mutex_lock(&conn->session->viewer_ready_lock);


                cds_list_for_each_entry_safe(stream, tmp_stream,
&conn->recv_head,


                                                recv_list) {


                                stream->viewer_ready = 1;


                                cds_list_del(&stream->recv_list);


                }


                pthread_mutex_unlock(&conn->session->viewer_ready_lock);


                return;


}


It iterates over the entire conn->recv_head and removes all the entries. So
my previous solution would try to remove an entry from the list which is
already removed causing some other corruption ( the removal from the list
using cds_list_del() does not have any safety checks )



The solution that I am thinking of is below:


1.  Wherever cds_list_del(&stream->recv_list) gets called, reset the
pointers to NULL as

    stream->recv_list.next = stream->recv_list.prev = NULL;


2. Remove the entry from conn->recv_head list from only stream_delete()
after checking for above NULLs and stream->viewer_ready flag.

   The NULL set/check may not be needed as we could manage with the
viewer_ready flag alone, but in case we have missed any other scenario,
NULL would help us identity/fix it faster.




/---------------------------------------------------------------------/

The patch with the fix :

/---------------------------------------------------------------------/


diff --git a/src/bin/lttng-relayd/main.c b/src/bin/lttng-relayd/main.c


index cc68410..da1b2e2 100644


--- a/src/bin/lttng-relayd/main.c


+++ b/src/bin/lttng-relayd/main.c


@@ -1153,6 +1153,8 @@ void set_viewer_ready_flag(struct relay_connection
*conn)


                        recv_list) {


                stream->viewer_ready = 1;


                cds_list_del(&stream->recv_list);


+               /* Reset the removed node memory, we shall check in other
places for this */


+               memset(&stream->recv_list, 0, sizeof(struct cds_list_head));


        }


        pthread_mutex_unlock(&conn->session->viewer_ready_lock);


        return;








diff --git a/src/bin/lttng-relayd/stream.c b/src/bin/lttng-relayd/stream.c


index 410fae8..7043797 100644


--- a/src/bin/lttng-relayd/stream.c


+++ b/src/bin/lttng-relayd/stream.c


@@ -144,6 +144,13 @@ void stream_delete(struct lttng_ht *ht, struct
relay_stream *stream)


        assert(!ret);





        cds_list_del(&stream->trace_list);


+/* Before we attempt to remove the stream from conn->recv_head list, need
to check if it was already removed from set_viewer_ready_flag() */


+       if(stream->viewer_ready == 0) {


+               cds_list_del(&stream->recv_list);


+               /* Reset the removed node memory, we shall check in other
places for this */


+               memset(&stream->recv_list, 0, sizeof(struct cds_list_head));


+       }


+


 }



On Tue, Jul 14, 2015 at 12:59 AM, Aravind HT <aravind.ht@gmail.com> wrote:

> cds_list_del(&stream->recv_list) definitely needs to be called from
> relay_close_stream() else conn->recv_head will continue to have a stale
> pointer(for streams that got closed) until
> the entire connection is destroyed, and conversely,
> connection_destruction() means destroying the conn->recv_head list which
> means iterating/removing entries of streams ( some of them could be stale,
> if relay_close_stream() does not do cds_list_del() ) and in which case,
> cds_list_del(&stream->recv_list) needs to be called also.
>
> So removing either of them, I think, is not the good and in this case,
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
> could be reverted
>
>
>  *** However *** ,
> there could also be a synchronization issue if relay_close_stream() and
> connection_destroy() can happen concurrently if these calls are not
> serialized ( stream->recv_list being accessed from two contexts ), need to
> verify this before reverting
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>
>
> I also do not know the kind of issues being seen/details to prompt the
> reversal of
> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
>
>
>
> On Tue, Jul 14, 2015 at 12:17 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> The specific change in
>> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
>> that I need is for cds_list_del(&stream->recv_list); to be called.
>> Below is the patch extract.
>>
>> diff --git a/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> b/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>>
>> index a93151a
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> ..d82a341
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>> 100644 (file)
>> --- a/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> +++ b/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>> @@ -1340,6
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
>> +1340,18
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
>> @@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
>>         session->stream_count--;
>>         assert(session->stream_count >= 0);
>>
>> +       /*
>>
>> +        * Remove the stream from the connection recv list since we are about to
>>
>> +        * flag it invalid and thus might be freed. This has to be done here since
>> +        * only the control thread can do actions on that list.
>> +        *
>>
>> +        * Note that this stream might NOT be in the list but we have to try to
>>
>> +        * remove it here else this can race with the stream destruction freeing
>>
>> +        * the object and the connection destroy doing a use after free when
>> +        * deleting the remaining nodes in this list.
>> +        */
>> +       cds_list_del(&stream->recv_list);
>> +
>>         /* Check if we can close it or else the data will do it. */
>>         try_close_stream(session, stream);
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
>> jeremie.galarneau@efficios.com> wrote:
>>
>>> Do you have a specific changeset in mind when you say you applied the
>>> changes mentioned in that thread?
>>>
>>> Do you simply revert the following commit?
>>>
>>> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>>>
>>> Thanks,
>>> Jérémie
>>>
>>>
>>> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>>
>>>> To try this fix, I also needed the changes talked about in
>>>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>>>> otherwise, I would see relayd coring.
>>>> Once I took them I could see that the soft hang issue was no longer
>>>> reproducible.
>>>>
>>>> Thanks for helping.
>>>>
>>>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht@gmail.com>
>>>> wrote:
>>>>
>>>>> Sure, I will try the fix and update.
>>>>>
>>>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>>>> jeremie.galarneau@efficios.com> wrote:
>>>>>
>>>>>> Would you mind testing Mathieu's patch?
>>>>>>
>>>>>> I have rebased it on stable-2.6:
>>>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>>>
>>>>>> Thanks,
>>>>>> Jérémie
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>>>> jeremie.galarneau@efficios.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> What do you observe when this happens? Does the session daemon
>>>>>>> become unresponsive?
>>>>>>>
>>>>>>> Jérémie
>>>>>>>
>>>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aravind.
>>>>>>>>
>>>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>>>> consumerd failed soon after starting the session.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>>>> causing an overall health check failure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>>>> there is an error detected by *revents =
>>>>>>>>> LTTNG_POLL_GETEV(&events, i)*
>>>>>>>>>
>>>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated
>>>>>>>>> for *epoll()* causing *ust_app_unregister()* to be called for
>>>>>>>>> that app.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is
>>>>>>>>> set*.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>>>> * consumer_close_metadata() *which sends
>>>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>>>> this doesn’t seem to help*.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>>>> to fail.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>>>> cause the lttng some problems.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>>>> an *ERROR* message back to *consumerd()* . This could be done
>>>>>>>>> from *ust_consumer_metadata_request()* call. Can someone please
>>>>>>>>> let me know if this is correct and shed more light on the issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please forgive if there are any guideline omissions for posting
>>>>>>>>> here from my part. This is my first post.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Aravind.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> lttng-dev mailing list
>>>>>>>> lttng-dev@lists.lttng.org
>>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jérémie Galarneau
>>>>>>> EfficiOS Inc.
>>>>>>> http://www.efficios.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jérémie Galarneau
>>>>>> EfficiOS Inc.
>>>>>> http://www.efficios.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 40800 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Lttng Soft Hang issue
@ 2015-05-20 18:48 Aravind HT
  0 siblings, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-05-20 18:48 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 2248 bytes --]

Hi,



I have recently started trying lttng 2.6 for a few applications on my
Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
failed soon after starting the session.


On investigating, I found that thread_manage_consumer() had exited causing
an overall health check failure.


Here are the sequence of steps that I found contributed to
*thread_manage_consumer()* exiting.


1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
being called. This happens when there is an error detected by *revents =
LTTNG_POLL_GETEV(&events, i)*

My initial guess here is that as one of my apps has crashed, producing
a *LPOLLERR
| LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()* causing
*ust_app_unregister()* to be called for that app.



2.       In *ust_app_unregister():3154 , close_metadata(registry,
ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*.*



(2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the consumerd
to stop it from further dealing with the concerned app. Somehow this
doesn’t seem to help*.*



3.       Next, I see that the *thread_manage_consumer():1353* for some
reason has ignored the above 2.a and gets to do request/reply for that app
by calling*ust_consumer_metadata_request():491 ->
ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
*registry->metadata_closed* and returns an *–EPIPE*



4.       This *–EPIPE* error cascades all the way back up to
*thread_manage_consumer():1353* at which point
*thread_manage_consumer()* decides
to exit causing health_check() to fail.





So it looks like under some scenario, an application crash could cause the
lttng some problems.



I think a possible fix for this scenario, is to instead of 4, send an
*ERROR* message back to *consumerd()* . This could be done from
*ust_consumer_metadata_request()* call. Can someone please let me know if
this is correct and shed more light on the issue ?





Please forgive if there are any guideline omissions for posting here from
my part. This is my first post.


Regards,

Aravind.

[-- Attachment #1.2: Type: text/html, Size: 4477 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

* lttng soft hang issue
@ 2015-05-19 11:06 Aravind HT
  0 siblings, 0 replies; 11+ messages in thread
From: Aravind HT @ 2015-05-19 11:06 UTC (permalink / raw)
  To: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 2250 bytes --]

Hi,



I have recently started trying lttng 2.6 for a few applications on my
Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
failed soon after starting the session.


On investigating, I found that thread_manage_consumer() had exited causing
an overall health check failure.


Here are the sequence of steps that I found contributed to
*thread_manage_consumer()* exiting.


1.       In *thread_manage_apps() : 1558 , ust_app_unregister(pollfd)* is
being called. This happens when there is an error detected by *revents =
LTTNG_POLL_GETEV(&events, i) *

My initial guess here is that as one of my apps has crashed, producing
a *LPOLLERR
| LPOLLHUP | LPOLLRDHUP *to be generated for *epoll()* causing
*ust_app_unregister()* to be called for that app.



2.       In *ust_app_unregister():3154 , close_metadata(registry,
ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set*. *



(2.a) Note:* close_metadata() *also calls* consumer_close_metadata() *which
sends* LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the consumerd
to stop it from further dealing with the concerned app. Somehow this
doesn’t seem to help*.*



3.       Next, I see that the *thread_manage_consumer():1353* for some
reason has ignored the above 2.a and gets to do request/reply for that app
by calling *ust_consumer_metadata_request():491 ->
ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks for
*registry->metadata_closed* and returns an *–EPIPE*



4.       This *–EPIPE* error cascades all the way back up to
*thread_manage_consumer():1353* at which point *thread_manage_consumer()*
decides to exit causing health_check() to fail.





So it looks like under some scenario, an application crash could cause the
lttng some problems.



I think a possible fix for this scenario, is to instead of 4, send an
*ERROR* message back to *consumerd()* . This could be done from
*ust_consumer_metadata_request()* call. Can someone please let me know if
this is correct and shed more light on the issue ?





Please forgive if there are any guideline omissions for posting here from
my part. This is my first post.


Regards,

Aravind.

[-- Attachment #1.2: Type: text/html, Size: 3574 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-07-15 14:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CACd+_b7pakrUPzfzj=kXYYtkezpwxOLUgGs_Hy7yMYOh-f-PqQ@mail.gmail.com>
2015-05-27 10:19 ` Lttng Soft Hang issue Aravind HT
     [not found] ` <CACd+_b4MtdbAVjHCeQ6EnK5J3KRYzMK7Niq_c18pcKcdpnq07w@mail.gmail.com>
2015-07-06 15:23   ` Jérémie Galarneau
     [not found]   ` <CA+jJMxuHWhy1n9Ty1Q6=eGd7bRe-brh3Mqog1uzw5EXbKDPJ0w@mail.gmail.com>
2015-07-06 16:42     ` Jérémie Galarneau
     [not found]     ` <CA+jJMxt2Dph0yhuE3tQ3aYKyAmX8t1DnmT6kDbnWVSgNN=rmUA@mail.gmail.com>
2015-07-08  4:19       ` Aravind HT
     [not found]       ` <CACd+_b5v5mpfBqNRQNHiJnA51niK0BchdAgf90VRpSQQpNn_7w@mail.gmail.com>
2015-07-11  7:26         ` Aravind HT
     [not found]         ` <CACd+_b7Gz6be-+0cc4gnQi92hSoPDDuu7fWw6t=Qn19WdNc2jA@mail.gmail.com>
2015-07-13 15:34           ` Jérémie Galarneau
     [not found]           ` <CA+jJMxtyOqTDSh3iphDAE4qPx0+C8jJ3Txi1j+ouxHx0guuXng@mail.gmail.com>
2015-07-13 18:47             ` Aravind HT
     [not found]             ` <CACd+_b5dwMRdR8wjzG0S5wpwtuB+qx_Q4Y0RHA2GDw7ZZgny+A@mail.gmail.com>
2015-07-13 19:29               ` Aravind HT
     [not found]               ` <CACd+_b6Dz8eriOnPm9n5Ht8f1DJy-Cy3DE80NBCd27WxtDZdUQ@mail.gmail.com>
2015-07-15 14:00                 ` Aravind HT
2015-05-20 18:48 Aravind HT
  -- strict thread matches above, loose matches on Subject: below --
2015-05-19 11:06 lttng soft hang issue Aravind HT

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.