All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Relayd trace drops
       [not found] <CACd+_b7qEeW1XD1H_0MT4dDqwLv68rfOURfNDRh6SO=xS4k-Og@mail.gmail.com>
@ 2015-12-04 12:41 ` Jérémie Galarneau
  2015-12-04 14:06 ` Jonathan Rajotte Julien
       [not found] ` <CA+jJMxsuF+Ywvvhe3L4OqBHTsohyFffPoF4mZWFDqPvBC-P80w@mail.gmail.com>
  2 siblings, 0 replies; 18+ messages in thread
From: Jérémie Galarneau @ 2015-12-04 12:41 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev

Hi Aravind,

Can't say I have looked at everything you sent yet, but as a
preemptive question, which version are we talking about here? 2.6.0 or
2.6.1? 2.6.1 contains a lot of relay daemon fixes.

Thanks,
Jérémie

On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com> wrote:
> Hi,
>
> I am trying to obtain the performance characteristics of lttng with the use
> of test applications. Traces are being produced on a local node and
> delivered to relayd that is running on a separate node for storage.
>
> An lttng session with the test applications producing an initial bit rate of
> 10 kb/s is started and run for about 30 seconds. The starting sub-buffer
> size is kept at 128 kb and sub-buf count at 4. The session is then stopped
> and destroyed and traces are analyzed to see if there are any drops. This is
> being done in a loop with every subsequent session having an increment of 2
> kb/s as long as there are no drops. If there are drops, I increase the
> buffer size by a factor of x2 without incrementing the bit rate.
>
> I see trace drops happening consistently with test apps producing traces at
> less than 40 kb/s, it doesnt seem to help even if I started with 1mb x 4
> sub-buffers.
>
> Analysis :
>
> I have attached the lttng_relayd , lttng_consumerd_64 logs and the entire
> trace directory, hope you will be able to view it.
> I have modified lttng_relayd code to dump the traces being captured in the
> lttng_relayd logs along with debug info.
>
> Each test app is producing logs in the form of  :
> "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate - 2079 b/s
> threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>
> The test application PID, test application thread id, thread bit rate, test
> app bit rate, thread trace number and application trace number s are part of
> the trace. So in the above trace, the thread is producing at 1 kb/s and the
> whole test app is producing at 2 kb/s.
>
> If we look at the babeltrace out put, we see that the Trace with TraceApp
> PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on being
> successfully captured.
> I looked at the lttng_relayd logs and found that trace of "appTraceNum 2" is
> not delivered/generated by the consumerd to the relayd in sequence with
> other traces. To rule out that this is not a test application problem, you
> can look at line ltttng_relayd log : 12778 and see traces from appTraceNum -
> 1 to appTraceNum - 18 including the appTraceNum 2 are "re-delivered" by the
> consumerd to the relayd.
> Essentially, I see appTraceNum 1 through appTraceNum 18 being delivered
> twice, once individually where appTraceNum 2 is missing and once as a group
> at line 12778 where its present.
>
>
> Request help with
> 1. why traces are delivered twice, is it by design or a genuine problem ?
> 2. how to avoid traces being dropped even though buffers are sufficiently
> large enough ?
>
>
> Regards,
> Aravind.
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>



-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found] <CACd+_b7qEeW1XD1H_0MT4dDqwLv68rfOURfNDRh6SO=xS4k-Og@mail.gmail.com>
  2015-12-04 12:41 ` Relayd trace drops Jérémie Galarneau
@ 2015-12-04 14:06 ` Jonathan Rajotte Julien
       [not found] ` <CA+jJMxsuF+Ywvvhe3L4OqBHTsohyFffPoF4mZWFDqPvBC-P80w@mail.gmail.com>
  2 siblings, 0 replies; 18+ messages in thread
From: Jonathan Rajotte Julien @ 2015-12-04 14:06 UTC (permalink / raw)
  To: lttng-dev

Hi Aravind,

Could you share with us the code you are using to obtain the *the 
performance characteristics* so we can try to reproduce the problem? 
Could you push it to some
git repo (github/bitbucket) somewhere?

Cheers!

On 2015-12-03 07:01 AM, Aravind HT wrote:
> Hi,
>
> I am trying to obtain the performance characteristics of lttng with 
> the use of test applications. Traces are being produced on a local 
> node and delivered to relayd that is running on a separate node for 
> storage.
>
> An lttng session with the test applications producing an initial bit 
> rate of 10 kb/s is started and run for about 30 seconds. The starting 
> sub-buffer size is kept at 128 kb and sub-buf count at 4. The session 
> is then stopped and destroyed and traces are analyzed to see if there 
> are any drops. This is being done in a loop with every subsequent 
> session having an increment of 2 kb/s as long as there are no drops. 
> If there are drops, I increase the buffer size by a factor of x2 
> without incrementing the bit rate.
>
> I see trace drops happening consistently with test apps producing 
> traces at less than 40 kb/s, it doesnt seem to help even if I started 
> with 1mb x 4 sub-buffers.
>
> Analysis :
>
> I have attached the lttng_relayd , lttng_consumerd_64 logs and the 
> entire trace directory, hope you will be able to view it.
> I have modified lttng_relayd code to dump the traces being captured in 
> the lttng_relayd logs along with debug info.
>
> Each test app is producing logs in the form of  :
> "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate - 
> 2079 b/s threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>
> The test application PID, test application thread id, thread bit rate, 
> test app bit rate, thread trace number and application trace number s 
> are part of the trace. So in the above trace, the thread is producing 
> at 1 kb/s and the whole test app is producing at 2 kb/s.
>
> If we look at the babeltrace out put, we see that the Trace with 
> TraceApp PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so 
> on being successfully captured.
> I looked at the lttng_relayd logs and found that trace of "appTraceNum 
> 2" is not delivered/generated by the consumerd to the relayd in 
> sequence with other traces. To rule out that this is not a test 
> application problem, you can look at line ltttng_relayd log : 12778 
> and see traces from appTraceNum - 1 to appTraceNum - 18 including the 
> appTraceNum 2 are "re-delivered" by the consumerd to the relayd.
> Essentially, I see appTraceNum 1 through appTraceNum 18 being 
> delivered twice, once individually where appTraceNum 2 is missing and 
> once as a group at line 12778 where its present.
>
>
> Request help with
> 1. why traces are delivered twice, is it by design or a genuine problem ?
> 2. how to avoid traces being dropped even though buffers are 
> sufficiently large enough ?
>
>
> Regards,
> Aravind.
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Jonathan R. Julien
Efficios

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found] ` <CA+jJMxsuF+Ywvvhe3L4OqBHTsohyFffPoF4mZWFDqPvBC-P80w@mail.gmail.com>
@ 2015-12-05  4:06   ` Aravind HT
       [not found]   ` <CACd+_b5AGJKWH58Kq3QP8qW8R7Q9JH3fqbEwU=z8pyg5-DL=KQ@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-05  4:06 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 3838 bytes --]

I am using 2.6.0 .I will try to share the code that I'm using here in some
time. If there are any specific fixes that are relevant to this issue, see
if you can provide a link to them. I would ideally like to try them out
before trying a full upgrade to the latest versions.

On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau <
jeremie.galarneau@efficios.com> wrote:

> Hi Aravind,
>
> Can't say I have looked at everything you sent yet, but as a
> preemptive question, which version are we talking about here? 2.6.0 or
> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>
> Thanks,
> Jérémie
>
> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com> wrote:
> > Hi,
> >
> > I am trying to obtain the performance characteristics of lttng with the
> use
> > of test applications. Traces are being produced on a local node and
> > delivered to relayd that is running on a separate node for storage.
> >
> > An lttng session with the test applications producing an initial bit
> rate of
> > 10 kb/s is started and run for about 30 seconds. The starting sub-buffer
> > size is kept at 128 kb and sub-buf count at 4. The session is then
> stopped
> > and destroyed and traces are analyzed to see if there are any drops.
> This is
> > being done in a loop with every subsequent session having an increment
> of 2
> > kb/s as long as there are no drops. If there are drops, I increase the
> > buffer size by a factor of x2 without incrementing the bit rate.
> >
> > I see trace drops happening consistently with test apps producing traces
> at
> > less than 40 kb/s, it doesnt seem to help even if I started with 1mb x 4
> > sub-buffers.
> >
> > Analysis :
> >
> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the entire
> > trace directory, hope you will be able to view it.
> > I have modified lttng_relayd code to dump the traces being captured in
> the
> > lttng_relayd logs along with debug info.
> >
> > Each test app is producing logs in the form of  :
> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate - 2079
> b/s
> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
> >
> > The test application PID, test application thread id, thread bit rate,
> test
> > app bit rate, thread trace number and application trace number s are
> part of
> > the trace. So in the above trace, the thread is producing at 1 kb/s and
> the
> > whole test app is producing at 2 kb/s.
> >
> > If we look at the babeltrace out put, we see that the Trace with TraceApp
> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on being
> > successfully captured.
> > I looked at the lttng_relayd logs and found that trace of "appTraceNum
> 2" is
> > not delivered/generated by the consumerd to the relayd in sequence with
> > other traces. To rule out that this is not a test application problem,
> you
> > can look at line ltttng_relayd log : 12778 and see traces from
> appTraceNum -
> > 1 to appTraceNum - 18 including the appTraceNum 2 are "re-delivered" by
> the
> > consumerd to the relayd.
> > Essentially, I see appTraceNum 1 through appTraceNum 18 being delivered
> > twice, once individually where appTraceNum 2 is missing and once as a
> group
> > at line 12778 where its present.
> >
> >
> > Request help with
> > 1. why traces are delivered twice, is it by design or a genuine problem ?
> > 2. how to avoid traces being dropped even though buffers are sufficiently
> > large enough ?
> >
> >
> > Regards,
> > Aravind.
> >
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev@lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >
>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 4834 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]   ` <CACd+_b5AGJKWH58Kq3QP8qW8R7Q9JH3fqbEwU=z8pyg5-DL=KQ@mail.gmail.com>
@ 2015-12-05 11:53     ` Jérémie Galarneau
       [not found]     ` <CA+jJMxsJReP4ZO2bSJP-rJe+UoJzT09Pix1LhEypbsww_nwrsA@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Jérémie Galarneau @ 2015-12-05 11:53 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev

On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT <aravind.ht@gmail.com> wrote:
> I am using 2.6.0 .I will try to share the code that I'm using here in some
> time. If there are any specific fixes that are relevant to this issue, see
> if you can provide a link to them. I would ideally like to try them out
> before trying a full upgrade to the latest versions.

Hi,

Can you provide more information on the system? Which distribution,
architecture, kernel version?

The verbose sessiond logs might help pinpoint any unexpected behaviour
here (are all applications registering as expected?).

Jérémie

>
> On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
> <jeremie.galarneau@efficios.com> wrote:
>>
>> Hi Aravind,
>>
>> Can't say I have looked at everything you sent yet, but as a
>> preemptive question, which version are we talking about here? 2.6.0 or
>> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>
>> Thanks,
>> Jérémie
>>
>> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com> wrote:
>> > Hi,
>> >
>> > I am trying to obtain the performance characteristics of lttng with the
>> > use
>> > of test applications. Traces are being produced on a local node and
>> > delivered to relayd that is running on a separate node for storage.
>> >
>> > An lttng session with the test applications producing an initial bit
>> > rate of
>> > 10 kb/s is started and run for about 30 seconds. The starting sub-buffer
>> > size is kept at 128 kb and sub-buf count at 4. The session is then
>> > stopped
>> > and destroyed and traces are analyzed to see if there are any drops.
>> > This is
>> > being done in a loop with every subsequent session having an increment
>> > of 2
>> > kb/s as long as there are no drops. If there are drops, I increase the
>> > buffer size by a factor of x2 without incrementing the bit rate.
>> >
>> > I see trace drops happening consistently with test apps producing traces
>> > at
>> > less than 40 kb/s, it doesnt seem to help even if I started with 1mb x 4
>> > sub-buffers.
>> >
>> > Analysis :
>> >
>> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the
>> > entire
>> > trace directory, hope you will be able to view it.
>> > I have modified lttng_relayd code to dump the traces being captured in
>> > the
>> > lttng_relayd logs along with debug info.
>> >
>> > Each test app is producing logs in the form of  :
>> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate - 2079
>> > b/s
>> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>> >
>> > The test application PID, test application thread id, thread bit rate,
>> > test
>> > app bit rate, thread trace number and application trace number s are
>> > part of
>> > the trace. So in the above trace, the thread is producing at 1 kb/s and
>> > the
>> > whole test app is producing at 2 kb/s.
>> >
>> > If we look at the babeltrace out put, we see that the Trace with
>> > TraceApp
>> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on being
>> > successfully captured.
>> > I looked at the lttng_relayd logs and found that trace of "appTraceNum
>> > 2" is
>> > not delivered/generated by the consumerd to the relayd in sequence with
>> > other traces. To rule out that this is not a test application problem,
>> > you
>> > can look at line ltttng_relayd log : 12778 and see traces from
>> > appTraceNum -
>> > 1 to appTraceNum - 18 including the appTraceNum 2 are "re-delivered" by
>> > the
>> > consumerd to the relayd.
>> > Essentially, I see appTraceNum 1 through appTraceNum 18 being delivered
>> > twice, once individually where appTraceNum 2 is missing and once as a
>> > group
>> > at line 12778 where its present.
>> >
>> >
>> > Request help with
>> > 1. why traces are delivered twice, is it by design or a genuine problem
>> > ?
>> > 2. how to avoid traces being dropped even though buffers are
>> > sufficiently
>> > large enough ?
>> >
>> >
>> > Regards,
>> > Aravind.
>> >
>> > _______________________________________________
>> > lttng-dev mailing list
>> > lttng-dev@lists.lttng.org
>> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>> >
>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>
>



-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]     ` <CA+jJMxsJReP4ZO2bSJP-rJe+UoJzT09Pix1LhEypbsww_nwrsA@mail.gmail.com>
@ 2015-12-07 18:07       ` Aravind HT
       [not found]       ` <CACd+_b7ScRDo9uL+neOnA2pqVVwexsPsbBd2mBS--yy7Amxbmg@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-07 18:07 UTC (permalink / raw)
  To: Jérémie Galarneau; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 5224 bytes --]

Hi,

I have attached the complete profiling scripts here, its a bit shabby, im
new to python.

There is a README which has the details on how to execute it.
Im using a Yocto 1.6 on x86_64 platforms on both the nodes.


Running this script when there are other sessions running seems to
reproduce this problem easily.
Please try it and let me know if you have any issues reproducing the
problem.

Regards,
Aravind.

On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau <
jeremie.galarneau@efficios.com> wrote:

> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT <aravind.ht@gmail.com> wrote:
> > I am using 2.6.0 .I will try to share the code that I'm using here in
> some
> > time. If there are any specific fixes that are relevant to this issue,
> see
> > if you can provide a link to them. I would ideally like to try them out
> > before trying a full upgrade to the latest versions.
>
> Hi,
>
> Can you provide more information on the system? Which distribution,
> architecture, kernel version?
>
> The verbose sessiond logs might help pinpoint any unexpected behaviour
> here (are all applications registering as expected?).
>
> Jérémie
>
> >
> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
> > <jeremie.galarneau@efficios.com> wrote:
> >>
> >> Hi Aravind,
> >>
> >> Can't say I have looked at everything you sent yet, but as a
> >> preemptive question, which version are we talking about here? 2.6.0 or
> >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
> >>
> >> Thanks,
> >> Jérémie
> >>
> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I am trying to obtain the performance characteristics of lttng with
> the
> >> > use
> >> > of test applications. Traces are being produced on a local node and
> >> > delivered to relayd that is running on a separate node for storage.
> >> >
> >> > An lttng session with the test applications producing an initial bit
> >> > rate of
> >> > 10 kb/s is started and run for about 30 seconds. The starting
> sub-buffer
> >> > size is kept at 128 kb and sub-buf count at 4. The session is then
> >> > stopped
> >> > and destroyed and traces are analyzed to see if there are any drops.
> >> > This is
> >> > being done in a loop with every subsequent session having an increment
> >> > of 2
> >> > kb/s as long as there are no drops. If there are drops, I increase the
> >> > buffer size by a factor of x2 without incrementing the bit rate.
> >> >
> >> > I see trace drops happening consistently with test apps producing
> traces
> >> > at
> >> > less than 40 kb/s, it doesnt seem to help even if I started with 1mb
> x 4
> >> > sub-buffers.
> >> >
> >> > Analysis :
> >> >
> >> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the
> >> > entire
> >> > trace directory, hope you will be able to view it.
> >> > I have modified lttng_relayd code to dump the traces being captured in
> >> > the
> >> > lttng_relayd logs along with debug info.
> >> >
> >> > Each test app is producing logs in the form of  :
> >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate -
> 2079
> >> > b/s
> >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
> >> >
> >> > The test application PID, test application thread id, thread bit rate,
> >> > test
> >> > app bit rate, thread trace number and application trace number s are
> >> > part of
> >> > the trace. So in the above trace, the thread is producing at 1 kb/s
> and
> >> > the
> >> > whole test app is producing at 2 kb/s.
> >> >
> >> > If we look at the babeltrace out put, we see that the Trace with
> >> > TraceApp
> >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on being
> >> > successfully captured.
> >> > I looked at the lttng_relayd logs and found that trace of "appTraceNum
> >> > 2" is
> >> > not delivered/generated by the consumerd to the relayd in sequence
> with
> >> > other traces. To rule out that this is not a test application problem,
> >> > you
> >> > can look at line ltttng_relayd log : 12778 and see traces from
> >> > appTraceNum -
> >> > 1 to appTraceNum - 18 including the appTraceNum 2 are "re-delivered"
> by
> >> > the
> >> > consumerd to the relayd.
> >> > Essentially, I see appTraceNum 1 through appTraceNum 18 being
> delivered
> >> > twice, once individually where appTraceNum 2 is missing and once as a
> >> > group
> >> > at line 12778 where its present.
> >> >
> >> >
> >> > Request help with
> >> > 1. why traces are delivered twice, is it by design or a genuine
> problem
> >> > ?
> >> > 2. how to avoid traces being dropped even though buffers are
> >> > sufficiently
> >> > large enough ?
> >> >
> >> >
> >> > Regards,
> >> > Aravind.
> >> >
> >> > _______________________________________________
> >> > lttng-dev mailing list
> >> > lttng-dev@lists.lttng.org
> >> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >> >
> >>
> >>
> >>
> >> --
> >> Jérémie Galarneau
> >> EfficiOS Inc.
> >> http://www.efficios.com
> >
> >
>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 7366 bytes --]

[-- Attachment #2: Lttng_dev_mailing_scripts.rar --]
[-- Type: application/rar, Size: 5726 bytes --]

[-- Attachment #3: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]       ` <CACd+_b7ScRDo9uL+neOnA2pqVVwexsPsbBd2mBS--yy7Amxbmg@mail.gmail.com>
@ 2015-12-07 19:13         ` Jonathan Rajotte
       [not found]         ` <CAF4oYfF2Li4zau0OLE=JZUtGfPZyFT1FciBr7AG1hvzGdeUycA@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Jonathan Rajotte @ 2015-12-07 19:13 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 5824 bytes --]

Hi Aravind,

Did you have the chance to upgrade to 2.6.1.If so where you able to
reproduce?

Cheers

On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht@gmail.com> wrote:

> Hi,
>
> I have attached the complete profiling scripts here, its a bit shabby, im
> new to python.
>
> There is a README which has the details on how to execute it.
> Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>
>
> Running this script when there are other sessions running seems to
> reproduce this problem easily.
> Please try it and let me know if you have any issues reproducing the
> problem.
>
> Regards,
> Aravind.
>
> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau <
> jeremie.galarneau@efficios.com> wrote:
>
>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT <aravind.ht@gmail.com> wrote:
>> > I am using 2.6.0 .I will try to share the code that I'm using here in
>> some
>> > time. If there are any specific fixes that are relevant to this issue,
>> see
>> > if you can provide a link to them. I would ideally like to try them out
>> > before trying a full upgrade to the latest versions.
>>
>> Hi,
>>
>> Can you provide more information on the system? Which distribution,
>> architecture, kernel version?
>>
>> The verbose sessiond logs might help pinpoint any unexpected behaviour
>> here (are all applications registering as expected?).
>>
>> Jérémie
>>
>> >
>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>> > <jeremie.galarneau@efficios.com> wrote:
>> >>
>> >> Hi Aravind,
>> >>
>> >> Can't say I have looked at everything you sent yet, but as a
>> >> preemptive question, which version are we talking about here? 2.6.0 or
>> >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>> >>
>> >> Thanks,
>> >> Jérémie
>> >>
>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com>
>> wrote:
>> >> > Hi,
>> >> >
>> >> > I am trying to obtain the performance characteristics of lttng with
>> the
>> >> > use
>> >> > of test applications. Traces are being produced on a local node and
>> >> > delivered to relayd that is running on a separate node for storage.
>> >> >
>> >> > An lttng session with the test applications producing an initial bit
>> >> > rate of
>> >> > 10 kb/s is started and run for about 30 seconds. The starting
>> sub-buffer
>> >> > size is kept at 128 kb and sub-buf count at 4. The session is then
>> >> > stopped
>> >> > and destroyed and traces are analyzed to see if there are any drops.
>> >> > This is
>> >> > being done in a loop with every subsequent session having an
>> increment
>> >> > of 2
>> >> > kb/s as long as there are no drops. If there are drops, I increase
>> the
>> >> > buffer size by a factor of x2 without incrementing the bit rate.
>> >> >
>> >> > I see trace drops happening consistently with test apps producing
>> traces
>> >> > at
>> >> > less than 40 kb/s, it doesnt seem to help even if I started with 1mb
>> x 4
>> >> > sub-buffers.
>> >> >
>> >> > Analysis :
>> >> >
>> >> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the
>> >> > entire
>> >> > trace directory, hope you will be able to view it.
>> >> > I have modified lttng_relayd code to dump the traces being captured
>> in
>> >> > the
>> >> > lttng_relayd logs along with debug info.
>> >> >
>> >> > Each test app is producing logs in the form of  :
>> >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate -
>> 2079
>> >> > b/s
>> >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>> >> >
>> >> > The test application PID, test application thread id, thread bit
>> rate,
>> >> > test
>> >> > app bit rate, thread trace number and application trace number s are
>> >> > part of
>> >> > the trace. So in the above trace, the thread is producing at 1 kb/s
>> and
>> >> > the
>> >> > whole test app is producing at 2 kb/s.
>> >> >
>> >> > If we look at the babeltrace out put, we see that the Trace with
>> >> > TraceApp
>> >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on
>> being
>> >> > successfully captured.
>> >> > I looked at the lttng_relayd logs and found that trace of
>> "appTraceNum
>> >> > 2" is
>> >> > not delivered/generated by the consumerd to the relayd in sequence
>> with
>> >> > other traces. To rule out that this is not a test application
>> problem,
>> >> > you
>> >> > can look at line ltttng_relayd log : 12778 and see traces from
>> >> > appTraceNum -
>> >> > 1 to appTraceNum - 18 including the appTraceNum 2 are "re-delivered"
>> by
>> >> > the
>> >> > consumerd to the relayd.
>> >> > Essentially, I see appTraceNum 1 through appTraceNum 18 being
>> delivered
>> >> > twice, once individually where appTraceNum 2 is missing and once as a
>> >> > group
>> >> > at line 12778 where its present.
>> >> >
>> >> >
>> >> > Request help with
>> >> > 1. why traces are delivered twice, is it by design or a genuine
>> problem
>> >> > ?
>> >> > 2. how to avoid traces being dropped even though buffers are
>> >> > sufficiently
>> >> > large enough ?
>> >> >
>> >> >
>> >> > Regards,
>> >> > Aravind.
>> >> >
>> >> > _______________________________________________
>> >> > lttng-dev mailing list
>> >> > lttng-dev@lists.lttng.org
>> >> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jérémie Galarneau
>> >> EfficiOS Inc.
>> >> http://www.efficios.com
>> >
>> >
>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>


-- 
Jonathan Rajotte Julien

[-- Attachment #1.2: Type: text/html, Size: 8447 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]         ` <CAF4oYfF2Li4zau0OLE=JZUtGfPZyFT1FciBr7AG1hvzGdeUycA@mail.gmail.com>
@ 2015-12-08 12:51           ` Aravind HT
       [not found]           ` <CACd+_b6byMnZ_oWBqEk44fJOgpVX-oAGMjfh_WV8AV6cKyK9iw@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-08 12:51 UTC (permalink / raw)
  To: Jonathan Rajotte; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 7126 bytes --]

Hi,

I am trying to upgrade in parallel, but this issue may still be present
after I upgrade or may be temporarily masked. So I need to find the root
cause for this and then see if its available on the latest before
committing to upgrade.

There is another issue i'm hitting, the lttng list command hangs after
lttng destroy session when running the profiling.

I found that consumerd 64 goes into an infinite loop waiting to flush
metadata in lttng_ustconsumer_recv_metadata() :: while
(consumer_metadata_cache_flushed(channel, offset + len, timer)) .
In consumer_metadata_cache, channel->metadata_stream->endpoint_status
is CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 with
offset having some value. This call always returns a 1 from the last else{}
block resulting in an infinite loop. Upon searching the forum, I found the
same issue being reported here :
 https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

Regards,
Aravind.


On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte <
jonathan.r.julien@gmail.com> wrote:

> Hi Aravind,
>
> Did you have the chance to upgrade to 2.6.1.If so where you able to
> reproduce?
>
> Cheers
>
> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht@gmail.com> wrote:
>
>> Hi,
>>
>> I have attached the complete profiling scripts here, its a bit shabby, im
>> new to python.
>>
>> There is a README which has the details on how to execute it.
>> Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>>
>>
>> Running this script when there are other sessions running seems to
>> reproduce this problem easily.
>> Please try it and let me know if you have any issues reproducing the
>> problem.
>>
>> Regards,
>> Aravind.
>>
>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau <
>> jeremie.galarneau@efficios.com> wrote:
>>
>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>> > I am using 2.6.0 .I will try to share the code that I'm using here in
>>> some
>>> > time. If there are any specific fixes that are relevant to this issue,
>>> see
>>> > if you can provide a link to them. I would ideally like to try them out
>>> > before trying a full upgrade to the latest versions.
>>>
>>> Hi,
>>>
>>> Can you provide more information on the system? Which distribution,
>>> architecture, kernel version?
>>>
>>> The verbose sessiond logs might help pinpoint any unexpected behaviour
>>> here (are all applications registering as expected?).
>>>
>>> Jérémie
>>>
>>> >
>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>>> > <jeremie.galarneau@efficios.com> wrote:
>>> >>
>>> >> Hi Aravind,
>>> >>
>>> >> Can't say I have looked at everything you sent yet, but as a
>>> >> preemptive question, which version are we talking about here? 2.6.0 or
>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>> >>
>>> >> Thanks,
>>> >> Jérémie
>>> >>
>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I am trying to obtain the performance characteristics of lttng with
>>> the
>>> >> > use
>>> >> > of test applications. Traces are being produced on a local node and
>>> >> > delivered to relayd that is running on a separate node for storage.
>>> >> >
>>> >> > An lttng session with the test applications producing an initial bit
>>> >> > rate of
>>> >> > 10 kb/s is started and run for about 30 seconds. The starting
>>> sub-buffer
>>> >> > size is kept at 128 kb and sub-buf count at 4. The session is then
>>> >> > stopped
>>> >> > and destroyed and traces are analyzed to see if there are any drops.
>>> >> > This is
>>> >> > being done in a loop with every subsequent session having an
>>> increment
>>> >> > of 2
>>> >> > kb/s as long as there are no drops. If there are drops, I increase
>>> the
>>> >> > buffer size by a factor of x2 without incrementing the bit rate.
>>> >> >
>>> >> > I see trace drops happening consistently with test apps producing
>>> traces
>>> >> > at
>>> >> > less than 40 kb/s, it doesnt seem to help even if I started with
>>> 1mb x 4
>>> >> > sub-buffers.
>>> >> >
>>> >> > Analysis :
>>> >> >
>>> >> > I have attached the lttng_relayd , lttng_consumerd_64 logs and the
>>> >> > entire
>>> >> > trace directory, hope you will be able to view it.
>>> >> > I have modified lttng_relayd code to dump the traces being captured
>>> in
>>> >> > the
>>> >> > lttng_relayd logs along with debug info.
>>> >> >
>>> >> > Each test app is producing logs in the form of  :
>>> >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032 b/s appRate -
>>> 2079
>>> >> > b/s
>>> >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>>> >> >
>>> >> > The test application PID, test application thread id, thread bit
>>> rate,
>>> >> > test
>>> >> > app bit rate, thread trace number and application trace number s are
>>> >> > part of
>>> >> > the trace. So in the above trace, the thread is producing at 1 kb/s
>>> and
>>> >> > the
>>> >> > whole test app is producing at 2 kb/s.
>>> >> >
>>> >> > If we look at the babeltrace out put, we see that the Trace with
>>> >> > TraceApp
>>> >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4, 5 and so on
>>> being
>>> >> > successfully captured.
>>> >> > I looked at the lttng_relayd logs and found that trace of
>>> "appTraceNum
>>> >> > 2" is
>>> >> > not delivered/generated by the consumerd to the relayd in sequence
>>> with
>>> >> > other traces. To rule out that this is not a test application
>>> problem,
>>> >> > you
>>> >> > can look at line ltttng_relayd log : 12778 and see traces from
>>> >> > appTraceNum -
>>> >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>>> "re-delivered" by
>>> >> > the
>>> >> > consumerd to the relayd.
>>> >> > Essentially, I see appTraceNum 1 through appTraceNum 18 being
>>> delivered
>>> >> > twice, once individually where appTraceNum 2 is missing and once as
>>> a
>>> >> > group
>>> >> > at line 12778 where its present.
>>> >> >
>>> >> >
>>> >> > Request help with
>>> >> > 1. why traces are delivered twice, is it by design or a genuine
>>> problem
>>> >> > ?
>>> >> > 2. how to avoid traces being dropped even though buffers are
>>> >> > sufficiently
>>> >> > large enough ?
>>> >> >
>>> >> >
>>> >> > Regards,
>>> >> > Aravind.
>>> >> >
>>> >> > _______________________________________________
>>> >> > lttng-dev mailing list
>>> >> > lttng-dev@lists.lttng.org
>>> >> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jérémie Galarneau
>>> >> EfficiOS Inc.
>>> >> http://www.efficios.com
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev@lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>
>
> --
> Jonathan Rajotte Julien
>
>

[-- Attachment #1.2: Type: text/html, Size: 10457 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]           ` <CACd+_b6byMnZ_oWBqEk44fJOgpVX-oAGMjfh_WV8AV6cKyK9iw@mail.gmail.com>
@ 2015-12-08 19:27             ` Jonathan Rajotte Julien
       [not found]             ` <56672F33.3070808@efficios.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Jonathan Rajotte Julien @ 2015-12-08 19:27 UTC (permalink / raw)
  To: Aravind HT, Jonathan Rajotte; +Cc: lttng-dev

Hi Aravind,

There is no README in the archive you sent.

Cheers

On 2015-12-08 07:51 AM, Aravind HT wrote:
> Hi,
>
> I am trying to upgrade in parallel, but this issue may still be 
> present after I upgrade or may be temporarily masked. So I need to 
> find the root cause for this and then see if its available on the 
> latest before committing to upgrade.
>
> There is another issue i'm hitting, the lttng list command hangs after 
> lttng destroy session when running the profiling.
>
> I found that consumerd 64 goes into an infinite loop waiting to flush 
> metadata in lttng_ustconsumer_recv_metadata() :: while 
> (consumer_metadata_cache_flushed(channel, offset + len, timer)) .
> In consumer_metadata_cache, channel->metadata_stream->endpoint_status 
> is CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 
> with offset having some value. This call always returns a 1 from the 
> last else{} block resulting in an infinite loop. Upon searching the 
> forum, I found the same issue being reported here :
> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>
> Regards,
> Aravind.
>
>
> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte 
> <jonathan.r.julien@gmail.com> wrote:
>
>     Hi Aravind,
>
>     Did you have the chance to upgrade to 2.6.1.If so where you able
>     to reproduce?
>
>     Cheers
>
>     On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht@gmail.com>
>     wrote:
>
>         Hi,
>
>         I have attached the complete profiling scripts here, its a bit
>         shabby, im new to python.
>
>         There is a README which has the details on how to execute it.
>         Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>
>
>         Running this script when there are other sessions running
>         seems to reproduce this problem easily.
>         Please try it and let me know if you have any issues
>         reproducing the problem.
>
>         Regards,
>         Aravind.
>
>         On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>         <jeremie.galarneau@efficios.com
>         <mailto:jeremie.galarneau@efficios.com>> wrote:
>
>             On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>> wrote:
>             > I am using 2.6.0 .I will try to share the code that I'm
>             using here in some
>             > time. If there are any specific fixes that are relevant
>             to this issue, see
>             > if you can provide a link to them. I would ideally like
>             to try them out
>             > before trying a full upgrade to the latest versions.
>
>             Hi,
>
>             Can you provide more information on the system? Which
>             distribution,
>             architecture, kernel version?
>
>             The verbose sessiond logs might help pinpoint any
>             unexpected behaviour
>             here (are all applications registering as expected?).
>
>             Jérémie
>
>             >
>             > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>             > <jeremie.galarneau@efficios.com
>             <mailto:jeremie.galarneau@efficios.com>> wrote:
>             >>
>             >> Hi Aravind,
>             >>
>             >> Can't say I have looked at everything you sent yet, but
>             as a
>             >> preemptive question, which version are we talking about
>             here? 2.6.0 or
>             >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>             >>
>             >> Thanks,
>             >> Jérémie
>             >>
>             >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>             <aravind.ht@gmail.com> wrote:
>             >> > Hi,
>             >> >
>             >> > I am trying to obtain the performance characteristics
>             of lttng with the
>             >> > use
>             >> > of test applications. Traces are being produced on a
>             local node and
>             >> > delivered to relayd that is running on a separate
>             node for storage.
>             >> >
>             >> > An lttng session with the test applications producing
>             an initial bit
>             >> > rate of
>             >> > 10 kb/s is started and run for about 30 seconds. The
>             starting sub-buffer
>             >> > size is kept at 128 kb and sub-buf count at 4. The
>             session is then
>             >> > stopped
>             >> > and destroyed and traces are analyzed to see if there
>             are any drops.
>             >> > This is
>             >> > being done in a loop with every subsequent session
>             having an increment
>             >> > of 2
>             >> > kb/s as long as there are no drops. If there are
>             drops, I increase the
>             >> > buffer size by a factor of x2 without incrementing
>             the bit rate.
>             >> >
>             >> > I see trace drops happening consistently with test
>             apps producing traces
>             >> > at
>             >> > less than 40 kb/s, it doesnt seem to help even if I
>             started with 1mb x 4
>             >> > sub-buffers.
>             >> >
>             >> > Analysis :
>             >> >
>             >> > I have attached the lttng_relayd , lttng_consumerd_64
>             logs and the
>             >> > entire
>             >> > trace directory, hope you will be able to view it.
>             >> > I have modified lttng_relayd code to dump the traces
>             being captured in
>             >> > the
>             >> > lttng_relayd logs along with debug info.
>             >> >
>             >> > Each test app is producing logs in the form of  :
>             >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032
>             b/s appRate - 2079
>             >> > b/s
>             >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>             >> >
>             >> > The test application PID, test application thread id,
>             thread bit rate,
>             >> > test
>             >> > app bit rate, thread trace number and application
>             trace number s are
>             >> > part of
>             >> > the trace. So in the above trace, the thread is
>             producing at 1 kb/s and
>             >> > the
>             >> > whole test app is producing at 2 kb/s.
>             >> >
>             >> > If we look at the babeltrace out put, we see that the
>             Trace with
>             >> > TraceApp
>             >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4,
>             5 and so on being
>             >> > successfully captured.
>             >> > I looked at the lttng_relayd logs and found that
>             trace of "appTraceNum
>             >> > 2" is
>             >> > not delivered/generated by the consumerd to the
>             relayd in sequence with
>             >> > other traces. To rule out that this is not a test
>             application problem,
>             >> > you
>             >> > can look at line ltttng_relayd log : 12778 and see
>             traces from
>             >> > appTraceNum -
>             >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>             "re-delivered" by
>             >> > the
>             >> > consumerd to the relayd.
>             >> > Essentially, I see appTraceNum 1 through appTraceNum
>             18 being delivered
>             >> > twice, once individually where appTraceNum 2 is
>             missing and once as a
>             >> > group
>             >> > at line 12778 where its present.
>             >> >
>             >> >
>             >> > Request help with
>             >> > 1. why traces are delivered twice, is it by design or
>             a genuine problem
>             >> > ?
>             >> > 2. how to avoid traces being dropped even though
>             buffers are
>             >> > sufficiently
>             >> > large enough ?
>             >> >
>             >> >
>             >> > Regards,
>             >> > Aravind.
>             >> >
>             >> > _______________________________________________
>             >> > lttng-dev mailing list
>             >> > lttng-dev@lists.lttng.org
>             >> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>             >> >
>             >>
>             >>
>             >>
>             >> --
>             >> Jérémie Galarneau
>             >> EfficiOS Inc.
>             >> http://www.efficios.com
>             >
>             >
>
>
>
>             --
>             Jérémie Galarneau
>             EfficiOS Inc.
>             http://www.efficios.com
>
>
>
>         _______________________________________________
>         lttng-dev mailing list
>         lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>         http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
>
>     -- 
>     Jonathan Rajotte Julien
>
>
>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Jonathan R. Julien
Efficios

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]             ` <56672F33.3070808@efficios.com>
@ 2015-12-09  9:34               ` Aravind HT
       [not found]               ` <CACd+_b6xjN8BEcfqypkhGrM2XWwxhW9RmCVgHfN5jmksGmVnPw@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-09  9:34 UTC (permalink / raw)
  To: Jonathan Rajotte Julien; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 12668 bytes --]

Sorry about that, not sure how it got missed.
Here it is.


# This needs a two node set up (1. local current node 2. remote node )
# relayd runs on the current node where traces are captured from the remote
node
# remote node runs test applications which generate traces.
# the launch_script_RN is executed on the current node and uses ssh  to
execute commands on the remote node. So this part may not work in every
case and may prompt for a password.
# if experiencing problems with ssh , kindly check
http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password

# ====================       To Run        =============================
launch_script_RN.py self_profile -c /tmp/configFile.txt



# configFile.txt is the file which has configuration params that
launchScript
# needs to configure lttng sessions. Below is an explanation of the
different options.
# ===================     configFile.txt   =============================

[section1]
# final out put file path
OutputFile = /tmp/Final_report.txt
# the remote node hostname on which test applications run and the test
sessions will be created; this should be something that could be used with
ssh. Traces will be transported from this node to the lttng_relayd running
on the current node.
Node = MY_REMOTE_NODE
# Sub buffer size to start this with
SubBufSize = 16k
# Sub buffer count
SubBufCount = 4
# per uid buffer
BufferScheme = --buffers-uid
# yes
EnableTracing = yes
# Bit rate of the test applications. Comman seperated example "1, 3, 3, 50"
sayss 4 test applications producing 1, 3, 3, and 50 Kb/s traces.
# So with the below, we just start with 1 test application producing 10 kb/s
TestApps = 10
# session life time in seonds
TestTime = 10
# Max number of successive sessions to configure. if n then n-1 sessions
are run, ex MaxRun = 2 will run 1 session.
MaxRun = 100


# ====================  Place the following files under ===============

# /tmp on the remote node
clean_RemNode_apps.sh
report_lttng_script.sh

# rest of the scripts under /usr/sbin on the current local node on which
lttng_realyd runs
# Define a trace point MY_TRACE to take a single string arg with
LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp and
place it under /usr/sbin of the remote host

# in launch_script_RN.py change currentNodeIP to the IP address on which
relayd is receiving, default ports are used.

# lttng_relayd is started as
/usr/bin/lttng-relayd -o /var/log/lttng-traces -d

# lttng_sessiond is started as
/usr/bin/lttng-sessiond --consumerd32-path
/usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir /usr/lib/
--consumerd64-path /usr/lib64/lttng/libexec/lttng-consumerd
--consumerd64-libdir /usr/lib64/ -b --no-kernel




Regards,
Aravind.

On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien <
Jonathan.rajotte-julien@efficios.com> wrote:

> Hi Aravind,
>
> There is no README in the archive you sent.
>
> Cheers
>
> On 2015-12-08 07:51 AM, Aravind HT wrote:
>
>> Hi,
>>
>> I am trying to upgrade in parallel, but this issue may still be present
>> after I upgrade or may be temporarily masked. So I need to find the root
>> cause for this and then see if its available on the latest before
>> committing to upgrade.
>>
>> There is another issue i'm hitting, the lttng list command hangs after
>> lttng destroy session when running the profiling.
>>
>> I found that consumerd 64 goes into an infinite loop waiting to flush
>> metadata in lttng_ustconsumer_recv_metadata() :: while
>> (consumer_metadata_cache_flushed(channel, offset + len, timer)) .
>> In consumer_metadata_cache, channel->metadata_stream->endpoint_status is
>> CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 with
>> offset having some value. This call always returns a 1 from the last else{}
>> block resulting in an infinite loop. Upon searching the forum, I found the
>> same issue being reported here :
>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>
>> Regards,
>> Aravind.
>>
>>
>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte <
>> jonathan.r.julien@gmail.com> wrote:
>>
>>     Hi Aravind,
>>
>>     Did you have the chance to upgrade to 2.6.1.If so where you able
>>     to reproduce?
>>
>>     Cheers
>>
>>     On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht@gmail.com>
>>     wrote:
>>
>>         Hi,
>>
>>         I have attached the complete profiling scripts here, its a bit
>>         shabby, im new to python.
>>
>>         There is a README which has the details on how to execute it.
>>         Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>>
>>
>>         Running this script when there are other sessions running
>>         seems to reproduce this problem easily.
>>         Please try it and let me know if you have any issues
>>         reproducing the problem.
>>
>>         Regards,
>>         Aravind.
>>
>>         On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>         <jeremie.galarneau@efficios.com
>>         <mailto:jeremie.galarneau@efficios.com>> wrote:
>>
>>             On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>> wrote:
>>             > I am using 2.6.0 .I will try to share the code that I'm
>>             using here in some
>>             > time. If there are any specific fixes that are relevant
>>             to this issue, see
>>             > if you can provide a link to them. I would ideally like
>>             to try them out
>>             > before trying a full upgrade to the latest versions.
>>
>>             Hi,
>>
>>             Can you provide more information on the system? Which
>>             distribution,
>>             architecture, kernel version?
>>
>>             The verbose sessiond logs might help pinpoint any
>>             unexpected behaviour
>>             here (are all applications registering as expected?).
>>
>>             Jérémie
>>
>>             >
>>             > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>>             > <jeremie.galarneau@efficios.com
>>             <mailto:jeremie.galarneau@efficios.com>> wrote:
>>             >>
>>             >> Hi Aravind,
>>             >>
>>             >> Can't say I have looked at everything you sent yet, but
>>             as a
>>             >> preemptive question, which version are we talking about
>>             here? 2.6.0 or
>>             >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>             >>
>>             >> Thanks,
>>             >> Jérémie
>>             >>
>>             >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>             <aravind.ht@gmail.com> wrote:
>>             >> > Hi,
>>             >> >
>>             >> > I am trying to obtain the performance characteristics
>>             of lttng with the
>>             >> > use
>>             >> > of test applications. Traces are being produced on a
>>             local node and
>>             >> > delivered to relayd that is running on a separate
>>             node for storage.
>>             >> >
>>             >> > An lttng session with the test applications producing
>>             an initial bit
>>             >> > rate of
>>             >> > 10 kb/s is started and run for about 30 seconds. The
>>             starting sub-buffer
>>             >> > size is kept at 128 kb and sub-buf count at 4. The
>>             session is then
>>             >> > stopped
>>             >> > and destroyed and traces are analyzed to see if there
>>             are any drops.
>>             >> > This is
>>             >> > being done in a loop with every subsequent session
>>             having an increment
>>             >> > of 2
>>             >> > kb/s as long as there are no drops. If there are
>>             drops, I increase the
>>             >> > buffer size by a factor of x2 without incrementing
>>             the bit rate.
>>             >> >
>>             >> > I see trace drops happening consistently with test
>>             apps producing traces
>>             >> > at
>>             >> > less than 40 kb/s, it doesnt seem to help even if I
>>             started with 1mb x 4
>>             >> > sub-buffers.
>>             >> >
>>             >> > Analysis :
>>             >> >
>>             >> > I have attached the lttng_relayd , lttng_consumerd_64
>>             logs and the
>>             >> > entire
>>             >> > trace directory, hope you will be able to view it.
>>             >> > I have modified lttng_relayd code to dump the traces
>>             being captured in
>>             >> > the
>>             >> > lttng_relayd logs along with debug info.
>>             >> >
>>             >> > Each test app is producing logs in the form of  :
>>             >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032
>>             b/s appRate - 2079
>>             >> > b/s
>>             >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>>             >> >
>>             >> > The test application PID, test application thread id,
>>             thread bit rate,
>>             >> > test
>>             >> > app bit rate, thread trace number and application
>>             trace number s are
>>             >> > part of
>>             >> > the trace. So in the above trace, the thread is
>>             producing at 1 kb/s and
>>             >> > the
>>             >> > whole test app is producing at 2 kb/s.
>>             >> >
>>             >> > If we look at the babeltrace out put, we see that the
>>             Trace with
>>             >> > TraceApp
>>             >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4,
>>             5 and so on being
>>             >> > successfully captured.
>>             >> > I looked at the lttng_relayd logs and found that
>>             trace of "appTraceNum
>>             >> > 2" is
>>             >> > not delivered/generated by the consumerd to the
>>             relayd in sequence with
>>             >> > other traces. To rule out that this is not a test
>>             application problem,
>>             >> > you
>>             >> > can look at line ltttng_relayd log : 12778 and see
>>             traces from
>>             >> > appTraceNum -
>>             >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>>             "re-delivered" by
>>             >> > the
>>             >> > consumerd to the relayd.
>>             >> > Essentially, I see appTraceNum 1 through appTraceNum
>>             18 being delivered
>>             >> > twice, once individually where appTraceNum 2 is
>>             missing and once as a
>>             >> > group
>>             >> > at line 12778 where its present.
>>             >> >
>>             >> >
>>             >> > Request help with
>>             >> > 1. why traces are delivered twice, is it by design or
>>             a genuine problem
>>             >> > ?
>>             >> > 2. how to avoid traces being dropped even though
>>             buffers are
>>             >> > sufficiently
>>             >> > large enough ?
>>             >> >
>>             >> >
>>             >> > Regards,
>>             >> > Aravind.
>>             >> >
>>             >> > _______________________________________________
>>             >> > lttng-dev mailing list
>>             >> > lttng-dev@lists.lttng.org
>>             >> >
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>             >> >
>>             >>
>>             >>
>>             >>
>>             >> --
>>             >> Jérémie Galarneau
>>             >> EfficiOS Inc.
>>             >> http://www.efficios.com
>>             >
>>             >
>>
>>
>>
>>             --
>>             Jérémie Galarneau
>>             EfficiOS Inc.
>>             http://www.efficios.com
>>
>>
>>
>>         _______________________________________________
>>         lttng-dev mailing list
>>         lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>>         http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>
>>
>>     --     Jonathan Rajotte Julien
>>
>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev@lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>
> --
> Jonathan R. Julien
> Efficios
>
>

[-- Attachment #1.2: Type: text/html, Size: 17711 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]               ` <CACd+_b6xjN8BEcfqypkhGrM2XWwxhW9RmCVgHfN5jmksGmVnPw@mail.gmail.com>
@ 2015-12-11 17:16                 ` Aravind HT
       [not found]                 ` <CACd+_b46wkEH3jVOfz=SR2eS+ZXuWoL=f81ahNa4uwPNpWTHbw@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-11 17:16 UTC (permalink / raw)
  To: Jonathan Rajotte Julien; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 13304 bytes --]

Hi,

Did you get a chance to reproduce the problem with the scripts ? Let me
know if you need any help running it.

Regards,
Aravind.

On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com> wrote:

> Sorry about that, not sure how it got missed.
> Here it is.
>
>
> # This needs a two node set up (1. local current node 2. remote node )
> # relayd runs on the current node where traces are captured from the
> remote node
> # remote node runs test applications which generate traces.
> # the launch_script_RN is executed on the current node and uses ssh  to
> execute commands on the remote node. So this part may not work in every
> case and may prompt for a password.
> # if experiencing problems with ssh , kindly check
> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>
> # ====================       To Run        =============================
> launch_script_RN.py self_profile -c /tmp/configFile.txt
>
>
>
> # configFile.txt is the file which has configuration params that
> launchScript
> # needs to configure lttng sessions. Below is an explanation of the
> different options.
> # ===================     configFile.txt   =============================
>
> [section1]
> # final out put file path
> OutputFile = /tmp/Final_report.txt
> # the remote node hostname on which test applications run and the test
> sessions will be created; this should be something that could be used with
> ssh. Traces will be transported from this node to the lttng_relayd running
> on the current node.
> Node = MY_REMOTE_NODE
> # Sub buffer size to start this with
> SubBufSize = 16k
> # Sub buffer count
> SubBufCount = 4
> # per uid buffer
> BufferScheme = --buffers-uid
> # yes
> EnableTracing = yes
> # Bit rate of the test applications. Comman seperated example "1, 3, 3,
> 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s traces.
> # So with the below, we just start with 1 test application producing 10
> kb/s
> TestApps = 10
> # session life time in seonds
> TestTime = 10
> # Max number of successive sessions to configure. if n then n-1 sessions
> are run, ex MaxRun = 2 will run 1 session.
> MaxRun = 100
>
>
> # ====================  Place the following files under ===============
>
> # /tmp on the remote node
> clean_RemNode_apps.sh
> report_lttng_script.sh
>
> # rest of the scripts under /usr/sbin on the current local node on which
> lttng_realyd runs
> # Define a trace point MY_TRACE to take a single string arg with
> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp and
> place it under /usr/sbin of the remote host
>
> # in launch_script_RN.py change currentNodeIP to the IP address on which
> relayd is receiving, default ports are used.
>
> # lttng_relayd is started as
> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>
> # lttng_sessiond is started as
> /usr/bin/lttng-sessiond --consumerd32-path
> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir /usr/lib/
> --consumerd64-path /usr/lib64/lttng/libexec/lttng-consumerd
> --consumerd64-libdir /usr/lib64/ -b --no-kernel
>
>
>
>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien <
> Jonathan.rajotte-julien@efficios.com> wrote:
>
>> Hi Aravind,
>>
>> There is no README in the archive you sent.
>>
>> Cheers
>>
>> On 2015-12-08 07:51 AM, Aravind HT wrote:
>>
>>> Hi,
>>>
>>> I am trying to upgrade in parallel, but this issue may still be present
>>> after I upgrade or may be temporarily masked. So I need to find the root
>>> cause for this and then see if its available on the latest before
>>> committing to upgrade.
>>>
>>> There is another issue i'm hitting, the lttng list command hangs after
>>> lttng destroy session when running the profiling.
>>>
>>> I found that consumerd 64 goes into an infinite loop waiting to flush
>>> metadata in lttng_ustconsumer_recv_metadata() :: while
>>> (consumer_metadata_cache_flushed(channel, offset + len, timer)) .
>>> In consumer_metadata_cache, channel->metadata_stream->endpoint_status is
>>> CONSUMER_ENDPOINT_ACTIVE, metadata_stream->ust_metadata_pushed is 0 with
>>> offset having some value. This call always returns a 1 from the last else{}
>>> block resulting in an infinite loop. Upon searching the forum, I found the
>>> same issue being reported here :
>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>>
>>> Regards,
>>> Aravind.
>>>
>>>
>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte <
>>> jonathan.r.julien@gmail.com> wrote:
>>>
>>>     Hi Aravind,
>>>
>>>     Did you have the chance to upgrade to 2.6.1.If so where you able
>>>     to reproduce?
>>>
>>>     Cheers
>>>
>>>     On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT <aravind.ht@gmail.com>
>>>     wrote:
>>>
>>>         Hi,
>>>
>>>         I have attached the complete profiling scripts here, its a bit
>>>         shabby, im new to python.
>>>
>>>         There is a README which has the details on how to execute it.
>>>         Im using a Yocto 1.6 on x86_64 platforms on both the nodes.
>>>
>>>
>>>         Running this script when there are other sessions running
>>>         seems to reproduce this problem easily.
>>>         Please try it and let me know if you have any issues
>>>         reproducing the problem.
>>>
>>>         Regards,
>>>         Aravind.
>>>
>>>         On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>         <jeremie.galarneau@efficios.com
>>>         <mailto:jeremie.galarneau@efficios.com>> wrote:
>>>
>>>             On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>> wrote:
>>>             > I am using 2.6.0 .I will try to share the code that I'm
>>>             using here in some
>>>             > time. If there are any specific fixes that are relevant
>>>             to this issue, see
>>>             > if you can provide a link to them. I would ideally like
>>>             to try them out
>>>             > before trying a full upgrade to the latest versions.
>>>
>>>             Hi,
>>>
>>>             Can you provide more information on the system? Which
>>>             distribution,
>>>             architecture, kernel version?
>>>
>>>             The verbose sessiond logs might help pinpoint any
>>>             unexpected behaviour
>>>             here (are all applications registering as expected?).
>>>
>>>             Jérémie
>>>
>>>             >
>>>             > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie Galarneau
>>>             > <jeremie.galarneau@efficios.com
>>>             <mailto:jeremie.galarneau@efficios.com>> wrote:
>>>             >>
>>>             >> Hi Aravind,
>>>             >>
>>>             >> Can't say I have looked at everything you sent yet, but
>>>             as a
>>>             >> preemptive question, which version are we talking about
>>>             here? 2.6.0 or
>>>             >> 2.6.1? 2.6.1 contains a lot of relay daemon fixes.
>>>             >>
>>>             >> Thanks,
>>>             >> Jérémie
>>>             >>
>>>             >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>             <aravind.ht@gmail.com> wrote:
>>>             >> > Hi,
>>>             >> >
>>>             >> > I am trying to obtain the performance characteristics
>>>             of lttng with the
>>>             >> > use
>>>             >> > of test applications. Traces are being produced on a
>>>             local node and
>>>             >> > delivered to relayd that is running on a separate
>>>             node for storage.
>>>             >> >
>>>             >> > An lttng session with the test applications producing
>>>             an initial bit
>>>             >> > rate of
>>>             >> > 10 kb/s is started and run for about 30 seconds. The
>>>             starting sub-buffer
>>>             >> > size is kept at 128 kb and sub-buf count at 4. The
>>>             session is then
>>>             >> > stopped
>>>             >> > and destroyed and traces are analyzed to see if there
>>>             are any drops.
>>>             >> > This is
>>>             >> > being done in a loop with every subsequent session
>>>             having an increment
>>>             >> > of 2
>>>             >> > kb/s as long as there are no drops. If there are
>>>             drops, I increase the
>>>             >> > buffer size by a factor of x2 without incrementing
>>>             the bit rate.
>>>             >> >
>>>             >> > I see trace drops happening consistently with test
>>>             apps producing traces
>>>             >> > at
>>>             >> > less than 40 kb/s, it doesnt seem to help even if I
>>>             started with 1mb x 4
>>>             >> > sub-buffers.
>>>             >> >
>>>             >> > Analysis :
>>>             >> >
>>>             >> > I have attached the lttng_relayd , lttng_consumerd_64
>>>             logs and the
>>>             >> > entire
>>>             >> > trace directory, hope you will be able to view it.
>>>             >> > I have modified lttng_relayd code to dump the traces
>>>             being captured in
>>>             >> > the
>>>             >> > lttng_relayd logs along with debug info.
>>>             >> >
>>>             >> > Each test app is producing logs in the form of  :
>>>             >> > "TraceApp PID - 31940 THID - 31970 @threadRate - 1032
>>>             b/s appRate - 2079
>>>             >> > b/s
>>>             >> > threadTraceNum - 9 appTraceNum - 18  sleepTime - 192120"
>>>             >> >
>>>             >> > The test application PID, test application thread id,
>>>             thread bit rate,
>>>             >> > test
>>>             >> > app bit rate, thread trace number and application
>>>             trace number s are
>>>             >> > part of
>>>             >> > the trace. So in the above trace, the thread is
>>>             producing at 1 kb/s and
>>>             >> > the
>>>             >> > whole test app is producing at 2 kb/s.
>>>             >> >
>>>             >> > If we look at the babeltrace out put, we see that the
>>>             Trace with
>>>             >> > TraceApp
>>>             >> > PID - 31940 appTraceNum 2 is missing , with 1, 3, 4,
>>>             5 and so on being
>>>             >> > successfully captured.
>>>             >> > I looked at the lttng_relayd logs and found that
>>>             trace of "appTraceNum
>>>             >> > 2" is
>>>             >> > not delivered/generated by the consumerd to the
>>>             relayd in sequence with
>>>             >> > other traces. To rule out that this is not a test
>>>             application problem,
>>>             >> > you
>>>             >> > can look at line ltttng_relayd log : 12778 and see
>>>             traces from
>>>             >> > appTraceNum -
>>>             >> > 1 to appTraceNum - 18 including the appTraceNum 2 are
>>>             "re-delivered" by
>>>             >> > the
>>>             >> > consumerd to the relayd.
>>>             >> > Essentially, I see appTraceNum 1 through appTraceNum
>>>             18 being delivered
>>>             >> > twice, once individually where appTraceNum 2 is
>>>             missing and once as a
>>>             >> > group
>>>             >> > at line 12778 where its present.
>>>             >> >
>>>             >> >
>>>             >> > Request help with
>>>             >> > 1. why traces are delivered twice, is it by design or
>>>             a genuine problem
>>>             >> > ?
>>>             >> > 2. how to avoid traces being dropped even though
>>>             buffers are
>>>             >> > sufficiently
>>>             >> > large enough ?
>>>             >> >
>>>             >> >
>>>             >> > Regards,
>>>             >> > Aravind.
>>>             >> >
>>>             >> > _______________________________________________
>>>             >> > lttng-dev mailing list
>>>             >> > lttng-dev@lists.lttng.org
>>>             >> >
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>             >> >
>>>             >>
>>>             >>
>>>             >>
>>>             >> --
>>>             >> Jérémie Galarneau
>>>             >> EfficiOS Inc.
>>>             >> http://www.efficios.com
>>>             >
>>>             >
>>>
>>>
>>>
>>>             --
>>>             Jérémie Galarneau
>>>             EfficiOS Inc.
>>>             http://www.efficios.com
>>>
>>>
>>>
>>>         _______________________________________________
>>>         lttng-dev mailing list
>>>         lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>>>         http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>>
>>>
>>>     --     Jonathan Rajotte Julien
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>
>> --
>> Jonathan R. Julien
>> Efficios
>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 18264 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                 ` <CACd+_b46wkEH3jVOfz=SR2eS+ZXuWoL=f81ahNa4uwPNpWTHbw@mail.gmail.com>
@ 2015-12-11 18:03                   ` Jonathan Rajotte Julien
       [not found]                   ` <566B0FEA.7040307@efficios.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Jonathan Rajotte Julien @ 2015-12-11 18:03 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev

Hi Aravind,

On 2015-12-11 12:16 PM, Aravind HT wrote:
> Hi,
>
> Did you get a chance to reproduce the problem with the scripts ? Let 
> me know if you need any help running it.

Time is a scarce resource unfortunately. But until I have the time to 
look at it you might want to investigate if you are able to reproduce it 
on 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would 
really help.

Cheers!

>
> Regards,
> Aravind.
>
> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com 
> <mailto:aravind.ht@gmail.com>> wrote:
>
>     Sorry about that, not sure how it got missed.
>     Here it is.
>
>
>     # This needs a two node set up (1. local current node 2. remote node )
>     # relayd runs on the current node where traces are captured from
>     the remote node
>     # remote node runs test applications which generate traces.
>     # the launch_script_RN is executed on the current node and uses
>     ssh  to execute commands on the remote node. So this part may not
>     work in every case and may prompt for a password.
>     # if experiencing problems with ssh , kindly check
>     http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>
>     # ====================       To Run  =============================
>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>
>
>
>     # configFile.txt is the file which has configuration params that
>     launchScript
>     # needs to configure lttng sessions. Below is an explanation of
>     the different options.
>     # ===================     configFile.txt =============================
>
>     [section1]
>     # final out put file path
>     OutputFile = /tmp/Final_report.txt
>     # the remote node hostname on which test applications run and the
>     test sessions will be created; this should be something that could
>     be used with ssh. Traces will be transported from this node to the
>     lttng_relayd running on the current node.
>     Node = MY_REMOTE_NODE
>     # Sub buffer size to start this with
>     SubBufSize = 16k
>     # Sub buffer count
>     SubBufCount = 4
>     # per uid buffer
>     BufferScheme = --buffers-uid
>     # yes
>     EnableTracing = yes
>     # Bit rate of the test applications. Comman seperated example "1,
>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>     traces.
>     # So with the below, we just start with 1 test application
>     producing 10 kb/s
>     TestApps = 10
>     # session life time in seonds
>     TestTime = 10
>     # Max number of successive sessions to configure. if n then n-1
>     sessions are run, ex MaxRun = 2 will run 1 session.
>     MaxRun = 100
>
>
>     # ====================  Place the following files under
>     ===============
>
>     # /tmp on the remote node
>     clean_RemNode_apps.sh
>     report_lttng_script.sh
>
>     # rest of the scripts under /usr/sbin on the current local node on
>     which lttng_realyd runs
>     # Define a trace point MY_TRACE to take a single string arg with
>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>     and place it under /usr/sbin of the remote host
>
>     # in launch_script_RN.py change currentNodeIP to the IP address on
>     which relayd is receiving, default ports are used.
>
>     # lttng_relayd is started as
>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>
>     # lttng_sessiond is started as
>     /usr/bin/lttng-sessiond --consumerd32-path
>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>     /usr/lib/ --consumerd64-path
>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>     /usr/lib64/ -b --no-kernel
>
>
>
>
>     Regards,
>     Aravind.
>
>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>     <Jonathan.rajotte-julien@efficios.com
>     <mailto:Jonathan.rajotte-julien@efficios.com>> wrote:
>
>         Hi Aravind,
>
>         There is no README in the archive you sent.
>
>         Cheers
>
>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>
>             Hi,
>
>             I am trying to upgrade in parallel, but this issue may
>             still be present after I upgrade or may be temporarily
>             masked. So I need to find the root cause for this and then
>             see if its available on the latest before committing to
>             upgrade.
>
>             There is another issue i'm hitting, the lttng list command
>             hangs after lttng destroy session when running the profiling.
>
>             I found that consumerd 64 goes into an infinite loop
>             waiting to flush metadata in
>             lttng_ustconsumer_recv_metadata() :: while
>             (consumer_metadata_cache_flushed(channel, offset + len,
>             timer)) .
>             In consumer_metadata_cache,
>             channel->metadata_stream->endpoint_status is
>             CONSUMER_ENDPOINT_ACTIVE,
>             metadata_stream->ust_metadata_pushed is 0 with offset
>             having some value. This call always returns a 1 from the
>             last else{} block resulting in an infinite loop. Upon
>             searching the forum, I found the same issue being reported
>             here :
>             https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>
>             Regards,
>             Aravind.
>
>
>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>             <jonathan.r.julien@gmail.com
>             <mailto:jonathan.r.julien@gmail.com>> wrote:
>
>                 Hi Aravind,
>
>                 Did you have the chance to upgrade to 2.6.1.If so
>             where you able
>                 to reproduce?
>
>                 Cheers
>
>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>>
>                 wrote:
>
>                     Hi,
>
>                     I have attached the complete profiling scripts
>             here, its a bit
>                     shabby, im new to python.
>
>                     There is a README which has the details on how to
>             execute it.
>                     Im using a Yocto 1.6 on x86_64 platforms on both
>             the nodes.
>
>
>                     Running this script when there are other sessions
>             running
>                     seems to reproduce this problem easily.
>                     Please try it and let me know if you have any issues
>                     reproducing the problem.
>
>                     Regards,
>                     Aravind.
>
>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>                     <jeremie.galarneau@efficios.com
>             <mailto:jeremie.galarneau@efficios.com>
>                     <mailto:jeremie.galarneau@efficios.com
>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>
>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>                         <aravind.ht@gmail.com
>             <mailto:aravind.ht@gmail.com> <mailto:aravind.ht@gmail.com
>             <mailto:aravind.ht@gmail.com>>> wrote:
>                         > I am using 2.6.0 .I will try to share the
>             code that I'm
>                         using here in some
>                         > time. If there are any specific fixes that
>             are relevant
>                         to this issue, see
>                         > if you can provide a link to them. I would
>             ideally like
>                         to try them out
>                         > before trying a full upgrade to the latest
>             versions.
>
>                         Hi,
>
>                         Can you provide more information on the
>             system? Which
>                         distribution,
>                         architecture, kernel version?
>
>                         The verbose sessiond logs might help pinpoint any
>                         unexpected behaviour
>                         here (are all applications registering as
>             expected?).
>
>                         Jérémie
>
>                         >
>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>             Galarneau
>                         > <jeremie.galarneau@efficios.com
>             <mailto:jeremie.galarneau@efficios.com>
>                         <mailto:jeremie.galarneau@efficios.com
>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>                         >>
>                         >> Hi Aravind,
>                         >>
>                         >> Can't say I have looked at everything you
>             sent yet, but
>                         as a
>                         >> preemptive question, which version are we
>             talking about
>                         here? 2.6.0 or
>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>             fixes.
>                         >>
>                         >> Thanks,
>                         >> Jérémie
>                         >>
>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>                         <aravind.ht@gmail.com
>             <mailto:aravind.ht@gmail.com>> wrote:
>                         >> > Hi,
>                         >> >
>                         >> > I am trying to obtain the performance
>             characteristics
>                         of lttng with the
>                         >> > use
>                         >> > of test applications. Traces are being
>             produced on a
>                         local node and
>                         >> > delivered to relayd that is running on a
>             separate
>                         node for storage.
>                         >> >
>                         >> > An lttng session with the test
>             applications producing
>                         an initial bit
>                         >> > rate of
>                         >> > 10 kb/s is started and run for about 30
>             seconds. The
>                         starting sub-buffer
>                         >> > size is kept at 128 kb and sub-buf count
>             at 4. The
>                         session is then
>                         >> > stopped
>                         >> > and destroyed and traces are analyzed to
>             see if there
>                         are any drops.
>                         >> > This is
>                         >> > being done in a loop with every
>             subsequent session
>                         having an increment
>                         >> > of 2
>                         >> > kb/s as long as there are no drops. If
>             there are
>                         drops, I increase the
>                         >> > buffer size by a factor of x2 without
>             incrementing
>                         the bit rate.
>                         >> >
>                         >> > I see trace drops happening consistently
>             with test
>                         apps producing traces
>                         >> > at
>                         >> > less than 40 kb/s, it doesnt seem to help
>             even if I
>                         started with 1mb x 4
>                         >> > sub-buffers.
>                         >> >
>                         >> > Analysis :
>                         >> >
>                         >> > I have attached the lttng_relayd ,
>             lttng_consumerd_64
>                         logs and the
>                         >> > entire
>                         >> > trace directory, hope you will be able to
>             view it.
>                         >> > I have modified lttng_relayd code to dump
>             the traces
>                         being captured in
>                         >> > the
>                         >> > lttng_relayd logs along with debug info.
>                         >> >
>                         >> > Each test app is producing logs in the
>             form of  :
>                         >> > "TraceApp PID - 31940 THID - 31970
>             @threadRate - 1032
>                         b/s appRate - 2079
>                         >> > b/s
>                         >> > threadTraceNum - 9 appTraceNum - 18 
>             sleepTime - 192120"
>                         >> >
>                         >> > The test application PID, test
>             application thread id,
>                         thread bit rate,
>                         >> > test
>                         >> > app bit rate, thread trace number and
>             application
>                         trace number s are
>                         >> > part of
>                         >> > the trace. So in the above trace, the
>             thread is
>                         producing at 1 kb/s and
>                         >> > the
>                         >> > whole test app is producing at 2 kb/s.
>                         >> >
>                         >> > If we look at the babeltrace out put, we
>             see that the
>                         Trace with
>                         >> > TraceApp
>                         >> > PID - 31940 appTraceNum 2 is missing ,
>             with 1, 3, 4,
>                         5 and so on being
>                         >> > successfully captured.
>                         >> > I looked at the lttng_relayd logs and
>             found that
>                         trace of "appTraceNum
>                         >> > 2" is
>                         >> > not delivered/generated by the consumerd
>             to the
>                         relayd in sequence with
>                         >> > other traces. To rule out that this is
>             not a test
>                         application problem,
>                         >> > you
>                         >> > can look at line ltttng_relayd log :
>             12778 and see
>                         traces from
>                         >> > appTraceNum -
>                         >> > 1 to appTraceNum - 18 including the
>             appTraceNum 2 are
>                         "re-delivered" by
>                         >> > the
>                         >> > consumerd to the relayd.
>                         >> > Essentially, I see appTraceNum 1 through
>             appTraceNum
>                         18 being delivered
>                         >> > twice, once individually where
>             appTraceNum 2 is
>                         missing and once as a
>                         >> > group
>                         >> > at line 12778 where its present.
>                         >> >
>                         >> >
>                         >> > Request help with
>                         >> > 1. why traces are delivered twice, is it
>             by design or
>                         a genuine problem
>                         >> > ?
>                         >> > 2. how to avoid traces being dropped even
>             though
>                         buffers are
>                         >> > sufficiently
>                         >> > large enough ?
>                         >> >
>                         >> >
>                         >> > Regards,
>                         >> > Aravind.
>                         >> >
>                         >> >
>             _______________________________________________
>                         >> > lttng-dev mailing list
>                         >> > lttng-dev@lists.lttng.org
>             <mailto:lttng-dev@lists.lttng.org>
>                         >> >
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>                         >> >
>                         >>
>                         >>
>                         >>
>                         >> --
>                         >> Jérémie Galarneau
>                         >> EfficiOS Inc.
>                         >> http://www.efficios.com
>                         >
>                         >
>
>
>
>                         --
>                         Jérémie Galarneau
>                         EfficiOS Inc.
>             http://www.efficios.com
>
>
>
>             _______________________________________________
>                     lttng-dev mailing list
>             lttng-dev@lists.lttng.org
>             <mailto:lttng-dev@lists.lttng.org>
>             <mailto:lttng-dev@lists.lttng.org
>             <mailto:lttng-dev@lists.lttng.org>>
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
>
>                 --     Jonathan Rajotte Julien
>
>
>
>
>             _______________________________________________
>             lttng-dev mailing list
>             lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>         -- 
>         Jonathan R. Julien
>         Efficios
>
>
>

-- 
Jonathan R. Julien
Efficios


_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                   ` <566B0FEA.7040307@efficios.com>
@ 2015-12-11 18:42                     ` Aravind HT
       [not found]                     ` <CACd+_b6za+gk_fFhFapz-zRpwO8B31UX6+hMJxcjJsxrKVM5ow@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-11 18:42 UTC (permalink / raw)
  To: Jonathan Rajotte Julien; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 19963 bytes --]

NPBS understand it, I will continue to investigate this, got a bit side
tracked with the sessiond hang issue that I mentioned earlier.
Making some progress on that, will update as soon as I find the root cause,
it looks like a mutex deadlock between
the following threads in the consumerd daemon

#1  0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
buf=buf@entry=0x7f16acca57f0,
len=len@entry=4476) at unix.c:175
#2  0x000000000041dc96 in lttng_ustconsumer_request_metadata
(ctx=ctx@entry=0x640d10,
channel=0x7f16980192b0, timer=timer@entry=0, wait=wait@entry=0) at
ust-consumer.c:2331
#3  0x000000000041ff3c in lttng_ustconsumer_sync_metadata
(ctx=ctx@entry=0x640d10,
metadata=metadata@entry=0x7f169801aa40) at ust-consumer.c:1829
#4  0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
metadata=<optimized out>) at consumer-stream.c:468
#5  consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8) at
consumer-stream.c:545
#6  0x000000000041d316 in lttng_ustconsumer_read_subbuffer
(stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
ust-consumer.c:2063
#7  0x000000000040b613 in lttng_consumer_read_subbuffer
(stream=0x7f1698018fa0, ctx=0x640d10) at consumer.c:3203
#8  0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
consumer.c:2557
#9  0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
pthread_create.c:312
#10 0x00007f16af3dcafd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

and

consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging


This again can be reproduced by the same script, just running consecutive
sessions of short duration (<10 seconds) with traces getting captured on a
remote node.
I hit this every time I run the script.

Regards,
Aravind.



On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
Jonathan.rajotte-julien@efficios.com> wrote:

> Hi Aravind,
>
> On 2015-12-11 12:16 PM, Aravind HT wrote:
>
>> Hi,
>>
>> Did you get a chance to reproduce the problem with the scripts ? Let me
>> know if you need any help running it.
>>
>
> Time is a scarce resource unfortunately. But until I have the time to look
> at it you might want to investigate if you are able to reproduce it on
> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
> really help.
>
> Cheers!
>
>
>> Regards,
>> Aravind.
>>
>>
>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com <mailto:
>> aravind.ht@gmail.com>> wrote:
>>
>>     Sorry about that, not sure how it got missed.
>>     Here it is.
>>
>>
>>     # This needs a two node set up (1. local current node 2. remote node )
>>     # relayd runs on the current node where traces are captured from
>>     the remote node
>>     # remote node runs test applications which generate traces.
>>     # the launch_script_RN is executed on the current node and uses
>>     ssh  to execute commands on the remote node. So this part may not
>>     work in every case and may prompt for a password.
>>     # if experiencing problems with ssh , kindly check
>>
>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>>
>>     # ====================       To Run  =============================
>>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>>
>>
>>
>>     # configFile.txt is the file which has configuration params that
>>     launchScript
>>     # needs to configure lttng sessions. Below is an explanation of
>>     the different options.
>>     # ===================     configFile.txt =============================
>>
>>     [section1]
>>     # final out put file path
>>     OutputFile = /tmp/Final_report.txt
>>     # the remote node hostname on which test applications run and the
>>     test sessions will be created; this should be something that could
>>     be used with ssh. Traces will be transported from this node to the
>>     lttng_relayd running on the current node.
>>     Node = MY_REMOTE_NODE
>>     # Sub buffer size to start this with
>>     SubBufSize = 16k
>>     # Sub buffer count
>>     SubBufCount = 4
>>     # per uid buffer
>>     BufferScheme = --buffers-uid
>>     # yes
>>     EnableTracing = yes
>>     # Bit rate of the test applications. Comman seperated example "1,
>>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>     traces.
>>     # So with the below, we just start with 1 test application
>>     producing 10 kb/s
>>     TestApps = 10
>>     # session life time in seonds
>>     TestTime = 10
>>     # Max number of successive sessions to configure. if n then n-1
>>     sessions are run, ex MaxRun = 2 will run 1 session.
>>     MaxRun = 100
>>
>>
>>     # ====================  Place the following files under
>>     ===============
>>
>>     # /tmp on the remote node
>>     clean_RemNode_apps.sh
>>     report_lttng_script.sh
>>
>>     # rest of the scripts under /usr/sbin on the current local node on
>>     which lttng_realyd runs
>>     # Define a trace point MY_TRACE to take a single string arg with
>>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>     and place it under /usr/sbin of the remote host
>>
>>     # in launch_script_RN.py change currentNodeIP to the IP address on
>>     which relayd is receiving, default ports are used.
>>
>>     # lttng_relayd is started as
>>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>>
>>     # lttng_sessiond is started as
>>     /usr/bin/lttng-sessiond --consumerd32-path
>>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>     /usr/lib/ --consumerd64-path
>>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>     /usr/lib64/ -b --no-kernel
>>
>>
>>
>>
>>     Regards,
>>     Aravind.
>>
>>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>     <Jonathan.rajotte-julien@efficios.com
>>     <mailto:Jonathan.rajotte-julien@efficios.com>> wrote:
>>
>>         Hi Aravind,
>>
>>         There is no README in the archive you sent.
>>
>>         Cheers
>>
>>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>>
>>             Hi,
>>
>>             I am trying to upgrade in parallel, but this issue may
>>             still be present after I upgrade or may be temporarily
>>             masked. So I need to find the root cause for this and then
>>             see if its available on the latest before committing to
>>             upgrade.
>>
>>             There is another issue i'm hitting, the lttng list command
>>             hangs after lttng destroy session when running the profiling.
>>
>>             I found that consumerd 64 goes into an infinite loop
>>             waiting to flush metadata in
>>             lttng_ustconsumer_recv_metadata() :: while
>>             (consumer_metadata_cache_flushed(channel, offset + len,
>>             timer)) .
>>             In consumer_metadata_cache,
>>             channel->metadata_stream->endpoint_status is
>>             CONSUMER_ENDPOINT_ACTIVE,
>>             metadata_stream->ust_metadata_pushed is 0 with offset
>>             having some value. This call always returns a 1 from the
>>             last else{} block resulting in an infinite loop. Upon
>>             searching the forum, I found the same issue being reported
>>             here :
>>
>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>
>>             Regards,
>>             Aravind.
>>
>>
>>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>             <jonathan.r.julien@gmail.com
>>             <mailto:jonathan.r.julien@gmail.com>> wrote:
>>
>>                 Hi Aravind,
>>
>>                 Did you have the chance to upgrade to 2.6.1.If so
>>             where you able
>>                 to reproduce?
>>
>>                 Cheers
>>
>>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>>
>>                 wrote:
>>
>>                     Hi,
>>
>>                     I have attached the complete profiling scripts
>>             here, its a bit
>>                     shabby, im new to python.
>>
>>                     There is a README which has the details on how to
>>             execute it.
>>                     Im using a Yocto 1.6 on x86_64 platforms on both
>>             the nodes.
>>
>>
>>                     Running this script when there are other sessions
>>             running
>>                     seems to reproduce this problem easily.
>>                     Please try it and let me know if you have any issues
>>                     reproducing the problem.
>>
>>                     Regards,
>>                     Aravind.
>>
>>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>                     <jeremie.galarneau@efficios.com
>>             <mailto:jeremie.galarneau@efficios.com>
>>                     <mailto:jeremie.galarneau@efficios.com
>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>
>>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>                         <aravind.ht@gmail.com
>>             <mailto:aravind.ht@gmail.com> <mailto:aravind.ht@gmail.com
>>
>>             <mailto:aravind.ht@gmail.com>>> wrote:
>>                         > I am using 2.6.0 .I will try to share the
>>             code that I'm
>>                         using here in some
>>                         > time. If there are any specific fixes that
>>             are relevant
>>                         to this issue, see
>>                         > if you can provide a link to them. I would
>>             ideally like
>>                         to try them out
>>                         > before trying a full upgrade to the latest
>>             versions.
>>
>>                         Hi,
>>
>>                         Can you provide more information on the
>>             system? Which
>>                         distribution,
>>                         architecture, kernel version?
>>
>>                         The verbose sessiond logs might help pinpoint any
>>                         unexpected behaviour
>>                         here (are all applications registering as
>>             expected?).
>>
>>                         Jérémie
>>
>>                         >
>>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>             Galarneau
>>                         > <jeremie.galarneau@efficios.com
>>             <mailto:jeremie.galarneau@efficios.com>
>>                         <mailto:jeremie.galarneau@efficios.com
>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>                         >>
>>                         >> Hi Aravind,
>>                         >>
>>                         >> Can't say I have looked at everything you
>>             sent yet, but
>>                         as a
>>                         >> preemptive question, which version are we
>>             talking about
>>                         here? 2.6.0 or
>>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>             fixes.
>>                         >>
>>                         >> Thanks,
>>                         >> Jérémie
>>                         >>
>>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>                         <aravind.ht@gmail.com
>>             <mailto:aravind.ht@gmail.com>> wrote:
>>                         >> > Hi,
>>                         >> >
>>                         >> > I am trying to obtain the performance
>>             characteristics
>>                         of lttng with the
>>                         >> > use
>>                         >> > of test applications. Traces are being
>>             produced on a
>>                         local node and
>>                         >> > delivered to relayd that is running on a
>>             separate
>>                         node for storage.
>>                         >> >
>>                         >> > An lttng session with the test
>>             applications producing
>>                         an initial bit
>>                         >> > rate of
>>                         >> > 10 kb/s is started and run for about 30
>>             seconds. The
>>                         starting sub-buffer
>>                         >> > size is kept at 128 kb and sub-buf count
>>             at 4. The
>>                         session is then
>>                         >> > stopped
>>                         >> > and destroyed and traces are analyzed to
>>             see if there
>>                         are any drops.
>>                         >> > This is
>>                         >> > being done in a loop with every
>>             subsequent session
>>                         having an increment
>>                         >> > of 2
>>                         >> > kb/s as long as there are no drops. If
>>             there are
>>                         drops, I increase the
>>                         >> > buffer size by a factor of x2 without
>>             incrementing
>>                         the bit rate.
>>                         >> >
>>                         >> > I see trace drops happening consistently
>>             with test
>>                         apps producing traces
>>                         >> > at
>>                         >> > less than 40 kb/s, it doesnt seem to help
>>             even if I
>>                         started with 1mb x 4
>>                         >> > sub-buffers.
>>                         >> >
>>                         >> > Analysis :
>>                         >> >
>>                         >> > I have attached the lttng_relayd ,
>>             lttng_consumerd_64
>>                         logs and the
>>                         >> > entire
>>                         >> > trace directory, hope you will be able to
>>             view it.
>>                         >> > I have modified lttng_relayd code to dump
>>             the traces
>>                         being captured in
>>                         >> > the
>>                         >> > lttng_relayd logs along with debug info.
>>                         >> >
>>                         >> > Each test app is producing logs in the
>>             form of  :
>>                         >> > "TraceApp PID - 31940 THID - 31970
>>             @threadRate - 1032
>>                         b/s appRate - 2079
>>                         >> > b/s
>>                         >> > threadTraceNum - 9 appTraceNum - 18
>>    sleepTime - 192120"
>>                         >> >
>>                         >> > The test application PID, test
>>             application thread id,
>>                         thread bit rate,
>>                         >> > test
>>                         >> > app bit rate, thread trace number and
>>             application
>>                         trace number s are
>>                         >> > part of
>>                         >> > the trace. So in the above trace, the
>>             thread is
>>                         producing at 1 kb/s and
>>                         >> > the
>>                         >> > whole test app is producing at 2 kb/s.
>>                         >> >
>>                         >> > If we look at the babeltrace out put, we
>>             see that the
>>                         Trace with
>>                         >> > TraceApp
>>                         >> > PID - 31940 appTraceNum 2 is missing ,
>>             with 1, 3, 4,
>>                         5 and so on being
>>                         >> > successfully captured.
>>                         >> > I looked at the lttng_relayd logs and
>>             found that
>>                         trace of "appTraceNum
>>                         >> > 2" is
>>                         >> > not delivered/generated by the consumerd
>>             to the
>>                         relayd in sequence with
>>                         >> > other traces. To rule out that this is
>>             not a test
>>                         application problem,
>>                         >> > you
>>                         >> > can look at line ltttng_relayd log :
>>             12778 and see
>>                         traces from
>>                         >> > appTraceNum -
>>                         >> > 1 to appTraceNum - 18 including the
>>             appTraceNum 2 are
>>                         "re-delivered" by
>>                         >> > the
>>                         >> > consumerd to the relayd.
>>                         >> > Essentially, I see appTraceNum 1 through
>>             appTraceNum
>>                         18 being delivered
>>                         >> > twice, once individually where
>>             appTraceNum 2 is
>>                         missing and once as a
>>                         >> > group
>>                         >> > at line 12778 where its present.
>>                         >> >
>>                         >> >
>>                         >> > Request help with
>>                         >> > 1. why traces are delivered twice, is it
>>             by design or
>>                         a genuine problem
>>                         >> > ?
>>                         >> > 2. how to avoid traces being dropped even
>>             though
>>                         buffers are
>>                         >> > sufficiently
>>                         >> > large enough ?
>>                         >> >
>>                         >> >
>>                         >> > Regards,
>>                         >> > Aravind.
>>                         >> >
>>                         >> >
>>             _______________________________________________
>>                         >> > lttng-dev mailing list
>>                         >> > lttng-dev@lists.lttng.org
>>             <mailto:lttng-dev@lists.lttng.org>
>>                         >> >
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>                         >> >
>>                         >>
>>                         >>
>>                         >>
>>                         >> --
>>                         >> Jérémie Galarneau
>>                         >> EfficiOS Inc.
>>                         >> http://www.efficios.com
>>                         >
>>                         >
>>
>>
>>
>>                         --
>>                         Jérémie Galarneau
>>                         EfficiOS Inc.
>>             http://www.efficios.com
>>
>>
>>
>>             _______________________________________________
>>                     lttng-dev mailing list
>>             lttng-dev@lists.lttng.org
>>             <mailto:lttng-dev@lists.lttng.org>
>>             <mailto:lttng-dev@lists.lttng.org
>>             <mailto:lttng-dev@lists.lttng.org>>
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>
>>
>>                 --     Jonathan Rajotte Julien
>>
>>
>>
>>
>>             _______________________________________________
>>             lttng-dev mailing list
>>             lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>>         --         Jonathan R. Julien
>>         Efficios
>>
>>
>>
>>
> --
> Jonathan R. Julien
> Efficios
>
>

[-- Attachment #1.2: Type: text/html, Size: 28427 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                     ` <CACd+_b6za+gk_fFhFapz-zRpwO8B31UX6+hMJxcjJsxrKVM5ow@mail.gmail.com>
@ 2015-12-11 22:43                       ` Mathieu Desnoyers
       [not found]                       ` <1125287188.233739.1449873787684.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2015-12-11 22:43 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 14752 bytes --]

Hi, 

Please upgrade to 2.6.1 or 2.7.0, those are old issues. 

Thanks, 

Mathieu 

----- On Dec 11, 2015, at 1:42 PM, Aravind HT <aravind.ht@gmail.com> wrote: 

> NPBS understand it, I will continue to investigate this, got a bit side tracked
> with the sessiond hang issue that I mentioned earlier.
> Making some progress on that, will update as soon as I find the root cause, it
> looks like a mutex deadlock between
> the following threads in the consumerd daemon

> #1 0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
> buf=buf@entry=0x7f16acca57f0, len=len@entry=4476) at unix.c:175
> #2 0x000000000041dc96 in lttng_ustconsumer_request_metadata
> (ctx=ctx@entry=0x640d10, channel=0x7f16980192b0, timer=timer@entry=0,
> wait=wait@entry=0) at ust-consumer.c:2331
> #3 0x000000000041ff3c in lttng_ustconsumer_sync_metadata
> (ctx=ctx@entry=0x640d10, metadata=metadata@entry=0x7f169801aa40) at
> ust-consumer.c:1829
> #4 0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
> metadata=<optimized out>) at consumer-stream.c:468
> #5 consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8) at
> consumer-stream.c:545
> #6 0x000000000041d316 in lttng_ustconsumer_read_subbuffer
> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
> ust-consumer.c:2063
> #7 0x000000000040b613 in lttng_consumer_read_subbuffer (stream=0x7f1698018fa0,
> ctx=0x640d10) at consumer.c:3203
> #8 0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
> consumer.c:2557
> #9 0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
> pthread_create.c:312
> #10 0x00007f16af3dcafd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

> and

> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging

> This again can be reproduced by the same script, just running consecutive
> sessions of short duration (<10 seconds) with traces getting captured on a
> remote node.
> I hit this every time I run the script.

> Regards,
> Aravind.

> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
> Jonathan.rajotte-julien@efficios.com > wrote:

>> Hi Aravind,

>> On 2015-12-11 12:16 PM, Aravind HT wrote:

>>> Hi,

>>> Did you get a chance to reproduce the problem with the scripts ? Let me know if
>>> you need any help running it.

>> Time is a scarce resource unfortunately. But until I have the time to look at it
>> you might want to investigate if you are able to reproduce it on
>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would really
>> help.

>> Cheers!

>>> Regards,
>>> Aravind.

>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT < aravind.ht@gmail.com <mailto:
>>> aravind.ht@gmail.com >> wrote:

>>> Sorry about that, not sure how it got missed.
>>> Here it is.

>>> # This needs a two node set up (1. local current node 2. remote node )
>>> # relayd runs on the current node where traces are captured from
>>> the remote node
>>> # remote node runs test applications which generate traces.
>>> # the launch_script_RN is executed on the current node and uses
>>> ssh to execute commands on the remote node. So this part may not
>>> work in every case and may prompt for a password.
>>> # if experiencing problems with ssh , kindly check
>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password

>>> # ==================== To Run =============================
>>> launch_script_RN.py self_profile -c /tmp/configFile.txt

>>> # configFile.txt is the file which has configuration params that
>>> launchScript
>>> # needs to configure lttng sessions. Below is an explanation of
>>> the different options.
>>> # =================== configFile.txt =============================

>>> [section1]
>>> # final out put file path
>>> OutputFile = /tmp/Final_report.txt
>>> # the remote node hostname on which test applications run and the
>>> test sessions will be created; this should be something that could
>>> be used with ssh. Traces will be transported from this node to the
>>> lttng_relayd running on the current node.
>>> Node = MY_REMOTE_NODE
>>> # Sub buffer size to start this with
>>> SubBufSize = 16k
>>> # Sub buffer count
>>> SubBufCount = 4
>>> # per uid buffer
>>> BufferScheme = --buffers-uid
>>> # yes
>>> EnableTracing = yes
>>> # Bit rate of the test applications. Comman seperated example "1,
>>> 3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>> traces.
>>> # So with the below, we just start with 1 test application
>>> producing 10 kb/s
>>> TestApps = 10
>>> # session life time in seonds
>>> TestTime = 10
>>> # Max number of successive sessions to configure. if n then n-1
>>> sessions are run, ex MaxRun = 2 will run 1 session.
>>> MaxRun = 100

>>> # ==================== Place the following files under
>>> ===============

>>> # /tmp on the remote node
>>> clean_RemNode_apps.sh
>>> report_lttng_script.sh

>>> # rest of the scripts under /usr/sbin on the current local node on
>>> which lttng_realyd runs
>>> # Define a trace point MY_TRACE to take a single string arg with
>>> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>> and place it under /usr/sbin of the remote host

>>> # in launch_script_RN.py change currentNodeIP to the IP address on
>>> which relayd is receiving, default ports are used.

>>> # lttng_relayd is started as
>>> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d

>>> # lttng_sessiond is started as
>>> /usr/bin/lttng-sessiond --consumerd32-path
>>> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>> /usr/lib/ --consumerd64-path
>>> /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>> /usr/lib64/ -b --no-kernel

>>> Regards,
>>> Aravind.

>>> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>> < Jonathan.rajotte-julien@efficios.com
>>> <mailto: Jonathan.rajotte-julien@efficios.com >> wrote:

>>> Hi Aravind,

>>> There is no README in the archive you sent.

>>> Cheers

>>> On 2015-12-08 07:51 AM, Aravind HT wrote:

>>> Hi,

>>> I am trying to upgrade in parallel, but this issue may
>>> still be present after I upgrade or may be temporarily
>>> masked. So I need to find the root cause for this and then
>>> see if its available on the latest before committing to
>>> upgrade.

>>> There is another issue i'm hitting, the lttng list command
>>> hangs after lttng destroy session when running the profiling.

>>> I found that consumerd 64 goes into an infinite loop
>>> waiting to flush metadata in
>>> lttng_ustconsumer_recv_metadata() :: while
>>> (consumer_metadata_cache_flushed(channel, offset + len,
>>> timer)) .
>>> In consumer_metadata_cache,
>>> channel->metadata_stream->endpoint_status is
>>> CONSUMER_ENDPOINT_ACTIVE,
>>> metadata_stream->ust_metadata_pushed is 0 with offset
>>> having some value. This call always returns a 1 from the
>>> last else{} block resulting in an infinite loop. Upon
>>> searching the forum, I found the same issue being reported
>>> here :
>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

>>> Regards,
>>> Aravind.

>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>> < jonathan.r.julien@gmail.com
>>> <mailto: jonathan.r.julien@gmail.com >> wrote:

>>> Hi Aravind,

>>> Did you have the chance to upgrade to 2.6.1.If so
>>> where you able
>>> to reproduce?

>>> Cheers

>>> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>> < aravind.ht@gmail.com <mailto: aravind.ht@gmail.com >>
>>> wrote:

>>> Hi,

>>> I have attached the complete profiling scripts
>>> here, its a bit
>>> shabby, im new to python.

>>> There is a README which has the details on how to
>>> execute it.
>>> Im using a Yocto 1.6 on x86_64 platforms on both
>>> the nodes.

>>> Running this script when there are other sessions
>>> running
>>> seems to reproduce this problem easily.
>>> Please try it and let me know if you have any issues
>>> reproducing the problem.

>>> Regards,
>>> Aravind.

>>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>> < jeremie.galarneau@efficios.com
>>> <mailto: jeremie.galarneau@efficios.com >
>>> <mailto: jeremie.galarneau@efficios.com
>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>> < aravind.ht@gmail.com
>>> <mailto: aravind.ht@gmail.com > <mailto: aravind.ht@gmail.com

>>> <mailto: aravind.ht@gmail.com >>> wrote:
>>> > I am using 2.6.0 .I will try to share the
>>> code that I'm
>>> using here in some
>>> > time. If there are any specific fixes that
>>> are relevant
>>> to this issue, see
>>> > if you can provide a link to them. I would
>>> ideally like
>>> to try them out
>>> > before trying a full upgrade to the latest
>>> versions.

>>> Hi,

>>> Can you provide more information on the
>>> system? Which
>>> distribution,
>>> architecture, kernel version?

>>> The verbose sessiond logs might help pinpoint any
>>> unexpected behaviour
>>> here (are all applications registering as
>>> expected?).

>>> Jérémie


>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>> Galarneau
>>> > < jeremie.galarneau@efficios.com
>>> <mailto: jeremie.galarneau@efficios.com >
>>> <mailto: jeremie.galarneau@efficios.com
>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>> >> Hi Aravind,

>>> >> Can't say I have looked at everything you
>>> sent yet, but
>>> as a
>>> >> preemptive question, which version are we
>>> talking about
>>> here? 2.6.0 or
>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>> fixes.

>>> >> Thanks,
>>> >> Jérémie

>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>> < aravind.ht@gmail.com
>>> <mailto: aravind.ht@gmail.com >> wrote:
>>> >> > Hi,

>>> >> > I am trying to obtain the performance
>>> characteristics
>>> of lttng with the
>>> >> > use
>>> >> > of test applications. Traces are being
>>> produced on a
>>> local node and
>>> >> > delivered to relayd that is running on a
>>> separate
>>> node for storage.

>>> >> > An lttng session with the test
>>> applications producing
>>> an initial bit
>>> >> > rate of
>>> >> > 10 kb/s is started and run for about 30
>>> seconds. The
>>> starting sub-buffer
>>> >> > size is kept at 128 kb and sub-buf count
>>> at 4. The
>>> session is then
>>> >> > stopped
>>> >> > and destroyed and traces are analyzed to
>>> see if there
>>> are any drops.
>>> >> > This is
>>> >> > being done in a loop with every
>>> subsequent session
>>> having an increment
>>> >> > of 2
>>> >> > kb/s as long as there are no drops. If
>>> there are
>>> drops, I increase the
>>> >> > buffer size by a factor of x2 without
>>> incrementing
>>> the bit rate.

>>> >> > I see trace drops happening consistently
>>> with test
>>> apps producing traces
>>> >> > at
>>> >> > less than 40 kb/s, it doesnt seem to help
>>> even if I
>>> started with 1mb x 4
>>> >> > sub-buffers.

>>> >> > Analysis :

>>> >> > I have attached the lttng_relayd ,
>>> lttng_consumerd_64
>>> logs and the
>>> >> > entire
>>> >> > trace directory, hope you will be able to
>>> view it.
>>> >> > I have modified lttng_relayd code to dump
>>> the traces
>>> being captured in
>>> >> > the
>>> >> > lttng_relayd logs along with debug info.

>>> >> > Each test app is producing logs in the
>>> form of :
>>> >> > "TraceApp PID - 31940 THID - 31970
>>> @threadRate - 1032
>>> b/s appRate - 2079
>>> >> > b/s
>>> >> > threadTraceNum - 9 appTraceNum - 18 sleepTime - 192120"

>>> >> > The test application PID, test
>>> application thread id,
>>> thread bit rate,
>>> >> > test
>>> >> > app bit rate, thread trace number and
>>> application
>>> trace number s are
>>> >> > part of
>>> >> > the trace. So in the above trace, the
>>> thread is
>>> producing at 1 kb/s and
>>> >> > the
>>> >> > whole test app is producing at 2 kb/s.

>>> >> > If we look at the babeltrace out put, we
>>> see that the
>>> Trace with
>>> >> > TraceApp
>>> >> > PID - 31940 appTraceNum 2 is missing ,
>>> with 1, 3, 4,
>>> 5 and so on being
>>> >> > successfully captured.
>>> >> > I looked at the lttng_relayd logs and
>>> found that
>>> trace of "appTraceNum
>>> >> > 2" is
>>> >> > not delivered/generated by the consumerd
>>> to the
>>> relayd in sequence with
>>> >> > other traces. To rule out that this is
>>> not a test
>>> application problem,
>>> >> > you
>>> >> > can look at line ltttng_relayd log :
>>> 12778 and see
>>> traces from
>>> >> > appTraceNum -
>>> >> > 1 to appTraceNum - 18 including the
>>> appTraceNum 2 are
>>> "re-delivered" by
>>> >> > the
>>> >> > consumerd to the relayd.
>>> >> > Essentially, I see appTraceNum 1 through
>>> appTraceNum
>>> 18 being delivered
>>> >> > twice, once individually where
>>> appTraceNum 2 is
>>> missing and once as a
>>> >> > group
>>> >> > at line 12778 where its present.


>>> >> > Request help with
>>> >> > 1. why traces are delivered twice, is it
>>> by design or
>>> a genuine problem
>>> >> > ?
>>> >> > 2. how to avoid traces being dropped even
>>> though
>>> buffers are
>>> >> > sufficiently
>>> >> > large enough ?


>>> >> > Regards,
>>> >> > Aravind.


>>> _______________________________________________
>>> >> > lttng-dev mailing list
>>> >> > lttng-dev@lists.lttng.org
>>> <mailto: lttng-dev@lists.lttng.org >

>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev




>>> >> --
>>> >> Jérémie Galarneau
>>> >> EfficiOS Inc.
>>> >> http://www.efficios.com



>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com

>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org
>>> <mailto: lttng-dev@lists.lttng.org >
>>> <mailto: lttng-dev@lists.lttng.org
>>> <mailto: lttng-dev@lists.lttng.org >>
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>> -- Jonathan Rajotte Julien

>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org <mailto: lttng-dev@lists.lttng.org >
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>> -- Jonathan R. Julien
>>> Efficios

>> --
>> Jonathan R. Julien
>> Efficios

> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 40930 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                       ` <1125287188.233739.1449873787684.JavaMail.zimbra@efficios.com>
@ 2015-12-16 13:02                         ` Aravind HT
       [not found]                         ` <CACd+_b7X3+Hzi5HYt3KggWMCVNPyhyDb4aK3-Pik98wcb70kCQ@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-16 13:02 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 22516 bytes --]

Hi Mathieu,

Ive been able to analyze this sessiond - consumerd hang scenario and I see
that the fix is not available on 2.6.1 or on 2.7.0.

Root Cause :

The hang is mainly caused because of a race condition for the
consumer_metadata_socket between the following two flows of code
Flow 1. do_sync_metadata() -> lttng_ustconsumer_sync_metadat() ->
lttng_ustconsumer_request_metadata() -> lttncomm_recv_unix_sock() ->
recvmsg()

and

Flow 2. lttng_consumer_recv_cmd() -> lttng_ust_consumer_recv_cmd() ->
lttng_ustconsumer_recv_metadata()

Flow 1 above initiates a metadata requests and recvmsg()s for the reply
from sessiond on the consumer_metadata_socket ( it has taken the
metadata_socket_lock and the stream->lock ) , but simultaneously this
socket is being polled from Flow 2. Once in a while Flow 2 gets to receive
while Flow 1 just hangs on recvmsg() . Since Flow 1 has taken the
stream->lock, this then introduces a deadlock at a later stage
(on_buffer_ready() when the metadata is being flushed. )

Think using the metadata_socket_lock in Flow 2 before attempting to receive
may address this issue, but is a bit complicated as the message would have
already been received at the point of using that lock. The other
alternative is to remove the complete recv() from Flow 1 and let that be
taken care by Flow 2.

Please let me know if this has been fixed in the later builds, may be I
searched it wrong.


Regards,
Aravind.

On Sat, Dec 12, 2015 at 4:13 AM, Mathieu Desnoyers <
mathieu.desnoyers@efficios.com> wrote:

> Hi,
>
> Please upgrade to 2.6.1 or 2.7.0, those are old issues.
>
> Thanks,
>
> Mathieu
>
>
> ----- On Dec 11, 2015, at 1:42 PM, Aravind HT <aravind.ht@gmail.com>
> wrote:
>
>
> NPBS understand it, I will continue to investigate this, got a bit side
> tracked with the sessiond hang issue that I mentioned earlier.
> Making some progress on that, will update as soon as I find the root
> cause, it looks like a mutex deadlock between
> the following threads in the consumerd daemon
>
> #1  0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26, buf=buf@entry=0x7f16acca57f0,
> len=len@entry=4476) at unix.c:175
> #2  0x000000000041dc96 in lttng_ustconsumer_request_metadata (ctx=ctx@entry=0x640d10,
> channel=0x7f16980192b0, timer=timer@entry=0, wait=wait@entry=0) at
> ust-consumer.c:2331
> #3  0x000000000041ff3c in lttng_ustconsumer_sync_metadata (ctx=ctx@entry=0x640d10,
> metadata=metadata@entry=0x7f169801aa40) at ust-consumer.c:1829
> #4  0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
> metadata=<optimized out>) at consumer-stream.c:468
> #5  consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8)
> at consumer-stream.c:545
> #6  0x000000000041d316 in lttng_ustconsumer_read_subbuffer
> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
> ust-consumer.c:2063
> #7  0x000000000040b613 in lttng_consumer_read_subbuffer
> (stream=0x7f1698018fa0, ctx=0x640d10) at consumer.c:3203
> #8  0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
> consumer.c:2557
> #9  0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
> pthread_create.c:312
> #10 0x00007f16af3dcafd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> and
>
> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging
>
>
> This again can be reproduced by the same script, just running consecutive
> sessions of short duration (<10 seconds) with traces getting captured on a
> remote node.
> I hit this every time I run the script.
>
> Regards,
> Aravind.
>
>
>
> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
> Jonathan.rajotte-julien@efficios.com> wrote:
>
>> Hi Aravind,
>>
>> On 2015-12-11 12:16 PM, Aravind HT wrote:
>>
>>> Hi,
>>>
>>> Did you get a chance to reproduce the problem with the scripts ? Let me
>>> know if you need any help running it.
>>>
>>
>> Time is a scarce resource unfortunately. But until I have the time to
>> look at it you might want to investigate if you are able to reproduce it on
>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
>> really help.
>>
>> Cheers!
>>
>>
>>> Regards,
>>> Aravind.
>>>
>>>
>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com
>>> <mailto:aravind.ht@gmail.com>> wrote:
>>>
>>>     Sorry about that, not sure how it got missed.
>>>     Here it is.
>>>
>>>
>>>     # This needs a two node set up (1. local current node 2. remote node
>>> )
>>>     # relayd runs on the current node where traces are captured from
>>>     the remote node
>>>     # remote node runs test applications which generate traces.
>>>     # the launch_script_RN is executed on the current node and uses
>>>     ssh  to execute commands on the remote node. So this part may not
>>>     work in every case and may prompt for a password.
>>>     # if experiencing problems with ssh , kindly check
>>>
>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>>>
>>>     # ====================       To Run  =============================
>>>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>>>
>>>
>>>
>>>     # configFile.txt is the file which has configuration params that
>>>     launchScript
>>>     # needs to configure lttng sessions. Below is an explanation of
>>>     the different options.
>>>     # ===================     configFile.txt
>>> =============================
>>>
>>>     [section1]
>>>     # final out put file path
>>>     OutputFile = /tmp/Final_report.txt
>>>     # the remote node hostname on which test applications run and the
>>>     test sessions will be created; this should be something that could
>>>     be used with ssh. Traces will be transported from this node to the
>>>     lttng_relayd running on the current node.
>>>     Node = MY_REMOTE_NODE
>>>     # Sub buffer size to start this with
>>>     SubBufSize = 16k
>>>     # Sub buffer count
>>>     SubBufCount = 4
>>>     # per uid buffer
>>>     BufferScheme = --buffers-uid
>>>     # yes
>>>     EnableTracing = yes
>>>     # Bit rate of the test applications. Comman seperated example "1,
>>>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>>     traces.
>>>     # So with the below, we just start with 1 test application
>>>     producing 10 kb/s
>>>     TestApps = 10
>>>     # session life time in seonds
>>>     TestTime = 10
>>>     # Max number of successive sessions to configure. if n then n-1
>>>     sessions are run, ex MaxRun = 2 will run 1 session.
>>>     MaxRun = 100
>>>
>>>
>>>     # ====================  Place the following files under
>>>     ===============
>>>
>>>     # /tmp on the remote node
>>>     clean_RemNode_apps.sh
>>>     report_lttng_script.sh
>>>
>>>     # rest of the scripts under /usr/sbin on the current local node on
>>>     which lttng_realyd runs
>>>     # Define a trace point MY_TRACE to take a single string arg with
>>>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>>     and place it under /usr/sbin of the remote host
>>>
>>>     # in launch_script_RN.py change currentNodeIP to the IP address on
>>>     which relayd is receiving, default ports are used.
>>>
>>>     # lttng_relayd is started as
>>>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>>>
>>>     # lttng_sessiond is started as
>>>     /usr/bin/lttng-sessiond --consumerd32-path
>>>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>>     /usr/lib/ --consumerd64-path
>>>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>>     /usr/lib64/ -b --no-kernel
>>>
>>>
>>>
>>>
>>>     Regards,
>>>     Aravind.
>>>
>>>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>>     <Jonathan.rajotte-julien@efficios.com
>>>     <mailto:Jonathan.rajotte-julien@efficios.com>> wrote:
>>>
>>>         Hi Aravind,
>>>
>>>         There is no README in the archive you sent.
>>>
>>>         Cheers
>>>
>>>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>>>
>>>             Hi,
>>>
>>>             I am trying to upgrade in parallel, but this issue may
>>>             still be present after I upgrade or may be temporarily
>>>             masked. So I need to find the root cause for this and then
>>>             see if its available on the latest before committing to
>>>             upgrade.
>>>
>>>             There is another issue i'm hitting, the lttng list command
>>>             hangs after lttng destroy session when running the profiling.
>>>
>>>             I found that consumerd 64 goes into an infinite loop
>>>             waiting to flush metadata in
>>>             lttng_ustconsumer_recv_metadata() :: while
>>>             (consumer_metadata_cache_flushed(channel, offset + len,
>>>             timer)) .
>>>             In consumer_metadata_cache,
>>>             channel->metadata_stream->endpoint_status is
>>>             CONSUMER_ENDPOINT_ACTIVE,
>>>             metadata_stream->ust_metadata_pushed is 0 with offset
>>>             having some value. This call always returns a 1 from the
>>>             last else{} block resulting in an infinite loop. Upon
>>>             searching the forum, I found the same issue being reported
>>>             here :
>>>
>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>>
>>>             Regards,
>>>             Aravind.
>>>
>>>
>>>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>>             <jonathan.r.julien@gmail.com
>>>             <mailto:jonathan.r.julien@gmail.com>> wrote:
>>>
>>>                 Hi Aravind,
>>>
>>>                 Did you have the chance to upgrade to 2.6.1.If so
>>>             where you able
>>>                 to reproduce?
>>>
>>>                 Cheers
>>>
>>>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>>
>>>                 wrote:
>>>
>>>                     Hi,
>>>
>>>                     I have attached the complete profiling scripts
>>>             here, its a bit
>>>                     shabby, im new to python.
>>>
>>>                     There is a README which has the details on how to
>>>             execute it.
>>>                     Im using a Yocto 1.6 on x86_64 platforms on both
>>>             the nodes.
>>>
>>>
>>>                     Running this script when there are other sessions
>>>             running
>>>                     seems to reproduce this problem easily.
>>>                     Please try it and let me know if you have any issues
>>>                     reproducing the problem.
>>>
>>>                     Regards,
>>>                     Aravind.
>>>
>>>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>                     <jeremie.galarneau@efficios.com
>>>             <mailto:jeremie.galarneau@efficios.com>
>>>                     <mailto:jeremie.galarneau@efficios.com
>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>
>>>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>                         <aravind.ht@gmail.com
>>>             <mailto:aravind.ht@gmail.com> <mailto:aravind.ht@gmail.com
>>>
>>>             <mailto:aravind.ht@gmail.com>>> wrote:
>>>                         > I am using 2.6.0 .I will try to share the
>>>             code that I'm
>>>                         using here in some
>>>                         > time. If there are any specific fixes that
>>>             are relevant
>>>                         to this issue, see
>>>                         > if you can provide a link to them. I would
>>>             ideally like
>>>                         to try them out
>>>                         > before trying a full upgrade to the latest
>>>             versions.
>>>
>>>                         Hi,
>>>
>>>                         Can you provide more information on the
>>>             system? Which
>>>                         distribution,
>>>                         architecture, kernel version?
>>>
>>>                         The verbose sessiond logs might help pinpoint any
>>>                         unexpected behaviour
>>>                         here (are all applications registering as
>>>             expected?).
>>>
>>>                         Jérémie
>>>
>>>                         >
>>>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>>             Galarneau
>>>                         > <jeremie.galarneau@efficios.com
>>>             <mailto:jeremie.galarneau@efficios.com>
>>>                         <mailto:jeremie.galarneau@efficios.com
>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>                         >>
>>>                         >> Hi Aravind,
>>>                         >>
>>>                         >> Can't say I have looked at everything you
>>>             sent yet, but
>>>                         as a
>>>                         >> preemptive question, which version are we
>>>             talking about
>>>                         here? 2.6.0 or
>>>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>>             fixes.
>>>                         >>
>>>                         >> Thanks,
>>>                         >> Jérémie
>>>                         >>
>>>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>                         <aravind.ht@gmail.com
>>>             <mailto:aravind.ht@gmail.com>> wrote:
>>>                         >> > Hi,
>>>                         >> >
>>>                         >> > I am trying to obtain the performance
>>>             characteristics
>>>                         of lttng with the
>>>                         >> > use
>>>                         >> > of test applications. Traces are being
>>>             produced on a
>>>                         local node and
>>>                         >> > delivered to relayd that is running on a
>>>             separate
>>>                         node for storage.
>>>                         >> >
>>>                         >> > An lttng session with the test
>>>             applications producing
>>>                         an initial bit
>>>                         >> > rate of
>>>                         >> > 10 kb/s is started and run for about 30
>>>             seconds. The
>>>                         starting sub-buffer
>>>                         >> > size is kept at 128 kb and sub-buf count
>>>             at 4. The
>>>                         session is then
>>>                         >> > stopped
>>>                         >> > and destroyed and traces are analyzed to
>>>             see if there
>>>                         are any drops.
>>>                         >> > This is
>>>                         >> > being done in a loop with every
>>>             subsequent session
>>>                         having an increment
>>>                         >> > of 2
>>>                         >> > kb/s as long as there are no drops. If
>>>             there are
>>>                         drops, I increase the
>>>                         >> > buffer size by a factor of x2 without
>>>             incrementing
>>>                         the bit rate.
>>>                         >> >
>>>                         >> > I see trace drops happening consistently
>>>             with test
>>>                         apps producing traces
>>>                         >> > at
>>>                         >> > less than 40 kb/s, it doesnt seem to help
>>>             even if I
>>>                         started with 1mb x 4
>>>                         >> > sub-buffers.
>>>                         >> >
>>>                         >> > Analysis :
>>>                         >> >
>>>                         >> > I have attached the lttng_relayd ,
>>>             lttng_consumerd_64
>>>                         logs and the
>>>                         >> > entire
>>>                         >> > trace directory, hope you will be able to
>>>             view it.
>>>                         >> > I have modified lttng_relayd code to dump
>>>             the traces
>>>                         being captured in
>>>                         >> > the
>>>                         >> > lttng_relayd logs along with debug info.
>>>                         >> >
>>>                         >> > Each test app is producing logs in the
>>>             form of  :
>>>                         >> > "TraceApp PID - 31940 THID - 31970
>>>             @threadRate - 1032
>>>                         b/s appRate - 2079
>>>                         >> > b/s
>>>                         >> > threadTraceNum - 9 appTraceNum - 18
>>>      sleepTime - 192120"
>>>                         >> >
>>>                         >> > The test application PID, test
>>>             application thread id,
>>>                         thread bit rate,
>>>                         >> > test
>>>                         >> > app bit rate, thread trace number and
>>>             application
>>>                         trace number s are
>>>                         >> > part of
>>>                         >> > the trace. So in the above trace, the
>>>             thread is
>>>                         producing at 1 kb/s and
>>>                         >> > the
>>>                         >> > whole test app is producing at 2 kb/s.
>>>                         >> >
>>>                         >> > If we look at the babeltrace out put, we
>>>             see that the
>>>                         Trace with
>>>                         >> > TraceApp
>>>                         >> > PID - 31940 appTraceNum 2 is missing ,
>>>             with 1, 3, 4,
>>>                         5 and so on being
>>>                         >> > successfully captured.
>>>                         >> > I looked at the lttng_relayd logs and
>>>             found that
>>>                         trace of "appTraceNum
>>>                         >> > 2" is
>>>                         >> > not delivered/generated by the consumerd
>>>             to the
>>>                         relayd in sequence with
>>>                         >> > other traces. To rule out that this is
>>>             not a test
>>>                         application problem,
>>>                         >> > you
>>>                         >> > can look at line ltttng_relayd log :
>>>             12778 and see
>>>                         traces from
>>>                         >> > appTraceNum -
>>>                         >> > 1 to appTraceNum - 18 including the
>>>             appTraceNum 2 are
>>>                         "re-delivered" by
>>>                         >> > the
>>>                         >> > consumerd to the relayd.
>>>                         >> > Essentially, I see appTraceNum 1 through
>>>             appTraceNum
>>>                         18 being delivered
>>>                         >> > twice, once individually where
>>>             appTraceNum 2 is
>>>                         missing and once as a
>>>                         >> > group
>>>                         >> > at line 12778 where its present.
>>>                         >> >
>>>                         >> >
>>>                         >> > Request help with
>>>                         >> > 1. why traces are delivered twice, is it
>>>             by design or
>>>                         a genuine problem
>>>                         >> > ?
>>>                         >> > 2. how to avoid traces being dropped even
>>>             though
>>>                         buffers are
>>>                         >> > sufficiently
>>>                         >> > large enough ?
>>>                         >> >
>>>                         >> >
>>>                         >> > Regards,
>>>                         >> > Aravind.
>>>                         >> >
>>>                         >> >
>>>             _______________________________________________
>>>                         >> > lttng-dev mailing list
>>>                         >> > lttng-dev@lists.lttng.org
>>>             <mailto:lttng-dev@lists.lttng.org>
>>>                         >> >
>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>                         >> >
>>>                         >>
>>>                         >>
>>>                         >>
>>>                         >> --
>>>                         >> Jérémie Galarneau
>>>                         >> EfficiOS Inc.
>>>                         >> http://www.efficios.com
>>>                         >
>>>                         >
>>>
>>>
>>>
>>>                         --
>>>                         Jérémie Galarneau
>>>                         EfficiOS Inc.
>>>             http://www.efficios.com
>>>
>>>
>>>
>>>             _______________________________________________
>>>                     lttng-dev mailing list
>>>             lttng-dev@lists.lttng.org
>>>             <mailto:lttng-dev@lists.lttng.org>
>>>             <mailto:lttng-dev@lists.lttng.org
>>>             <mailto:lttng-dev@lists.lttng.org>>
>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>>
>>>
>>>                 --     Jonathan Rajotte Julien
>>>
>>>
>>>
>>>
>>>             _______________________________________________
>>>             lttng-dev mailing list
>>>             lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>>         --         Jonathan R. Julien
>>>         Efficios
>>>
>>>
>>>
>>>
>> --
>> Jonathan R. Julien
>> Efficios
>>
>>
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 31675 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                         ` <CACd+_b7X3+Hzi5HYt3KggWMCVNPyhyDb4aK3-Pik98wcb70kCQ@mail.gmail.com>
@ 2015-12-16 14:18                           ` Mathieu Desnoyers
       [not found]                           ` <654480294.257124.1450275539752.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2015-12-16 14:18 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 17464 bytes --]

----- On Dec 16, 2015, at 8:02 AM, Aravind HT <aravind.ht@gmail.com> wrote: 

> Hi Mathieu,
> Ive been able to analyze this sessiond - consumerd hang scenario and I see that
> the fix is not available on 2.6.1 or on 2.7.0.

Did you actually reproduce the issue with 2.6.1 or 2.7.0 ? Can we get the 
associated backtrace reproduced on those versions ? 

Thanks, 

Mathieu 

> Root Cause :

> The hang is mainly caused because of a race condition for the
> consumer_metadata_socket between the following two flows of code
> Flow 1. do_sync_metadata() -> lttng_ustconsumer_sync_metadat() ->
> lttng_ustconsumer_request_metadata() -> lttncomm_recv_unix_sock() -> recvmsg()

> and

> Flow 2. lttng_consumer_recv_cmd() -> lttng_ust_consumer_recv_cmd() ->
> lttng_ustconsumer_recv_metadata()

> Flow 1 above initiates a metadata requests and recvmsg()s for the reply from
> sessiond on the consumer_metadata_socket ( it has taken the
> metadata_socket_lock and the stream->lock ) , but simultaneously this socket is
> being polled from Flow 2. Once in a while Flow 2 gets to receive while Flow 1
> just hangs on recvmsg() . Since Flow 1 has taken the stream->lock, this then
> introduces a deadlock at a later stage (on_buffer_ready() when the metadata is
> being flushed. )

> Think using the metadata_socket_lock in Flow 2 before attempting to receive may
> address this issue, but is a bit complicated as the message would have already
> been received at the point of using that lock. The other alternative is to
> remove the complete recv() from Flow 1 and let that be taken care by Flow 2.

> Please let me know if this has been fixed in the later builds, may be I searched
> it wrong.

> Regards,
> Aravind.

> On Sat, Dec 12, 2015 at 4:13 AM, Mathieu Desnoyers <
> mathieu.desnoyers@efficios.com > wrote:

>> Hi,

>> Please upgrade to 2.6.1 or 2.7.0, those are old issues.

>> Thanks,

>> Mathieu

>> ----- On Dec 11, 2015, at 1:42 PM, Aravind HT < aravind.ht@gmail.com > wrote:

>>> NPBS understand it, I will continue to investigate this, got a bit side tracked
>>> with the sessiond hang issue that I mentioned earlier.
>>> Making some progress on that, will update as soon as I find the root cause, it
>>> looks like a mutex deadlock between
>>> the following threads in the consumerd daemon

>>> #1 0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
>>> buf=buf@entry=0x7f16acca57f0, len=len@entry=4476) at unix.c:175
>>> #2 0x000000000041dc96 in lttng_ustconsumer_request_metadata
>>> (ctx=ctx@entry=0x640d10, channel=0x7f16980192b0, timer=timer@entry=0,
>>> wait=wait@entry=0) at ust-consumer.c:2331
>>> #3 0x000000000041ff3c in lttng_ustconsumer_sync_metadata
>>> (ctx=ctx@entry=0x640d10, metadata=metadata@entry=0x7f169801aa40) at
>>> ust-consumer.c:1829
>>> #4 0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
>>> metadata=<optimized out>) at consumer-stream.c:468
>>> #5 consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8) at
>>> consumer-stream.c:545
>>> #6 0x000000000041d316 in lttng_ustconsumer_read_subbuffer
>>> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
>>> ust-consumer.c:2063
>>> #7 0x000000000040b613 in lttng_consumer_read_subbuffer (stream=0x7f1698018fa0,
>>> ctx=0x640d10) at consumer.c:3203
>>> #8 0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
>>> consumer.c:2557
>>> #9 0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
>>> pthread_create.c:312
>>> #10 0x00007f16af3dcafd in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

>>> and

>>> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
>>> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging

>>> This again can be reproduced by the same script, just running consecutive
>>> sessions of short duration (<10 seconds) with traces getting captured on a
>>> remote node.
>>> I hit this every time I run the script.

>>> Regards,
>>> Aravind.

>>> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
>>> Jonathan.rajotte-julien@efficios.com > wrote:

>>>> Hi Aravind,

>>>> On 2015-12-11 12:16 PM, Aravind HT wrote:

>>>>> Hi,

>>>>> Did you get a chance to reproduce the problem with the scripts ? Let me know if
>>>>> you need any help running it.

>>>> Time is a scarce resource unfortunately. But until I have the time to look at it
>>>> you might want to investigate if you are able to reproduce it on
>>>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would really
>>>> help.

>>>> Cheers!

>>>>> Regards,
>>>>> Aravind.

>>>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT < aravind.ht@gmail.com <mailto:
>>>>> aravind.ht@gmail.com >> wrote:

>>>>> Sorry about that, not sure how it got missed.
>>>>> Here it is.

>>>>> # This needs a two node set up (1. local current node 2. remote node )
>>>>> # relayd runs on the current node where traces are captured from
>>>>> the remote node
>>>>> # remote node runs test applications which generate traces.
>>>>> # the launch_script_RN is executed on the current node and uses
>>>>> ssh to execute commands on the remote node. So this part may not
>>>>> work in every case and may prompt for a password.
>>>>> # if experiencing problems with ssh , kindly check
>>>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password

>>>>> # ==================== To Run =============================
>>>>> launch_script_RN.py self_profile -c /tmp/configFile.txt

>>>>> # configFile.txt is the file which has configuration params that
>>>>> launchScript
>>>>> # needs to configure lttng sessions. Below is an explanation of
>>>>> the different options.
>>>>> # =================== configFile.txt =============================

>>>>> [section1]
>>>>> # final out put file path
>>>>> OutputFile = /tmp/Final_report.txt
>>>>> # the remote node hostname on which test applications run and the
>>>>> test sessions will be created; this should be something that could
>>>>> be used with ssh. Traces will be transported from this node to the
>>>>> lttng_relayd running on the current node.
>>>>> Node = MY_REMOTE_NODE
>>>>> # Sub buffer size to start this with
>>>>> SubBufSize = 16k
>>>>> # Sub buffer count
>>>>> SubBufCount = 4
>>>>> # per uid buffer
>>>>> BufferScheme = --buffers-uid
>>>>> # yes
>>>>> EnableTracing = yes
>>>>> # Bit rate of the test applications. Comman seperated example "1,
>>>>> 3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>>>> traces.
>>>>> # So with the below, we just start with 1 test application
>>>>> producing 10 kb/s
>>>>> TestApps = 10
>>>>> # session life time in seonds
>>>>> TestTime = 10
>>>>> # Max number of successive sessions to configure. if n then n-1
>>>>> sessions are run, ex MaxRun = 2 will run 1 session.
>>>>> MaxRun = 100

>>>>> # ==================== Place the following files under
>>>>> ===============

>>>>> # /tmp on the remote node
>>>>> clean_RemNode_apps.sh
>>>>> report_lttng_script.sh

>>>>> # rest of the scripts under /usr/sbin on the current local node on
>>>>> which lttng_realyd runs
>>>>> # Define a trace point MY_TRACE to take a single string arg with
>>>>> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>>>> and place it under /usr/sbin of the remote host

>>>>> # in launch_script_RN.py change currentNodeIP to the IP address on
>>>>> which relayd is receiving, default ports are used.

>>>>> # lttng_relayd is started as
>>>>> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d

>>>>> # lttng_sessiond is started as
>>>>> /usr/bin/lttng-sessiond --consumerd32-path
>>>>> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>>>> /usr/lib/ --consumerd64-path
>>>>> /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>>>> /usr/lib64/ -b --no-kernel

>>>>> Regards,
>>>>> Aravind.

>>>>> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>>>> < Jonathan.rajotte-julien@efficios.com
>>>>> <mailto: Jonathan.rajotte-julien@efficios.com >> wrote:

>>>>> Hi Aravind,

>>>>> There is no README in the archive you sent.

>>>>> Cheers

>>>>> On 2015-12-08 07:51 AM, Aravind HT wrote:

>>>>> Hi,

>>>>> I am trying to upgrade in parallel, but this issue may
>>>>> still be present after I upgrade or may be temporarily
>>>>> masked. So I need to find the root cause for this and then
>>>>> see if its available on the latest before committing to
>>>>> upgrade.

>>>>> There is another issue i'm hitting, the lttng list command
>>>>> hangs after lttng destroy session when running the profiling.

>>>>> I found that consumerd 64 goes into an infinite loop
>>>>> waiting to flush metadata in
>>>>> lttng_ustconsumer_recv_metadata() :: while
>>>>> (consumer_metadata_cache_flushed(channel, offset + len,
>>>>> timer)) .
>>>>> In consumer_metadata_cache,
>>>>> channel->metadata_stream->endpoint_status is
>>>>> CONSUMER_ENDPOINT_ACTIVE,
>>>>> metadata_stream->ust_metadata_pushed is 0 with offset
>>>>> having some value. This call always returns a 1 from the
>>>>> last else{} block resulting in an infinite loop. Upon
>>>>> searching the forum, I found the same issue being reported
>>>>> here :
>>>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

>>>>> Regards,
>>>>> Aravind.

>>>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>>>> < jonathan.r.julien@gmail.com
>>>>> <mailto: jonathan.r.julien@gmail.com >> wrote:

>>>>> Hi Aravind,

>>>>> Did you have the chance to upgrade to 2.6.1.If so
>>>>> where you able
>>>>> to reproduce?

>>>>> Cheers

>>>>> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>>>> < aravind.ht@gmail.com <mailto: aravind.ht@gmail.com >>
>>>>> wrote:

>>>>> Hi,

>>>>> I have attached the complete profiling scripts
>>>>> here, its a bit
>>>>> shabby, im new to python.

>>>>> There is a README which has the details on how to
>>>>> execute it.
>>>>> Im using a Yocto 1.6 on x86_64 platforms on both
>>>>> the nodes.

>>>>> Running this script when there are other sessions
>>>>> running
>>>>> seems to reproduce this problem easily.
>>>>> Please try it and let me know if you have any issues
>>>>> reproducing the problem.

>>>>> Regards,
>>>>> Aravind.

>>>>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>>> < jeremie.galarneau@efficios.com
>>>>> <mailto: jeremie.galarneau@efficios.com >
>>>>> <mailto: jeremie.galarneau@efficios.com
>>>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>>> < aravind.ht@gmail.com
>>>>> <mailto: aravind.ht@gmail.com > <mailto: aravind.ht@gmail.com

>>>>> <mailto: aravind.ht@gmail.com >>> wrote:
>>>>> > I am using 2.6.0 .I will try to share the
>>>>> code that I'm
>>>>> using here in some
>>>>> > time. If there are any specific fixes that
>>>>> are relevant
>>>>> to this issue, see
>>>>> > if you can provide a link to them. I would
>>>>> ideally like
>>>>> to try them out
>>>>> > before trying a full upgrade to the latest
>>>>> versions.

>>>>> Hi,

>>>>> Can you provide more information on the
>>>>> system? Which
>>>>> distribution,
>>>>> architecture, kernel version?

>>>>> The verbose sessiond logs might help pinpoint any
>>>>> unexpected behaviour
>>>>> here (are all applications registering as
>>>>> expected?).

>>>>> Jérémie


>>>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>>>> Galarneau
>>>>> > < jeremie.galarneau@efficios.com
>>>>> <mailto: jeremie.galarneau@efficios.com >
>>>>> <mailto: jeremie.galarneau@efficios.com
>>>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>>>> >> Hi Aravind,

>>>>> >> Can't say I have looked at everything you
>>>>> sent yet, but
>>>>> as a
>>>>> >> preemptive question, which version are we
>>>>> talking about
>>>>> here? 2.6.0 or
>>>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>>>> fixes.

>>>>> >> Thanks,
>>>>> >> Jérémie

>>>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>>> < aravind.ht@gmail.com
>>>>> <mailto: aravind.ht@gmail.com >> wrote:
>>>>> >> > Hi,

>>>>> >> > I am trying to obtain the performance
>>>>> characteristics
>>>>> of lttng with the
>>>>> >> > use
>>>>> >> > of test applications. Traces are being
>>>>> produced on a
>>>>> local node and
>>>>> >> > delivered to relayd that is running on a
>>>>> separate
>>>>> node for storage.

>>>>> >> > An lttng session with the test
>>>>> applications producing
>>>>> an initial bit
>>>>> >> > rate of
>>>>> >> > 10 kb/s is started and run for about 30
>>>>> seconds. The
>>>>> starting sub-buffer
>>>>> >> > size is kept at 128 kb and sub-buf count
>>>>> at 4. The
>>>>> session is then
>>>>> >> > stopped
>>>>> >> > and destroyed and traces are analyzed to
>>>>> see if there
>>>>> are any drops.
>>>>> >> > This is
>>>>> >> > being done in a loop with every
>>>>> subsequent session
>>>>> having an increment
>>>>> >> > of 2
>>>>> >> > kb/s as long as there are no drops. If
>>>>> there are
>>>>> drops, I increase the
>>>>> >> > buffer size by a factor of x2 without
>>>>> incrementing
>>>>> the bit rate.

>>>>> >> > I see trace drops happening consistently
>>>>> with test
>>>>> apps producing traces
>>>>> >> > at
>>>>> >> > less than 40 kb/s, it doesnt seem to help
>>>>> even if I
>>>>> started with 1mb x 4
>>>>> >> > sub-buffers.

>>>>> >> > Analysis :

>>>>> >> > I have attached the lttng_relayd ,
>>>>> lttng_consumerd_64
>>>>> logs and the
>>>>> >> > entire
>>>>> >> > trace directory, hope you will be able to
>>>>> view it.
>>>>> >> > I have modified lttng_relayd code to dump
>>>>> the traces
>>>>> being captured in
>>>>> >> > the
>>>>> >> > lttng_relayd logs along with debug info.

>>>>> >> > Each test app is producing logs in the
>>>>> form of :
>>>>> >> > "TraceApp PID - 31940 THID - 31970
>>>>> @threadRate - 1032
>>>>> b/s appRate - 2079
>>>>> >> > b/s
>>>>> >> > threadTraceNum - 9 appTraceNum - 18 sleepTime - 192120"

>>>>> >> > The test application PID, test
>>>>> application thread id,
>>>>> thread bit rate,
>>>>> >> > test
>>>>> >> > app bit rate, thread trace number and
>>>>> application
>>>>> trace number s are
>>>>> >> > part of
>>>>> >> > the trace. So in the above trace, the
>>>>> thread is
>>>>> producing at 1 kb/s and
>>>>> >> > the
>>>>> >> > whole test app is producing at 2 kb/s.

>>>>> >> > If we look at the babeltrace out put, we
>>>>> see that the
>>>>> Trace with
>>>>> >> > TraceApp
>>>>> >> > PID - 31940 appTraceNum 2 is missing ,
>>>>> with 1, 3, 4,
>>>>> 5 and so on being
>>>>> >> > successfully captured.
>>>>> >> > I looked at the lttng_relayd logs and
>>>>> found that
>>>>> trace of "appTraceNum
>>>>> >> > 2" is
>>>>> >> > not delivered/generated by the consumerd
>>>>> to the
>>>>> relayd in sequence with
>>>>> >> > other traces. To rule out that this is
>>>>> not a test
>>>>> application problem,
>>>>> >> > you
>>>>> >> > can look at line ltttng_relayd log :
>>>>> 12778 and see
>>>>> traces from
>>>>> >> > appTraceNum -
>>>>> >> > 1 to appTraceNum - 18 including the
>>>>> appTraceNum 2 are
>>>>> "re-delivered" by
>>>>> >> > the
>>>>> >> > consumerd to the relayd.
>>>>> >> > Essentially, I see appTraceNum 1 through
>>>>> appTraceNum
>>>>> 18 being delivered
>>>>> >> > twice, once individually where
>>>>> appTraceNum 2 is
>>>>> missing and once as a
>>>>> >> > group
>>>>> >> > at line 12778 where its present.


>>>>> >> > Request help with
>>>>> >> > 1. why traces are delivered twice, is it
>>>>> by design or
>>>>> a genuine problem
>>>>> >> > ?
>>>>> >> > 2. how to avoid traces being dropped even
>>>>> though
>>>>> buffers are
>>>>> >> > sufficiently
>>>>> >> > large enough ?


>>>>> >> > Regards,
>>>>> >> > Aravind.


>>>>> _______________________________________________
>>>>> >> > lttng-dev mailing list
>>>>> >> > lttng-dev@lists.lttng.org
>>>>> <mailto: lttng-dev@lists.lttng.org >

>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev




>>>>> >> --
>>>>> >> Jérémie Galarneau
>>>>> >> EfficiOS Inc.
>>>>> >> http://www.efficios.com



>>>>> --
>>>>> Jérémie Galarneau
>>>>> EfficiOS Inc.
>>>>> http://www.efficios.com

>>>>> _______________________________________________
>>>>> lttng-dev mailing list
>>>>> lttng-dev@lists.lttng.org
>>>>> <mailto: lttng-dev@lists.lttng.org >
>>>>> <mailto: lttng-dev@lists.lttng.org
>>>>> <mailto: lttng-dev@lists.lttng.org >>
>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>>>> -- Jonathan Rajotte Julien

>>>>> _______________________________________________
>>>>> lttng-dev mailing list
>>>>> lttng-dev@lists.lttng.org <mailto: lttng-dev@lists.lttng.org >
>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>>>> -- Jonathan R. Julien
>>>>> Efficios

>>>> --
>>>> Jonathan R. Julien
>>>> Efficios

>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 44189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                           ` <654480294.257124.1450275539752.JavaMail.zimbra@efficios.com>
@ 2015-12-28 17:24                             ` Aravind HT
       [not found]                             ` <CACd+_b5cQ__Gqynic40iXw8UBpdd4cU9H2R=_HEqm_22LxApKw@mail.gmail.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-28 17:24 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 26984 bytes --]

Hi,

Ive not had a chance to try it on 2.6.1 or 2.7.0 yet, will try and let you
know.
However, I was able to come up with a fix for 2.6.0 and I do not see any
such changes on 2.6.1 or 2.7.0, so forwarding you the patch.


diff -Naur ./src/bin/lttng-sessiond/ust-consumer.c
/tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c
--- ./src/bin/lttng-sessiond/ust-consumer.c 2015-12-28 07:45:34.610430830
-0800
+++ /tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c 2015-12-28
07:58:18.906918555 -0800
@@ -1,3 +1,4 @@
+
 /*
  * Copyright (C) 2011 - David Goulet <david.goulet@polymtl.ca>
  *
@@ -486,17 +487,48 @@
  ust_reg = reg_pid->registry->reg.ust;
  }
  assert(ust_reg);
+/* ust_reg->lock can also have been taken from push_metadata making this
metadata request
+*  wait without sending a response to the consumer which gets stuck on the
recv() call
+*  waiting for a response from sessiond. This metadata request from the
consumer would
+*  have taken the stream->lock, do_sync_metadata() making the
push_metadata wait for the stream->lock to be
+*  released, making the sessiond thread pushing the metadata wait on the
recv() for a
+*  response from the consumer, creating a deadlock. So to avoid this
deadlock, making
+*  the request metadata flow to fail incase if sessiond is already pushing
metadata on
+*  its own. */
+        ret = pthread_mutex_trylock(&ust_reg->lock);
+        if(!ret) {
+                ret_push = ust_app_push_metadata(ust_reg, socket, 1);
+                pthread_mutex_unlock(&ust_reg->lock);
+                if (ret_push < 0) {
+ ERR("Error Pushing metadata");
+                /* -EPIPE error would be thrown if the
appRegistry->metada_closed was set,
+                *  this could have happened if ust_app_unregister() was
called when there
+                *  was a problem with the sock ( epoll() ) from
thread_manage_apps(). This
+                *  would have made the current func return an error, upon
which, the whole
+                *  thread_manage_consumer would have been killed. To avoid
this, send
+                *  an error back to consumerd and continue buisness as
usual */
+                 if( ret_push == -EPIPE ) {
+                        DBG("app metadata was closed, so sending ERR msg
to Consumerd\n");
+                        memset(&msg, 0, sizeof(msg));
+                         msg.cmd_type = LTTNG_ERR_UND;
+                        (void) consumer_send_msg(socket, &msg);
+                        ret = 0;
+                 } else {
+                        ret = -1;
+                 }
+        } else {
+                 DBG("UST Consumer metadata pushed successfully");
+                ret = 0;
+         }
+
+        } else {
+                DBG(" ust_reg->lock is taken, cannot push metadata for
this request at present \n");
+                memset(&msg, 0, sizeof(msg));
+                msg.cmd_type = LTTNG_ERR_UND;
+                (void) consumer_send_msg(socket, &msg);
+                ret = 0;

- pthread_mutex_lock(&ust_reg->lock);
- ret_push = ust_app_push_metadata(ust_reg, socket, 1);
- pthread_mutex_unlock(&ust_reg->lock);
- if (ret_push < 0) {
- ERR("Pushing metadata");
- ret = -1;
- goto end;
- }
- DBG("UST Consumer metadata pushed successfully");
- ret = 0;
+        }

 end:
  rcu_read_unlock();


Regards,
Aravind.

On Wed, Dec 16, 2015 at 7:48 PM, Mathieu Desnoyers <
mathieu.desnoyers@efficios.com> wrote:

>
> ----- On Dec 16, 2015, at 8:02 AM, Aravind HT <aravind.ht@gmail.com>
> wrote:
>
> Hi Mathieu,
> Ive been able to analyze this sessiond - consumerd hang scenario and I see
> that the fix is not available on 2.6.1 or on 2.7.0.
>
>
> Did you actually reproduce the issue with 2.6.1 or 2.7.0 ? Can we get the
> associated backtrace reproduced on those versions ?
>
> Thanks,
>
> Mathieu
>
>
> Root Cause :
>
> The hang is mainly caused because of a race condition for the
> consumer_metadata_socket between the following two flows of code
> Flow 1. do_sync_metadata() -> lttng_ustconsumer_sync_metadat() ->
> lttng_ustconsumer_request_metadata() -> lttncomm_recv_unix_sock() ->
> recvmsg()
>
> and
>
> Flow 2. lttng_consumer_recv_cmd() -> lttng_ust_consumer_recv_cmd() ->
> lttng_ustconsumer_recv_metadata()
>
> Flow 1 above initiates a metadata requests and recvmsg()s for the reply
> from sessiond on the consumer_metadata_socket ( it has taken the
> metadata_socket_lock and the stream->lock ) , but simultaneously this
> socket is being polled from Flow 2. Once in a while Flow 2 gets to receive
> while Flow 1 just hangs on recvmsg() . Since Flow 1 has taken the
> stream->lock, this then introduces a deadlock at a later stage
> (on_buffer_ready() when the metadata is being flushed. )
>
> Think using the metadata_socket_lock in Flow 2 before attempting to
> receive may address this issue, but is a bit complicated as the message
> would have already been received at the point of using that lock. The other
> alternative is to remove the complete recv() from Flow 1 and let that be
> taken care by Flow 2.
>
> Please let me know if this has been fixed in the later builds, may be I
> searched it wrong.
>
>
> Regards,
> Aravind.
>
> On Sat, Dec 12, 2015 at 4:13 AM, Mathieu Desnoyers <
> mathieu.desnoyers@efficios.com> wrote:
>
>> Hi,
>>
>> Please upgrade to 2.6.1 or 2.7.0, those are old issues.
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>> ----- On Dec 11, 2015, at 1:42 PM, Aravind HT <aravind.ht@gmail.com>
>> wrote:
>>
>>
>> NPBS understand it, I will continue to investigate this, got a bit side
>> tracked with the sessiond hang issue that I mentioned earlier.
>> Making some progress on that, will update as soon as I find the root
>> cause, it looks like a mutex deadlock between
>> the following threads in the consumerd daemon
>>
>> #1  0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26, buf=buf@entry=0x7f16acca57f0,
>> len=len@entry=4476) at unix.c:175
>> #2  0x000000000041dc96 in lttng_ustconsumer_request_metadata
>> (ctx=ctx@entry=0x640d10, channel=0x7f16980192b0, timer=timer@entry=0,
>> wait=wait@entry=0) at ust-consumer.c:2331
>> #3  0x000000000041ff3c in lttng_ustconsumer_sync_metadata (ctx=ctx@entry=0x640d10,
>> metadata=metadata@entry=0x7f169801aa40) at ust-consumer.c:1829
>> #4  0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
>> metadata=<optimized out>) at consumer-stream.c:468
>> #5  consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8)
>> at consumer-stream.c:545
>> #6  0x000000000041d316 in lttng_ustconsumer_read_subbuffer
>> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
>> ust-consumer.c:2063
>> #7  0x000000000040b613 in lttng_consumer_read_subbuffer
>> (stream=0x7f1698018fa0, ctx=0x640d10) at consumer.c:3203
>> #8  0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
>> consumer.c:2557
>> #9  0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
>> pthread_create.c:312
>> #10 0x00007f16af3dcafd in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>>
>> and
>>
>> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
>> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging
>>
>>
>> This again can be reproduced by the same script, just running consecutive
>> sessions of short duration (<10 seconds) with traces getting captured on a
>> remote node.
>> I hit this every time I run the script.
>>
>> Regards,
>> Aravind.
>>
>>
>>
>> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
>> Jonathan.rajotte-julien@efficios.com> wrote:
>>
>>> Hi Aravind,
>>>
>>> On 2015-12-11 12:16 PM, Aravind HT wrote:
>>>
>>>> Hi,
>>>>
>>>> Did you get a chance to reproduce the problem with the scripts ? Let me
>>>> know if you need any help running it.
>>>>
>>>
>>> Time is a scarce resource unfortunately. But until I have the time to
>>> look at it you might want to investigate if you are able to reproduce it on
>>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
>>> really help.
>>>
>>> Cheers!
>>>
>>>
>>>> Regards,
>>>> Aravind.
>>>>
>>>>
>>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com
>>>> <mailto:aravind.ht@gmail.com>> wrote:
>>>>
>>>>     Sorry about that, not sure how it got missed.
>>>>     Here it is.
>>>>
>>>>
>>>>     # This needs a two node set up (1. local current node 2. remote
>>>> node )
>>>>     # relayd runs on the current node where traces are captured from
>>>>     the remote node
>>>>     # remote node runs test applications which generate traces.
>>>>     # the launch_script_RN is executed on the current node and uses
>>>>     ssh  to execute commands on the remote node. So this part may not
>>>>     work in every case and may prompt for a password.
>>>>     # if experiencing problems with ssh , kindly check
>>>>
>>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>>>>
>>>>     # ====================       To Run  =============================
>>>>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>>>>
>>>>
>>>>
>>>>     # configFile.txt is the file which has configuration params that
>>>>     launchScript
>>>>     # needs to configure lttng sessions. Below is an explanation of
>>>>     the different options.
>>>>     # ===================     configFile.txt
>>>> =============================
>>>>
>>>>     [section1]
>>>>     # final out put file path
>>>>     OutputFile = /tmp/Final_report.txt
>>>>     # the remote node hostname on which test applications run and the
>>>>     test sessions will be created; this should be something that could
>>>>     be used with ssh. Traces will be transported from this node to the
>>>>     lttng_relayd running on the current node.
>>>>     Node = MY_REMOTE_NODE
>>>>     # Sub buffer size to start this with
>>>>     SubBufSize = 16k
>>>>     # Sub buffer count
>>>>     SubBufCount = 4
>>>>     # per uid buffer
>>>>     BufferScheme = --buffers-uid
>>>>     # yes
>>>>     EnableTracing = yes
>>>>     # Bit rate of the test applications. Comman seperated example "1,
>>>>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>>>     traces.
>>>>     # So with the below, we just start with 1 test application
>>>>     producing 10 kb/s
>>>>     TestApps = 10
>>>>     # session life time in seonds
>>>>     TestTime = 10
>>>>     # Max number of successive sessions to configure. if n then n-1
>>>>     sessions are run, ex MaxRun = 2 will run 1 session.
>>>>     MaxRun = 100
>>>>
>>>>
>>>>     # ====================  Place the following files under
>>>>     ===============
>>>>
>>>>     # /tmp on the remote node
>>>>     clean_RemNode_apps.sh
>>>>     report_lttng_script.sh
>>>>
>>>>     # rest of the scripts under /usr/sbin on the current local node on
>>>>     which lttng_realyd runs
>>>>     # Define a trace point MY_TRACE to take a single string arg with
>>>>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>>>     and place it under /usr/sbin of the remote host
>>>>
>>>>     # in launch_script_RN.py change currentNodeIP to the IP address on
>>>>     which relayd is receiving, default ports are used.
>>>>
>>>>     # lttng_relayd is started as
>>>>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>>>>
>>>>     # lttng_sessiond is started as
>>>>     /usr/bin/lttng-sessiond --consumerd32-path
>>>>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>>>     /usr/lib/ --consumerd64-path
>>>>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>>>     /usr/lib64/ -b --no-kernel
>>>>
>>>>
>>>>
>>>>
>>>>     Regards,
>>>>     Aravind.
>>>>
>>>>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>>>     <Jonathan.rajotte-julien@efficios.com
>>>>     <mailto:Jonathan.rajotte-julien@efficios.com>> wrote:
>>>>
>>>>         Hi Aravind,
>>>>
>>>>         There is no README in the archive you sent.
>>>>
>>>>         Cheers
>>>>
>>>>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>>>>
>>>>             Hi,
>>>>
>>>>             I am trying to upgrade in parallel, but this issue may
>>>>             still be present after I upgrade or may be temporarily
>>>>             masked. So I need to find the root cause for this and then
>>>>             see if its available on the latest before committing to
>>>>             upgrade.
>>>>
>>>>             There is another issue i'm hitting, the lttng list command
>>>>             hangs after lttng destroy session when running the
>>>> profiling.
>>>>
>>>>             I found that consumerd 64 goes into an infinite loop
>>>>             waiting to flush metadata in
>>>>             lttng_ustconsumer_recv_metadata() :: while
>>>>             (consumer_metadata_cache_flushed(channel, offset + len,
>>>>             timer)) .
>>>>             In consumer_metadata_cache,
>>>>             channel->metadata_stream->endpoint_status is
>>>>             CONSUMER_ENDPOINT_ACTIVE,
>>>>             metadata_stream->ust_metadata_pushed is 0 with offset
>>>>             having some value. This call always returns a 1 from the
>>>>             last else{} block resulting in an infinite loop. Upon
>>>>             searching the forum, I found the same issue being reported
>>>>             here :
>>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>>>
>>>>             Regards,
>>>>             Aravind.
>>>>
>>>>
>>>>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>>>             <jonathan.r.julien@gmail.com
>>>>             <mailto:jonathan.r.julien@gmail.com>> wrote:
>>>>
>>>>                 Hi Aravind,
>>>>
>>>>                 Did you have the chance to upgrade to 2.6.1.If so
>>>>             where you able
>>>>                 to reproduce?
>>>>
>>>>                 Cheers
>>>>
>>>>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>>
>>>>                 wrote:
>>>>
>>>>                     Hi,
>>>>
>>>>                     I have attached the complete profiling scripts
>>>>             here, its a bit
>>>>                     shabby, im new to python.
>>>>
>>>>                     There is a README which has the details on how to
>>>>             execute it.
>>>>                     Im using a Yocto 1.6 on x86_64 platforms on both
>>>>             the nodes.
>>>>
>>>>
>>>>                     Running this script when there are other sessions
>>>>             running
>>>>                     seems to reproduce this problem easily.
>>>>                     Please try it and let me know if you have any issues
>>>>                     reproducing the problem.
>>>>
>>>>                     Regards,
>>>>                     Aravind.
>>>>
>>>>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>>                     <jeremie.galarneau@efficios.com
>>>>             <mailto:jeremie.galarneau@efficios.com>
>>>>                     <mailto:jeremie.galarneau@efficios.com
>>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>>
>>>>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>>                         <aravind.ht@gmail.com
>>>>             <mailto:aravind.ht@gmail.com> <mailto:aravind.ht@gmail.com
>>>>
>>>>             <mailto:aravind.ht@gmail.com>>> wrote:
>>>>                         > I am using 2.6.0 .I will try to share the
>>>>             code that I'm
>>>>                         using here in some
>>>>                         > time. If there are any specific fixes that
>>>>             are relevant
>>>>                         to this issue, see
>>>>                         > if you can provide a link to them. I would
>>>>             ideally like
>>>>                         to try them out
>>>>                         > before trying a full upgrade to the latest
>>>>             versions.
>>>>
>>>>                         Hi,
>>>>
>>>>                         Can you provide more information on the
>>>>             system? Which
>>>>                         distribution,
>>>>                         architecture, kernel version?
>>>>
>>>>                         The verbose sessiond logs might help pinpoint
>>>> any
>>>>                         unexpected behaviour
>>>>                         here (are all applications registering as
>>>>             expected?).
>>>>
>>>>                         Jérémie
>>>>
>>>>                         >
>>>>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>>>             Galarneau
>>>>                         > <jeremie.galarneau@efficios.com
>>>>             <mailto:jeremie.galarneau@efficios.com>
>>>>                         <mailto:jeremie.galarneau@efficios.com
>>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>>                         >>
>>>>                         >> Hi Aravind,
>>>>                         >>
>>>>                         >> Can't say I have looked at everything you
>>>>             sent yet, but
>>>>                         as a
>>>>                         >> preemptive question, which version are we
>>>>             talking about
>>>>                         here? 2.6.0 or
>>>>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>>>             fixes.
>>>>                         >>
>>>>                         >> Thanks,
>>>>                         >> Jérémie
>>>>                         >>
>>>>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>>                         <aravind.ht@gmail.com
>>>>             <mailto:aravind.ht@gmail.com>> wrote:
>>>>                         >> > Hi,
>>>>                         >> >
>>>>                         >> > I am trying to obtain the performance
>>>>             characteristics
>>>>                         of lttng with the
>>>>                         >> > use
>>>>                         >> > of test applications. Traces are being
>>>>             produced on a
>>>>                         local node and
>>>>                         >> > delivered to relayd that is running on a
>>>>             separate
>>>>                         node for storage.
>>>>                         >> >
>>>>                         >> > An lttng session with the test
>>>>             applications producing
>>>>                         an initial bit
>>>>                         >> > rate of
>>>>                         >> > 10 kb/s is started and run for about 30
>>>>             seconds. The
>>>>                         starting sub-buffer
>>>>                         >> > size is kept at 128 kb and sub-buf count
>>>>             at 4. The
>>>>                         session is then
>>>>                         >> > stopped
>>>>                         >> > and destroyed and traces are analyzed to
>>>>             see if there
>>>>                         are any drops.
>>>>                         >> > This is
>>>>                         >> > being done in a loop with every
>>>>             subsequent session
>>>>                         having an increment
>>>>                         >> > of 2
>>>>                         >> > kb/s as long as there are no drops. If
>>>>             there are
>>>>                         drops, I increase the
>>>>                         >> > buffer size by a factor of x2 without
>>>>             incrementing
>>>>                         the bit rate.
>>>>                         >> >
>>>>                         >> > I see trace drops happening consistently
>>>>             with test
>>>>                         apps producing traces
>>>>                         >> > at
>>>>                         >> > less than 40 kb/s, it doesnt seem to help
>>>>             even if I
>>>>                         started with 1mb x 4
>>>>                         >> > sub-buffers.
>>>>                         >> >
>>>>                         >> > Analysis :
>>>>                         >> >
>>>>                         >> > I have attached the lttng_relayd ,
>>>>             lttng_consumerd_64
>>>>                         logs and the
>>>>                         >> > entire
>>>>                         >> > trace directory, hope you will be able to
>>>>             view it.
>>>>                         >> > I have modified lttng_relayd code to dump
>>>>             the traces
>>>>                         being captured in
>>>>                         >> > the
>>>>                         >> > lttng_relayd logs along with debug info.
>>>>                         >> >
>>>>                         >> > Each test app is producing logs in the
>>>>             form of  :
>>>>                         >> > "TraceApp PID - 31940 THID - 31970
>>>>             @threadRate - 1032
>>>>                         b/s appRate - 2079
>>>>                         >> > b/s
>>>>                         >> > threadTraceNum - 9 appTraceNum - 18
>>>>      sleepTime - 192120"
>>>>                         >> >
>>>>                         >> > The test application PID, test
>>>>             application thread id,
>>>>                         thread bit rate,
>>>>                         >> > test
>>>>                         >> > app bit rate, thread trace number and
>>>>             application
>>>>                         trace number s are
>>>>                         >> > part of
>>>>                         >> > the trace. So in the above trace, the
>>>>             thread is
>>>>                         producing at 1 kb/s and
>>>>                         >> > the
>>>>                         >> > whole test app is producing at 2 kb/s.
>>>>                         >> >
>>>>                         >> > If we look at the babeltrace out put, we
>>>>             see that the
>>>>                         Trace with
>>>>                         >> > TraceApp
>>>>                         >> > PID - 31940 appTraceNum 2 is missing ,
>>>>             with 1, 3, 4,
>>>>                         5 and so on being
>>>>                         >> > successfully captured.
>>>>                         >> > I looked at the lttng_relayd logs and
>>>>             found that
>>>>                         trace of "appTraceNum
>>>>                         >> > 2" is
>>>>                         >> > not delivered/generated by the consumerd
>>>>             to the
>>>>                         relayd in sequence with
>>>>                         >> > other traces. To rule out that this is
>>>>             not a test
>>>>                         application problem,
>>>>                         >> > you
>>>>                         >> > can look at line ltttng_relayd log :
>>>>             12778 and see
>>>>                         traces from
>>>>                         >> > appTraceNum -
>>>>                         >> > 1 to appTraceNum - 18 including the
>>>>             appTraceNum 2 are
>>>>                         "re-delivered" by
>>>>                         >> > the
>>>>                         >> > consumerd to the relayd.
>>>>                         >> > Essentially, I see appTraceNum 1 through
>>>>             appTraceNum
>>>>                         18 being delivered
>>>>                         >> > twice, once individually where
>>>>             appTraceNum 2 is
>>>>                         missing and once as a
>>>>                         >> > group
>>>>                         >> > at line 12778 where its present.
>>>>                         >> >
>>>>                         >> >
>>>>                         >> > Request help with
>>>>                         >> > 1. why traces are delivered twice, is it
>>>>             by design or
>>>>                         a genuine problem
>>>>                         >> > ?
>>>>                         >> > 2. how to avoid traces being dropped even
>>>>             though
>>>>                         buffers are
>>>>                         >> > sufficiently
>>>>                         >> > large enough ?
>>>>                         >> >
>>>>                         >> >
>>>>                         >> > Regards,
>>>>                         >> > Aravind.
>>>>                         >> >
>>>>                         >> >
>>>>             _______________________________________________
>>>>                         >> > lttng-dev mailing list
>>>>                         >> > lttng-dev@lists.lttng.org
>>>>             <mailto:lttng-dev@lists.lttng.org>
>>>>                         >> >
>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>                         >> >
>>>>                         >>
>>>>                         >>
>>>>                         >>
>>>>                         >> --
>>>>                         >> Jérémie Galarneau
>>>>                         >> EfficiOS Inc.
>>>>                         >> http://www.efficios.com
>>>>                         >
>>>>                         >
>>>>
>>>>
>>>>
>>>>                         --
>>>>                         Jérémie Galarneau
>>>>                         EfficiOS Inc.
>>>>             http://www.efficios.com
>>>>
>>>>
>>>>
>>>>             _______________________________________________
>>>>                     lttng-dev mailing list
>>>>             lttng-dev@lists.lttng.org
>>>>             <mailto:lttng-dev@lists.lttng.org>
>>>>             <mailto:lttng-dev@lists.lttng.org
>>>>             <mailto:lttng-dev@lists.lttng.org>>
>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>
>>>>
>>>>
>>>>
>>>>                 --     Jonathan Rajotte Julien
>>>>
>>>>
>>>>
>>>>
>>>>             _______________________________________________
>>>>             lttng-dev mailing list
>>>>             lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org
>>>> >
>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>
>>>>
>>>>         --         Jonathan R. Julien
>>>>         Efficios
>>>>
>>>>
>>>>
>>>>
>>> --
>>> Jonathan R. Julien
>>> Efficios
>>>
>>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev@lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 39273 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                             ` <CACd+_b5cQ__Gqynic40iXw8UBpdd4cU9H2R=_HEqm_22LxApKw@mail.gmail.com>
@ 2015-12-30  2:05                               ` Mathieu Desnoyers
       [not found]                               ` <1521366980.305833.1451441138158.JavaMail.zimbra@efficios.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2015-12-30  2:05 UTC (permalink / raw)
  To: Aravind HT; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 22810 bytes --]

Hi Aravind, 

You may want to have a look at this commit, which appears 
in 2.6.1 too: 

commit c585821bc78955b3d747fcd733aa1d2b81a3258e 
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> 
Date: Wed Aug 19 14:44:59 2015 -0700 

Fix: sessiond vs consumerd push/get metadata deadlock 

We need to unlock the registry while we push metadata to break a 
circular dependency between the consumerd metadata lock and the sessiond 
registry lock. Indeed, pushing metadata to the consumerd awaits that it 
gets pushed all the way to relayd, but doing so requires grabbing the 
metadata lock. If a concurrent metadata request is being performed by 
consumerd, this can try to grab the registry lock on the sessiond while 
holding the metadata lock on the consumer daemon. Those push and pull 
schemes are performed on two different bidirectionnal communication 
sockets. 

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> 
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> 

Happy new year! By the way, everyone at EfficiOS is on vacation until January 4th 
(that includes me). ;-) 

Mathieu 

----- On Dec 28, 2015, at 12:24 PM, Aravind HT <aravind.ht@gmail.com> wrote: 

> Hi,
> Ive not had a chance to try it on 2.6.1 or 2.7.0 yet, will try and let you know.
> However, I was able to come up with a fix for 2.6.0 and I do not see any such
> changes on 2.6.1 or 2.7.0, so forwarding you the patch.

> diff -Naur ./src/bin/lttng-sessiond/ust-consumer.c
> /tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c
> --- ./src/bin/lttng-sessiond/ust-consumer.c 2015-12-28 07:45:34.610430830 -0800
> +++ /tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c 2015-12-28
> 07:58:18.906918555 -0800
> @@ -1,3 +1,4 @@
> +
> /*
> * Copyright (C) 2011 - David Goulet < david.goulet@polymtl.ca >
> *
> @@ -486,17 +487,48 @@
> ust_reg = reg_pid->registry->reg.ust;
> }
> assert(ust_reg);
> +/* ust_reg->lock can also have been taken from push_metadata making this
> metadata request
> +* wait without sending a response to the consumer which gets stuck on the
> recv() call
> +* waiting for a response from sessiond. This metadata request from the consumer
> would
> +* have taken the stream->lock, do_sync_metadata() making the push_metadata wait
> for the stream->lock to be
> +* released, making the sessiond thread pushing the metadata wait on the recv()
> for a
> +* response from the consumer, creating a deadlock. So to avoid this deadlock,
> making
> +* the request metadata flow to fail incase if sessiond is already pushing
> metadata on
> +* its own. */
> + ret = pthread_mutex_trylock(&ust_reg->lock);
> + if(!ret) {
> + ret_push = ust_app_push_metadata(ust_reg, socket, 1);
> + pthread_mutex_unlock(&ust_reg->lock);
> + if (ret_push < 0) {
> + ERR("Error Pushing metadata");
> + /* -EPIPE error would be thrown if the appRegistry->metada_closed was set,
> + * this could have happened if ust_app_unregister() was called when there
> + * was a problem with the sock ( epoll() ) from thread_manage_apps(). This
> + * would have made the current func return an error, upon which, the whole
> + * thread_manage_consumer would have been killed. To avoid this, send
> + * an error back to consumerd and continue buisness as usual */
> + if( ret_push == -EPIPE ) {
> + DBG("app metadata was closed, so sending ERR msg to Consumerd\n");
> + memset(&msg, 0, sizeof(msg));
> + msg.cmd_type = LTTNG_ERR_UND;
> + (void) consumer_send_msg(socket, &msg);
> + ret = 0;
> + } else {
> + ret = -1;
> + }
> + } else {
> + DBG("UST Consumer metadata pushed successfully");
> + ret = 0;
> + }
> +
> + } else {
> + DBG(" ust_reg->lock is taken, cannot push metadata for this request at present
> \n");
> + memset(&msg, 0, sizeof(msg));
> + msg.cmd_type = LTTNG_ERR_UND;
> + (void) consumer_send_msg(socket, &msg);
> + ret = 0;
> - pthread_mutex_lock(&ust_reg->lock);
> - ret_push = ust_app_push_metadata(ust_reg, socket, 1);
> - pthread_mutex_unlock(&ust_reg->lock);
> - if (ret_push < 0) {
> - ERR("Pushing metadata");
> - ret = -1;
> - goto end;
> - }
> - DBG("UST Consumer metadata pushed successfully");
> - ret = 0;
> + }
> end:
> rcu_read_unlock();

> Regards,
> Aravind.

> On Wed, Dec 16, 2015 at 7:48 PM, Mathieu Desnoyers <
> mathieu.desnoyers@efficios.com > wrote:

>> ----- On Dec 16, 2015, at 8:02 AM, Aravind HT < aravind.ht@gmail.com > wrote:

>>> Hi Mathieu,
>>> Ive been able to analyze this sessiond - consumerd hang scenario and I see that
>>> the fix is not available on 2.6.1 or on 2.7.0.

>> Did you actually reproduce the issue with 2.6.1 or 2.7.0 ? Can we get the
>> associated backtrace reproduced on those versions ?

>> Thanks,

>> Mathieu

>>> Root Cause :

>>> The hang is mainly caused because of a race condition for the
>>> consumer_metadata_socket between the following two flows of code
>>> Flow 1. do_sync_metadata() -> lttng_ustconsumer_sync_metadat() ->
>>> lttng_ustconsumer_request_metadata() -> lttncomm_recv_unix_sock() -> recvmsg()

>>> and

>>> Flow 2. lttng_consumer_recv_cmd() -> lttng_ust_consumer_recv_cmd() ->
>>> lttng_ustconsumer_recv_metadata()

>>> Flow 1 above initiates a metadata requests and recvmsg()s for the reply from
>>> sessiond on the consumer_metadata_socket ( it has taken the
>>> metadata_socket_lock and the stream->lock ) , but simultaneously this socket is
>>> being polled from Flow 2. Once in a while Flow 2 gets to receive while Flow 1
>>> just hangs on recvmsg() . Since Flow 1 has taken the stream->lock, this then
>>> introduces a deadlock at a later stage (on_buffer_ready() when the metadata is
>>> being flushed. )

>>> Think using the metadata_socket_lock in Flow 2 before attempting to receive may
>>> address this issue, but is a bit complicated as the message would have already
>>> been received at the point of using that lock. The other alternative is to
>>> remove the complete recv() from Flow 1 and let that be taken care by Flow 2.

>>> Please let me know if this has been fixed in the later builds, may be I searched
>>> it wrong.

>>> Regards,
>>> Aravind.

>>> On Sat, Dec 12, 2015 at 4:13 AM, Mathieu Desnoyers <
>>> mathieu.desnoyers@efficios.com > wrote:

>>>> Hi,

>>>> Please upgrade to 2.6.1 or 2.7.0, those are old issues.

>>>> Thanks,

>>>> Mathieu

>>>> ----- On Dec 11, 2015, at 1:42 PM, Aravind HT < aravind.ht@gmail.com > wrote:

>>>>> NPBS understand it, I will continue to investigate this, got a bit side tracked
>>>>> with the sessiond hang issue that I mentioned earlier.
>>>>> Making some progress on that, will update as soon as I find the root cause, it
>>>>> looks like a mutex deadlock between
>>>>> the following threads in the consumerd daemon

>>>>> #1 0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26,
>>>>> buf=buf@entry=0x7f16acca57f0, len=len@entry=4476) at unix.c:175
>>>>> #2 0x000000000041dc96 in lttng_ustconsumer_request_metadata
>>>>> (ctx=ctx@entry=0x640d10, channel=0x7f16980192b0, timer=timer@entry=0,
>>>>> wait=wait@entry=0) at ust-consumer.c:2331
>>>>> #3 0x000000000041ff3c in lttng_ustconsumer_sync_metadata
>>>>> (ctx=ctx@entry=0x640d10, metadata=metadata@entry=0x7f169801aa40) at
>>>>> ust-consumer.c:1829
>>>>> #4 0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
>>>>> metadata=<optimized out>) at consumer-stream.c:468
>>>>> #5 consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10, session_id=8) at
>>>>> consumer-stream.c:545
>>>>> #6 0x000000000041d316 in lttng_ustconsumer_read_subbuffer
>>>>> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
>>>>> ust-consumer.c:2063
>>>>> #7 0x000000000040b613 in lttng_consumer_read_subbuffer (stream=0x7f1698018fa0,
>>>>> ctx=0x640d10) at consumer.c:3203
>>>>> #8 0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
>>>>> consumer.c:2557
>>>>> #9 0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
>>>>> pthread_create.c:312
>>>>> #10 0x00007f16af3dcafd in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

>>>>> and

>>>>> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer() /
>>>>> on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then hanging

>>>>> This again can be reproduced by the same script, just running consecutive
>>>>> sessions of short duration (<10 seconds) with traces getting captured on a
>>>>> remote node.
>>>>> I hit this every time I run the script.

>>>>> Regards,
>>>>> Aravind.

>>>>> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
>>>>> Jonathan.rajotte-julien@efficios.com > wrote:

>>>>>> Hi Aravind,

>>>>>> On 2015-12-11 12:16 PM, Aravind HT wrote:

>>>>>>> Hi,

>>>>>>> Did you get a chance to reproduce the problem with the scripts ? Let me know if
>>>>>>> you need any help running it.

>>>>>> Time is a scarce resource unfortunately. But until I have the time to look at it
>>>>>> you might want to investigate if you are able to reproduce it on
>>>>>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would really
>>>>>> help.

>>>>>> Cheers!

>>>>>>> Regards,
>>>>>>> Aravind.

>>>>>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT < aravind.ht@gmail.com <mailto:
>>>>>>> aravind.ht@gmail.com >> wrote:

>>>>>>> Sorry about that, not sure how it got missed.
>>>>>>> Here it is.

>>>>>>> # This needs a two node set up (1. local current node 2. remote node )
>>>>>>> # relayd runs on the current node where traces are captured from
>>>>>>> the remote node
>>>>>>> # remote node runs test applications which generate traces.
>>>>>>> # the launch_script_RN is executed on the current node and uses
>>>>>>> ssh to execute commands on the remote node. So this part may not
>>>>>>> work in every case and may prompt for a password.
>>>>>>> # if experiencing problems with ssh , kindly check
>>>>>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password

>>>>>>> # ==================== To Run =============================
>>>>>>> launch_script_RN.py self_profile -c /tmp/configFile.txt

>>>>>>> # configFile.txt is the file which has configuration params that
>>>>>>> launchScript
>>>>>>> # needs to configure lttng sessions. Below is an explanation of
>>>>>>> the different options.
>>>>>>> # =================== configFile.txt =============================

>>>>>>> [section1]
>>>>>>> # final out put file path
>>>>>>> OutputFile = /tmp/Final_report.txt
>>>>>>> # the remote node hostname on which test applications run and the
>>>>>>> test sessions will be created; this should be something that could
>>>>>>> be used with ssh. Traces will be transported from this node to the
>>>>>>> lttng_relayd running on the current node.
>>>>>>> Node = MY_REMOTE_NODE
>>>>>>> # Sub buffer size to start this with
>>>>>>> SubBufSize = 16k
>>>>>>> # Sub buffer count
>>>>>>> SubBufCount = 4
>>>>>>> # per uid buffer
>>>>>>> BufferScheme = --buffers-uid
>>>>>>> # yes
>>>>>>> EnableTracing = yes
>>>>>>> # Bit rate of the test applications. Comman seperated example "1,
>>>>>>> 3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>>>>>> traces.
>>>>>>> # So with the below, we just start with 1 test application
>>>>>>> producing 10 kb/s
>>>>>>> TestApps = 10
>>>>>>> # session life time in seonds
>>>>>>> TestTime = 10
>>>>>>> # Max number of successive sessions to configure. if n then n-1
>>>>>>> sessions are run, ex MaxRun = 2 will run 1 session.
>>>>>>> MaxRun = 100

>>>>>>> # ==================== Place the following files under
>>>>>>> ===============

>>>>>>> # /tmp on the remote node
>>>>>>> clean_RemNode_apps.sh
>>>>>>> report_lttng_script.sh

>>>>>>> # rest of the scripts under /usr/sbin on the current local node on
>>>>>>> which lttng_realyd runs
>>>>>>> # Define a trace point MY_TRACE to take a single string arg with
>>>>>>> LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>>>>>> and place it under /usr/sbin of the remote host

>>>>>>> # in launch_script_RN.py change currentNodeIP to the IP address on
>>>>>>> which relayd is receiving, default ports are used.

>>>>>>> # lttng_relayd is started as
>>>>>>> /usr/bin/lttng-relayd -o /var/log/lttng-traces -d

>>>>>>> # lttng_sessiond is started as
>>>>>>> /usr/bin/lttng-sessiond --consumerd32-path
>>>>>>> /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>>>>>> /usr/lib/ --consumerd64-path
>>>>>>> /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>>>>>> /usr/lib64/ -b --no-kernel

>>>>>>> Regards,
>>>>>>> Aravind.

>>>>>>> On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>>>>>> < Jonathan.rajotte-julien@efficios.com
>>>>>>> <mailto: Jonathan.rajotte-julien@efficios.com >> wrote:

>>>>>>> Hi Aravind,

>>>>>>> There is no README in the archive you sent.

>>>>>>> Cheers

>>>>>>> On 2015-12-08 07:51 AM, Aravind HT wrote:

>>>>>>> Hi,

>>>>>>> I am trying to upgrade in parallel, but this issue may
>>>>>>> still be present after I upgrade or may be temporarily
>>>>>>> masked. So I need to find the root cause for this and then
>>>>>>> see if its available on the latest before committing to
>>>>>>> upgrade.

>>>>>>> There is another issue i'm hitting, the lttng list command
>>>>>>> hangs after lttng destroy session when running the profiling.

>>>>>>> I found that consumerd 64 goes into an infinite loop
>>>>>>> waiting to flush metadata in
>>>>>>> lttng_ustconsumer_recv_metadata() :: while
>>>>>>> (consumer_metadata_cache_flushed(channel, offset + len,
>>>>>>> timer)) .
>>>>>>> In consumer_metadata_cache,
>>>>>>> channel->metadata_stream->endpoint_status is
>>>>>>> CONSUMER_ENDPOINT_ACTIVE,
>>>>>>> metadata_stream->ust_metadata_pushed is 0 with offset
>>>>>>> having some value. This call always returns a 1 from the
>>>>>>> last else{} block resulting in an infinite loop. Upon
>>>>>>> searching the forum, I found the same issue being reported
>>>>>>> here :
>>>>>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html

>>>>>>> Regards,
>>>>>>> Aravind.

>>>>>>> On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>>>>>> < jonathan.r.julien@gmail.com
>>>>>>> <mailto: jonathan.r.julien@gmail.com >> wrote:

>>>>>>> Hi Aravind,

>>>>>>> Did you have the chance to upgrade to 2.6.1.If so
>>>>>>> where you able
>>>>>>> to reproduce?

>>>>>>> Cheers

>>>>>>> On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>>>>>> < aravind.ht@gmail.com <mailto: aravind.ht@gmail.com >>
>>>>>>> wrote:

>>>>>>> Hi,

>>>>>>> I have attached the complete profiling scripts
>>>>>>> here, its a bit
>>>>>>> shabby, im new to python.

>>>>>>> There is a README which has the details on how to
>>>>>>> execute it.
>>>>>>> Im using a Yocto 1.6 on x86_64 platforms on both
>>>>>>> the nodes.

>>>>>>> Running this script when there are other sessions
>>>>>>> running
>>>>>>> seems to reproduce this problem easily.
>>>>>>> Please try it and let me know if you have any issues
>>>>>>> reproducing the problem.

>>>>>>> Regards,
>>>>>>> Aravind.

>>>>>>> On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>>>>> < jeremie.galarneau@efficios.com
>>>>>>> <mailto: jeremie.galarneau@efficios.com >
>>>>>>> <mailto: jeremie.galarneau@efficios.com
>>>>>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>>>>>> On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>>>>> < aravind.ht@gmail.com
>>>>>>> <mailto: aravind.ht@gmail.com > <mailto: aravind.ht@gmail.com

>>>>>>> <mailto: aravind.ht@gmail.com >>> wrote:
>>>>>>> > I am using 2.6.0 .I will try to share the
>>>>>>> code that I'm
>>>>>>> using here in some
>>>>>>> > time. If there are any specific fixes that
>>>>>>> are relevant
>>>>>>> to this issue, see
>>>>>>> > if you can provide a link to them. I would
>>>>>>> ideally like
>>>>>>> to try them out
>>>>>>> > before trying a full upgrade to the latest
>>>>>>> versions.

>>>>>>> Hi,

>>>>>>> Can you provide more information on the
>>>>>>> system? Which
>>>>>>> distribution,
>>>>>>> architecture, kernel version?

>>>>>>> The verbose sessiond logs might help pinpoint any
>>>>>>> unexpected behaviour
>>>>>>> here (are all applications registering as
>>>>>>> expected?).

>>>>>>> Jérémie


>>>>>>> > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>>>>>> Galarneau
>>>>>>> > < jeremie.galarneau@efficios.com
>>>>>>> <mailto: jeremie.galarneau@efficios.com >
>>>>>>> <mailto: jeremie.galarneau@efficios.com
>>>>>>> <mailto: jeremie.galarneau@efficios.com >>> wrote:

>>>>>>> >> Hi Aravind,

>>>>>>> >> Can't say I have looked at everything you
>>>>>>> sent yet, but
>>>>>>> as a
>>>>>>> >> preemptive question, which version are we
>>>>>>> talking about
>>>>>>> here? 2.6.0 or
>>>>>>> >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>>>>>> fixes.

>>>>>>> >> Thanks,
>>>>>>> >> Jérémie

>>>>>>> >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>>>>> < aravind.ht@gmail.com
>>>>>>> <mailto: aravind.ht@gmail.com >> wrote:
>>>>>>> >> > Hi,

>>>>>>> >> > I am trying to obtain the performance
>>>>>>> characteristics
>>>>>>> of lttng with the
>>>>>>> >> > use
>>>>>>> >> > of test applications. Traces are being
>>>>>>> produced on a
>>>>>>> local node and
>>>>>>> >> > delivered to relayd that is running on a
>>>>>>> separate
>>>>>>> node for storage.

>>>>>>> >> > An lttng session with the test
>>>>>>> applications producing
>>>>>>> an initial bit
>>>>>>> >> > rate of
>>>>>>> >> > 10 kb/s is started and run for about 30
>>>>>>> seconds. The
>>>>>>> starting sub-buffer
>>>>>>> >> > size is kept at 128 kb and sub-buf count
>>>>>>> at 4. The
>>>>>>> session is then
>>>>>>> >> > stopped
>>>>>>> >> > and destroyed and traces are analyzed to
>>>>>>> see if there
>>>>>>> are any drops.
>>>>>>> >> > This is
>>>>>>> >> > being done in a loop with every
>>>>>>> subsequent session
>>>>>>> having an increment
>>>>>>> >> > of 2
>>>>>>> >> > kb/s as long as there are no drops. If
>>>>>>> there are
>>>>>>> drops, I increase the
>>>>>>> >> > buffer size by a factor of x2 without
>>>>>>> incrementing
>>>>>>> the bit rate.

>>>>>>> >> > I see trace drops happening consistently
>>>>>>> with test
>>>>>>> apps producing traces
>>>>>>> >> > at
>>>>>>> >> > less than 40 kb/s, it doesnt seem to help
>>>>>>> even if I
>>>>>>> started with 1mb x 4
>>>>>>> >> > sub-buffers.

>>>>>>> >> > Analysis :

>>>>>>> >> > I have attached the lttng_relayd ,
>>>>>>> lttng_consumerd_64
>>>>>>> logs and the
>>>>>>> >> > entire
>>>>>>> >> > trace directory, hope you will be able to
>>>>>>> view it.
>>>>>>> >> > I have modified lttng_relayd code to dump
>>>>>>> the traces
>>>>>>> being captured in
>>>>>>> >> > the
>>>>>>> >> > lttng_relayd logs along with debug info.

>>>>>>> >> > Each test app is producing logs in the
>>>>>>> form of :
>>>>>>> >> > "TraceApp PID - 31940 THID - 31970
>>>>>>> @threadRate - 1032
>>>>>>> b/s appRate - 2079
>>>>>>> >> > b/s
>>>>>>> >> > threadTraceNum - 9 appTraceNum - 18 sleepTime - 192120"

>>>>>>> >> > The test application PID, test
>>>>>>> application thread id,
>>>>>>> thread bit rate,
>>>>>>> >> > test
>>>>>>> >> > app bit rate, thread trace number and
>>>>>>> application
>>>>>>> trace number s are
>>>>>>> >> > part of
>>>>>>> >> > the trace. So in the above trace, the
>>>>>>> thread is
>>>>>>> producing at 1 kb/s and
>>>>>>> >> > the
>>>>>>> >> > whole test app is producing at 2 kb/s.

>>>>>>> >> > If we look at the babeltrace out put, we
>>>>>>> see that the
>>>>>>> Trace with
>>>>>>> >> > TraceApp
>>>>>>> >> > PID - 31940 appTraceNum 2 is missing ,
>>>>>>> with 1, 3, 4,
>>>>>>> 5 and so on being
>>>>>>> >> > successfully captured.
>>>>>>> >> > I looked at the lttng_relayd logs and
>>>>>>> found that
>>>>>>> trace of "appTraceNum
>>>>>>> >> > 2" is
>>>>>>> >> > not delivered/generated by the consumerd
>>>>>>> to the
>>>>>>> relayd in sequence with
>>>>>>> >> > other traces. To rule out that this is
>>>>>>> not a test
>>>>>>> application problem,
>>>>>>> >> > you
>>>>>>> >> > can look at line ltttng_relayd log :
>>>>>>> 12778 and see
>>>>>>> traces from
>>>>>>> >> > appTraceNum -
>>>>>>> >> > 1 to appTraceNum - 18 including the
>>>>>>> appTraceNum 2 are
>>>>>>> "re-delivered" by
>>>>>>> >> > the
>>>>>>> >> > consumerd to the relayd.
>>>>>>> >> > Essentially, I see appTraceNum 1 through
>>>>>>> appTraceNum
>>>>>>> 18 being delivered
>>>>>>> >> > twice, once individually where
>>>>>>> appTraceNum 2 is
>>>>>>> missing and once as a
>>>>>>> >> > group
>>>>>>> >> > at line 12778 where its present.


>>>>>>> >> > Request help with
>>>>>>> >> > 1. why traces are delivered twice, is it
>>>>>>> by design or
>>>>>>> a genuine problem
>>>>>>> >> > ?
>>>>>>> >> > 2. how to avoid traces being dropped even
>>>>>>> though
>>>>>>> buffers are
>>>>>>> >> > sufficiently
>>>>>>> >> > large enough ?


>>>>>>> >> > Regards,
>>>>>>> >> > Aravind.


>>>>>>> _______________________________________________
>>>>>>> >> > lttng-dev mailing list
>>>>>>> >> > lttng-dev@lists.lttng.org
>>>>>>> <mailto: lttng-dev@lists.lttng.org >

>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev




>>>>>>> >> --
>>>>>>> >> Jérémie Galarneau
>>>>>>> >> EfficiOS Inc.
>>>>>>> >> http://www.efficios.com



>>>>>>> --
>>>>>>> Jérémie Galarneau
>>>>>>> EfficiOS Inc.
>>>>>>> http://www.efficios.com

>>>>>>> _______________________________________________
>>>>>>> lttng-dev mailing list
>>>>>>> lttng-dev@lists.lttng.org
>>>>>>> <mailto: lttng-dev@lists.lttng.org >
>>>>>>> <mailto: lttng-dev@lists.lttng.org
>>>>>>> <mailto: lttng-dev@lists.lttng.org >>
>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>>>>>> -- Jonathan Rajotte Julien

>>>>>>> _______________________________________________
>>>>>>> lttng-dev mailing list
>>>>>>> lttng-dev@lists.lttng.org <mailto: lttng-dev@lists.lttng.org >
>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>>>>>> -- Jonathan R. Julien
>>>>>>> Efficios

>>>>>> --
>>>>>> Jonathan R. Julien
>>>>>> Efficios

>>>>> _______________________________________________
>>>>> lttng-dev mailing list
>>>>> lttng-dev@lists.lttng.org
>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>>>> --
>>>> Mathieu Desnoyers
>>>> EfficiOS Inc.
>>>> http://www.efficios.com

>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com

-- 
Mathieu Desnoyers 
EfficiOS Inc. 
http://www.efficios.com 

[-- Attachment #1.2: Type: text/html, Size: 54036 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Relayd trace drops
       [not found]                               ` <1521366980.305833.1451441138158.JavaMail.zimbra@efficios.com>
@ 2015-12-30  8:53                                 ` Aravind HT
  0 siblings, 0 replies; 18+ messages in thread
From: Aravind HT @ 2015-12-30  8:53 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: lttng-dev


[-- Attachment #1.1: Type: text/plain, Size: 29438 bytes --]

Hi Mathieu,

I will look into the commit. Appreciate you replying during the vacation,
have a nice one.
Happy new year to all.

Regards,
Aravind.

On Wed, Dec 30, 2015 at 7:35 AM, Mathieu Desnoyers <
mathieu.desnoyers@efficios.com> wrote:

> Hi Aravind,
>
> You may want to have a look at this commit, which appears
> in 2.6.1 too:
>
> commit c585821bc78955b3d747fcd733aa1d2b81a3258e
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date:   Wed Aug 19 14:44:59 2015 -0700
>
>     Fix: sessiond vs consumerd push/get metadata deadlock
>
>     We need to unlock the registry while we push metadata to break a
>     circular dependency between the consumerd metadata lock and the
> sessiond
>     registry lock. Indeed, pushing metadata to the consumerd awaits that it
>     gets pushed all the way to relayd, but doing so requires grabbing the
>     metadata lock. If a concurrent metadata request is being performed by
>     consumerd, this can try to grab the registry lock on the sessiond while
>     holding the metadata lock on the consumer daemon. Those push and pull
>     schemes are performed on two different bidirectionnal communication
>     sockets.
>
>     Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>     Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
>
> Happy new year! By the way, everyone at EfficiOS is on vacation until
> January 4th
> (that includes me). ;-)
>
> Mathieu
>
> ----- On Dec 28, 2015, at 12:24 PM, Aravind HT <aravind.ht@gmail.com>
> wrote:
>
> Hi,
> Ive not had a chance to try it on 2.6.1 or 2.7.0 yet, will try and let you
> know.
> However, I was able to come up with a fix for 2.6.0 and I do not see any
> such changes on 2.6.1 or 2.7.0, so forwarding you the patch.
>
>
> diff -Naur ./src/bin/lttng-sessiond/ust-consumer.c
> /tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c
> --- ./src/bin/lttng-sessiond/ust-consumer.c 2015-12-28 07:45:34.610430830
> -0800
> +++ /tmp/LTTNG_BKP_28_12_1/bin/lttng-sessiond/ust-consumer.c 2015-12-28
> 07:58:18.906918555 -0800
> @@ -1,3 +1,4 @@
> +
>  /*
>   * Copyright (C) 2011 - David Goulet <david.goulet@polymtl.ca>
>   *
> @@ -486,17 +487,48 @@
>   ust_reg = reg_pid->registry->reg.ust;
>   }
>   assert(ust_reg);
> +/* ust_reg->lock can also have been taken from push_metadata making this
> metadata request
> +*  wait without sending a response to the consumer which gets stuck on
> the recv() call
> +*  waiting for a response from sessiond. This metadata request from the
> consumer would
> +*  have taken the stream->lock, do_sync_metadata() making the
> push_metadata wait for the stream->lock to be
> +*  released, making the sessiond thread pushing the metadata wait on the
> recv() for a
> +*  response from the consumer, creating a deadlock. So to avoid this
> deadlock, making
> +*  the request metadata flow to fail incase if sessiond is already
> pushing metadata on
> +*  its own. */
> +        ret = pthread_mutex_trylock(&ust_reg->lock);
> +        if(!ret) {
> +                ret_push = ust_app_push_metadata(ust_reg, socket, 1);
> +                pthread_mutex_unlock(&ust_reg->lock);
> +                if (ret_push < 0) {
> + ERR("Error Pushing metadata");
> +                /* -EPIPE error would be thrown if the
> appRegistry->metada_closed was set,
> +                *  this could have happened if ust_app_unregister() was
> called when there
> +                *  was a problem with the sock ( epoll() ) from
> thread_manage_apps(). This
> +                *  would have made the current func return an error,
> upon which, the whole
> +                *  thread_manage_consumer would have been killed. To
> avoid this, send
> +                *  an error back to consumerd and continue buisness as
> usual */
> +                 if( ret_push == -EPIPE ) {
> +                        DBG("app metadata was closed, so sending ERR msg
> to Consumerd\n");
> +                        memset(&msg, 0, sizeof(msg));
> +                         msg.cmd_type = LTTNG_ERR_UND;
> +                        (void) consumer_send_msg(socket, &msg);
> +                        ret = 0;
> +                 } else {
> +                        ret = -1;
> +                 }
> +        } else {
> +                 DBG("UST Consumer metadata pushed successfully");
> +                ret = 0;
> +         }
> +
> +        } else {
> +                DBG(" ust_reg->lock is taken, cannot push metadata for
> this request at present \n");
> +                memset(&msg, 0, sizeof(msg));
> +                msg.cmd_type = LTTNG_ERR_UND;
> +                (void) consumer_send_msg(socket, &msg);
> +                ret = 0;
>
> - pthread_mutex_lock(&ust_reg->lock);
> - ret_push = ust_app_push_metadata(ust_reg, socket, 1);
> - pthread_mutex_unlock(&ust_reg->lock);
> - if (ret_push < 0) {
> - ERR("Pushing metadata");
> - ret = -1;
> - goto end;
> - }
> - DBG("UST Consumer metadata pushed successfully");
> - ret = 0;
> +        }
>
>  end:
>   rcu_read_unlock();
>
>
> Regards,
> Aravind.
>
> On Wed, Dec 16, 2015 at 7:48 PM, Mathieu Desnoyers <
> mathieu.desnoyers@efficios.com> wrote:
>
>>
>> ----- On Dec 16, 2015, at 8:02 AM, Aravind HT <aravind.ht@gmail.com>
>> wrote:
>>
>> Hi Mathieu,
>> Ive been able to analyze this sessiond - consumerd hang scenario and I
>> see that the fix is not available on 2.6.1 or on 2.7.0.
>>
>>
>> Did you actually reproduce the issue with 2.6.1 or 2.7.0 ? Can we get the
>> associated backtrace reproduced on those versions ?
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>> Root Cause :
>>
>> The hang is mainly caused because of a race condition for the
>> consumer_metadata_socket between the following two flows of code
>> Flow 1. do_sync_metadata() -> lttng_ustconsumer_sync_metadat() ->
>> lttng_ustconsumer_request_metadata() -> lttncomm_recv_unix_sock() ->
>> recvmsg()
>>
>> and
>>
>> Flow 2. lttng_consumer_recv_cmd() -> lttng_ust_consumer_recv_cmd() ->
>> lttng_ustconsumer_recv_metadata()
>>
>> Flow 1 above initiates a metadata requests and recvmsg()s for the reply
>> from sessiond on the consumer_metadata_socket ( it has taken the
>> metadata_socket_lock and the stream->lock ) , but simultaneously this
>> socket is being polled from Flow 2. Once in a while Flow 2 gets to receive
>> while Flow 1 just hangs on recvmsg() . Since Flow 1 has taken the
>> stream->lock, this then introduces a deadlock at a later stage
>> (on_buffer_ready() when the metadata is being flushed. )
>>
>> Think using the metadata_socket_lock in Flow 2 before attempting to
>> receive may address this issue, but is a bit complicated as the message
>> would have already been received at the point of using that lock. The other
>> alternative is to remove the complete recv() from Flow 1 and let that be
>> taken care by Flow 2.
>>
>> Please let me know if this has been fixed in the later builds, may be I
>> searched it wrong.
>>
>>
>> Regards,
>> Aravind.
>>
>> On Sat, Dec 12, 2015 at 4:13 AM, Mathieu Desnoyers <
>> mathieu.desnoyers@efficios.com> wrote:
>>
>>> Hi,
>>>
>>> Please upgrade to 2.6.1 or 2.7.0, those are old issues.
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>
>>> ----- On Dec 11, 2015, at 1:42 PM, Aravind HT <aravind.ht@gmail.com>
>>> wrote:
>>>
>>>
>>> NPBS understand it, I will continue to investigate this, got a bit side
>>> tracked with the sessiond hang issue that I mentioned earlier.
>>> Making some progress on that, will update as soon as I find the root
>>> cause, it looks like a mutex deadlock between
>>> the following threads in the consumerd daemon
>>>
>>> #1  0x0000000000410b5d in lttcomm_recv_unix_sock (sock=26, buf=buf@entry=0x7f16acca57f0,
>>> len=len@entry=4476) at unix.c:175
>>> #2  0x000000000041dc96 in lttng_ustconsumer_request_metadata
>>> (ctx=ctx@entry=0x640d10, channel=0x7f16980192b0, timer=timer@entry=0,
>>> wait=wait@entry=0) at ust-consumer.c:2331
>>> #3  0x000000000041ff3c in lttng_ustconsumer_sync_metadata (ctx=ctx@entry=0x640d10,
>>> metadata=metadata@entry=0x7f169801aa40) at ust-consumer.c:1829
>>> #4  0x000000000040eabf in do_sync_metadata (ctx=<optimized out>,
>>> metadata=<optimized out>) at consumer-stream.c:468
>>> #5  consumer_stream_sync_metadata (ctx=ctx@entry=0x640d10,
>>> session_id=8) at consumer-stream.c:545
>>> #6  0x000000000041d316 in lttng_ustconsumer_read_subbuffer
>>> (stream=stream@entry=0x7f1698018fa0, ctx=ctx@entry=0x640d10) at
>>> ust-consumer.c:2063
>>> #7  0x000000000040b613 in lttng_consumer_read_subbuffer
>>> (stream=0x7f1698018fa0, ctx=0x640d10) at consumer.c:3203
>>> #8  0x0000000000409608 in consumer_thread_data_poll (data=0x640d10) at
>>> consumer.c:2557
>>> #9  0x00007f16af6a6fe3 in start_thread (arg=0x7f16acca7700) at
>>> pthread_create.c:312
>>> #10 0x00007f16af3dcafd in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>>>
>>> and
>>>
>>> consumer_thread_metadata_poll() calling lttng_consumer_read_subbuffer()
>>> / on_buffer_ready() calling pthread_mutex_lock(stream->lock) and then
>>> hanging
>>>
>>>
>>> This again can be reproduced by the same script, just running
>>> consecutive sessions of short duration (<10 seconds) with traces getting
>>> captured on a remote node.
>>> I hit this every time I run the script.
>>>
>>> Regards,
>>> Aravind.
>>>
>>>
>>>
>>> On Fri, Dec 11, 2015 at 11:33 PM, Jonathan Rajotte Julien <
>>> Jonathan.rajotte-julien@efficios.com> wrote:
>>>
>>>> Hi Aravind,
>>>>
>>>> On 2015-12-11 12:16 PM, Aravind HT wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Did you get a chance to reproduce the problem with the scripts ? Let
>>>>> me know if you need any help running it.
>>>>>
>>>>
>>>> Time is a scarce resource unfortunately. But until I have the time to
>>>> look at it you might want to investigate if you are able to reproduce it on
>>>> 2.6.1/2.7.0/master on 2 regular (x86_64, ubuntu/debian) VM this would
>>>> really help.
>>>>
>>>> Cheers!
>>>>
>>>>
>>>>> Regards,
>>>>> Aravind.
>>>>>
>>>>>
>>>>> On Wed, Dec 9, 2015 at 3:04 PM, Aravind HT <aravind.ht@gmail.com
>>>>> <mailto:aravind.ht@gmail.com>> wrote:
>>>>>
>>>>>     Sorry about that, not sure how it got missed.
>>>>>     Here it is.
>>>>>
>>>>>
>>>>>     # This needs a two node set up (1. local current node 2. remote
>>>>> node )
>>>>>     # relayd runs on the current node where traces are captured from
>>>>>     the remote node
>>>>>     # remote node runs test applications which generate traces.
>>>>>     # the launch_script_RN is executed on the current node and uses
>>>>>     ssh  to execute commands on the remote node. So this part may not
>>>>>     work in every case and may prompt for a password.
>>>>>     # if experiencing problems with ssh , kindly check
>>>>>
>>>>> http://serverfault.com/questions/241588/how-to-automate-ssh-login-with-password
>>>>>
>>>>>     # ====================       To Run  =============================
>>>>>     launch_script_RN.py self_profile -c /tmp/configFile.txt
>>>>>
>>>>>
>>>>>
>>>>>     # configFile.txt is the file which has configuration params that
>>>>>     launchScript
>>>>>     # needs to configure lttng sessions. Below is an explanation of
>>>>>     the different options.
>>>>>     # ===================     configFile.txt
>>>>> =============================
>>>>>
>>>>>     [section1]
>>>>>     # final out put file path
>>>>>     OutputFile = /tmp/Final_report.txt
>>>>>     # the remote node hostname on which test applications run and the
>>>>>     test sessions will be created; this should be something that could
>>>>>     be used with ssh. Traces will be transported from this node to the
>>>>>     lttng_relayd running on the current node.
>>>>>     Node = MY_REMOTE_NODE
>>>>>     # Sub buffer size to start this with
>>>>>     SubBufSize = 16k
>>>>>     # Sub buffer count
>>>>>     SubBufCount = 4
>>>>>     # per uid buffer
>>>>>     BufferScheme = --buffers-uid
>>>>>     # yes
>>>>>     EnableTracing = yes
>>>>>     # Bit rate of the test applications. Comman seperated example "1,
>>>>>     3, 3, 50" sayss 4 test applications producing 1, 3, 3, and 50 Kb/s
>>>>>     traces.
>>>>>     # So with the below, we just start with 1 test application
>>>>>     producing 10 kb/s
>>>>>     TestApps = 10
>>>>>     # session life time in seonds
>>>>>     TestTime = 10
>>>>>     # Max number of successive sessions to configure. if n then n-1
>>>>>     sessions are run, ex MaxRun = 2 will run 1 session.
>>>>>     MaxRun = 100
>>>>>
>>>>>
>>>>>     # ====================  Place the following files under
>>>>>     ===============
>>>>>
>>>>>     # /tmp on the remote node
>>>>>     clean_RemNode_apps.sh
>>>>>     report_lttng_script.sh
>>>>>
>>>>>     # rest of the scripts under /usr/sbin on the current local node on
>>>>>     which lttng_realyd runs
>>>>>     # Define a trace point MY_TRACE to take a single string arg with
>>>>>     LOG_TEST_APP_PROFILING as the provider, compile test lttng_testApp
>>>>>     and place it under /usr/sbin of the remote host
>>>>>
>>>>>     # in launch_script_RN.py change currentNodeIP to the IP address on
>>>>>     which relayd is receiving, default ports are used.
>>>>>
>>>>>     # lttng_relayd is started as
>>>>>     /usr/bin/lttng-relayd -o /var/log/lttng-traces -d
>>>>>
>>>>>     # lttng_sessiond is started as
>>>>>     /usr/bin/lttng-sessiond --consumerd32-path
>>>>>     /usr/lib/lttng/libexec/lttng-consumerd --consumerd32-libdir
>>>>>     /usr/lib/ --consumerd64-path
>>>>>     /usr/lib64/lttng/libexec/lttng-consumerd --consumerd64-libdir
>>>>>     /usr/lib64/ -b --no-kernel
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     Regards,
>>>>>     Aravind.
>>>>>
>>>>>     On Wed, Dec 9, 2015 at 12:57 AM, Jonathan Rajotte Julien
>>>>>     <Jonathan.rajotte-julien@efficios.com
>>>>>     <mailto:Jonathan.rajotte-julien@efficios.com>> wrote:
>>>>>
>>>>>         Hi Aravind,
>>>>>
>>>>>         There is no README in the archive you sent.
>>>>>
>>>>>         Cheers
>>>>>
>>>>>         On 2015-12-08 07:51 AM, Aravind HT wrote:
>>>>>
>>>>>             Hi,
>>>>>
>>>>>             I am trying to upgrade in parallel, but this issue may
>>>>>             still be present after I upgrade or may be temporarily
>>>>>             masked. So I need to find the root cause for this and then
>>>>>             see if its available on the latest before committing to
>>>>>             upgrade.
>>>>>
>>>>>             There is another issue i'm hitting, the lttng list command
>>>>>             hangs after lttng destroy session when running the
>>>>> profiling.
>>>>>
>>>>>             I found that consumerd 64 goes into an infinite loop
>>>>>             waiting to flush metadata in
>>>>>             lttng_ustconsumer_recv_metadata() :: while
>>>>>             (consumer_metadata_cache_flushed(channel, offset + len,
>>>>>             timer)) .
>>>>>             In consumer_metadata_cache,
>>>>>             channel->metadata_stream->endpoint_status is
>>>>>             CONSUMER_ENDPOINT_ACTIVE,
>>>>>             metadata_stream->ust_metadata_pushed is 0 with offset
>>>>>             having some value. This call always returns a 1 from the
>>>>>             last else{} block resulting in an infinite loop. Upon
>>>>>             searching the forum, I found the same issue being reported
>>>>>             here :
>>>>> https://www.mail-archive.com/lttng-dev@lists.lttng.org/msg07982.html
>>>>>
>>>>>             Regards,
>>>>>             Aravind.
>>>>>
>>>>>
>>>>>             On Tue, Dec 8, 2015 at 12:43 AM, Jonathan Rajotte
>>>>>             <jonathan.r.julien@gmail.com
>>>>>             <mailto:jonathan.r.julien@gmail.com>> wrote:
>>>>>
>>>>>                 Hi Aravind,
>>>>>
>>>>>                 Did you have the chance to upgrade to 2.6.1.If so
>>>>>             where you able
>>>>>                 to reproduce?
>>>>>
>>>>>                 Cheers
>>>>>
>>>>>                 On Mon, Dec 7, 2015 at 1:07 PM, Aravind HT
>>>>>             <aravind.ht@gmail.com <mailto:aravind.ht@gmail.com>>
>>>>>                 wrote:
>>>>>
>>>>>                     Hi,
>>>>>
>>>>>                     I have attached the complete profiling scripts
>>>>>             here, its a bit
>>>>>                     shabby, im new to python.
>>>>>
>>>>>                     There is a README which has the details on how to
>>>>>             execute it.
>>>>>                     Im using a Yocto 1.6 on x86_64 platforms on both
>>>>>             the nodes.
>>>>>
>>>>>
>>>>>                     Running this script when there are other sessions
>>>>>             running
>>>>>                     seems to reproduce this problem easily.
>>>>>                     Please try it and let me know if you have any
>>>>> issues
>>>>>                     reproducing the problem.
>>>>>
>>>>>                     Regards,
>>>>>                     Aravind.
>>>>>
>>>>>                     On Sat, Dec 5, 2015 at 5:23 PM, Jérémie Galarneau
>>>>>                     <jeremie.galarneau@efficios.com
>>>>>             <mailto:jeremie.galarneau@efficios.com>
>>>>>                     <mailto:jeremie.galarneau@efficios.com
>>>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>>>
>>>>>                         On Fri, Dec 4, 2015 at 11:06 PM, Aravind HT
>>>>>                         <aravind.ht@gmail.com
>>>>>             <mailto:aravind.ht@gmail.com> <mailto:aravind.ht@gmail.com
>>>>>
>>>>>             <mailto:aravind.ht@gmail.com>>> wrote:
>>>>>                         > I am using 2.6.0 .I will try to share the
>>>>>             code that I'm
>>>>>                         using here in some
>>>>>                         > time. If there are any specific fixes that
>>>>>             are relevant
>>>>>                         to this issue, see
>>>>>                         > if you can provide a link to them. I would
>>>>>             ideally like
>>>>>                         to try them out
>>>>>                         > before trying a full upgrade to the latest
>>>>>             versions.
>>>>>
>>>>>                         Hi,
>>>>>
>>>>>                         Can you provide more information on the
>>>>>             system? Which
>>>>>                         distribution,
>>>>>                         architecture, kernel version?
>>>>>
>>>>>                         The verbose sessiond logs might help pinpoint
>>>>> any
>>>>>                         unexpected behaviour
>>>>>                         here (are all applications registering as
>>>>>             expected?).
>>>>>
>>>>>                         Jérémie
>>>>>
>>>>>                         >
>>>>>                         > On Fri, Dec 4, 2015 at 6:11 PM, Jérémie
>>>>>             Galarneau
>>>>>                         > <jeremie.galarneau@efficios.com
>>>>>             <mailto:jeremie.galarneau@efficios.com>
>>>>>                         <mailto:jeremie.galarneau@efficios.com
>>>>>             <mailto:jeremie.galarneau@efficios.com>>> wrote:
>>>>>                         >>
>>>>>                         >> Hi Aravind,
>>>>>                         >>
>>>>>                         >> Can't say I have looked at everything you
>>>>>             sent yet, but
>>>>>                         as a
>>>>>                         >> preemptive question, which version are we
>>>>>             talking about
>>>>>                         here? 2.6.0 or
>>>>>                         >> 2.6.1? 2.6.1 contains a lot of relay daemon
>>>>>             fixes.
>>>>>                         >>
>>>>>                         >> Thanks,
>>>>>                         >> Jérémie
>>>>>                         >>
>>>>>                         >> On Thu, Dec 3, 2015 at 7:01 AM, Aravind HT
>>>>>                         <aravind.ht@gmail.com
>>>>>             <mailto:aravind.ht@gmail.com>> wrote:
>>>>>                         >> > Hi,
>>>>>                         >> >
>>>>>                         >> > I am trying to obtain the performance
>>>>>             characteristics
>>>>>                         of lttng with the
>>>>>                         >> > use
>>>>>                         >> > of test applications. Traces are being
>>>>>             produced on a
>>>>>                         local node and
>>>>>                         >> > delivered to relayd that is running on a
>>>>>             separate
>>>>>                         node for storage.
>>>>>                         >> >
>>>>>                         >> > An lttng session with the test
>>>>>             applications producing
>>>>>                         an initial bit
>>>>>                         >> > rate of
>>>>>                         >> > 10 kb/s is started and run for about 30
>>>>>             seconds. The
>>>>>                         starting sub-buffer
>>>>>                         >> > size is kept at 128 kb and sub-buf count
>>>>>             at 4. The
>>>>>                         session is then
>>>>>                         >> > stopped
>>>>>                         >> > and destroyed and traces are analyzed to
>>>>>             see if there
>>>>>                         are any drops.
>>>>>                         >> > This is
>>>>>                         >> > being done in a loop with every
>>>>>             subsequent session
>>>>>                         having an increment
>>>>>                         >> > of 2
>>>>>                         >> > kb/s as long as there are no drops. If
>>>>>             there are
>>>>>                         drops, I increase the
>>>>>                         >> > buffer size by a factor of x2 without
>>>>>             incrementing
>>>>>                         the bit rate.
>>>>>                         >> >
>>>>>                         >> > I see trace drops happening consistently
>>>>>             with test
>>>>>                         apps producing traces
>>>>>                         >> > at
>>>>>                         >> > less than 40 kb/s, it doesnt seem to help
>>>>>             even if I
>>>>>                         started with 1mb x 4
>>>>>                         >> > sub-buffers.
>>>>>                         >> >
>>>>>                         >> > Analysis :
>>>>>                         >> >
>>>>>                         >> > I have attached the lttng_relayd ,
>>>>>             lttng_consumerd_64
>>>>>                         logs and the
>>>>>                         >> > entire
>>>>>                         >> > trace directory, hope you will be able to
>>>>>             view it.
>>>>>                         >> > I have modified lttng_relayd code to dump
>>>>>             the traces
>>>>>                         being captured in
>>>>>                         >> > the
>>>>>                         >> > lttng_relayd logs along with debug info.
>>>>>                         >> >
>>>>>                         >> > Each test app is producing logs in the
>>>>>             form of  :
>>>>>                         >> > "TraceApp PID - 31940 THID - 31970
>>>>>             @threadRate - 1032
>>>>>                         b/s appRate - 2079
>>>>>                         >> > b/s
>>>>>                         >> > threadTraceNum - 9 appTraceNum - 18
>>>>>        sleepTime - 192120"
>>>>>                         >> >
>>>>>                         >> > The test application PID, test
>>>>>             application thread id,
>>>>>                         thread bit rate,
>>>>>                         >> > test
>>>>>                         >> > app bit rate, thread trace number and
>>>>>             application
>>>>>                         trace number s are
>>>>>                         >> > part of
>>>>>                         >> > the trace. So in the above trace, the
>>>>>             thread is
>>>>>                         producing at 1 kb/s and
>>>>>                         >> > the
>>>>>                         >> > whole test app is producing at 2 kb/s.
>>>>>                         >> >
>>>>>                         >> > If we look at the babeltrace out put, we
>>>>>             see that the
>>>>>                         Trace with
>>>>>                         >> > TraceApp
>>>>>                         >> > PID - 31940 appTraceNum 2 is missing ,
>>>>>             with 1, 3, 4,
>>>>>                         5 and so on being
>>>>>                         >> > successfully captured.
>>>>>                         >> > I looked at the lttng_relayd logs and
>>>>>             found that
>>>>>                         trace of "appTraceNum
>>>>>                         >> > 2" is
>>>>>                         >> > not delivered/generated by the consumerd
>>>>>             to the
>>>>>                         relayd in sequence with
>>>>>                         >> > other traces. To rule out that this is
>>>>>             not a test
>>>>>                         application problem,
>>>>>                         >> > you
>>>>>                         >> > can look at line ltttng_relayd log :
>>>>>             12778 and see
>>>>>                         traces from
>>>>>                         >> > appTraceNum -
>>>>>                         >> > 1 to appTraceNum - 18 including the
>>>>>             appTraceNum 2 are
>>>>>                         "re-delivered" by
>>>>>                         >> > the
>>>>>                         >> > consumerd to the relayd.
>>>>>                         >> > Essentially, I see appTraceNum 1 through
>>>>>             appTraceNum
>>>>>                         18 being delivered
>>>>>                         >> > twice, once individually where
>>>>>             appTraceNum 2 is
>>>>>                         missing and once as a
>>>>>                         >> > group
>>>>>                         >> > at line 12778 where its present.
>>>>>                         >> >
>>>>>                         >> >
>>>>>                         >> > Request help with
>>>>>                         >> > 1. why traces are delivered twice, is it
>>>>>             by design or
>>>>>                         a genuine problem
>>>>>                         >> > ?
>>>>>                         >> > 2. how to avoid traces being dropped even
>>>>>             though
>>>>>                         buffers are
>>>>>                         >> > sufficiently
>>>>>                         >> > large enough ?
>>>>>                         >> >
>>>>>                         >> >
>>>>>                         >> > Regards,
>>>>>                         >> > Aravind.
>>>>>                         >> >
>>>>>                         >> >
>>>>>             _______________________________________________
>>>>>                         >> > lttng-dev mailing list
>>>>>                         >> > lttng-dev@lists.lttng.org
>>>>>             <mailto:lttng-dev@lists.lttng.org>
>>>>>                         >> >
>>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>                         >> >
>>>>>                         >>
>>>>>                         >>
>>>>>                         >>
>>>>>                         >> --
>>>>>                         >> Jérémie Galarneau
>>>>>                         >> EfficiOS Inc.
>>>>>                         >> http://www.efficios.com
>>>>>                         >
>>>>>                         >
>>>>>
>>>>>
>>>>>
>>>>>                         --
>>>>>                         Jérémie Galarneau
>>>>>                         EfficiOS Inc.
>>>>>             http://www.efficios.com
>>>>>
>>>>>
>>>>>
>>>>>             _______________________________________________
>>>>>                     lttng-dev mailing list
>>>>>             lttng-dev@lists.lttng.org
>>>>>             <mailto:lttng-dev@lists.lttng.org>
>>>>>             <mailto:lttng-dev@lists.lttng.org
>>>>>             <mailto:lttng-dev@lists.lttng.org>>
>>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                 --     Jonathan Rajotte Julien
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             _______________________________________________
>>>>>             lttng-dev mailing list
>>>>>             lttng-dev@lists.lttng.org <mailto:
>>>>> lttng-dev@lists.lttng.org>
>>>>>             http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>
>>>>>
>>>>>         --         Jonathan R. Julien
>>>>>         Efficios
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Jonathan R. Julien
>>>> Efficios
>>>>
>>>>
>>>
>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev@lists.lttng.org
>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>
>>>
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>

[-- Attachment #1.2: Type: text/html, Size: 42003 bytes --]

[-- Attachment #2: Type: text/plain, Size: 155 bytes --]

_______________________________________________
lttng-dev mailing list
lttng-dev@lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-12-30  8:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CACd+_b7qEeW1XD1H_0MT4dDqwLv68rfOURfNDRh6SO=xS4k-Og@mail.gmail.com>
2015-12-04 12:41 ` Relayd trace drops Jérémie Galarneau
2015-12-04 14:06 ` Jonathan Rajotte Julien
     [not found] ` <CA+jJMxsuF+Ywvvhe3L4OqBHTsohyFffPoF4mZWFDqPvBC-P80w@mail.gmail.com>
2015-12-05  4:06   ` Aravind HT
     [not found]   ` <CACd+_b5AGJKWH58Kq3QP8qW8R7Q9JH3fqbEwU=z8pyg5-DL=KQ@mail.gmail.com>
2015-12-05 11:53     ` Jérémie Galarneau
     [not found]     ` <CA+jJMxsJReP4ZO2bSJP-rJe+UoJzT09Pix1LhEypbsww_nwrsA@mail.gmail.com>
2015-12-07 18:07       ` Aravind HT
     [not found]       ` <CACd+_b7ScRDo9uL+neOnA2pqVVwexsPsbBd2mBS--yy7Amxbmg@mail.gmail.com>
2015-12-07 19:13         ` Jonathan Rajotte
     [not found]         ` <CAF4oYfF2Li4zau0OLE=JZUtGfPZyFT1FciBr7AG1hvzGdeUycA@mail.gmail.com>
2015-12-08 12:51           ` Aravind HT
     [not found]           ` <CACd+_b6byMnZ_oWBqEk44fJOgpVX-oAGMjfh_WV8AV6cKyK9iw@mail.gmail.com>
2015-12-08 19:27             ` Jonathan Rajotte Julien
     [not found]             ` <56672F33.3070808@efficios.com>
2015-12-09  9:34               ` Aravind HT
     [not found]               ` <CACd+_b6xjN8BEcfqypkhGrM2XWwxhW9RmCVgHfN5jmksGmVnPw@mail.gmail.com>
2015-12-11 17:16                 ` Aravind HT
     [not found]                 ` <CACd+_b46wkEH3jVOfz=SR2eS+ZXuWoL=f81ahNa4uwPNpWTHbw@mail.gmail.com>
2015-12-11 18:03                   ` Jonathan Rajotte Julien
     [not found]                   ` <566B0FEA.7040307@efficios.com>
2015-12-11 18:42                     ` Aravind HT
     [not found]                     ` <CACd+_b6za+gk_fFhFapz-zRpwO8B31UX6+hMJxcjJsxrKVM5ow@mail.gmail.com>
2015-12-11 22:43                       ` Mathieu Desnoyers
     [not found]                       ` <1125287188.233739.1449873787684.JavaMail.zimbra@efficios.com>
2015-12-16 13:02                         ` Aravind HT
     [not found]                         ` <CACd+_b7X3+Hzi5HYt3KggWMCVNPyhyDb4aK3-Pik98wcb70kCQ@mail.gmail.com>
2015-12-16 14:18                           ` Mathieu Desnoyers
     [not found]                           ` <654480294.257124.1450275539752.JavaMail.zimbra@efficios.com>
2015-12-28 17:24                             ` Aravind HT
     [not found]                             ` <CACd+_b5cQ__Gqynic40iXw8UBpdd4cU9H2R=_HEqm_22LxApKw@mail.gmail.com>
2015-12-30  2:05                               ` Mathieu Desnoyers
     [not found]                               ` <1521366980.305833.1451441138158.JavaMail.zimbra@efficios.com>
2015-12-30  8:53                                 ` Aravind HT

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.