All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] perf report: ordered events and flushing bug
@ 2015-03-12  3:32 Stephane Eranian
  2015-03-12  7:57 ` Namhyung Kim
  2015-03-12  9:02 ` Adrian Hunter
  0 siblings, 2 replies; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12  3:32 UTC (permalink / raw)
  To: LKML
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, David Ahern,
	Peter Zijlstra, mingo

Hi,

I am working on the JIT support to improve the flow and have
perf record inject the MMAPs at the end of the collection. For
that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
file, I simply append the the MMAP records at the end of the file. And that
puts them out-of-order in time systematically. But I thought it would be okay
because perf report would sort them by timestamps again.

Well, it does not in all cases! Why?

Because of the round flushing. Based on how far out-of-order an event is,
it may not be processed correctly because of round flushing. I believe this
may only impact MMAP records.  This is a serious issue because
mmaps drive symbolization of samples. If samples are processed without the
proper dso mapping, then samples may not be symbolized or may be wrongly
symbolized.

So far, the workaround I found was to set the oe->next_flush = 0 for
the ROUND mode.
In other words, do not flush anything until FINAL. To me, this is the
only sensible
way of avoiding this kind of problems. I am not sure I understand the
point of flushing
anyway, except to minimize memory footprint, maybe. But it does not
work with vastly
out-of-order mmaps.

Do you have a better solution?
Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12  3:32 [BUG] perf report: ordered events and flushing bug Stephane Eranian
@ 2015-03-12  7:57 ` Namhyung Kim
  2015-03-12  9:02 ` Adrian Hunter
  1 sibling, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2015-03-12  7:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Arnaldo Carvalho de Melo, Jiri Olsa, David Ahern,
	Peter Zijlstra, mingo

Hi Stephane,

On Wed, Mar 11, 2015 at 11:32:15PM -0400, Stephane Eranian wrote:
> Hi,
> 
> I am working on the JIT support to improve the flow and have
> perf record inject the MMAPs at the end of the collection. For
> that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
> file, I simply append the the MMAP records at the end of the file. And that
> puts them out-of-order in time systematically. But I thought it would be okay
> because perf report would sort them by timestamps again.
> 
> Well, it does not in all cases! Why?
> 
> Because of the round flushing. Based on how far out-of-order an event is,
> it may not be processed correctly because of round flushing. I believe this
> may only impact MMAP records.  This is a serious issue because
> mmaps drive symbolization of samples. If samples are processed without the
> proper dso mapping, then samples may not be symbolized or may be wrongly
> symbolized.
> 
> So far, the workaround I found was to set the oe->next_flush = 0 for
> the ROUND mode.
> In other words, do not flush anything until FINAL. To me, this is the
> only sensible
> way of avoiding this kind of problems. I am not sure I understand the
> point of flushing
> anyway, except to minimize memory footprint, maybe. But it does not
> work with vastly
> out-of-order mmaps.
> 
> Do you have a better solution?

You may want to see my multi-thread work [1] for this.  It supports
out-of-time processing of samples by splitting metadata and samples
with index.  I think you can just inject the MMAPs to the metadata
area - yes, it requires whole file rewrite anyway but you can skip the
sample data parts.

[1] https://lkml.org/lkml/2015/3/2/988

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12  3:32 [BUG] perf report: ordered events and flushing bug Stephane Eranian
  2015-03-12  7:57 ` Namhyung Kim
@ 2015-03-12  9:02 ` Adrian Hunter
  2015-03-12 19:05   ` Stephane Eranian
  1 sibling, 1 reply; 15+ messages in thread
From: Adrian Hunter @ 2015-03-12  9:02 UTC (permalink / raw)
  To: Stephane Eranian, LKML
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, David Ahern,
	Peter Zijlstra, mingo

On 12/03/15 05:32, Stephane Eranian wrote:
> Hi,
> 
> I am working on the JIT support to improve the flow and have
> perf record inject the MMAPs at the end of the collection. For
> that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
> file, I simply append the the MMAP records at the end of the file. And that
> puts them out-of-order in time systematically. But I thought it would be okay
> because perf report would sort them by timestamps again.
> 
> Well, it does not in all cases! Why?
> 
> Because of the round flushing. Based on how far out-of-order an event is,
> it may not be processed correctly because of round flushing. I believe this
> may only impact MMAP records.  This is a serious issue because
> mmaps drive symbolization of samples. If samples are processed without the
> proper dso mapping, then samples may not be symbolized or may be wrongly
> symbolized.
> 
> So far, the workaround I found was to set the oe->next_flush = 0 for
> the ROUND mode.
> In other words, do not flush anything until FINAL. To me, this is the
> only sensible
> way of avoiding this kind of problems. I am not sure I understand the
> point of flushing
> anyway, except to minimize memory footprint, maybe. But it does not
> work with vastly
> out-of-order mmaps.
> 
> Do you have a better solution?

You could hook the ordered event delivery (see Arnaldo's perf/core branch):

if (tool->ordered_events) {
	inject->deliver = session->ordered_events.deliver;
	session->ordered_events.deliver = jit_mmap_deliver;
}

int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event *event, struct perf_sample *sample)
{
	struct perf_inject *inject = container_of(oe->tool, struct perf_inject, tool);

	if (next_jit_mmap_time < sample->time) {
		<synth mmap event>
		perf_session__deliver_synth_event(...);
	}	
	return inject->deliver(oe, event, sample);
}

Need to get Arnaldo's comment on this approach first though.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12  9:02 ` Adrian Hunter
@ 2015-03-12 19:05   ` Stephane Eranian
  2015-03-12 19:13     ` David Ahern
  0 siblings, 1 reply; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12 19:05 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: LKML, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa,
	David Ahern, Peter Zijlstra, mingo

On Thu, Mar 12, 2015 at 5:02 AM, Adrian Hunter <adrian.hunter@intel.com> wrote:
> On 12/03/15 05:32, Stephane Eranian wrote:
>> Hi,
>>
>> I am working on the JIT support to improve the flow and have
>> perf record inject the MMAPs at the end of the collection. For
>> that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
>> file, I simply append the the MMAP records at the end of the file. And that
>> puts them out-of-order in time systematically. But I thought it would be okay
>> because perf report would sort them by timestamps again.
>>
>> Well, it does not in all cases! Why?
>>
>> Because of the round flushing. Based on how far out-of-order an event is,
>> it may not be processed correctly because of round flushing. I believe this
>> may only impact MMAP records.  This is a serious issue because
>> mmaps drive symbolization of samples. If samples are processed without the
>> proper dso mapping, then samples may not be symbolized or may be wrongly
>> symbolized.
>>
>> So far, the workaround I found was to set the oe->next_flush = 0 for
>> the ROUND mode.
>> In other words, do not flush anything until FINAL. To me, this is the
>> only sensible
>> way of avoiding this kind of problems. I am not sure I understand the
>> point of flushing
>> anyway, except to minimize memory footprint, maybe. But it does not
>> work with vastly
>> out-of-order mmaps.
>>
>> Do you have a better solution?
>
> You could hook the ordered event delivery (see Arnaldo's perf/core branch):
>
> if (tool->ordered_events) {
>         inject->deliver = session->ordered_events.deliver;
>         session->ordered_events.deliver = jit_mmap_deliver;
> }
>
> int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event *event, struct perf_sample *sample)
> {
>         struct perf_inject *inject = container_of(oe->tool, struct perf_inject, tool);
>
>         if (next_jit_mmap_time < sample->time) {
>                 <synth mmap event>
>                 perf_session__deliver_synth_event(...);
>         }
>         return inject->deliver(oe, event, sample);
> }
>
I don't quite understand what this does?
Are you saying that in perf/core there is a new callback mechanism to
override the deliver phase
when you need ordered event and that callback could be used to avoid
the flushing?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:05   ` Stephane Eranian
@ 2015-03-12 19:13     ` David Ahern
  2015-03-12 19:23       ` Stephane Eranian
  2015-03-12 20:27       ` Adrian Hunter
  0 siblings, 2 replies; 15+ messages in thread
From: David Ahern @ 2015-03-12 19:13 UTC (permalink / raw)
  To: Stephane Eranian, Adrian Hunter
  Cc: LKML, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa,
	Peter Zijlstra, mingo

On 3/12/15 1:05 PM, Stephane Eranian wrote:
> On Thu, Mar 12, 2015 at 5:02 AM, Adrian Hunter <adrian.hunter@intel.com> wrote:
>> On 12/03/15 05:32, Stephane Eranian wrote:
>>> Hi,
>>>
>>> I am working on the JIT support to improve the flow and have
>>> perf record inject the MMAPs at the end of the collection. For
>>> that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
>>> file, I simply append the the MMAP records at the end of the file. And that
>>> puts them out-of-order in time systematically. But I thought it would be okay
>>> because perf report would sort them by timestamps again.
>>>
>>> Well, it does not in all cases! Why?
>>>
>>> Because of the round flushing. Based on how far out-of-order an event is,
>>> it may not be processed correctly because of round flushing. I believe this
>>> may only impact MMAP records.  This is a serious issue because
>>> mmaps drive symbolization of samples. If samples are processed without the
>>> proper dso mapping, then samples may not be symbolized or may be wrongly
>>> symbolized.
>>>
>>> So far, the workaround I found was to set the oe->next_flush = 0 for
>>> the ROUND mode.
>>> In other words, do not flush anything until FINAL. To me, this is the
>>> only sensible
>>> way of avoiding this kind of problems. I am not sure I understand the
>>> point of flushing
>>> anyway, except to minimize memory footprint, maybe. But it does not
>>> work with vastly
>>> out-of-order mmaps.

Rounds and flushing after them helps with the user experience -- at 
least for some commands. On systems with 1024 cpus perf data files get 
large quickly and the resulting analysis command can appear to hang for 
long periods (e.g., i have done 1 second data collections that generate 
5GB files and take 30-45 minutes before perf shows any output).

I have a patch to perf-record to allow a user to specify the poll 
timeout. That helps define the maximum time between 'rounds' on the 
record side and then commands that show each event (e.g, perf script) 
will show output much more quickly without the need to consume large 
amounts of memory or process the entire file before displaying events.

>>>
>>> Do you have a better solution?
>>
>> You could hook the ordered event delivery (see Arnaldo's perf/core branch):
>>
>> if (tool->ordered_events) {
>>          inject->deliver = session->ordered_events.deliver;
>>          session->ordered_events.deliver = jit_mmap_deliver;
>> }
>>
>> int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event *event, struct perf_sample *sample)
>> {
>>          struct perf_inject *inject = container_of(oe->tool, struct perf_inject, tool);
>>
>>          if (next_jit_mmap_time < sample->time) {
>>                  <synth mmap event>
>>                  perf_session__deliver_synth_event(...);
>>          }
>>          return inject->deliver(oe, event, sample);
>> }
>>
> I don't quite understand what this does?
> Are you saying that in perf/core there is a new callback mechanism to
> override the deliver phase
> when you need ordered event and that callback could be used to avoid
> the flushing?
>

I think the suggestion is to push the mmap events into the stream at the 
appropriate time slot.

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:13     ` David Ahern
@ 2015-03-12 19:23       ` Stephane Eranian
  2015-03-12 19:34         ` David Ahern
  2015-03-12 20:27       ` Adrian Hunter
  1 sibling, 1 reply; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12 19:23 UTC (permalink / raw)
  To: David Ahern
  Cc: Adrian Hunter, LKML, Arnaldo Carvalho de Melo, Namhyung Kim,
	Jiri Olsa, Peter Zijlstra, mingo

On Thu, Mar 12, 2015 at 3:13 PM, David Ahern <dsahern@gmail.com> wrote:
> On 3/12/15 1:05 PM, Stephane Eranian wrote:
>>
>> On Thu, Mar 12, 2015 at 5:02 AM, Adrian Hunter <adrian.hunter@intel.com>
>> wrote:
>>>
>>> On 12/03/15 05:32, Stephane Eranian wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am working on the JIT support to improve the flow and have
>>>> perf record inject the MMAPs at the end of the collection. For
>>>> that I piggyback on the buildid pass. To avoid rewriting the entire
>>>> perf.data
>>>> file, I simply append the the MMAP records at the end of the file. And
>>>> that
>>>> puts them out-of-order in time systematically. But I thought it would be
>>>> okay
>>>> because perf report would sort them by timestamps again.
>>>>
>>>> Well, it does not in all cases! Why?
>>>>
>>>> Because of the round flushing. Based on how far out-of-order an event
>>>> is,
>>>> it may not be processed correctly because of round flushing. I believe
>>>> this
>>>> may only impact MMAP records.  This is a serious issue because
>>>> mmaps drive symbolization of samples. If samples are processed without
>>>> the
>>>> proper dso mapping, then samples may not be symbolized or may be wrongly
>>>> symbolized.
>>>>
>>>> So far, the workaround I found was to set the oe->next_flush = 0 for
>>>> the ROUND mode.
>>>> In other words, do not flush anything until FINAL. To me, this is the
>>>> only sensible
>>>> way of avoiding this kind of problems. I am not sure I understand the
>>>> point of flushing
>>>> anyway, except to minimize memory footprint, maybe. But it does not
>>>> work with vastly
>>>> out-of-order mmaps.
>
>
> Rounds and flushing after them helps with the user experience -- at least
> for some commands. On systems with 1024 cpus perf data files get large
> quickly and the resulting analysis command can appear to hang for long
> periods (e.g., i have done 1 second data collections that generate 5GB files
> and take 30-45 minutes before perf shows any output).
>
> I have a patch to perf-record to allow a user to specify the poll timeout.
> That helps define the maximum time between 'rounds' on the record side and
> then commands that show each event (e.g, perf script) will show output much
> more quickly without the need to consume large amounts of memory or process
> the entire file before displaying events.
>
I can understand this but not at the cost of correctness. You have no guarantee
that events will appear in the perf.data with a fixed maximum time window, i.e.,
beyond the window, out-of-order events may be discarded.
There ought to be a better way of keeping the user aware of what's going on
with the progress bar.


>>>>
>>>> Do you have a better solution?
>>>
>>>
>>> You could hook the ordered event delivery (see Arnaldo's perf/core
>>> branch):
>>>
>>> if (tool->ordered_events) {
>>>          inject->deliver = session->ordered_events.deliver;
>>>          session->ordered_events.deliver = jit_mmap_deliver;
>>> }
>>>
>>> int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event
>>> *event, struct perf_sample *sample)
>>> {
>>>          struct perf_inject *inject = container_of(oe->tool, struct
>>> perf_inject, tool);
>>>
>>>          if (next_jit_mmap_time < sample->time) {
>>>                  <synth mmap event>
>>>                  perf_session__deliver_synth_event(...);
>>>          }
>>>          return inject->deliver(oe, event, sample);
>>> }
>>>
>> I don't quite understand what this does?
>> Are you saying that in perf/core there is a new callback mechanism to
>> override the deliver phase
>> when you need ordered event and that callback could be used to avoid
>> the flushing?
>>
>
> I think the suggestion is to push the mmap events into the stream at the
> appropriate time slot.
>
> David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:23       ` Stephane Eranian
@ 2015-03-12 19:34         ` David Ahern
  2015-03-12 19:39           ` Stephane Eranian
  0 siblings, 1 reply; 15+ messages in thread
From: David Ahern @ 2015-03-12 19:34 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Adrian Hunter, LKML, Arnaldo Carvalho de Melo, Namhyung Kim,
	Jiri Olsa, Peter Zijlstra, mingo

On 3/12/15 1:23 PM, Stephane Eranian wrote:
>> Rounds and flushing after them helps with the user experience -- at least
>> for some commands. On systems with 1024 cpus perf data files get large
>> quickly and the resulting analysis command can appear to hang for long
>> periods (e.g., i have done 1 second data collections that generate 5GB files
>> and take 30-45 minutes before perf shows any output).
>>
>> I have a patch to perf-record to allow a user to specify the poll timeout.
>> That helps define the maximum time between 'rounds' on the record side and
>> then commands that show each event (e.g, perf script) will show output much
>> more quickly without the need to consume large amounts of memory or process
>> the entire file before displaying events.
>>
> I can understand this but not at the cost of correctness. You have no guarantee
> that events will appear in the perf.data with a fixed maximum time window, i.e.,
> beyond the window, out-of-order events may be discarded.
> There ought to be a better way of keeping the user aware of what's going on
> with the progress bar.

Theoretically there are no events coming in out of order. Have you seen 
the comment before process_finished_round() in util/session.c?

In practice it did happen occasionally and perf would dump:
     Warning: Timestamp below last timeslice flush

and then terminate. That was removed recently by Jiri I think and events 
out of the window are discarded (??).

In my case I don't want a progress bar, I want output and I don't need 
the entire file processed for output to start. For a while I was 
dropping back to the ftrace interface in debugfs b/c it was faster.

For the JIT case you tried appending events at the end of the file but 
that makes them no where near the real time points so what you really 
need is to insert events into the stream. Perhaps the separate files 
option would help -- jit events in a separate file and inserted based on 
time as the files are processed?

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:34         ` David Ahern
@ 2015-03-12 19:39           ` Stephane Eranian
  2015-03-12 19:53             ` David Ahern
  0 siblings, 1 reply; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12 19:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Adrian Hunter, LKML, Arnaldo Carvalho de Melo, Namhyung Kim,
	Jiri Olsa, Peter Zijlstra, mingo

On Thu, Mar 12, 2015 at 3:34 PM, David Ahern <dsahern@gmail.com> wrote:
> On 3/12/15 1:23 PM, Stephane Eranian wrote:
>>>
>>> Rounds and flushing after them helps with the user experience -- at least
>>> for some commands. On systems with 1024 cpus perf data files get large
>>> quickly and the resulting analysis command can appear to hang for long
>>> periods (e.g., i have done 1 second data collections that generate 5GB
>>> files
>>> and take 30-45 minutes before perf shows any output).
>>>
>>> I have a patch to perf-record to allow a user to specify the poll
>>> timeout.
>>> That helps define the maximum time between 'rounds' on the record side
>>> and
>>> then commands that show each event (e.g, perf script) will show output
>>> much
>>> more quickly without the need to consume large amounts of memory or
>>> process
>>> the entire file before displaying events.
>>>
>> I can understand this but not at the cost of correctness. You have no
>> guarantee
>> that events will appear in the perf.data with a fixed maximum time window,
>> i.e.,
>> beyond the window, out-of-order events may be discarded.
>> There ought to be a better way of keeping the user aware of what's going
>> on
>> with the progress bar.
>
>
> Theoretically there are no events coming in out of order. Have you seen the
> comment before process_finished_round() in util/session.c?
>
> In practice it did happen occasionally and perf would dump:
>     Warning: Timestamp below last timeslice flush
>
> and then terminate. That was removed recently by Jiri I think and events out
> of the window are discarded (??).
>
> In my case I don't want a progress bar, I want output and I don't need the
> entire file processed for output to start. For a while I was dropping back
> to the ftrace interface in debugfs b/c it was faster.
>
> For the JIT case you tried appending events at the end of the file but that
> makes them no where near the real time points so what you really need is to
> insert events into the stream. Perhaps the separate files option would help
> -- jit events in a separate file and inserted based on time as the files are
> processed?
>
That's how I had it initially but it mean you have to use perf inject or have
perf record inject at the end, i.e., two passes over the perf.data file.  I can
certainly have this in perf inject.

What the point of having all the ordered event logic if you are saying events
must be saved in order. I don't think there is a way to make that guarantee
when monitoring multiple CPUs at the same time.

>From you description it seems to me, you script wants to process data quickly.
Is this an interactive tool? If not, you need to go through all the
samples anyway.

> David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:39           ` Stephane Eranian
@ 2015-03-12 19:53             ` David Ahern
  2015-03-12 20:06               ` Stephane Eranian
  2015-03-12 20:16               ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 15+ messages in thread
From: David Ahern @ 2015-03-12 19:53 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Adrian Hunter, LKML, Arnaldo Carvalho de Melo, Namhyung Kim,
	Jiri Olsa, Peter Zijlstra, mingo

On 3/12/15 1:39 PM, Stephane Eranian wrote:
> What the point of having all the ordered event logic if you are saying events
> must be saved in order. I don't think there is a way to make that guarantee
> when monitoring multiple CPUs at the same time.

The record command does not analyze the events, it just copies from mmap 
to file in lumps per mmap. e.g., on a given round the perf data file has 
events like this:

    111112223344444444555566666F111111111
    |<------- round --------->|^
                               |
         finished round event -|

where 11111 are events read from mmap1, 2222 are events from mmap2, etc. 
F is the finished round event which a pass over all mmaps has been done.

So for mmap1 all of the 11111 events are in time order, then jumping to 
mmap2 events the 2222 times are time sorted relative to mmap2 but not 
relative to mmap1 events.

The ordered events code sorts the clumps into a time based stream:
     123141641445124564234645656...

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:53             ` David Ahern
@ 2015-03-12 20:06               ` Stephane Eranian
  2015-03-12 20:50                 ` Arnaldo Carvalho de Melo
  2015-03-12 20:16               ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12 20:06 UTC (permalink / raw)
  To: David Ahern
  Cc: Adrian Hunter, LKML, Arnaldo Carvalho de Melo, Namhyung Kim,
	Jiri Olsa, Peter Zijlstra, mingo

On Thu, Mar 12, 2015 at 3:53 PM, David Ahern <dsahern@gmail.com> wrote:
> On 3/12/15 1:39 PM, Stephane Eranian wrote:
>>
>> What the point of having all the ordered event logic if you are saying
>> events
>> must be saved in order. I don't think there is a way to make that
>> guarantee
>> when monitoring multiple CPUs at the same time.
>
>
> The record command does not analyze the events, it just copies from mmap to
> file in lumps per mmap. e.g., on a given round the perf data file has events
> like this:
>
>    111112223344444444555566666F111111111
>    |<------- round --------->|^
>                               |
>         finished round event -|
>
> where 11111 are events read from mmap1, 2222 are events from mmap2, etc. F
> is the finished round event which a pass over all mmaps has been done.
>
> So for mmap1 all of the 11111 events are in time order, then jumping to
> mmap2 events the 2222 times are time sorted relative to mmap2 but not
> relative to mmap1 events.
>
> The ordered events code sorts the clumps into a time based stream:
>     123141641445124564234645656...
>
In my case I care about time ordering the mmap records between themselves
because they overlap on the address range.

t0: 0x100000-0x200000 //anon

t10: sample @ 0x100600
t20: sample @ 0x100250

t1: 0x100600-0x100700 jit2
t2: 0x100200-0x100300 jit3

with t1 < t2 < t2 < t10

I inject t1, t2 mmaps at the end of the perf.data file.

The full ordering should yield:

t0: 0x100000-0x200000 //anon
t1: 0x100600-0x100700 jit2
t2: 0x100200-0x100300 jit3
t10: sample @ 0x100600 -> jit2
t20: sample @ 0x100250 -> jit3

Partial ordering would likely yield both samples pointing to //anon.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:53             ` David Ahern
  2015-03-12 20:06               ` Stephane Eranian
@ 2015-03-12 20:16               ` Arnaldo Carvalho de Melo
  2015-03-12 20:24                 ` Stephane Eranian
  1 sibling, 1 reply; 15+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-03-12 20:16 UTC (permalink / raw)
  To: David Ahern
  Cc: Stephane Eranian, Adrian Hunter, LKML, Namhyung Kim, Jiri Olsa,
	Peter Zijlstra, mingo, acme

Em Thu, Mar 12, 2015 at 01:53:29PM -0600, David Ahern escreveu:
> On 3/12/15 1:39 PM, Stephane Eranian wrote:
> >What the point of having all the ordered event logic if you are saying events
> >must be saved in order. I don't think there is a way to make that guarantee
> >when monitoring multiple CPUs at the same time.
> 
> The record command does not analyze the events, it just copies from
> mmap to file in lumps per mmap. e.g., on a given round the perf data
> file has events like this:
> 
>    111112223344444444555566666F111111111
>    |<------- round --------->|^
>                               |
>         finished round event -|
> 
> where 11111 are events read from mmap1, 2222 are events from mmap2,
> etc. F is the finished round event which a pass over all mmaps has
> been done.
> 
> So for mmap1 all of the 11111 events are in time order, then jumping
> to mmap2 events the 2222 times are time sorted relative to mmap2 but
> not relative to mmap1 events.
> 
> The ordered events code sorts the clumps into a time based stream:
>     123141641445124564234645656...

And it does that because it merges all the mmap buffers into just one
file...

OK, for inserting MMAP events (or any other), I think one could either
use perf inject and merge two perf.data files, both in order, or add a
'perf data merge' subcommand to 'perf data', perhaps the later will be
useful in more cases.

But there is something else here, we should take advantage of the fact
that events in each perf mmap are ordered and keep that in the output of
perf record, i.e. we should start one thread per CPU that will just
write into a .perf.data/cpu-N file

Then, when reading, we will do what I'll do for 'trace' and 'top', i.e.
order the N cpus and go on processing in order, if you need that
(tracing, perf top perhaps).

Or do a first pass, get the lifetime events, aka the PERF_RECORD_
metadata, stash in the struct machine rbtrees, as usual, but keeping a
reference to all threadas, even the dead ones, which I guess is what
Namhyung does in some way in his patchkit, then go wild processing the
samples in parallel.

So, I think for Stephane, right now, the easiest path to follow is to
hack 'perf inject' to insert the MMAP events where he needs, right?

Agreed?

- Arnaldo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 20:16               ` Arnaldo Carvalho de Melo
@ 2015-03-12 20:24                 ` Stephane Eranian
  0 siblings, 0 replies; 15+ messages in thread
From: Stephane Eranian @ 2015-03-12 20:24 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Ahern, Adrian Hunter, LKML, Namhyung Kim, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo

On Thu, Mar 12, 2015 at 4:16 PM, Arnaldo Carvalho de Melo
<acme@redhat.com> wrote:
> Em Thu, Mar 12, 2015 at 01:53:29PM -0600, David Ahern escreveu:
>> On 3/12/15 1:39 PM, Stephane Eranian wrote:
>> >What the point of having all the ordered event logic if you are saying events
>> >must be saved in order. I don't think there is a way to make that guarantee
>> >when monitoring multiple CPUs at the same time.
>>
>> The record command does not analyze the events, it just copies from
>> mmap to file in lumps per mmap. e.g., on a given round the perf data
>> file has events like this:
>>
>>    111112223344444444555566666F111111111
>>    |<------- round --------->|^
>>                               |
>>         finished round event -|
>>
>> where 11111 are events read from mmap1, 2222 are events from mmap2,
>> etc. F is the finished round event which a pass over all mmaps has
>> been done.
>>
>> So for mmap1 all of the 11111 events are in time order, then jumping
>> to mmap2 events the 2222 times are time sorted relative to mmap2 but
>> not relative to mmap1 events.
>>
>> The ordered events code sorts the clumps into a time based stream:
>>     123141641445124564234645656...
>
> And it does that because it merges all the mmap buffers into just one
> file...
>
> OK, for inserting MMAP events (or any other), I think one could either
> use perf inject and merge two perf.data files, both in order, or add a
> 'perf data merge' subcommand to 'perf data', perhaps the later will be
> useful in more cases.
>
> But there is something else here, we should take advantage of the fact
> that events in each perf mmap are ordered and keep that in the output of
> perf record, i.e. we should start one thread per CPU that will just
> write into a .perf.data/cpu-N file
>
> Then, when reading, we will do what I'll do for 'trace' and 'top', i.e.
> order the N cpus and go on processing in order, if you need that
> (tracing, perf top perhaps).
>
> Or do a first pass, get the lifetime events, aka the PERF_RECORD_
> metadata, stash in the struct machine rbtrees, as usual, but keeping a
> reference to all threadas, even the dead ones, which I guess is what
> Namhyung does in some way in his patchkit, then go wild processing the
> samples in parallel.
>
> So, I think for Stephane, right now, the easiest path to follow is to
> hack 'perf inject' to insert the MMAP events where he needs, right?
>
Well, I had that but wanted to avoid the extra step for the user.
I had that last week. I will go back to it and verify that this approach
also works in pipe mode.
Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 19:13     ` David Ahern
  2015-03-12 19:23       ` Stephane Eranian
@ 2015-03-12 20:27       ` Adrian Hunter
  1 sibling, 0 replies; 15+ messages in thread
From: Adrian Hunter @ 2015-03-12 20:27 UTC (permalink / raw)
  To: David Ahern, Stephane Eranian
  Cc: LKML, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa,
	Peter Zijlstra, mingo

On 12/03/2015 9:13 p.m., David Ahern wrote:
> On 3/12/15 1:05 PM, Stephane Eranian wrote:
>> On Thu, Mar 12, 2015 at 5:02 AM, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>> On 12/03/15 05:32, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> I am working on the JIT support to improve the flow and have
>>>> perf record inject the MMAPs at the end of the collection. For
>>>> that I piggyback on the buildid pass. To avoid rewriting the entire perf.data
>>>> file, I simply append the the MMAP records at the end of the file. And that
>>>> puts them out-of-order in time systematically. But I thought it would be okay
>>>> because perf report would sort them by timestamps again.
>>>>
>>>> Well, it does not in all cases! Why?
>>>>
>>>> Because of the round flushing. Based on how far out-of-order an event is,
>>>> it may not be processed correctly because of round flushing. I believe this
>>>> may only impact MMAP records.  This is a serious issue because
>>>> mmaps drive symbolization of samples. If samples are processed without the
>>>> proper dso mapping, then samples may not be symbolized or may be wrongly
>>>> symbolized.
>>>>
>>>> So far, the workaround I found was to set the oe->next_flush = 0 for
>>>> the ROUND mode.
>>>> In other words, do not flush anything until FINAL. To me, this is the
>>>> only sensible
>>>> way of avoiding this kind of problems. I am not sure I understand the
>>>> point of flushing
>>>> anyway, except to minimize memory footprint, maybe. But it does not
>>>> work with vastly
>>>> out-of-order mmaps.
>
> Rounds and flushing after them helps with the user experience -- at least for some commands. On systems with 1024 cpus perf data files get large quickly and the resulting analysis command can appear to hang for long periods (e.g., i have done 1 second data collections that generate 5GB files and take 30-45 minutes before perf shows any output).
>
> I have a patch to perf-record to allow a user to specify the poll timeout. That helps define the maximum time between 'rounds' on the record side and then commands that show each event (e.g, perf script) will show output much more quickly without the need to consume large amounts of memory or process the entire file before displaying events.
>
>>>>
>>>> Do you have a better solution?
>>>
>>> You could hook the ordered event delivery (see Arnaldo's perf/core branch):
>>>
>>> if (tool->ordered_events) {
>>>          inject->deliver = session->ordered_events.deliver;
>>>          session->ordered_events.deliver = jit_mmap_deliver;
>>> }
>>>
>>> int jit_mmap_deliver(struct ordered_events *oe, struct ordered_event *event, struct perf_sample *sample)
>>> {
>>>          struct perf_inject *inject = container_of(oe->tool, struct perf_inject, tool);
>>>
>>>          if (next_jit_mmap_time < sample->time) {
>>>                  <synth mmap event>
>>>                  perf_session__deliver_synth_event(...);
>>>          }
>>>          return inject->deliver(oe, event, sample);
>>> }
>>>
>> I don't quite understand what this does?
>> Are you saying that in perf/core there is a new callback mechanism to
>> override the deliver phase
>> when you need ordered event and that callback could be used to avoid
>> the flushing?
>>
>
> I think the suggestion is to push the mmap events into the stream at the appropriate time slot.

Yes that was all. Sorry if it was confusing.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 20:06               ` Stephane Eranian
@ 2015-03-12 20:50                 ` Arnaldo Carvalho de Melo
  2015-03-16  1:11                   ` Namhyung Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-03-12 20:50 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: David Ahern, Adrian Hunter, LKML, Namhyung Kim, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, acme

Em Thu, Mar 12, 2015 at 04:06:46PM -0400, Stephane Eranian escreveu:
> On Thu, Mar 12, 2015 at 3:53 PM, David Ahern <dsahern@gmail.com> wrote:
> > On 3/12/15 1:39 PM, Stephane Eranian wrote:
> >>
> >> What the point of having all the ordered event logic if you are saying
> >> events
> >> must be saved in order. I don't think there is a way to make that
> >> guarantee
> >> when monitoring multiple CPUs at the same time.
> >
> >
> > The record command does not analyze the events, it just copies from mmap to
> > file in lumps per mmap. e.g., on a given round the perf data file has events
> > like this:
> >
> >    111112223344444444555566666F111111111
> >    |<------- round --------->|^
> >                               |
> >         finished round event -|
> >
> > where 11111 are events read from mmap1, 2222 are events from mmap2, etc. F
> > is the finished round event which a pass over all mmaps has been done.
> >
> > So for mmap1 all of the 11111 events are in time order, then jumping to
> > mmap2 events the 2222 times are time sorted relative to mmap2 but not
> > relative to mmap1 events.
> >
> > The ordered events code sorts the clumps into a time based stream:
> >     123141641445124564234645656...
> >
> In my case I care about time ordering the mmap records between themselves
> because they overlap on the address range.

Right, Namhyung, do you keep all the MMAP records as well? Looking at
the threaded patchkit I see:

-----------------
[PATCH 23/38] perf tools: Add a test case for timed map groups handling

A test case for verifying thread->mg and ->mg_list handling during
time change and new thread__find_addr_map_time() and friends.
-----------------

You said:

Subject: [PATCH 19/37] perf tools: Introduce thread__find_addr_location_time() and friends

The *_time() variants are for find appropriate map (and symbol) at the
given time.  This is based on the fact that map_groups list is sorted
by time in the previous patch.
----------------

Running out of time here, but I couldn't find code where you keep
'struct map' that overlaps and that could then be looked up using
timestamp in addition to the addr being looked up, that is needed for
parallely process samples having first processed all PERF_RECORD_MMAP
events.




 
> t0: 0x100000-0x200000 //anon
> 
> t10: sample @ 0x100600
> t20: sample @ 0x100250
> 
> t1: 0x100600-0x100700 jit2
> t2: 0x100200-0x100300 jit3
> 
> with t1 < t2 < t2 < t10
> 
> I inject t1, t2 mmaps at the end of the perf.data file.
> 
> The full ordering should yield:
> 
> t0: 0x100000-0x200000 //anon
> t1: 0x100600-0x100700 jit2
> t2: 0x100200-0x100300 jit3
> t10: sample @ 0x100600 -> jit2
> t20: sample @ 0x100250 -> jit3
> 
> Partial ordering would likely yield both samples pointing to //anon.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] perf report: ordered events and flushing bug
  2015-03-12 20:50                 ` Arnaldo Carvalho de Melo
@ 2015-03-16  1:11                   ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2015-03-16  1:11 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Stephane Eranian, David Ahern, Adrian Hunter, LKML, Jiri Olsa,
	Peter Zijlstra, Ingo Molnar, acme

Hi Arnaldo,

On Thu, Mar 12, 2015 at 05:50:53PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 12, 2015 at 04:06:46PM -0400, Stephane Eranian escreveu:
> > On Thu, Mar 12, 2015 at 3:53 PM, David Ahern <dsahern@gmail.com> wrote:
> > > On 3/12/15 1:39 PM, Stephane Eranian wrote:
> > >>
> > >> What the point of having all the ordered event logic if you are saying
> > >> events
> > >> must be saved in order. I don't think there is a way to make that
> > >> guarantee
> > >> when monitoring multiple CPUs at the same time.
> > >
> > >
> > > The record command does not analyze the events, it just copies from mmap to
> > > file in lumps per mmap. e.g., on a given round the perf data file has events
> > > like this:
> > >
> > >    111112223344444444555566666F111111111
> > >    |<------- round --------->|^
> > >                               |
> > >         finished round event -|
> > >
> > > where 11111 are events read from mmap1, 2222 are events from mmap2, etc. F
> > > is the finished round event which a pass over all mmaps has been done.
> > >
> > > So for mmap1 all of the 11111 events are in time order, then jumping to
> > > mmap2 events the 2222 times are time sorted relative to mmap2 but not
> > > relative to mmap1 events.
> > >
> > > The ordered events code sorts the clumps into a time based stream:
> > >     123141641445124564234645656...
> > >
> > In my case I care about time ordering the mmap records between themselves
> > because they overlap on the address range.
> 
> Right, Namhyung, do you keep all the MMAP records as well? Looking at
> the threaded patchkit I see:
> 
> -----------------
> [PATCH 23/38] perf tools: Add a test case for timed map groups handling
> 
> A test case for verifying thread->mg and ->mg_list handling during
> time change and new thread__find_addr_map_time() and friends.
> -----------------
> 
> You said:
> 
> Subject: [PATCH 19/37] perf tools: Introduce thread__find_addr_location_time() and friends
> 
> The *_time() variants are for find appropriate map (and symbol) at the
> given time.  This is based on the fact that map_groups list is sorted
> by time in the previous patch.
> ----------------
> 
> Running out of time here, but I couldn't find code where you keep
> 'struct map' that overlaps and that could then be looked up using
> timestamp in addition to the addr being looked up, that is needed for
> parallely process samples having first processed all PERF_RECORD_MMAP
> events.

It currently keeps only 'map_groups' per fork/exec so overlapped
'struct map' are not treated well assuming it's a rare event.  But
with the JIT map injection, it certainly also needs to keep 'struct
map' by timestamp - I'll work on it for the next spin.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-03-16  1:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-12  3:32 [BUG] perf report: ordered events and flushing bug Stephane Eranian
2015-03-12  7:57 ` Namhyung Kim
2015-03-12  9:02 ` Adrian Hunter
2015-03-12 19:05   ` Stephane Eranian
2015-03-12 19:13     ` David Ahern
2015-03-12 19:23       ` Stephane Eranian
2015-03-12 19:34         ` David Ahern
2015-03-12 19:39           ` Stephane Eranian
2015-03-12 19:53             ` David Ahern
2015-03-12 20:06               ` Stephane Eranian
2015-03-12 20:50                 ` Arnaldo Carvalho de Melo
2015-03-16  1:11                   ` Namhyung Kim
2015-03-12 20:16               ` Arnaldo Carvalho de Melo
2015-03-12 20:24                 ` Stephane Eranian
2015-03-12 20:27       ` Adrian Hunter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.