All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
@ 2020-10-02 13:03 Namhyung Kim
  2020-10-04 19:52 ` Jiri Olsa
  0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2020-10-02 13:03 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Mark Rutland, Alexander Shishkin,
	LKML, Stephane Eranian, Ian Rogers, Al Grant, Adrian Hunter

Currently perf inject just repipes the event without any flush.  It
makes an issue that it changes the order of events processed.

Normally we want to process events in time order, but events are
recorded from several cpus and they weren't sorted with each other.
So we put them into the ordered event queue, sorted by time, and flush
them when we see a next FINISHED_ROUND event.

But this is for events from kernel, user space events (like the
FINISHED_ROUND) are processed without queueing.  So during the perf
inject, it writes all the FINISHED_ROUND events directly while
SAMPLE (and other) events are kept in the queue.  This makes the user
space events are put before any kernel events.

You can see that with the following command:

  $ perf record -a -o- sleep 1 | perf inject -b | perf script -i- --show-round-events
  PERF_RECORD_FINISHED_ROUND
  PERF_RECORD_FINISHED_ROUND
  PERF_RECORD_FINISHED_ROUND
  ...

Omitting perf inject, you can see the events are placed in the middle
of the data.

You might argue that the whole point of the FINISHED_ROUND event is to
sort (kernel) events.  And after perf inject, all events are written
in a proper time order so we don't care about the FINISHED_ROUND event
anymore.

But the problem is we don't know whether the input data is sorted or
not (maybe we can add a feature bit for this?) so it should use an
ordered event queue when processing the input like in perf report.

Remember all the FINISHED_ROUND events now come before other events so
the tool cannot know when it can flush the data so everything will be
queued until it meets the end of the input.  Actually it's same for
perf inject itself as it doesn't flush the queue.

Below measures time and memory usage during the perf inject and
report using ~190MB data file.

Before:
  perf inject:  11.09 s,  382148 KB
  perf report:   8.05 s,  397440 KB

After:
  perf inject:  16.24 s,   83376 KB
  perf report:   7.96 s,  216184 KB

As you can see, it used 2x memory of the input size.  I guess it's
because it needs to keep the copy for the whole input.  But I don't
understand why processing time of perf inject increased..

I'm not sure how it affects the auxtrace, but it should be fine IMHO.

Cc: Al Grant <al.grant@foss.arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-inject.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 6d2f410d773a..9fa78a9edfc5 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -79,6 +79,7 @@ static int perf_event__repipe_oe_synth(struct perf_tool *tool,
 				       union perf_event *event,
 				       struct ordered_events *oe __maybe_unused)
 {
+	ordered_events__flush(oe, OE_FLUSH__ROUND);
 	return perf_event__repipe_synth(tool, event);
 }
 
-- 
2.28.0.806.g8561365e88-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-02 13:03 [PATCH] perf inject: Flush ordered events on FINISHED_ROUND Namhyung Kim
@ 2020-10-04 19:52 ` Jiri Olsa
  2020-10-06  1:28   ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Jiri Olsa @ 2020-10-04 19:52 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

On Fri, Oct 02, 2020 at 10:03:17PM +0900, Namhyung Kim wrote:
> Currently perf inject just repipes the event without any flush.  It
> makes an issue that it changes the order of events processed.
> 
> Normally we want to process events in time order, but events are
> recorded from several cpus and they weren't sorted with each other.
> So we put them into the ordered event queue, sorted by time, and flush
> them when we see a next FINISHED_ROUND event.
> 
> But this is for events from kernel, user space events (like the
> FINISHED_ROUND) are processed without queueing.  So during the perf
> inject, it writes all the FINISHED_ROUND events directly while
> SAMPLE (and other) events are kept in the queue.  This makes the user
> space events are put before any kernel events.
> 
> You can see that with the following command:
> 
>   $ perf record -a -o- sleep 1 | perf inject -b | perf script -i- --show-round-events
>   PERF_RECORD_FINISHED_ROUND
>   PERF_RECORD_FINISHED_ROUND
>   PERF_RECORD_FINISHED_ROUND
>   ...
> 
> Omitting perf inject, you can see the events are placed in the middle
> of the data.
> 
> You might argue that the whole point of the FINISHED_ROUND event is to
> sort (kernel) events.  And after perf inject, all events are written
> in a proper time order so we don't care about the FINISHED_ROUND event
> anymore.
> 
> But the problem is we don't know whether the input data is sorted or
> not (maybe we can add a feature bit for this?) so it should use an
> ordered event queue when processing the input like in perf report.

I like the idea of storing the information that the data is sorted,
and when it's there, let's not use ordered_oevets

> 
> Remember all the FINISHED_ROUND events now come before other events so
> the tool cannot know when it can flush the data so everything will be
> queued until it meets the end of the input.  Actually it's same for
> perf inject itself as it doesn't flush the queue.
> 
> Below measures time and memory usage during the perf inject and
> report using ~190MB data file.
> 
> Before:
>   perf inject:  11.09 s,  382148 KB
>   perf report:   8.05 s,  397440 KB
> 
> After:
>   perf inject:  16.24 s,   83376 KB
>   perf report:   7.96 s,  216184 KB
> 
> As you can see, it used 2x memory of the input size.  I guess it's
> because it needs to keep the copy for the whole input.  But I don't
> understand why processing time of perf inject increased..

would be good to find out first

thanks,
jirka

> 
> I'm not sure how it affects the auxtrace, but it should be fine IMHO.
> 
> Cc: Al Grant <al.grant@foss.arm.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
> ---
>  tools/perf/builtin-inject.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
> index 6d2f410d773a..9fa78a9edfc5 100644
> --- a/tools/perf/builtin-inject.c
> +++ b/tools/perf/builtin-inject.c
> @@ -79,6 +79,7 @@ static int perf_event__repipe_oe_synth(struct perf_tool *tool,
>  				       union perf_event *event,
>  				       struct ordered_events *oe __maybe_unused)
>  {
> +	ordered_events__flush(oe, OE_FLUSH__ROUND);
>  	return perf_event__repipe_synth(tool, event);
>  }
>  
> -- 
> 2.28.0.806.g8561365e88-goog
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-04 19:52 ` Jiri Olsa
@ 2020-10-06  1:28   ` Namhyung Kim
  2020-10-06  2:39     ` namhyung
  0 siblings, 1 reply; 7+ messages in thread
From: Namhyung Kim @ 2020-10-06  1:28 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

Hi Jiri,

On Mon, Oct 5, 2020 at 4:52 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Fri, Oct 02, 2020 at 10:03:17PM +0900, Namhyung Kim wrote:
> > Currently perf inject just repipes the event without any flush.  It
> > makes an issue that it changes the order of events processed.
> >
> > Normally we want to process events in time order, but events are
> > recorded from several cpus and they weren't sorted with each other.
> > So we put them into the ordered event queue, sorted by time, and flush
> > them when we see a next FINISHED_ROUND event.
> >
> > But this is for events from kernel, user space events (like the
> > FINISHED_ROUND) are processed without queueing.  So during the perf
> > inject, it writes all the FINISHED_ROUND events directly while
> > SAMPLE (and other) events are kept in the queue.  This makes the user
> > space events are put before any kernel events.
> >
> > You can see that with the following command:
> >
> >   $ perf record -a -o- sleep 1 | perf inject -b | perf script -i- --show-round-events
> >   PERF_RECORD_FINISHED_ROUND
> >   PERF_RECORD_FINISHED_ROUND
> >   PERF_RECORD_FINISHED_ROUND
> >   ...
> >
> > Omitting perf inject, you can see the events are placed in the middle
> > of the data.
> >
> > You might argue that the whole point of the FINISHED_ROUND event is to
> > sort (kernel) events.  And after perf inject, all events are written
> > in a proper time order so we don't care about the FINISHED_ROUND event
> > anymore.
> >
> > But the problem is we don't know whether the input data is sorted or
> > not (maybe we can add a feature bit for this?) so it should use an
> > ordered event queue when processing the input like in perf report.
>
> I like the idea of storing the information that the data is sorted,
> and when it's there, let's not use ordered_oevets

Thanks for your input.  Yeah, I think it's better not to use it if possible.

>
> >
> > Remember all the FINISHED_ROUND events now come before other events so
> > the tool cannot know when it can flush the data so everything will be
> > queued until it meets the end of the input.  Actually it's same for
> > perf inject itself as it doesn't flush the queue.
> >
> > Below measures time and memory usage during the perf inject and
> > report using ~190MB data file.
> >
> > Before:
> >   perf inject:  11.09 s,  382148 KB
> >   perf report:   8.05 s,  397440 KB
> >
> > After:
> >   perf inject:  16.24 s,   83376 KB
> >   perf report:   7.96 s,  216184 KB
> >
> > As you can see, it used 2x memory of the input size.  I guess it's
> > because it needs to keep the copy for the whole input.  But I don't
> > understand why processing time of perf inject increased..
>
> would be good to find out first

Will do!

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-06  1:28   ` Namhyung Kim
@ 2020-10-06  2:39     ` namhyung
  2020-10-06  5:40       ` namhyung
  0 siblings, 1 reply; 7+ messages in thread
From: namhyung @ 2020-10-06  2:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

> > On Fri, Oct 02, 2020 at 10:03:17PM +0900, Namhyung Kim wrote:
> > > Below measures time and memory usage during the perf inject and
> > > report using ~190MB data file.
> > >
> > > Before:
> > >   perf inject:  11.09 s,  382148 KB
> > >   perf report:   8.05 s,  397440 KB
> > >
> > > After:
> > >   perf inject:  16.24 s,   83376 KB
> > >   perf report:   7.96 s,  216184 KB
> > >
> > > As you can see, it used 2x memory of the input size.  I guess it's
> > > because it needs to keep the copy for the whole input.  But I don't
> > > understand why processing time of perf inject increased..

Measuring it with time shows:

           before       after
  real    11.309s     17.040s
  user     8.084s     13.940s
  sys      6.535s      6.732s

So it's user space to make the difference.  I've run perf record on
both (with cycles:U) and the dominant function is same: queue_event.
(46.98% vs 65.87%)

It seems the flushing the queue makes more overhead on sorting.

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-06  2:39     ` namhyung
@ 2020-10-06  5:40       ` namhyung
  2020-10-08  9:06         ` Jiri Olsa
  0 siblings, 1 reply; 7+ messages in thread
From: namhyung @ 2020-10-06  5:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

On Tue, Oct 06, 2020 at 11:39:49AM +0900, namhyung@kernel.org wrote:
> > > On Fri, Oct 02, 2020 at 10:03:17PM +0900, Namhyung Kim wrote:
> > > > Below measures time and memory usage during the perf inject and
> > > > report using ~190MB data file.
> > > >
> > > > Before:
> > > >   perf inject:  11.09 s,  382148 KB
> > > >   perf report:   8.05 s,  397440 KB
> > > >
> > > > After:
> > > >   perf inject:  16.24 s,   83376 KB
> > > >   perf report:   7.96 s,  216184 KB
> > > >
> > > > As you can see, it used 2x memory of the input size.  I guess it's
> > > > because it needs to keep the copy for the whole input.  But I don't
> > > > understand why processing time of perf inject increased..
> 
> Measuring it with time shows:
> 
>            before       after
>   real    11.309s     17.040s
>   user     8.084s     13.940s
>   sys      6.535s      6.732s
> 
> So it's user space to make the difference.  I've run perf record on
> both (with cycles:U) and the dominant function is same: queue_event.
> (46.98% vs 65.87%)
> 
> It seems the flushing the queue makes more overhead on sorting.

So I suspect the cache-miss ratio affects the performance.  With
flushing, data is processed in the middle and all the entries are
reused after flush so it would invalidate all the cache lines
occasionally.

This is the perf stat result:

* Before

     7,167,414,019      L1-dcache-loads                                             
       337,471,761      L1-dcache-read-misses     #    4.71% of all L1-dcache hits  

      11.011224671 seconds time elapsed


* After

     7,075,556,792      L1-dcache-loads                                             
       771,810,388      L1-dcache-read-misses     #   10.91% of all L1-dcache hits  

      17.015901863 seconds time elapsed


Hmm.. it's a memory & time trade-off then.  Maybe we need a switch to
select which one?

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-06  5:40       ` namhyung
@ 2020-10-08  9:06         ` Jiri Olsa
  2020-10-12  5:04           ` Namhyung Kim
  0 siblings, 1 reply; 7+ messages in thread
From: Jiri Olsa @ 2020-10-08  9:06 UTC (permalink / raw)
  To: namhyung
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

On Tue, Oct 06, 2020 at 02:40:32PM +0900, namhyung@kernel.org wrote:
> On Tue, Oct 06, 2020 at 11:39:49AM +0900, namhyung@kernel.org wrote:
> > > > On Fri, Oct 02, 2020 at 10:03:17PM +0900, Namhyung Kim wrote:
> > > > > Below measures time and memory usage during the perf inject and
> > > > > report using ~190MB data file.
> > > > >
> > > > > Before:
> > > > >   perf inject:  11.09 s,  382148 KB
> > > > >   perf report:   8.05 s,  397440 KB
> > > > >
> > > > > After:
> > > > >   perf inject:  16.24 s,   83376 KB
> > > > >   perf report:   7.96 s,  216184 KB
> > > > >
> > > > > As you can see, it used 2x memory of the input size.  I guess it's
> > > > > because it needs to keep the copy for the whole input.  But I don't
> > > > > understand why processing time of perf inject increased..
> > 
> > Measuring it with time shows:
> > 
> >            before       after
> >   real    11.309s     17.040s
> >   user     8.084s     13.940s
> >   sys      6.535s      6.732s
> > 
> > So it's user space to make the difference.  I've run perf record on
> > both (with cycles:U) and the dominant function is same: queue_event.
> > (46.98% vs 65.87%)
> > 
> > It seems the flushing the queue makes more overhead on sorting.
> 
> So I suspect the cache-miss ratio affects the performance.  With
> flushing, data is processed in the middle and all the entries are
> reused after flush so it would invalidate all the cache lines
> occasionally.
> 
> This is the perf stat result:
> 
> * Before
> 
>      7,167,414,019      L1-dcache-loads                                             
>        337,471,761      L1-dcache-read-misses     #    4.71% of all L1-dcache hits  
> 
>       11.011224671 seconds time elapsed
> 
> 
> * After
> 
>      7,075,556,792      L1-dcache-loads                                             
>        771,810,388      L1-dcache-read-misses     #   10.91% of all L1-dcache hits  
> 
>       17.015901863 seconds time elapsed
> 
> 
> Hmm.. it's a memory & time trade-off then.  Maybe we need a switch to
> select which one?

I'd keep the faster one ;-) so the one before

jirka


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] perf inject: Flush ordered events on FINISHED_ROUND
  2020-10-08  9:06         ` Jiri Olsa
@ 2020-10-12  5:04           ` Namhyung Kim
  0 siblings, 0 replies; 7+ messages in thread
From: Namhyung Kim @ 2020-10-12  5:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Ian Rogers, Al Grant, Adrian Hunter

Hi Jiri,

On Thu, Oct 8, 2020 at 6:07 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Tue, Oct 06, 2020 at 02:40:32PM +0900, namhyung@kernel.org wrote:
> > This is the perf stat result:
> >
> > * Before
> >
> >      7,167,414,019      L1-dcache-loads
> >        337,471,761      L1-dcache-read-misses     #    4.71% of all L1-dcache hits
> >
> >       11.011224671 seconds time elapsed
> >
> >
> > * After
> >
> >      7,075,556,792      L1-dcache-loads
> >        771,810,388      L1-dcache-read-misses     #   10.91% of all L1-dcache hits
> >
> >       17.015901863 seconds time elapsed
> >
> >
> > Hmm.. it's a memory & time trade-off then.  Maybe we need a switch to
> > select which one?
>
> I'd keep the faster one ;-) so the one before

Yeah, it's hard to argue when it slows things down much.
But I'd like to note that memory consumption also can be a
big issue in some environments.  I'll bring this back later..

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-12  5:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-02 13:03 [PATCH] perf inject: Flush ordered events on FINISHED_ROUND Namhyung Kim
2020-10-04 19:52 ` Jiri Olsa
2020-10-06  1:28   ` Namhyung Kim
2020-10-06  2:39     ` namhyung
2020-10-06  5:40       ` namhyung
2020-10-08  9:06         ` Jiri Olsa
2020-10-12  5:04           ` Namhyung Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.