linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* A concern about overflow ring buffer mode
@ 2018-10-26 17:45 David Miller
  2018-10-26 18:38 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2018-10-26 17:45 UTC (permalink / raw)
  To: acme; +Cc: linux-kernel


Since the last time I looked deeply into perf I notice that
perf top now uses a new ring buffer mode by default.

Basically, events are written in reverse order, and when fetching
events the tool uses an ioctl to "pause" the ring buffer.

I understand some of the reasons for pursing this kind of scheme but I
think there may be a huge downside to this design.

Yes, if the tool can't keep up with the kernel, we'd rather see newer
rather than older events.

However, pausing the ring buffer during the fetch is going to
virtually guaratee that we lose critical events that impact
interpretation of future events in a non-recoverable way.

The thing is, the new scheme causes events to be lost even if the tool
can keep up with the kernel.

Any event that happens while the tool is fetching the ring entries
will be lost forever.  The kernel simply skips queuing up the event
and increments a lost counter.  During a kernel build, I typically see
9 or so events lost each fetch.

Ok, if this is just a SAMPLE then fine, it's not a big deal.

But what if the lost event is a FORK or an EXEC or the worst one to
lose, an MMAP?

Now we can't even match up events properly and we get tons of those
dreaded "Unknown" symbols and DSOs.  The output looks terrible and the
tool becomes useless.

And yes this happens frequently.

I think the overwrite ring buffer mode should be seriously
reconsidered.  The "I'd rather see new than old events" part is fine,
but the "pause" part is not.  You can't turn event recording off on
the kernel side while you fetch some events, because it means that
critical events that allow us to properly interpret future events will
be lost.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 17:45 A concern about overflow ring buffer mode David Miller
@ 2018-10-26 18:38 ` Arnaldo Carvalho de Melo
  2018-10-26 18:42   ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 18:38 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim

Addind a few folks to the CC list, Wang implemented the backwards ring
buffer code.

Em Fri, Oct 26, 2018 at 10:45:13AM -0700, David Miller escreveu:
> Since the last time I looked deeply into perf I notice that
> perf top now uses a new ring buffer mode by default.
> 
> Basically, events are written in reverse order, and when fetching
> events the tool uses an ioctl to "pause" the ring buffer.
> 
> I understand some of the reasons for pursing this kind of scheme but I
> think there may be a huge downside to this design.
> 
> Yes, if the tool can't keep up with the kernel, we'd rather see newer
> rather than older events.
> 
> However, pausing the ring buffer during the fetch is going to
> virtually guaratee that we lose critical events that impact
> interpretation of future events in a non-recoverable way.
> 
> The thing is, the new scheme causes events to be lost even if the tool
> can keep up with the kernel.
> 
> Any event that happens while the tool is fetching the ring entries
> will be lost forever.  The kernel simply skips queuing up the event
> and increments a lost counter.  During a kernel build, I typically see
> 9 or so events lost each fetch.
> 
> Ok, if this is just a SAMPLE then fine, it's not a big deal.
> 
> But what if the lost event is a FORK or an EXEC or the worst one to
> lose, an MMAP?

Right, we can't lose those, so for using this, we need something like
the intel_pt tooling code does, i.e. add an extra event to the mix, a
software event, "dummy", that then gets used to track just the
PERF_RECORD_!SAMPLE metadata events and then this one never gets paused.

The intel pt motivation is different, but the technique perhaps will
allow for using the backward code while not losing metadata events.

wdyt? Wang?

- Arnaldo
 
> Now we can't even match up events properly and we get tons of those
> dreaded "Unknown" symbols and DSOs.  The output looks terrible and the
> tool becomes useless.
> 
> And yes this happens frequently.
> 
> I think the overwrite ring buffer mode should be seriously
> reconsidered.  The "I'd rather see new than old events" part is fine,
> but the "pause" part is not.  You can't turn event recording off on

> the kernel side while you fetch some events, because it means that
> critical events that allow us to properly interpret future events will
> be lost.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 18:38 ` Arnaldo Carvalho de Melo
@ 2018-10-26 18:42   ` Arnaldo Carvalho de Melo
  2018-10-26 19:02     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 18:42 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim, Kan Liang,
	Andi Kleen, Jin Yao, Peter Zijlstra

Em Fri, Oct 26, 2018 at 03:38:05PM -0300, Arnaldo Carvalho de Melo escreveu:
> Addind a few folks to the CC list, Wang implemented the backwards ring
> buffer code.

Adding a few more, since the patch switching 'perf top' to overwrite
mode and the motivation for doing so is this one:

commit ebebbf082357f86cc84a4d46ce897a5750e41b7a
Author: Kan Liang <kan.liang@intel.com>
Date:   Thu Jan 18 13:26:31 2018 -0800

    perf top: Switch default mode to overwrite mode
    
    perf_top__mmap_read() has a severe performance issue in the Knights
    Landing/Mill platform, when monitoring heavy load systems. It costs
    several minutes to finish, which is unacceptable.

    Currently, 'perf top' uses the non overwrite mode. For non overwrite
    mode, it tries to read everything in the ringbuffer and doesn't pause
    it. Once there are lots of samples delivered persistently, the
    processing time could be very long. Also, the latest samples could be
    lost when the ringbuffer is full.
    
    For overwrite mode, it takes a snapshot for the system by pausing the
    ringbuffer, which could significantly reduce the processing time.  Also,
    the overwrite mode always keep the latest samples.  Considering the real
    time requirement for 'perf top', the overwrite mode is more suitable for
    it.
    
    Actually, 'perf top' was overwrite mode. It is changed to non overwrite
    mode since commit 93fc64f14472 ("perf top: Switch to non overwrite
    mode"). It's better to change it back to overwrite mode by default.
    
    For the kernel which doesn't support overwrite mode, it will fall back
    to non overwrite mode.
    
    There would be some records lost in overwrite mode because of pausing
    the ringbuffer. It has little impact for the accuracy of the snapshot
    and can be tolerated.
    
    For overwrite mode, unconditionally wait 100 ms before each snapshot. It
    also reduces the overhead caused by pausing ringbuffer, especially on
    light load system.
    
    Signed-off-by: Kan Liang <kan.liang@intel.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Jin Yao <yao.jin@linux.intel.com>
    Cc: Namhyung Kim <namhyung@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Wang Nan <wangnan0@huawei.com>
    Link: http://lkml.kernel.org/r/1516310792-208685-17-git-send-email-kan.liang@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

 
> Em Fri, Oct 26, 2018 at 10:45:13AM -0700, David Miller escreveu:
> > Since the last time I looked deeply into perf I notice that
> > perf top now uses a new ring buffer mode by default.
> > 
> > Basically, events are written in reverse order, and when fetching
> > events the tool uses an ioctl to "pause" the ring buffer.
> > 
> > I understand some of the reasons for pursing this kind of scheme but I
> > think there may be a huge downside to this design.
> > 
> > Yes, if the tool can't keep up with the kernel, we'd rather see newer
> > rather than older events.
> > 
> > However, pausing the ring buffer during the fetch is going to
> > virtually guaratee that we lose critical events that impact
> > interpretation of future events in a non-recoverable way.
> > 
> > The thing is, the new scheme causes events to be lost even if the tool
> > can keep up with the kernel.
> > 
> > Any event that happens while the tool is fetching the ring entries
> > will be lost forever.  The kernel simply skips queuing up the event
> > and increments a lost counter.  During a kernel build, I typically see
> > 9 or so events lost each fetch.
> > 
> > Ok, if this is just a SAMPLE then fine, it's not a big deal.
> > 
> > But what if the lost event is a FORK or an EXEC or the worst one to
> > lose, an MMAP?
> 
> Right, we can't lose those, so for using this, we need something like
> the intel_pt tooling code does, i.e. add an extra event to the mix, a
> software event, "dummy", that then gets used to track just the
> PERF_RECORD_!SAMPLE metadata events and then this one never gets paused.
> 
> The intel pt motivation is different, but the technique perhaps will
> allow for using the backward code while not losing metadata events.
> 
> wdyt? Wang?
> 
> - Arnaldo
>  
> > Now we can't even match up events properly and we get tons of those
> > dreaded "Unknown" symbols and DSOs.  The output looks terrible and the
> > tool becomes useless.
> > 
> > And yes this happens frequently.
> > 
> > I think the overwrite ring buffer mode should be seriously
> > reconsidered.  The "I'd rather see new than old events" part is fine,
> > but the "pause" part is not.  You can't turn event recording off on
> 
> > the kernel side while you fetch some events, because it means that
> > critical events that allow us to properly interpret future events will
> > be lost.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 18:42   ` Arnaldo Carvalho de Melo
@ 2018-10-26 19:02     ` Arnaldo Carvalho de Melo
  2018-10-26 19:07       ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 19:02 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim, Kan Liang,
	Andi Kleen, Jin Yao, Peter Zijlstra

So, I'm adding the following to my tree to help in diagnosing problems
with this overwrite mode:


From 40feb09001c7cc2ba8aeaa0a8f03b6d28fa4ca95 Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Fri, 26 Oct 2018 15:55:23 -0300
Subject: [PATCH 1/1] perf top: Allow disabling the overwrite mode

In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
forgot to leave a way to disable that new default, add a --overwrite
option that can be disabled using --no-overwrite, since the code already
in such a way that we can readily disable this mode.

This is useful when investigating bugs with this mode like the recent
report from David Miller where lots of unknown symbols appear due to
disabling the events while processing them which disables all record
types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
maps when we lose PERF_RECORD_MMAP records.

This can be easily seen while building a kernel, when there are lots of
short lived processes.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 5 +++++
 tools/perf/builtin-top.c              | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 114fda12aa49..d4be6061fe1c 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -242,6 +242,11 @@ Default is to monitor all CPUS.
 --hierarchy::
 	Enable hierarchy output.
 
+--overwrite::
+	This is the default, but for investigating problems with it or any other strange
+	behaviour like lots of unknown samples, we may want to disable this mode by using
+	--no-overwrite.
+
 --force::
 	Don't do ownership validation.
 
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d21d8751e749..214fad747b04 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
 		    "Show raw trace event output (do not use print fmt or plugins)"),
 	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
 		    "Show entries in a hierarchy"),
+	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
+		    "Use a backward ring buffer, default: yes"),
 	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
 	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
 			"number of thread to run event synthesize"),
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 19:02     ` Arnaldo Carvalho de Melo
@ 2018-10-26 19:07       ` Liang, Kan
  2018-10-26 19:12         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2018-10-26 19:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, David Miller
  Cc: linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim, Kan Liang,
	Andi Kleen, Jin Yao, Peter Zijlstra



On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
> So, I'm adding the following to my tree to help in diagnosing problems
> with this overwrite mode:
>

Actually, you can use per-event overwrite term to disable overwrite mode 
for perf top.

/*
  * Check per-event overwrite term.
  * perf top should support consistent term for all events.
  * - All events don't have per-event term
  *   E.g. "cpu/cpu-cycles/,cpu/instructions/"
  *   Nothing change, return 0.
  * - All events have same per-event term
  *   E.g. "cpu/cpu-cycles,no-overwrite/,cpu/instructions,no-overwrite/
  *   Using the per-event setting to replace the opts->overwrite if
  *   they are different, then return 0.
  * - Events have different per-event term
  *   E.g. "cpu/cpu-cycles,overwrite/,cpu/instructions,no-overwrite/"
  *   Return -1
  * - Some of the event set per-event term, but some not.
  *   E.g. "cpu/cpu-cycles/,cpu/instructions,no-overwrite/"
  *   Return -1
  */
static int perf_top__overwrite_check(struct perf_top *top)
{

Thanks,
Kan

> 
>  From 40feb09001c7cc2ba8aeaa0a8f03b6d28fa4ca95 Mon Sep 17 00:00:00 2001
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
> Date: Fri, 26 Oct 2018 15:55:23 -0300
> Subject: [PATCH 1/1] perf top: Allow disabling the overwrite mode
> 
> In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
> forgot to leave a way to disable that new default, add a --overwrite
> option that can be disabled using --no-overwrite, since the code already
> in such a way that we can readily disable this mode.
> 
> This is useful when investigating bugs with this mode like the recent
> report from David Miller where lots of unknown symbols appear due to
> disabling the events while processing them which disables all record
> types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
> maps when we lose PERF_RECORD_MMAP records.
> 
> This can be easily seen while building a kernel, when there are lots of
> short lived processes.
> 
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Jin Yao <yao.jin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Kan Liang <kan.liang@intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wang Nan <wangnan0@huawei.com>
> Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>   tools/perf/Documentation/perf-top.txt | 5 +++++
>   tools/perf/builtin-top.c              | 2 ++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
> index 114fda12aa49..d4be6061fe1c 100644
> --- a/tools/perf/Documentation/perf-top.txt
> +++ b/tools/perf/Documentation/perf-top.txt
> @@ -242,6 +242,11 @@ Default is to monitor all CPUS.
>   --hierarchy::
>   	Enable hierarchy output.
>   
> +--overwrite::
> +	This is the default, but for investigating problems with it or any other strange
> +	behaviour like lots of unknown samples, we may want to disable this mode by using
> +	--no-overwrite.
> +
>   --force::
>   	Don't do ownership validation.
>   
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index d21d8751e749..214fad747b04 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
>   		    "Show raw trace event output (do not use print fmt or plugins)"),
>   	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
>   		    "Show entries in a hierarchy"),
> +	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
> +		    "Use a backward ring buffer, default: yes"),
>   	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
>   	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
>   			"number of thread to run event synthesize"),
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 19:07       ` Liang, Kan
@ 2018-10-26 19:12         ` Arnaldo Carvalho de Melo
  2018-10-26 19:16           ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 19:12 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra

Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
> On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
> > So, I'm adding the following to my tree to help in diagnosing problems
> > with this overwrite mode:
 
> Actually, you can use per-event overwrite term to disable overwrite mode for
> perf top.
 
> /*
>  * Check per-event overwrite term.
>  * perf top should support consistent term for all events.
>  * - All events don't have per-event term
>  *   E.g. "cpu/cpu-cycles/,cpu/instructions/"
>  *   Nothing change, return 0.
>  * - All events have same per-event term
>  *   E.g. "cpu/cpu-cycles,no-overwrite/,cpu/instructions,no-overwrite/
>  *   Using the per-event setting to replace the opts->overwrite if
>  *   they are different, then return 0.
>  * - Events have different per-event term
>  *   E.g. "cpu/cpu-cycles,overwrite/,cpu/instructions,no-overwrite/"
>  *   Return -1
>  * - Some of the event set per-event term, but some not.
>  *   E.g. "cpu/cpu-cycles/,cpu/instructions,no-overwrite/"
>  *   Return -1
>  */
> static int perf_top__overwrite_check(struct perf_top *top)
> {

I see, it will disable that opts->overwrite if it finds the no-overwrite
in the per-event definition, so the equivalent of the option I added
below:

     perf top --no-overwrite

is:

     perf top -e cycles/no-overwrite/

I checked and both have the same result. But I still think there is
value in having the shorter form, ok?

- Arnaldo

> Thanks,
> Kan
> 
> > 
> >  From 40feb09001c7cc2ba8aeaa0a8f03b6d28fa4ca95 Mon Sep 17 00:00:00 2001
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Date: Fri, 26 Oct 2018 15:55:23 -0300
> > Subject: [PATCH 1/1] perf top: Allow disabling the overwrite mode
> > 
> > In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
> > forgot to leave a way to disable that new default, add a --overwrite
> > option that can be disabled using --no-overwrite, since the code already
> > in such a way that we can readily disable this mode.
> > 
> > This is useful when investigating bugs with this mode like the recent
> > report from David Miller where lots of unknown symbols appear due to
> > disabling the events while processing them which disables all record
> > types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
> > maps when we lose PERF_RECORD_MMAP records.
> > 
> > This can be easily seen while building a kernel, when there are lots of
> > short lived processes.
> > 
> > Cc: Adrian Hunter <adrian.hunter@intel.com>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: David Miller <davem@davemloft.net>
> > Cc: Jin Yao <yao.jin@linux.intel.com>
> > Cc: Jiri Olsa <jolsa@kernel.org>
> > Cc: Kan Liang <kan.liang@intel.com>
> > Cc: Namhyung Kim <namhyung@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Wang Nan <wangnan0@huawei.com>
> > Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
> > Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > ---
> >   tools/perf/Documentation/perf-top.txt | 5 +++++
> >   tools/perf/builtin-top.c              | 2 ++
> >   2 files changed, 7 insertions(+)
> > 
> > diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
> > index 114fda12aa49..d4be6061fe1c 100644
> > --- a/tools/perf/Documentation/perf-top.txt
> > +++ b/tools/perf/Documentation/perf-top.txt
> > @@ -242,6 +242,11 @@ Default is to monitor all CPUS.
> >   --hierarchy::
> >   	Enable hierarchy output.
> > +--overwrite::
> > +	This is the default, but for investigating problems with it or any other strange
> > +	behaviour like lots of unknown samples, we may want to disable this mode by using
> > +	--no-overwrite.
> > +
> >   --force::
> >   	Don't do ownership validation.
> > diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> > index d21d8751e749..214fad747b04 100644
> > --- a/tools/perf/builtin-top.c
> > +++ b/tools/perf/builtin-top.c
> > @@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
> >   		    "Show raw trace event output (do not use print fmt or plugins)"),
> >   	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
> >   		    "Show entries in a hierarchy"),
> > +	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
> > +		    "Use a backward ring buffer, default: yes"),
> >   	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
> >   	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
> >   			"number of thread to run event synthesize"),
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 19:12         ` Arnaldo Carvalho de Melo
@ 2018-10-26 19:16           ` Liang, Kan
  2018-10-26 19:24             ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2018-10-26 19:16 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra



On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
>> On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
>>> So, I'm adding the following to my tree to help in diagnosing problems
>>> with this overwrite mode:
>   
>> Actually, you can use per-event overwrite term to disable overwrite mode for
>> perf top.
>   
>> /*
>>   * Check per-event overwrite term.
>>   * perf top should support consistent term for all events.
>>   * - All events don't have per-event term
>>   *   E.g. "cpu/cpu-cycles/,cpu/instructions/"
>>   *   Nothing change, return 0.
>>   * - All events have same per-event term
>>   *   E.g. "cpu/cpu-cycles,no-overwrite/,cpu/instructions,no-overwrite/
>>   *   Using the per-event setting to replace the opts->overwrite if
>>   *   they are different, then return 0.
>>   * - Events have different per-event term
>>   *   E.g. "cpu/cpu-cycles,overwrite/,cpu/instructions,no-overwrite/"
>>   *   Return -1
>>   * - Some of the event set per-event term, but some not.
>>   *   E.g. "cpu/cpu-cycles/,cpu/instructions,no-overwrite/"
>>   *   Return -1
>>   */
>> static int perf_top__overwrite_check(struct perf_top *top)
>> {
> 
> I see, it will disable that opts->overwrite if it finds the no-overwrite
> in the per-event definition, so the equivalent of the option I added
> below:
> 
>       perf top --no-overwrite
> 
> is:
> 
>       perf top -e cycles/no-overwrite/
> 
> I checked and both have the same result. But I still think there is
> value in having the shorter form, ok?
>

Sure.

Thanks,
Kan

> - Arnaldo
> 
>> Thanks,
>> Kan
>>
>>>
>>>   From 40feb09001c7cc2ba8aeaa0a8f03b6d28fa4ca95 Mon Sep 17 00:00:00 2001
>>> From: Arnaldo Carvalho de Melo <acme@redhat.com>
>>> Date: Fri, 26 Oct 2018 15:55:23 -0300
>>> Subject: [PATCH 1/1] perf top: Allow disabling the overwrite mode
>>>
>>> In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
>>> forgot to leave a way to disable that new default, add a --overwrite
>>> option that can be disabled using --no-overwrite, since the code already
>>> in such a way that we can readily disable this mode.
>>>
>>> This is useful when investigating bugs with this mode like the recent
>>> report from David Miller where lots of unknown symbols appear due to
>>> disabling the events while processing them which disables all record
>>> types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
>>> maps when we lose PERF_RECORD_MMAP records.
>>>
>>> This can be easily seen while building a kernel, when there are lots of
>>> short lived processes.
>>>
>>> Cc: Adrian Hunter <adrian.hunter@intel.com>
>>> Cc: Andi Kleen <ak@linux.intel.com>
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Cc: David Miller <davem@davemloft.net>
>>> Cc: Jin Yao <yao.jin@linux.intel.com>
>>> Cc: Jiri Olsa <jolsa@kernel.org>
>>> Cc: Kan Liang <kan.liang@intel.com>
>>> Cc: Namhyung Kim <namhyung@kernel.org>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Wang Nan <wangnan0@huawei.com>
>>> Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
>>> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>>> ---
>>>    tools/perf/Documentation/perf-top.txt | 5 +++++
>>>    tools/perf/builtin-top.c              | 2 ++
>>>    2 files changed, 7 insertions(+)
>>>
>>> diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
>>> index 114fda12aa49..d4be6061fe1c 100644
>>> --- a/tools/perf/Documentation/perf-top.txt
>>> +++ b/tools/perf/Documentation/perf-top.txt
>>> @@ -242,6 +242,11 @@ Default is to monitor all CPUS.
>>>    --hierarchy::
>>>    	Enable hierarchy output.
>>> +--overwrite::
>>> +	This is the default, but for investigating problems with it or any other strange
>>> +	behaviour like lots of unknown samples, we may want to disable this mode by using
>>> +	--no-overwrite.
>>> +
>>>    --force::
>>>    	Don't do ownership validation.
>>> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
>>> index d21d8751e749..214fad747b04 100644
>>> --- a/tools/perf/builtin-top.c
>>> +++ b/tools/perf/builtin-top.c
>>> @@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
>>>    		    "Show raw trace event output (do not use print fmt or plugins)"),
>>>    	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
>>>    		    "Show entries in a hierarchy"),
>>> +	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
>>> +		    "Use a backward ring buffer, default: yes"),
>>>    	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
>>>    	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
>>>    			"number of thread to run event synthesize"),
>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 19:16           ` Liang, Kan
@ 2018-10-26 19:24             ` Arnaldo Carvalho de Melo
  2018-10-26 20:11               ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 19:24 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra

Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
> 
> 
> On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
> > > On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
> > > > So, I'm adding the following to my tree to help in diagnosing problems
> > > > with this overwrite mode:
> > > Actually, you can use per-event overwrite term to disable overwrite mode for
> > > perf top.
<SMIP>
> > I see, it will disable that opts->overwrite if it finds the no-overwrite
> > in the per-event definition, so the equivalent of the option I added
> > below:

> >       perf top --no-overwrite

> > is:

> >       perf top -e cycles/no-overwrite/

> > I checked and both have the same result. But I still think there is
> > value in having the shorter form, ok?

> Sure.

Ok.

I think that we should default back to --no-overwrite till we get this
sorted out, as the effect is easily noticeable, as David reported and I
reproduced, when doing kernel builds.

On systems such as Knights Landing/Mill one can use --overwrite, knowing
about this current map resolving limitation, i.e. for workloads where
there are not that many short lived threads or mmap'ing, that could be
possibly tolerable.

Fixing this properly will probably involve using the ordered_events code
and two evlist, one for the PERF_RECORD_!SAMPLE in non-overwrite mode
and the other for PERF_RECORD_SAMPLE in overwrite mode, else someone
comes up with some better solution :-)

wdyt?

- Arnaldo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 19:24             ` Arnaldo Carvalho de Melo
@ 2018-10-26 20:11               ` Liang, Kan
  2018-10-26 20:43                 ` Arnaldo Carvalho de Melo
                                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Liang, Kan @ 2018-10-26 20:11 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra



On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
>>
>>
>> On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
>>> Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
>>>> On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
>>>>> So, I'm adding the following to my tree to help in diagnosing problems
>>>>> with this overwrite mode:
>>>> Actually, you can use per-event overwrite term to disable overwrite mode for
>>>> perf top.
> <SMIP>
>>> I see, it will disable that opts->overwrite if it finds the no-overwrite
>>> in the per-event definition, so the equivalent of the option I added
>>> below:
> 
>>>        perf top --no-overwrite
> 
>>> is:
> 
>>>        perf top -e cycles/no-overwrite/
> 
>>> I checked and both have the same result. But I still think there is
>>> value in having the shorter form, ok?
> 
>> Sure.
> 
> Ok.
> 
> I think that we should default back to --no-overwrite till we get this
> sorted out, as the effect is easily noticeable, as David reported and I
> reproduced, when doing kernel builds.

It is mainly for performance reason to switch to overwrite mode. The 
impact was very small when I did my test. But now the effect is easily 
noticeable in other tests. Yes, I agree. We may change it back to 
non-overwrite mode until the issue is addressed.


> 
> On systems such as Knights Landing/Mill one can use --overwrite, knowing
> about this current map resolving limitation, i.e. for workloads where
> there are not that many short lived threads or mmap'ing, that could be
> possibly tolerable.

Could you please add this in the description of --overwrite?
It looks like the --overwrite is not default anymore.
+--overwrite::
+	This is the default, but for investigating problems with it or any 
other strange
+	behaviour like lots of unknown samples, we may want to disable this 
mode by using
+	--no-overwrite.



> 
> Fixing this properly will probably involve using the ordered_events code
> and two evlist, one for the PERF_RECORD_!SAMPLE in non-overwrite mode
> and the other for PERF_RECORD_SAMPLE in overwrite mode, else someone
> comes up with some better solution :-)
>

Supporting both overwrite and non-overwrite mode?
I think that needs some changes in kernel. May need to split the ring 
buffer for different mode. I think it should be very complex.
But I don't have a better solution for now. :)


Thanks,
Kan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: A concern about overflow ring buffer mode
  2018-10-26 20:11               ` Liang, Kan
@ 2018-10-26 20:43                 ` Arnaldo Carvalho de Melo
  2018-10-29 13:03                 ` [PATCHES/RFC] " Arnaldo Carvalho de Melo
  2018-10-31 22:03                 ` [tip:perf/urgent] perf top: Do not use overwrite mode by default tip-bot for Arnaldo Carvalho de Melo
  2 siblings, 0 replies; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-26 20:43 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra

Em Fri, Oct 26, 2018 at 04:11:51PM -0400, Liang, Kan escreveu:
> 
> 
> On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
> > > 
> > > 
> > > On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
> > > > Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
> > > > > On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
> > > > > > So, I'm adding the following to my tree to help in diagnosing problems
> > > > > > with this overwrite mode:
> > > > > Actually, you can use per-event overwrite term to disable overwrite mode for
> > > > > perf top.
> > <SMIP>
> > > > I see, it will disable that opts->overwrite if it finds the no-overwrite
> > > > in the per-event definition, so the equivalent of the option I added
> > > > below:
> > 
> > > >        perf top --no-overwrite
> > 
> > > > is:
> > 
> > > >        perf top -e cycles/no-overwrite/
> > 
> > > > I checked and both have the same result. But I still think there is
> > > > value in having the shorter form, ok?
> > 
> > > Sure.
> > 
> > Ok.
> > 
> > I think that we should default back to --no-overwrite till we get this
> > sorted out, as the effect is easily noticeable, as David reported and I
> > reproduced, when doing kernel builds.
> 
> It is mainly for performance reason to switch to overwrite mode. The impact
> was very small when I did my test. But now the effect is easily noticeable
> in other tests. Yes, I agree. We may change it back to non-overwrite mode
> until the issue is addressed.

ok
 
> > On systems such as Knights Landing/Mill one can use --overwrite, knowing
> > about this current map resolving limitation, i.e. for workloads where
> > there are not that many short lived threads or mmap'ing, that could be
> > possibly tolerable.
 
> Could you please add this in the description of --overwrite?
> It looks like the --overwrite is not default anymore.
> +--overwrite::
> +	This is the default, but for investigating problems with it or any other
> strange
> +	behaviour like lots of unknown samples, we may want to disable this mode
> by using
> +	--no-overwrite.

Ok, when I make that change, then I'll change the documentation for the
option.
 
> > Fixing this properly will probably involve using the ordered_events code
> > and two evlist, one for the PERF_RECORD_!SAMPLE in non-overwrite mode
> > and the other for PERF_RECORD_SAMPLE in overwrite mode, else someone
> > comes up with some better solution :-)
> > 
> 
> Supporting both overwrite and non-overwrite mode?

Not on the same ring buffer, two ring buffers, one overwrite, the
other non-overwrite, get events from both and order, then consume, like
perf_session does now when processing perf.data files.

> I think that needs some changes in kernel. May need to split the ring buffer
> for different mode. I think it should be very complex.
> But I don't have a better solution for now. :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-26 20:11               ` Liang, Kan
  2018-10-26 20:43                 ` Arnaldo Carvalho de Melo
@ 2018-10-29 13:03                 ` Arnaldo Carvalho de Melo
  2018-10-29 14:33                   ` Liang, Kan
  2018-10-31 22:03                 ` [tip:perf/urgent] perf top: Do not use overwrite mode by default tip-bot for Arnaldo Carvalho de Melo
  2 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-29 13:03 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra

Em Fri, Oct 26, 2018 at 04:11:51PM -0400, Liang, Kan escreveu:
> On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
> > > On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
> > > > Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
> > > > > On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
> > > > I checked and both have the same result. But I still think there is
> > > > value in having the shorter form, ok?

> > > Sure.

> > Ok.

> > I think that we should default back to --no-overwrite till we get this
> > sorted out, as the effect is easily noticeable, as David reported and I
> > reproduced, when doing kernel builds.
 
> It is mainly for performance reason to switch to overwrite mode. The impact
> was very small when I did my test. But now the effect is easily noticeable
> in other tests. Yes, I agree. We may change it back to non-overwrite mode
> until the issue is addressed.

So, I have these two patches in my perf/core branch, with Fixes tags
that will make them get to the stable kernels, ok?


From f54ef0e7342efb77205b2faaacbcb81cdd31f064 Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Fri, 26 Oct 2018 15:55:23 -0300
Subject: [PATCH 1/2] perf top: Allow disabling the overwrite mode

In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
forgot to leave a way to disable that new default, add a --overwrite
option that can be disabled using --no-overwrite, since the code already
in such a way that we can readily disable this mode.

This is useful when investigating bugs with this mode like the recent
report from David Miller where lots of unknown symbols appear due to
disabling the events while processing them which disables all record
types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
maps when we lose PERF_RECORD_MMAP records.

This can be easily seen while building a kernel, when there are lots of
short lived processes.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Kan Liang <kan.liang@intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 5 +++++
 tools/perf/builtin-top.c              | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 114fda12aa49..d4be6061fe1c 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -242,6 +242,11 @@ Default is to monitor all CPUS.
 --hierarchy::
 	Enable hierarchy output.
 
+--overwrite::
+	This is the default, but for investigating problems with it or any other strange
+	behaviour like lots of unknown samples, we may want to disable this mode by using
+	--no-overwrite.
+
 --force::
 	Don't do ownership validation.
 
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d21d8751e749..214fad747b04 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
 		    "Show raw trace event output (do not use print fmt or plugins)"),
 	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
 		    "Show entries in a hierarchy"),
+	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
+		    "Use a backward ring buffer, default: yes"),
 	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
 	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
 			"number of thread to run event synthesize"),
-- 
2.14.4

From 5af190cac4ec10c53f1a934e7bbd30da7e616b22 Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Mon, 29 Oct 2018 09:47:00 -0300
Subject: [PATCH 2/2] perf top: Do not use overwrite mode by default

Enabling --overwrite mode allows us to to use just the most recent
records, which helps in high core count machines such as Knights
Landing/Mill, but right now is being disabled by default as the pausing
used in this technique is leading to loss of metadata events such as
PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples,
leading to lots of unknown samples appearing on the UI.

Enabling this may be useful if you are in such machines and profiling a
workload that doesn't creates short lived threads and/or doesn't uses
many executable mmap operations.

Work is being planed to solve this situation, till then, this will
remain disabled by default.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Kan Liang <kan.liang@intel.com>
Link: https://lkml.kernel.org/r/4f84468f-37d9-cf1b-12c1-514ef74b6a48@linux.intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
Link: https://lkml.kernel.org/n/tip-f19zd2uyfimkeon4vt6kzucq@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 11 ++++++++---
 tools/perf/builtin-top.c              |  9 ++++++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index d4be6061fe1c..808b664343c9 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -243,9 +243,14 @@ Default is to monitor all CPUS.
 	Enable hierarchy output.
 
 --overwrite::
-	This is the default, but for investigating problems with it or any other strange
-	behaviour like lots of unknown samples, we may want to disable this mode by using
-	--no-overwrite.
+	Enable this to use just the most recent records, which helps in high core count
+	machines such as Knights Landing/Mill, but right now is disabled by default as
+	the pausing used in this technique is leading to loss of metadata events such
+	as PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples, leading
+	to lots of unknown samples appearing on the UI. Enable this if you are in such
+	machines and profiling a workload that doesn't creates short lived threads and/or
+	doesn't uses many executable mmap operations. Work is being planed to solve
+	this situation, till then, this will remain disabled by default.
 
 --force::
 	Don't do ownership validation.
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 214fad747b04..0891656da6ef 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1257,7 +1257,14 @@ int cmd_top(int argc, const char **argv)
 				.uses_mmap   = true,
 			},
 			.proc_map_timeout    = 500,
-			.overwrite	= 1,
+			/*
+			 * FIXME: This will lose PERF_RECORD_MMAP and other metadata
+			 * when we pause, fix that and reenable. Probably using a
+			 * separate evlist with a dummy event, i.e. a non-overwrite
+			 * ring buffer just for metadata events, while PERF_RECORD_SAMPLE
+			 * stays in overwrite mode. -acme
+			 * */
+			.overwrite	= 0,
 		},
 		.max_stack	     = sysctl__max_stack(),
 		.annotation_opts     = annotation__default_options,
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 13:03                 ` [PATCHES/RFC] " Arnaldo Carvalho de Melo
@ 2018-10-29 14:33                   ` Liang, Kan
  2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
                                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Liang, Kan @ 2018-10-29 14:33 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra



On 10/29/2018 9:03 AM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Oct 26, 2018 at 04:11:51PM -0400, Liang, Kan escreveu:
>> On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
>>> Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
>>>> On 10/26/2018 3:12 PM, Arnaldo Carvalho de Melo wrote:
>>>>> Em Fri, Oct 26, 2018 at 03:07:40PM -0400, Liang, Kan escreveu:
>>>>>> On 10/26/2018 3:02 PM, Arnaldo Carvalho de Melo wrote:
>>>>> I checked and both have the same result. But I still think there is
>>>>> value in having the shorter form, ok?
> 
>>>> Sure.
> 
>>> Ok.
> 
>>> I think that we should default back to --no-overwrite till we get this
>>> sorted out, as the effect is easily noticeable, as David reported and I
>>> reproduced, when doing kernel builds.
>   
>> It is mainly for performance reason to switch to overwrite mode. The impact
>> was very small when I did my test. But now the effect is easily noticeable
>> in other tests. Yes, I agree. We may change it back to non-overwrite mode
>> until the issue is addressed.
> 
> So, I have these two patches in my perf/core branch, with Fixes tags
> that will make them get to the stable kernels, ok?

I just realized that the problem in KNL will be back if we switch back 
to non-overwrite mode.
The problem is that users have to wait tens of minutes to see perf top 
results on the screen in KNL. Before that, there is nothing but a black 
screen.

Sorry I didn't notice it last Friday. Because I thought the ui_warning 
in perf_top__mmap_read() can give user a hint. So the user can switch to 
overwrite mode manually.
But unfortunately, the ui_warning doesn't work. Because it is called 
after perf_top__mmap_read(). The processing time of 
perf_top__mmap_read() could be tens of minutes.

Thanks,
Kan

> 
> 
>  From f54ef0e7342efb77205b2faaacbcb81cdd31f064 Mon Sep 17 00:00:00 2001
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
> Date: Fri, 26 Oct 2018 15:55:23 -0300
> Subject: [PATCH 1/2] perf top: Allow disabling the overwrite mode
> 
> In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we
> forgot to leave a way to disable that new default, add a --overwrite
> option that can be disabled using --no-overwrite, since the code already
> in such a way that we can readily disable this mode.
> 
> This is useful when investigating bugs with this mode like the recent
> report from David Miller where lots of unknown symbols appear due to
> disabling the events while processing them which disables all record
> types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
> maps when we lose PERF_RECORD_MMAP records.
> 
> This can be easily seen while building a kernel, when there are lots of
> short lived processes.
> 
> Reported-by: David Miller <davem@davemloft.net>
> Acked-by: Kan Liang <kan.liang@intel.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Jin Yao <yao.jin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wang Nan <wangnan0@huawei.com>
> Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
> Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>   tools/perf/Documentation/perf-top.txt | 5 +++++
>   tools/perf/builtin-top.c              | 2 ++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
> index 114fda12aa49..d4be6061fe1c 100644
> --- a/tools/perf/Documentation/perf-top.txt
> +++ b/tools/perf/Documentation/perf-top.txt
> @@ -242,6 +242,11 @@ Default is to monitor all CPUS.
>   --hierarchy::
>   	Enable hierarchy output.
>   
> +--overwrite::
> +	This is the default, but for investigating problems with it or any other strange
> +	behaviour like lots of unknown samples, we may want to disable this mode by using
> +	--no-overwrite.
> +
>   --force::
>   	Don't do ownership validation.
>   
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index d21d8751e749..214fad747b04 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -1372,6 +1372,8 @@ int cmd_top(int argc, const char **argv)
>   		    "Show raw trace event output (do not use print fmt or plugins)"),
>   	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
>   		    "Show entries in a hierarchy"),
> +	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
> +		    "Use a backward ring buffer, default: yes"),
>   	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
>   	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
>   			"number of thread to run event synthesize"),
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 14:33                   ` Liang, Kan
@ 2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
  2018-10-29 15:11                       ` Liang, Kan
  2018-10-29 17:40                     ` David Miller
  2018-10-30 19:05                     ` David Miller
  2 siblings, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-29 14:35 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra

Em Mon, Oct 29, 2018 at 10:33:06AM -0400, Liang, Kan escreveu:
> On 10/29/2018 9:03 AM, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Oct 26, 2018 at 04:11:51PM -0400, Liang, Kan escreveu:
> > > On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
> > > > Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
> > > It is mainly for performance reason to switch to overwrite mode. The impact
> > > was very small when I did my test. But now the effect is easily noticeable
> > > in other tests. Yes, I agree. We may change it back to non-overwrite mode
> > > until the issue is addressed.

> > So, I have these two patches in my perf/core branch, with Fixes tags
> > that will make them get to the stable kernels, ok?
 
> I just realized that the problem in KNL will be back if we switch back to
> non-overwrite mode.
> The problem is that users have to wait tens of minutes to see perf top
> results on the screen in KNL. Before that, there is nothing but a black
> screen.
 
> Sorry I didn't notice it last Friday. Because I thought the ui_warning in
> perf_top__mmap_read() can give user a hint. So the user can switch to
> overwrite mode manually.
> But unfortunately, the ui_warning doesn't work. Because it is called after
> perf_top__mmap_read(). The processing time of perf_top__mmap_read() could be
> tens of minutes.

So we need a way to notice that we're in a machine like that and warn
the user before the wait takes place, ideas on how to do that?

- Arnaldo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
@ 2018-10-29 15:11                       ` Liang, Kan
  2018-10-29 17:43                         ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2018-10-29 15:11 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, linux-kernel, Wang Nan, Jiri Olsa, Namhyung Kim,
	Kan Liang, Andi Kleen, Jin Yao, Peter Zijlstra



On 10/29/2018 10:35 AM, Arnaldo Carvalho de Melo wrote:
> Em Mon, Oct 29, 2018 at 10:33:06AM -0400, Liang, Kan escreveu:
>> On 10/29/2018 9:03 AM, Arnaldo Carvalho de Melo wrote:
>>> Em Fri, Oct 26, 2018 at 04:11:51PM -0400, Liang, Kan escreveu:
>>>> On 10/26/2018 3:24 PM, Arnaldo Carvalho de Melo wrote:
>>>>> Em Fri, Oct 26, 2018 at 03:16:29PM -0400, Liang, Kan escreveu:
>>>> It is mainly for performance reason to switch to overwrite mode. The impact
>>>> was very small when I did my test. But now the effect is easily noticeable
>>>> in other tests. Yes, I agree. We may change it back to non-overwrite mode
>>>> until the issue is addressed.
> 
>>> So, I have these two patches in my perf/core branch, with Fixes tags
>>> that will make them get to the stable kernels, ok?
>   
>> I just realized that the problem in KNL will be back if we switch back to
>> non-overwrite mode.
>> The problem is that users have to wait tens of minutes to see perf top
>> results on the screen in KNL. Before that, there is nothing but a black
>> screen.
>   
>> Sorry I didn't notice it last Friday. Because I thought the ui_warning in
>> perf_top__mmap_read() can give user a hint. So the user can switch to
>> overwrite mode manually.
>> But unfortunately, the ui_warning doesn't work. Because it is called after
>> perf_top__mmap_read(). The processing time of perf_top__mmap_read() could be
>> tens of minutes.
> 
> So we need a way to notice that we're in a machine like that and warn
> the user before the wait takes place, ideas on how to do that?
> 
The processing time for each perf_top__mmap_read_idx() should not that 
long. We may check it after each perf_top__mmap_read_idx(). Also change 
the ui_warning to one-time warning. The patch as below can do that (not 
test).


diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d21d875..5e532e0 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -877,31 +877,40 @@ static void perf_top__mmap_read_idx(struct 
perf_top *top, int idx)
  	perf_mmap__read_done(md);
  }

+static bool check_processing_time = true;
+
  static void perf_top__mmap_read(struct perf_top *top)
  {
  	bool overwrite = top->record_opts.overwrite;
  	struct perf_evlist *evlist = top->evlist;
-	unsigned long long start, end;
+	unsigned long long start, end, tolerance;
  	int i;

-	start = rdclock();
  	if (overwrite)
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_DATA_PENDING);

-	for (i = 0; i < top->evlist->nr_mmaps; i++)
+	tolerance = (unsigned long long)top->delay_secs * NSEC_PER_SEC / 
top->evlist->nr_mmaps;
+	start = rdclock();
+	for (i = 0; i < top->evlist->nr_mmaps; i++) {
  		perf_top__mmap_read_idx(top, i);
+		if (check_processing_time) {
+			end = rdclock();
+
+			if ((end - start) > tolerance) {
+				ui__warning("Too slow to read ring buffer.\n"
+					    "Please try increasing the period (-c) or\n"
+					    "decreasing the freq (-F) or\n"
+					    "limiting the number of CPUs (-C)\n");
+				check_processing_time = false;
+			}
+			start = end;
+		}
+	}

  	if (overwrite) {
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_EMPTY);
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_RUNNING);
  	}
-	end = rdclock();
-
-	if ((end - start) > (unsigned long long)top->delay_secs * NSEC_PER_SEC)
-		ui__warning("Too slow to read ring buffer.\n"
-			    "Please try increasing the period (-c) or\n"
-			    "decreasing the freq (-F) or\n"
-			    "limiting the number of CPUs (-C)\n");
  }

  /*

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 14:33                   ` Liang, Kan
  2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
@ 2018-10-29 17:40                     ` David Miller
  2018-10-29 17:42                       ` Liang, Kan
  2018-10-29 17:55                       ` Arnaldo Carvalho de Melo
  2018-10-30 19:05                     ` David Miller
  2 siblings, 2 replies; 28+ messages in thread
From: David Miller @ 2018-10-29 17:40 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 10:33:06 -0400

> I just realized that the problem in KNL will be back if we switch
> back to non-overwrite mode.

What is KNL?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 17:40                     ` David Miller
@ 2018-10-29 17:42                       ` Liang, Kan
  2018-10-29 17:48                         ` David Miller
  2018-10-29 17:55                       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2018-10-29 17:42 UTC (permalink / raw)
  To: David Miller
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz



On 10/29/2018 1:40 PM, David Miller wrote:
> From: "Liang, Kan" <kan.liang@linux.intel.com>
> Date: Mon, 29 Oct 2018 10:33:06 -0400
> 
>> I just realized that the problem in KNL will be back if we switch
>> back to non-overwrite mode.
> 
> What is KNL?
>
Intel Xeon Phi Processor, Knights Landing.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 15:11                       ` Liang, Kan
@ 2018-10-29 17:43                         ` David Miller
  2018-10-29 17:56                           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2018-10-29 17:43 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 11:11:25 -0400

> The processing time for each perf_top__mmap_read_idx() should not that
> long. We may check it after each perf_top__mmap_read_idx(). Also
> change the ui_warning to one-time warning. The patch as below can do
> that (not test).

You cannot make ui__*() calls from the event processing thread.

I tried to make this point strongly over the weekend.

It hangs the event thread, because the ui call waits for a keypress
but the display thread will eat them up and the event thread thus
hangs in select().

So, once more, all ui calls must be avoided in event processing code.

Said another way, it is not legal to call UI interfaces from any
code that could be called by the event thread.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 17:42                       ` Liang, Kan
@ 2018-10-29 17:48                         ` David Miller
  2018-10-29 18:20                           ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2018-10-29 17:48 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 13:42:56 -0400

> 
> 
> On 10/29/2018 1:40 PM, David Miller wrote:
>> From: "Liang, Kan" <kan.liang@linux.intel.com>
>> Date: Mon, 29 Oct 2018 10:33:06 -0400
>> 
>>> I just realized that the problem in KNL will be back if we switch
>>> back to non-overwrite mode.
>> What is KNL?
>>
> Intel Xeon Phi Processor, Knights Landing.

I don't understand how a specific piece of hardware directly leads to
ring buffer processing timeouts, or multi-minute thread map processing
times...

You'll have to explain all of the details of your test scenerio, and
the exact problems triggers, which caused you to write these patches
which causes serious regressions for what I consider a core simple use
case of perf top.

And that's running perf top during a parallel kernel build.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 17:40                     ` David Miller
  2018-10-29 17:42                       ` Liang, Kan
@ 2018-10-29 17:55                       ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-29 17:55 UTC (permalink / raw)
  To: David Miller
  Cc: kan.liang, linux-kernel, wangnan0, jolsa, namhyung, kan.liang,
	ak, yao.jin, peterz

Em Mon, Oct 29, 2018 at 10:40:08AM -0700, David Miller escreveu:
> From: "Liang, Kan" <kan.liang@linux.intel.com>
> Date: Mon, 29 Oct 2018 10:33:06 -0400
> 
> > I just realized that the problem in KNL will be back if we switch
> > back to non-overwrite mode.
> 
>> What is KNL?

I guess its Knights Landing.

https://en.wikipedia.org/wiki/Xeon_Phi

- Arnaldo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 17:43                         ` David Miller
@ 2018-10-29 17:56                           ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-29 17:56 UTC (permalink / raw)
  To: David Miller
  Cc: kan.liang, linux-kernel, wangnan0, jolsa, namhyung, kan.liang,
	ak, yao.jin, peterz

Em Mon, Oct 29, 2018 at 10:43:21AM -0700, David Miller escreveu:
> From: "Liang, Kan" <kan.liang@linux.intel.com>
> Date: Mon, 29 Oct 2018 11:11:25 -0400
> 
> > The processing time for each perf_top__mmap_read_idx() should not that
> > long. We may check it after each perf_top__mmap_read_idx(). Also
> > change the ui_warning to one-time warning. The patch as below can do
> > that (not test).
> 
> You cannot make ui__*() calls from the event processing thread.
> 
> I tried to make this point strongly over the weekend.
> 
> It hangs the event thread, because the ui call waits for a keypress
> but the display thread will eat them up and the event thread thus
> hangs in select().
> 
> So, once more, all ui calls must be avoided in event processing code.
> 
> Said another way, it is not legal to call UI interfaces from any
> code that could be called by the event thread.

Right, that global in fact needs to be set in the event processing and
checked in the display thread, start it with 0, switch to 1 and then to
2 to say it was processed just once, etc.

- Arnaldo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 17:48                         ` David Miller
@ 2018-10-29 18:20                           ` Liang, Kan
  2018-10-29 18:32                             ` Arnaldo Carvalho de Melo
  2018-10-29 21:16                             ` David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Liang, Kan @ 2018-10-29 18:20 UTC (permalink / raw)
  To: David Miller
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz



On 10/29/2018 1:48 PM, David Miller wrote:
> From: "Liang, Kan" <kan.liang@linux.intel.com>
> Date: Mon, 29 Oct 2018 13:42:56 -0400
> 
>>
>>
>> On 10/29/2018 1:40 PM, David Miller wrote:
>>> From: "Liang, Kan" <kan.liang@linux.intel.com>
>>> Date: Mon, 29 Oct 2018 10:33:06 -0400
>>>
>>>> I just realized that the problem in KNL will be back if we switch
>>>> back to non-overwrite mode.
>>> What is KNL?
>>>
>> Intel Xeon Phi Processor, Knights Landing.
> 
> I don't understand how a specific piece of hardware directly leads to
> ring buffer processing timeouts, or multi-minute thread map processing
> times...

Perf top processes all samples in a serial way. With the number of CPU 
increasing under the heavy load, the number of samples increase 
dramatically. The processing time also increase significantly.
When the processing time is longer than display refresh time, only the 
stale data is shown.

I use KNL as an example. Because the problem is even worse on KNL. There 
is nothing output with perf top.

In theory, it's a problem for all large scale platforms.

> 
> You'll have to explain all of the details of your test scenerio, and
> the exact problems triggers, which

My test was the same as yours, just running a parallel kernel build on KNL.

> caused you to write these patches
> which causes serious regressions for what I consider a core simple use
> case of perf top.

I agree that the warning message is annoying. I will try to find another 
way to deliver the message. But I think we do need the warning message.

You didn't see any warning before the patch. I think it is just because 
perf top hides the problem.

Thanks,
Kan
> 
> And that's running perf top during a parallel kernel build.
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 18:20                           ` Liang, Kan
@ 2018-10-29 18:32                             ` Arnaldo Carvalho de Melo
  2018-10-29 22:32                               ` Liang, Kan
  2018-10-29 21:16                             ` David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-29 18:32 UTC (permalink / raw)
  To: Liang, Kan
  Cc: David Miller, linux-kernel, wangnan0, jolsa, namhyung, kan.liang,
	ak, yao.jin, peterz

Em Mon, Oct 29, 2018 at 02:20:15PM -0400, Liang, Kan escreveu:
> 
> 
> On 10/29/2018 1:48 PM, David Miller wrote:
> > From: "Liang, Kan" <kan.liang@linux.intel.com>
> > Date: Mon, 29 Oct 2018 13:42:56 -0400
> > 
> > > 
> > > 
> > > On 10/29/2018 1:40 PM, David Miller wrote:
> > > > From: "Liang, Kan" <kan.liang@linux.intel.com>
> > > > Date: Mon, 29 Oct 2018 10:33:06 -0400
> > > > 
> > > > > I just realized that the problem in KNL will be back if we switch
> > > > > back to non-overwrite mode.
> > > > What is KNL?
> > > > 
> > > Intel Xeon Phi Processor, Knights Landing.
> > 
> > I don't understand how a specific piece of hardware directly leads to
> > ring buffer processing timeouts, or multi-minute thread map processing
> > times...
> 
> Perf top processes all samples in a serial way. With the number of CPU
> increasing under the heavy load, the number of samples increase
> dramatically. The processing time also increase significantly.
> When the processing time is longer than display refresh time, only the stale
> data is shown.
> 
> I use KNL as an example. Because the problem is even worse on KNL. There is
> nothing output with perf top.
> 
> In theory, it's a problem for all large scale platforms.
> 
> > 
> > You'll have to explain all of the details of your test scenerio, and
> > the exact problems triggers, which
> 
> My test was the same as yours, just running a parallel kernel build on KNL.
> 
> > caused you to write these patches
> > which causes serious regressions for what I consider a core simple use
> > case of perf top.
> 
> I agree that the warning message is annoying. I will try to find another way
> to deliver the message. But I think we do need the warning message.

There is no problem with the message, the problem is the thread where
the message is being displayed, just signal the display thread to
display the warning, not doing that in the event processing thread.
 
> You didn't see any warning before the patch. I think it is just because perf
> top hides the problem.
> 
> Thanks,
> Kan
> > 
> > And that's running perf top during a parallel kernel build.
> > 
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 18:20                           ` Liang, Kan
  2018-10-29 18:32                             ` Arnaldo Carvalho de Melo
@ 2018-10-29 21:16                             ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2018-10-29 21:16 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 14:20:15 -0400

> You didn't see any warning before the patch. I think it is just
> because perf top hides the problem.

Quite honestly, the last time I played around with this:

1) The new ring buffer mode didn't exist

2) perf started up much more quickly and was much more responsive
   than it is these days

It used to handle a 128-cpu system doing a parallel kernel build
with no problems, no dropped events, nothing.

Something has changed to make perf more bloated and slow, and one by
one I'm trying to identify and deal with these issues rather than
just make perf abort when it can't keep up which is the approach
that has seem to have taken over.  That's to me is just wrong.

One point I want to make clear, dropping things like mmap events will
make perf run more slowly not more fast.

I keep trying to explain this over and over.

If you drop map events, you have no range into which to fit samples.

Therefore samples create a unique histogram entries, and the histogram
table grows to be quite huge.  And this slows down perf significantly
because every new sample and histogram decay walks this table, sorting
it, killing off old entries, finding a place for new ones, etc.

I really think dropping events causes more harm than good in this
case, therefore.

This also happens when perf cannot find a symbol, for example in a
binary or library with no symbols.  I see this as an area for
significant improvement.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 18:32                             ` Arnaldo Carvalho de Melo
@ 2018-10-29 22:32                               ` Liang, Kan
  2018-10-29 22:42                                 ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2018-10-29 22:32 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: David Miller, linux-kernel, wangnan0, jolsa, namhyung, kan.liang,
	ak, yao.jin, peterz


> There is no problem with the message, the problem is the thread where
> the message is being displayed, just signal the display thread to
> display the warning, not doing that in the event processing thread.
>   

How about this patch (just did some simple test)? It moves the warning 
to display thread.
I will find a KNL and do more test tomorrow.

 From 78e471c5c153c97f352dca8956ed03af81cb80ea Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 15:14:27 -0700
Subject: [PATCH] perf top: Move the timeout warning from event 
processing thread to display thread

The main event processing thread may hang if the ring buffer event
processing timeouts.

Analysis from David Miller:
"It hangs the event thread, because the ui call waits for a keypress
but the display thread will eat them up and the event thread thus
hangs in select()."

The timeout warning is moved to display thread.

The nr_rb_read is introduced to track the times of
perf_top__mmap_read(), which is the main function of event processing.
If the nr_rb_read doesn't increase during the refresh time, the display
thread may output stale data. The timeout warning will be triggered.

The timeout warning can only be triggered one time to avoid the annoying
and duplicated warning message.

The first perf_top__mmap_read() is moved to after display thread create.
Because the perf_top__mmap_read() could cost long time. For example, the
function may cost tens of minutes on Knights Landing platform with
parallel kernel build. There will be nothing displayed on the screent.
The display thread has to be created before perf_top__mmap_read(). But
at that time, the data is not ready. Sleep the refresh time in display
thread.

Fix: 8cc42de736b6 ("perf top: Check the latency of
perf_top__mmap_read()")
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
  tools/perf/builtin-c2c.c       |  4 +--
  tools/perf/builtin-report.c    |  3 ++-
  tools/perf/builtin-top.c       | 39 +++++++++++++++++++---------
  tools/perf/ui/browsers/hists.c | 58 
++++++++++++++++++++++++++++++++++--------
  tools/perf/ui/browsers/hists.h |  2 +-
  tools/perf/util/hist.h         |  6 +++--
  tools/perf/util/top.h          |  1 +
  7 files changed, 85 insertions(+), 28 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index f3aa9d0..1e77515 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -2371,7 +2371,7 @@ static int perf_c2c__browse_cacheline(struct 
hist_entry *he)
  	c2c_browser__update_nr_entries(browser);

  	while (1) {
-		key = hist_browser__run(browser, "? - help", true);
+		key = hist_browser__run(browser, "? - help", true, NULL);

  		switch (key) {
  		case 's':
@@ -2440,7 +2440,7 @@ static int perf_c2c__hists_browse(struct hists *hists)
  	c2c_browser__update_nr_entries(browser);

  	while (1) {
-		key = hist_browser__run(browser, "? - help", true);
+		key = hist_browser__run(browser, "? - help", true, NULL);

  		switch (key) {
  		case 'q':
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 76e12bc..ed908e6 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -562,7 +562,8 @@ static int report__browse_hists(struct report *rep)
  		ret = perf_evlist__tui_browse_hists(evlist, help, NULL,
  						    rep->min_percent,
  						    &session->header.env,
-						    true, &rep->annotation_opts);
+						    true, &rep->annotation_opts,
+						    NULL);
  		/*
  		 * Usually "ret" is the last pressed key, and we only
  		 * care if the key notifies us to switch data file.
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index d21d875..95409de 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -584,6 +584,8 @@ static void *display_thread_tui(void *arg)
  		.refresh	= top->delay_secs,
  	};

+	sleep(top->delay_secs);
+
  	/* In order to read symbols from other namespaces perf to  needs to call
  	 * setns(2).  This isn't permitted if the struct_fs has multiple users.
  	 * unshare(2) the fs so that we may continue to setns into namespaces
@@ -607,7 +609,8 @@ static void *display_thread_tui(void *arg)
  				      top->min_percent,
  				      &top->session->header.env,
  				      !top->record_opts.overwrite,
-				      &top->annotation_opts);
+				      &top->annotation_opts,
+				      &top->nr_rb_read);

  	done = 1;
  	return NULL;
@@ -633,6 +636,11 @@ static void *display_thread(void *arg)
  	struct termios save;
  	struct perf_top *top = arg;
  	int delay_msecs, c;
+	bool rb_read_timeout_warned = false;
+	bool rb_read_timeout = false;
+	int last_nr_rb_read = 0;
+
+	sleep(top->delay_secs);

  	/* In order to read symbols from other namespaces perf to  needs to call
  	 * setns(2).  This isn't permitted if the struct_fs has multiple users.
@@ -651,12 +659,26 @@ static void *display_thread(void *arg)

  	while (!done) {
  		perf_top__print_sym_table(top);
+
+		if (!rb_read_timeout_warned && rb_read_timeout) {
+			color_fprintf(stdout, PERF_COLOR_RED,
+				      "Too slow to read ring buffer.\n"
+				      "Please try increasing the period (-c) or\n"
+				      "decreasing the freq (-F) or\n"
+				      "limiting the number of CPUs (-C)\n");
+			rb_read_timeout_warned = true;
+		}
  		/*
  		 * Either timeout expired or we got an EINTR due to SIGWINCH,
  		 * refresh screen in both cases.
  		 */
  		switch (poll(&stdin_poll, 1, delay_msecs)) {
  		case 0:
+			if (atomic_read(&top->nr_rb_read) == last_nr_rb_read)
+				rb_read_timeout = true;
+			else
+				rb_read_timeout = false;
+			last_nr_rb_read = atomic_read(&top->nr_rb_read);
  			continue;
  		case -1:
  			if (errno == EINTR)
@@ -881,10 +903,10 @@ static void perf_top__mmap_read(struct perf_top *top)
  {
  	bool overwrite = top->record_opts.overwrite;
  	struct perf_evlist *evlist = top->evlist;
-	unsigned long long start, end;
  	int i;

-	start = rdclock();
+	atomic_inc(&top->nr_rb_read);
+
  	if (overwrite)
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_DATA_PENDING);

@@ -895,13 +917,6 @@ static void perf_top__mmap_read(struct perf_top *top)
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_EMPTY);
  		perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_RUNNING);
  	}
-	end = rdclock();
-
-	if ((end - start) > (unsigned long long)top->delay_secs * NSEC_PER_SEC)
-		ui__warning("Too slow to read ring buffer.\n"
-			    "Please try increasing the period (-c) or\n"
-			    "decreasing the freq (-F) or\n"
-			    "limiting the number of CPUs (-C)\n");
  }

  /*
@@ -1137,8 +1152,6 @@ static int __cmd_top(struct perf_top *top)
  	/* Wait for a minimal set of events before starting the snapshot */
  	perf_evlist__poll(top->evlist, 100);

-	perf_top__mmap_read(top);
-
  	ret = -1;
  	if (pthread_create(&thread, NULL, (use_browser > 0 ? display_thread_tui :
  							    display_thread), top)) {
@@ -1146,6 +1159,8 @@ static int __cmd_top(struct perf_top *top)
  		goto out_delete;
  	}

+	perf_top__mmap_read(top);
+
  	if (top->realtime_prio) {
  		struct sched_param param;

diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
index a96f62c..bda3e74 100644
--- a/tools/perf/ui/browsers/hists.c
+++ b/tools/perf/ui/browsers/hists.c
@@ -606,18 +606,29 @@ static void ui_browser__warn_lost_events(struct 
ui_browser *browser)
  		"Or reduce the sampling frequency.");
  }

+static void ui_browser__warn_rb_read_timeout(struct ui_browser *browser)
+{
+	ui_browser__warning(browser, 4,
+		"Too slow to read ring buffer.\n\n"
+		"Please try increasing the period (-c) or\n\n"
+		"decreasing the freq (-F) or\n\n"
+		"limiting the number of CPUs (-C)\n\n");
+}
+
  static int hist_browser__title(struct hist_browser *browser, char *bf, 
size_t size)
  {
  	return browser->title ? browser->title(browser, bf, size) : 0;
  }

  int hist_browser__run(struct hist_browser *browser, const char *help,
-		      bool warn_lost_event)
+		      bool warn_lost_event, atomic_t *nr_rb_read)
  {
  	int key;
  	char title[160];
  	struct hist_browser_timer *hbt = browser->hbt;
  	int delay_secs = hbt ? hbt->refresh : 0;
+	int last_nr_rb_read = nr_rb_read ? atomic_read(nr_rb_read) : 0;
+	bool rb_read_timeout_warned = false;

  	browser->b.entries = &browser->hists->entries;
  	browser->b.nr_entries = hist_browser__nr_entries(browser);
@@ -650,6 +661,15 @@ int hist_browser__run(struct hist_browser *browser, 
const char *help,
  				ui_browser__warn_lost_events(&browser->b);
  			}

+			if (nr_rb_read) {
+				if (!rb_read_timeout_warned &&
+				    (last_nr_rb_read == atomic_read(nr_rb_read))) {
+					ui_browser__warn_rb_read_timeout(&browser->b);
+					rb_read_timeout_warned = true;
+				}
+				last_nr_rb_read = atomic_read(nr_rb_read);
+			}
+
  			hist_browser__title(browser, title, sizeof(title));
  			ui_browser__show_title(&browser->b, title);
  			continue;
@@ -2703,7 +2723,8 @@ static int perf_evsel__hists_browse(struct 
perf_evsel *evsel, int nr_events,
  				    float min_pcnt,
  				    struct perf_env *env,
  				    bool warn_lost_event,
-				    struct annotation_options *annotation_opts)
+				    struct annotation_options *annotation_opts,
+				    atomic_t *nr_rb_read)
  {
  	struct hists *hists = evsel__hists(evsel);
  	struct hist_browser *browser = perf_evsel_browser__new(evsel, hbt, 
env, annotation_opts);
@@ -2785,7 +2806,8 @@ static int perf_evsel__hists_browse(struct 
perf_evsel *evsel, int nr_events,
  		nr_options = 0;

  		key = hist_browser__run(browser, helpline,
-					warn_lost_event);
+					warn_lost_event,
+					nr_rb_read);

  		if (browser->he_selection != NULL) {
  			thread = hist_browser__selected_thread(browser);
@@ -3070,6 +3092,8 @@ struct perf_evsel_menu {
  	struct perf_evsel *selection;
  	struct annotation_options *annotation_opts;
  	bool lost_events, lost_events_warned;
+	bool rb_read_timeout_warned;
+	int last_nr_rb_read;
  	float min_pcnt;
  	struct perf_env *env;
  };
@@ -3127,7 +3151,8 @@ static void perf_evsel_menu__write(struct 
ui_browser *browser,
  static int perf_evsel_menu__run(struct perf_evsel_menu *menu,
  				int nr_events, const char *help,
  				struct hist_browser_timer *hbt,
-				bool warn_lost_event)
+				bool warn_lost_event,
+				atomic_t *nr_rb_read)
  {
  	struct perf_evlist *evlist = menu->b.priv;
  	struct perf_evsel *pos;
@@ -3152,6 +3177,14 @@ static int perf_evsel_menu__run(struct 
perf_evsel_menu *menu,
  				ui_browser__warn_lost_events(&menu->b);
  				menu->lost_events_warned = true;
  			}
+			if (nr_rb_read) {
+				if (!menu->rb_read_timeout_warned &&
+				    (menu->last_nr_rb_read == atomic_read(nr_rb_read))) {
+					ui_browser__warn_rb_read_timeout(&menu->b);
+					menu->rb_read_timeout_warned = true;
+				}
+				menu->last_nr_rb_read = atomic_read(nr_rb_read);
+			}
  			continue;
  		case K_RIGHT:
  		case K_ENTER:
@@ -3171,7 +3204,8 @@ static int perf_evsel_menu__run(struct 
perf_evsel_menu *menu,
  						       menu->min_pcnt,
  						       menu->env,
  						       warn_lost_event,
-						       menu->annotation_opts);
+						       menu->annotation_opts,
+						       nr_rb_read);
  			ui_browser__show_title(&menu->b, title);
  			switch (key) {
  			case K_TAB:
@@ -3231,7 +3265,8 @@ static int __perf_evlist__tui_browse_hists(struct 
perf_evlist *evlist,
  					   float min_pcnt,
  					   struct perf_env *env,
  					   bool warn_lost_event,
-					   struct annotation_options *annotation_opts)
+					   struct annotation_options *annotation_opts,
+					   atomic_t *nr_rb_read)
  {
  	struct perf_evsel *pos;
  	struct perf_evsel_menu menu = {
@@ -3260,7 +3295,7 @@ static int __perf_evlist__tui_browse_hists(struct 
perf_evlist *evlist,
  	}

  	return perf_evsel_menu__run(&menu, nr_entries, help,
-				    hbt, warn_lost_event);
+				    hbt, warn_lost_event, nr_rb_read);
  }

  int perf_evlist__tui_browse_hists(struct perf_evlist *evlist, const 
char *help,
@@ -3268,7 +3303,8 @@ int perf_evlist__tui_browse_hists(struct 
perf_evlist *evlist, const char *help,
  				  float min_pcnt,
  				  struct perf_env *env,
  				  bool warn_lost_event,
-				  struct annotation_options *annotation_opts)
+				  struct annotation_options *annotation_opts,
+				  atomic_t *nr_rb_read)
  {
  	int nr_entries = evlist->nr_entries;

@@ -3279,7 +3315,8 @@ int perf_evlist__tui_browse_hists(struct 
perf_evlist *evlist, const char *help,
  		return perf_evsel__hists_browse(first, nr_entries, help,
  						false, hbt, min_pcnt,
  						env, warn_lost_event,
-						annotation_opts);
+						annotation_opts,
+						nr_rb_read);
  	}

  	if (symbol_conf.event_group) {
@@ -3298,5 +3335,6 @@ int perf_evlist__tui_browse_hists(struct 
perf_evlist *evlist, const char *help,
  	return __perf_evlist__tui_browse_hists(evlist, nr_entries, help,
  					       hbt, min_pcnt, env,
  					       warn_lost_event,
-					       annotation_opts);
+					       annotation_opts,
+					       nr_rb_read);
  }
diff --git a/tools/perf/ui/browsers/hists.h b/tools/perf/ui/browsers/hists.h
index 91d3e18..af6c1b9 100644
--- a/tools/perf/ui/browsers/hists.h
+++ b/tools/perf/ui/browsers/hists.h
@@ -32,7 +32,7 @@ struct hist_browser {
  struct hist_browser *hist_browser__new(struct hists *hists);
  void hist_browser__delete(struct hist_browser *browser);
  int hist_browser__run(struct hist_browser *browser, const char *help,
-		      bool warn_lost_event);
+		      bool warn_lost_event, atomic_t *nr_rb_read);
  void hist_browser__init(struct hist_browser *browser,
  			struct hists *hists);
  #endif /* _PERF_UI_BROWSER_HISTS_H_ */
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 3badd7f..67f31163 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -442,7 +442,8 @@ int perf_evlist__tui_browse_hists(struct perf_evlist 
*evlist, const char *help,
  				  float min_pcnt,
  				  struct perf_env *env,
  				  bool warn_lost_event,
-				  struct annotation_options *annotation_options);
+				  struct annotation_options *annotation_options,
+				  atomic_t *nr_rb_read);
  int script_browse(const char *script_opt);
  #else
  static inline
@@ -452,7 +453,8 @@ int perf_evlist__tui_browse_hists(struct perf_evlist 
*evlist __maybe_unused,
  				  float min_pcnt __maybe_unused,
  				  struct perf_env *env __maybe_unused,
  				  bool warn_lost_event __maybe_unused,
-				  struct annotation_options *annotation_options __maybe_unused)
+				  struct annotation_options *annotation_options __maybe_unused,
+				  atomic_t *nr_rb_read __maybe_unused)
  {
  	return 0;
  }
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index 9add1f7..2869d9d 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -40,6 +40,7 @@ struct perf_top {
  	const char	   *sym_filter;
  	float		   min_percent;
  	unsigned int	   nr_threads_synthesize;
+	atomic_t	   nr_rb_read;
  };

  #define CONSOLE_CLEAR ""
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 22:32                               ` Liang, Kan
@ 2018-10-29 22:42                                 ` David Miller
  2018-10-30  1:54                                   ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2018-10-29 22:42 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 18:32:40 -0400

> -				  struct annotation_options *annotation_options __maybe_unused)
> + struct annotation_options *annotation_options __maybe_unused,
> +				  atomic_t *nr_rb_read __maybe_unused)
>  {

What is going on with the indentations of this patch?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 22:42                                 ` David Miller
@ 2018-10-30  1:54                                   ` Liang, Kan
  0 siblings, 0 replies; 28+ messages in thread
From: Liang, Kan @ 2018-10-30  1:54 UTC (permalink / raw)
  To: David Miller
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz



On 10/29/2018 6:42 PM, David Miller wrote:
> From: "Liang, Kan" <kan.liang@linux.intel.com>
> Date: Mon, 29 Oct 2018 18:32:40 -0400
> 
>> -				  struct annotation_options *annotation_options __maybe_unused)
>> + struct annotation_options *annotation_options __maybe_unused,
>> +				  atomic_t *nr_rb_read __maybe_unused)
>>   {
> 
> What is going on with the indentations of this patch?
> 

Sorry, my editor auto wraps the line.
The patch has been sent in a separate email.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHES/RFC] Re: A concern about overflow ring buffer mode
  2018-10-29 14:33                   ` Liang, Kan
  2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
  2018-10-29 17:40                     ` David Miller
@ 2018-10-30 19:05                     ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2018-10-30 19:05 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, linux-kernel, wangnan0, jolsa, namhyung, kan.liang, ak,
	yao.jin, peterz

From: "Liang, Kan" <kan.liang@linux.intel.com>
Date: Mon, 29 Oct 2018 10:33:06 -0400

> The problem is that users have to wait tens of minutes to see perf
> top results on the screen in KNL. Before that, there is nothing but
> a black screen.

I'm actually curious why so I started digging last night.

First, I made perf top exit when it is done synthesizing existing
threads and profiled this during a full workload.

It takes about 3 seconds on this 128 cpu sparc64 system.

Some curious things in there, first of all we spend a lot of time
looking for the hugetlbfs mount point.

And sure enough the FS ABI code keeps reparsing the entire
/proc/mounts file every time hugetlbfs__mountpoint() is called.  It's
logic is that if the mountpoint wasn't found on a previous call it
rechecks everything.  For perf's usage, this is unnecessary overhead.

Simply mounting hugetlbfs gave me half a second back in startup time,
so I was down to 2.5 seconds from 3 seconds already.

Next I found that perf execution time during thread synthesis is
dominated by sscanf() processing.  So I went into the history books
and found that back in the 3.x days we parsed the file by hand, so I
brought that code back and extended it for what mmap2 events need.

That patch is at the end of this email, ignore the XXX bits as I was
too lazy to remove all of the mmap timeout code but I am absolutely
certain that is the right thing to do.

This gave me another half second or so back, and startup to this end
of thread synthesization is now less than 2 seconds for this workload.

Next, I let the perf top startup continue up until right before the
display thread is started.  This takes a minute or more total, and is
dominated by:

 time   seconds   seconds    calls   s/call   s/call  name
 28.30     12.70    12.70  1927341     0.00     0.00  __hists__add_entry
 12.77     18.43     5.73 27844359     0.00     0.00  sort__dso_cmp
 10.21     23.01     4.58 23074107     0.00     0.00  sort__sym_cmp
  8.29     26.73     3.72  1050281     0.00     0.00  dso__find_symbol
  4.95     28.95     2.22 106607184     0.00     0.00  perf_hpp__is_dynamic_entry

The histogram code is _insanely_ expensive and the algorithmic
complexity is quite high.  It does rbtree walks, doing a dso
comparison and then a symbol comparison at each and every step in the
walk.  The symbol comparison is a full blown strcmp() and the
histogram table in this situation is huge.

This is where the real problems are, not in the simple mmap processing
and other places where we've put timouts and other hacks that really
don't try to address the fundamental problems perf has in these
situations.

Thanks.

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index bc64618..70597fd 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -317,6 +318,30 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
 	return 0;
 }
 
+static int dec(char ch)
+{
+	if ((ch >= '0') && (ch <= '9'))
+		return ch - '0';
+	return -1;
+}
+
+static int dec2u64(const char *ptr, u64 *long_val)
+{
+	const char *p = ptr;
+
+	*long_val = 0;
+	while (*p) {
+		const int dec_val = dec(*p);
+
+		if (dec_val < 0)
+			break;
+
+		*long_val = (*long_val * 10) + dec_val;
+		p++;
+	}
+	return p - ptr;
+}
+
 int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 				       union perf_event *event,
 				       pid_t pid, pid_t tgid,
@@ -327,13 +352,12 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 {
 	char filename[PATH_MAX];
 	FILE *fp;
-	unsigned long long t;
-	bool truncation = false;
-	unsigned long long timeout = proc_map_timeout * 1000000ULL;
 	int rc = 0;
 	const char *hugetlbfs_mnt = hugetlbfs__mountpoint();
 	int hugetlbfs_mnt_len = hugetlbfs_mnt ? strlen(hugetlbfs_mnt) : 0;
 
+	(void) proc_map_timeout; /* XXX */
+
 	if (machine__is_default_guest(machine))
 		return 0;
 
@@ -350,87 +374,99 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 	}
 
 	event->header.type = PERF_RECORD_MMAP2;
-	t = rdclock();
 
 	while (1) {
-		char bf[BUFSIZ];
-		char prot[5];
-		char execname[PATH_MAX];
+		char bf[BUFSIZ], *pbf = bf;
 		char anonstr[] = "//anon";
-		unsigned int ino;
+		char *execname;
 		size_t size;
 		ssize_t n;
+		u64 tmp;
 
 		if (fgets(bf, sizeof(bf), fp) == NULL)
 			break;
 
-		if ((rdclock() - t) > timeout) {
-			pr_warning("Reading %s time out. "
-				   "You may want to increase "
-				   "the time limit by --proc-map-timeout\n",
-				   filename);
-			truncation = true;
-			goto out;
-		}
-
-		/* ensure null termination since stack will be reused. */
-		strcpy(execname, "");
-
 		/* 00400000-0040c000 r-xp 00000000 fd:01 41038  /bin/cat */
-		n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %x:%x %u %[^\n]\n",
-		       &event->mmap2.start, &event->mmap2.len, prot,
-		       &event->mmap2.pgoff, &event->mmap2.maj,
-		       &event->mmap2.min,
-		       &ino, execname);
-
-		/*
- 		 * Anon maps don't have the execname.
- 		 */
-		if (n < 7)
+		n = hex2u64(pbf, &event->mmap2.start);
+		if (n < 0)
 			continue;
-
-		event->mmap2.ino = (u64)ino;
-
-		/*
-		 * Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
-		 */
-		if (machine__is_host(machine))
-			event->header.misc = PERF_RECORD_MISC_USER;
-		else
-			event->header.misc = PERF_RECORD_MISC_GUEST_USER;
+		pbf += n + 1;
+		n = hex2u64(pbf, &event->mmap2.len);
+		if (n < 0)
+			continue;
+		pbf += n + 1;
 
 		/* map protection and flags bits */
 		event->mmap2.prot = 0;
 		event->mmap2.flags = 0;
-		if (prot[0] == 'r')
+		if (pbf[0] == 'r')
 			event->mmap2.prot |= PROT_READ;
-		if (prot[1] == 'w')
+		if (pbf[1] == 'w')
 			event->mmap2.prot |= PROT_WRITE;
-		if (prot[2] == 'x')
+		if (pbf[2] == 'x')
 			event->mmap2.prot |= PROT_EXEC;
 
-		if (prot[3] == 's')
+		if (pbf[3] == 's')
 			event->mmap2.flags |= MAP_SHARED;
 		else
 			event->mmap2.flags |= MAP_PRIVATE;
 
-		if (prot[2] != 'x') {
-			if (!mmap_data || prot[0] != 'r')
+		if (pbf[2] != 'x') {
+			if (!mmap_data || pbf[0] != 'r')
 				continue;
 
 			event->header.misc |= PERF_RECORD_MISC_MMAP_DATA;
 		}
 
-out:
-		if (truncation)
-			event->header.misc |= PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT;
+		pbf += 5;
+		n = hex2u64(pbf, &event->mmap2.pgoff);
+		if (n < 0)
+			continue;
+		pbf += n + 1;
+
+		n = hex2u64(pbf, &tmp);
+		if (n < 0)
+			continue;
+		event->mmap2.maj = tmp;
+		pbf += n + 1;
+
+		n = hex2u64(pbf, &tmp);
+		if (n < 0)
+			continue;
+		event->mmap2.min = tmp;
+		pbf += n + 1;
 
-		if (!strcmp(execname, ""))
-			strcpy(execname, anonstr);
+		n = dec2u64(pbf, &event->mmap2.ino);
+		if (n < 0)
+			continue;
+		pbf += n;
+
+		execname = strchr(pbf, '/');
+		if (!execname)
+			execname = strchr(pbf, '[');
+
+		/*
+		 * Anon map, skip.
+		 */
+		if (!execname)
+			continue;
+
+		pbf = strchr(execname, '\n');
+		if (!pbf)
+			continue;
+		*pbf = '\0';
+
+		/*
+		 * Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
+		 */
+		if (machine__is_host(machine))
+			event->header.misc = PERF_RECORD_MISC_USER;
+		else
+			event->header.misc = PERF_RECORD_MISC_GUEST_USER;
 
 		if (hugetlbfs_mnt_len &&
 		    !strncmp(execname, hugetlbfs_mnt, hugetlbfs_mnt_len)) {
-			strcpy(execname, anonstr);
+			execname = anonstr;
 			event->mmap2.flags |= MAP_HUGETLB;
 		}
 
@@ -449,9 +485,6 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 			rc = -1;
 			break;
 		}
-
-		if (truncation)
-			break;
 	}
 
 	fclose(fp);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [tip:perf/urgent] perf top: Do not use overwrite mode by default
  2018-10-26 20:11               ` Liang, Kan
  2018-10-26 20:43                 ` Arnaldo Carvalho de Melo
  2018-10-29 13:03                 ` [PATCHES/RFC] " Arnaldo Carvalho de Melo
@ 2018-10-31 22:03                 ` tip-bot for Arnaldo Carvalho de Melo
  2 siblings, 0 replies; 28+ messages in thread
From: tip-bot for Arnaldo Carvalho de Melo @ 2018-10-31 22:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, linux-kernel, namhyung, dsahern, davem, wangnan0,
	adrian.hunter, kan.liang, acme, jolsa, mingo

Commit-ID:  218d61110f69632974034b6e27686ce482a1c455
Gitweb:     https://git.kernel.org/tip/218d61110f69632974034b6e27686ce482a1c455
Author:     Arnaldo Carvalho de Melo <acme@redhat.com>
AuthorDate: Mon, 29 Oct 2018 09:47:00 -0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 31 Oct 2018 09:57:31 -0300

perf top: Do not use overwrite mode by default

Enabling --overwrite mode allows us to to use just the most recent
records, which helps in high core count machines such as Knights
Landing/Mill, but right now is being disabled by default as the pausing
used in this technique is leading to loss of metadata events such as
PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples,
leading to lots of unknown samples appearing on the UI.

Enabling this may be useful if you are in such machines and profiling a
workload that doesn't creates short lived threads and/or doesn't uses
many executable mmap operations.

Work is being planed to solve this situation, till then, this will
remain disabled by default.

Reported-by: David Miller <davem@davemloft.net>
Acked-by: Kan Liang <kan.liang@intel.com>
Link: https://lkml.kernel.org/r/4f84468f-37d9-cf1b-12c1-514ef74b6a48@linux.intel.com
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode")
Link: https://lkml.kernel.org/n/tip-ehvf77vi1si9409r7p4wx788@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-top.txt | 11 ++++++++---
 tools/perf/builtin-top.c              | 11 +++++++++--
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index d4be6061fe1c..808b664343c9 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -243,9 +243,14 @@ Default is to monitor all CPUS.
 	Enable hierarchy output.
 
 --overwrite::
-	This is the default, but for investigating problems with it or any other strange
-	behaviour like lots of unknown samples, we may want to disable this mode by using
-	--no-overwrite.
+	Enable this to use just the most recent records, which helps in high core count
+	machines such as Knights Landing/Mill, but right now is disabled by default as
+	the pausing used in this technique is leading to loss of metadata events such
+	as PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples, leading
+	to lots of unknown samples appearing on the UI. Enable this if you are in such
+	machines and profiling a workload that doesn't creates short lived threads and/or
+	doesn't uses many executable mmap operations. Work is being planed to solve
+	this situation, till then, this will remain disabled by default.
 
 --force::
 	Don't do ownership validation.
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 214fad747b04..8e29e0cc8626 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1257,7 +1257,14 @@ int cmd_top(int argc, const char **argv)
 				.uses_mmap   = true,
 			},
 			.proc_map_timeout    = 500,
-			.overwrite	= 1,
+			/*
+			 * FIXME: This will lose PERF_RECORD_MMAP and other metadata
+			 * when we pause, fix that and reenable. Probably using a
+			 * separate evlist with a dummy event, i.e. a non-overwrite
+			 * ring buffer just for metadata events, while PERF_RECORD_SAMPLE
+			 * stays in overwrite mode. -acme
+			 * */
+			.overwrite	= 0,
 		},
 		.max_stack	     = sysctl__max_stack(),
 		.annotation_opts     = annotation__default_options,
@@ -1373,7 +1380,7 @@ int cmd_top(int argc, const char **argv)
 	OPT_BOOLEAN(0, "hierarchy", &symbol_conf.report_hierarchy,
 		    "Show entries in a hierarchy"),
 	OPT_BOOLEAN(0, "overwrite", &top.record_opts.overwrite,
-		    "Use a backward ring buffer, default: yes"),
+		    "Use a backward ring buffer, default: no"),
 	OPT_BOOLEAN(0, "force", &symbol_conf.force, "don't complain, do it"),
 	OPT_UINTEGER(0, "num-thread-synthesize", &top.nr_threads_synthesize,
 			"number of thread to run event synthesize"),

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-10-31 22:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-26 17:45 A concern about overflow ring buffer mode David Miller
2018-10-26 18:38 ` Arnaldo Carvalho de Melo
2018-10-26 18:42   ` Arnaldo Carvalho de Melo
2018-10-26 19:02     ` Arnaldo Carvalho de Melo
2018-10-26 19:07       ` Liang, Kan
2018-10-26 19:12         ` Arnaldo Carvalho de Melo
2018-10-26 19:16           ` Liang, Kan
2018-10-26 19:24             ` Arnaldo Carvalho de Melo
2018-10-26 20:11               ` Liang, Kan
2018-10-26 20:43                 ` Arnaldo Carvalho de Melo
2018-10-29 13:03                 ` [PATCHES/RFC] " Arnaldo Carvalho de Melo
2018-10-29 14:33                   ` Liang, Kan
2018-10-29 14:35                     ` Arnaldo Carvalho de Melo
2018-10-29 15:11                       ` Liang, Kan
2018-10-29 17:43                         ` David Miller
2018-10-29 17:56                           ` Arnaldo Carvalho de Melo
2018-10-29 17:40                     ` David Miller
2018-10-29 17:42                       ` Liang, Kan
2018-10-29 17:48                         ` David Miller
2018-10-29 18:20                           ` Liang, Kan
2018-10-29 18:32                             ` Arnaldo Carvalho de Melo
2018-10-29 22:32                               ` Liang, Kan
2018-10-29 22:42                                 ` David Miller
2018-10-30  1:54                                   ` Liang, Kan
2018-10-29 21:16                             ` David Miller
2018-10-29 17:55                       ` Arnaldo Carvalho de Melo
2018-10-30 19:05                     ` David Miller
2018-10-31 22:03                 ` [tip:perf/urgent] perf top: Do not use overwrite mode by default tip-bot for Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).