Re: [PATCH] perf intel-pt: Synthesize cycle events

From: "Steinar H. Gunderson" <sesse@google.com>
To: Adrian Hunter <adrian.hunter@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events
Date: Mon, 21 Mar 2022 17:58:08 +0100	[thread overview]
Message-ID: <YjiuoEUL6jH32cBi@google.com> (raw)
In-Reply-To: <ffa56520-09b5-9c5d-7733-6767d2f8e350@intel.com>

On Mon, Mar 21, 2022 at 03:09:08PM +0200, Adrian Hunter wrote:
> Yes, it can cross calls and returns.  'returns' due to "Return Compression"
> which can be switched off at record time with config term noretcomp, but
> that may cause more overflows / trace data loss.
> 
> To get accurate times for a single function there is Intel PT
> address filtering.
> 
> Otherwise LBRs can have cycle times.

Many interesting points, I'll be sure to look into them.

Meanwhile, should I send a new patch with your latest changes?
It's more your patch than mine now, it seems, but I think we're
converging on something useful.

>> By the way, I noticed that synthesized call stacks do not respect
>> --inline; is that on purpose? The patch seems simple enough (just
>> a call to add_inlines), although it exposes extreme slowness in libbfd
>> when run over large binaries, which I'll have to look into.
>> (10+ ms for each address-to-symbol lookup is rather expensive when you
>> have 4M samples to churn through!)
> No, not on purpose.

The patch appears to be trivial:

--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -44,6 +44,7 @@
 #include <linux/zalloc.h>

 static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
+static int append_inlines(struct callchain_cursor *cursor, struct map_symbol *ms, u64 ip);

 static struct dso *machine__kernel_dso(struct machine *machine)
 {
@@ -2217,6 +2218,10 @@ static int add_callchain_ip(struct thread *thread,
 	ms.maps = al.maps;
 	ms.map = al.map;
 	ms.sym = al.sym;
+
+	if (append_inlines(cursor, &ms, ip) == 0)
+		return 0;
+
        srcline = callchain_srcline(&ms, al.addr);
        return callchain_cursor_append(cursor, ip, &ms,
                                       branch, flags, nr_loop_iter,

but I'm seeing problems with “failing to process sample” when I go from
10us to 1us, so I'll have to look into that.

I've sent some libbfd patches try to help the slowness:

  https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=30cbd32aec30b4bc13427bbd87c4c63c739d4578
  https://sourceware.org/pipermail/binutils/2022-March/120131.html
  https://sourceware.org/pipermail/binutils/2022-March/120133.html

/* Steinar */