Re: [PATCH] perf intel-pt: Synthesize cycle events

From: Adrian Hunter <adrian.hunter@intel.com>
To: "Steinar H. Gunderson" <sesse@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events
Date: Fri, 11 Mar 2022 11:10:30 +0200	[thread overview]
Message-ID: <586de5fc-858b-2693-1986-5c77e8c0e3d0@intel.com> (raw)
In-Reply-To: <20220310093844.982656-1-sesse@google.com>

On 10.3.2022 11.38, Steinar H. Gunderson wrote:
> There is no good reason why we cannot synthesize "cycle" events
> from Intel PT just as we can synthesize "instruction" events,
> in particular when CYC packets are available. This enables using
> PT to getting much more accurate cycle profiles than regular sampling
> (record -e cycles) when the work last for very short periods (<10 ms).
> Thus, add support for this, based off of the existing IPC calculation
> framework. The new option to --itrace is "y" (for cYcles), as c was
> taken for calls. Cycle and instruction events can be synthesized
> together, and are by default.
> 
> The only real caveat is that CYC packets are only emitted whenever
> some other packet is, which in practice is when a branch instruction
> is encountered. Thus, even at no subsampling (e.g. --itrace=y0ns),
> it is impossible to get more accuracy than a single basic block, and all
> cycles spent executing that block will get attributed to the branch
> instruction that ends it. Thus, one cannot know whether the cycles came
> from e.g. a specific load, a mispredicted branch, or something else.
> When subsampling (which is the default), the cycle events will get
> smeared out even more, but will still be useful to attribute cycle
> counts to functions.
> 
> Signed-off-by: Steinar H. Gunderson <sesse@google.com>
> ---
>  tools/perf/Documentation/itrace.txt        |  3 +-
>  tools/perf/Documentation/perf-intel-pt.txt | 33 +++++++-----
>  tools/perf/util/auxtrace.c                 |  9 +++-
>  tools/perf/util/auxtrace.h                 |  7 ++-
>  tools/perf/util/intel-pt.c                 | 59 +++++++++++++++++++---
>  5 files changed, 88 insertions(+), 23 deletions(-)
> 

<SNIP>

> -static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
> +static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq)
>  {
>  	struct intel_pt *pt = ptq->pt;
>  	union perf_event *event = ptq->event_buf;
>  	struct perf_sample sample = { .ip = 0, };
> +	int err;
>  
>  	if (intel_pt_skip_event(pt))
>  		return 0;
> @@ -1633,7 +1639,7 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>  	else
>  		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
>  
> -	if (ptq->sample_ipc)
> +	if (ptq->sample_ipc || pt->sample_cycles)

This is not quite right.  ptq->sample_ipc is set to indicate when the
cycle count is accurate for the current instruction.  It can be weakened
by using "Approx IPC" which was introduced for dlfilter-show-cycles.
Probably that approach should be followed for a "cycles" event also.

From perf-intel-pt man page:

dlfilter-show-cycles.so
~~~~~~~~~~~~~~~~~~~~~~~

Cycles can be displayed using dlfilter-show-cycles.so in which case the itrace A
option can be useful to provide higher granularity cycle information:

	perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so

To see a list of dlfilters:

	perf script -v --list-dlfilters

See also linkperf:perf-dlfilters[1]

>  		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
>  	if (sample.cyc_cnt) {
>  		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
> @@ -1643,8 +1649,30 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>  
>  	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;

There are variables here that are specific to the "instructions" event, so
mixing "cycles" with "instructions" means duplicating those, however maybe
it would be better not to allow "y" and "i" options at the same time?

>  
> -	return intel_pt_deliver_synth_event(pt, event, &sample,
> -					    pt->instructions_sample_type);
> +	if (pt->sample_instructions) {
> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
> +						   pt->instructions_sample_type);
> +		if (err)
> +			return err;
> +	}
> +
> +	/*
> +	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
> +	 * only see cycles being attributed to branches, since CYC packets
> +	 * only are emitted together with other packets are emitted.
> +	 * We should perhaps consider spreading it out over everything since
> +	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
> +	 */
> +	if (pt->sample_cycles && sample.cyc_cnt) {
> +		sample.id = ptq->pt->cycles_id;
> +		sample.stream_id = ptq->pt->cycles_id;

A "cycles" sample needs to set the sample period to the number of cycles since the
last "cycles" sample.

> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
> +						   pt->cycles_sample_type);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
>  }