linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf intel-pt: Synthesize cycle events
@ 2022-03-10  9:38 Steinar H. Gunderson
  2022-03-11  9:10 ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-10  9:38 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Adrian Hunter
  Cc: Jiri Olsa, Namhyung Kim, linux-perf-users, linux-kernel,
	Steinar H. Gunderson

There is no good reason why we cannot synthesize "cycle" events
from Intel PT just as we can synthesize "instruction" events,
in particular when CYC packets are available. This enables using
PT to getting much more accurate cycle profiles than regular sampling
(record -e cycles) when the work last for very short periods (<10 ms).
Thus, add support for this, based off of the existing IPC calculation
framework. The new option to --itrace is "y" (for cYcles), as c was
taken for calls. Cycle and instruction events can be synthesized
together, and are by default.

The only real caveat is that CYC packets are only emitted whenever
some other packet is, which in practice is when a branch instruction
is encountered. Thus, even at no subsampling (e.g. --itrace=y0ns),
it is impossible to get more accuracy than a single basic block, and all
cycles spent executing that block will get attributed to the branch
instruction that ends it. Thus, one cannot know whether the cycles came
from e.g. a specific load, a mispredicted branch, or something else.
When subsampling (which is the default), the cycle events will get
smeared out even more, but will still be useful to attribute cycle
counts to functions.

Signed-off-by: Steinar H. Gunderson <sesse@google.com>
---
 tools/perf/Documentation/itrace.txt        |  3 +-
 tools/perf/Documentation/perf-intel-pt.txt | 33 +++++++-----
 tools/perf/util/auxtrace.c                 |  9 +++-
 tools/perf/util/auxtrace.h                 |  7 ++-
 tools/perf/util/intel-pt.c                 | 59 +++++++++++++++++++---
 5 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/tools/perf/Documentation/itrace.txt b/tools/perf/Documentation/itrace.txt
index c52755481e2f..af69d80a05b7 100644
--- a/tools/perf/Documentation/itrace.txt
+++ b/tools/perf/Documentation/itrace.txt
@@ -1,4 +1,5 @@
 		i	synthesize instructions events
+		y	synthesize cycles events
 		b	synthesize branches events (branch misses for Arm SPE)
 		c	synthesize branches events (calls only)
 		r	synthesize branches events (returns only)
@@ -23,7 +24,7 @@
 		A	approximate IPC
 		Z	prefer to ignore timestamps (so-called "timeless" decoding)
 
-	The default is all events i.e. the same as --itrace=ibxwpe,
+	The default is all events i.e. the same as --itrace=iybxwpe,
 	except for perf script where it is --itrace=ce
 
 	In addition, the period (default 100000, except for perf script where it is 1)
diff --git a/tools/perf/Documentation/perf-intel-pt.txt b/tools/perf/Documentation/perf-intel-pt.txt
index cbb920f5d056..d457facfde03 100644
--- a/tools/perf/Documentation/perf-intel-pt.txt
+++ b/tools/perf/Documentation/perf-intel-pt.txt
@@ -101,12 +101,12 @@ data is available you can use the 'perf script' tool with all itrace sampling
 options, which will list all the samples.
 
 	perf record -e intel_pt//u ls
-	perf script --itrace=ibxwpe
+	perf script --itrace=iybxwpe
 
 An interesting field that is not printed by default is 'flags' which can be
 displayed as follows:
 
-	perf script --itrace=ibxwpe -F+flags
+	perf script --itrace=iybxwpe -F+flags
 
 The flags are "bcrosyiABExgh" which stand for branch, call, return, conditional,
 system, asynchronous, interrupt, transaction abort, trace begin, trace end,
@@ -146,16 +146,17 @@ displayed as follows:
 There are two ways that instructions-per-cycle (IPC) can be calculated depending
 on the recording.
 
-If the 'cyc' config term (see config terms section below) was used, then IPC is
-calculated using the cycle count from CYC packets, otherwise MTC packets are
-used - refer to the 'mtc' config term.  When MTC is used, however, the values
-are less accurate because the timing is less accurate.
+If the 'cyc' config term (see config terms section below) was used, then IPC
+and cycle events are calculated using the cycle count from CYC packets, otherwise
+MTC packets are used - refer to the 'mtc' config term.  When MTC is used, however,
+the values are less accurate because the timing is less accurate.
 
 Because Intel PT does not update the cycle count on every branch or instruction,
 the values will often be zero.  When there are values, they will be the number
 of instructions and number of cycles since the last update, and thus represent
-the average IPC since the last IPC for that event type.  Note IPC for "branches"
-events is calculated separately from IPC for "instructions" events.
+the average IPC cycle count since the last IPC for that event type.
+Note IPC for "branches" events is calculated separately from IPC for "instructions"
+events.
 
 Even with the 'cyc' config term, it is possible to produce IPC information for
 every change of timestamp, but at the expense of accuracy.  That is selected by
@@ -865,11 +866,12 @@ Having no option is the same as
 
 which, in turn, is the same as
 
-	--itrace=cepwx
+	--itrace=cepwxy
 
 The letters are:
 
 	i	synthesize "instructions" events
+	y	synthesize "cycles" events
 	b	synthesize "branches" events
 	x	synthesize "transactions" events
 	w	synthesize "ptwrite" events
@@ -890,6 +892,13 @@ The letters are:
 "Instructions" events look like they were recorded by "perf record -e
 instructions".
 
+"Cycles" events look like they were recorded by "perf record -e cycles"
+(ie., the default). Note that even with CYC packets enabled and no sampling,
+these are not fully accurate, since CYC packets are not emitted for each
+instruction, only when some other event (like a branch) happens causes
+a packet to be emitted. Thus, it is more effective for attributing cycles
+to functions and (possibly basic blocks) than to individual instructions.
+
 "Branches" events look like they were recorded by "perf record -e branches". "c"
 and "r" can be combined to get calls and returns.
 
@@ -897,9 +906,9 @@ and "r" can be combined to get calls and returns.
 'flags' field can be used in perf script to determine whether the event is a
 transaction start, commit or abort.
 
-Note that "instructions", "branches" and "transactions" events depend on code
-flow packets which can be disabled by using the config term "branch=0".  Refer
-to the config terms section above.
+Note that "instructions", "cycles", "branches" and "transactions" events
+depend on code flow packets which can be disabled by using the config term
+"branch=0".  Refer to the config terms section above.
 
 "ptwrite" events record the payload of the ptwrite instruction and whether
 "fup_on_ptw" was used.  "ptwrite" events depend on PTWRITE packets which are
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index 825336304a37..18e457b80bde 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -1346,6 +1346,7 @@ void itrace_synth_opts__set_default(struct itrace_synth_opts *synth_opts,
 		synth_opts->calls = true;
 	} else {
 		synth_opts->instructions = true;
+		synth_opts->cycles = true;
 		synth_opts->period_type = PERF_ITRACE_DEFAULT_PERIOD_TYPE;
 		synth_opts->period = PERF_ITRACE_DEFAULT_PERIOD;
 	}
@@ -1424,7 +1425,11 @@ int itrace_do_parse_synth_opts(struct itrace_synth_opts *synth_opts,
 	for (p = str; *p;) {
 		switch (*p++) {
 		case 'i':
-			synth_opts->instructions = true;
+		case 'y':
+			if (p[-1] == 'y')
+				synth_opts->cycles = true;
+			else
+				synth_opts->instructions = true;
 			while (*p == ' ' || *p == ',')
 				p += 1;
 			if (isdigit(*p)) {
@@ -1578,7 +1583,7 @@ int itrace_do_parse_synth_opts(struct itrace_synth_opts *synth_opts,
 		}
 	}
 out:
-	if (synth_opts->instructions) {
+	if (synth_opts->instructions || synth_opts->cycles) {
 		if (!period_type_set)
 			synth_opts->period_type =
 					PERF_ITRACE_DEFAULT_PERIOD_TYPE;
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 19910b9011f3..7cd6bad3e46a 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -69,6 +69,9 @@ enum itrace_period_type {
  * @inject: indicates the event (not just the sample) must be fully synthesized
  *          because 'perf inject' will write it out
  * @instructions: whether to synthesize 'instructions' events
+ * @cycles: whether to synthesize 'cycles' events
+ *          (not fully accurate, since CYC packets are only emitted
+ *          together with other events, such as branches)
  * @branches: whether to synthesize 'branches' events
  *            (branch misses only for Arm SPE)
  * @transactions: whether to synthesize events for transactions
@@ -115,6 +118,7 @@ struct itrace_synth_opts {
 	bool			default_no_sample;
 	bool			inject;
 	bool			instructions;
+	bool			cycles;
 	bool			branches;
 	bool			transactions;
 	bool			ptwrites;
@@ -628,6 +632,7 @@ bool auxtrace__evsel_is_auxtrace(struct perf_session *session,
 
 #define ITRACE_HELP \
 "				i[period]:    		synthesize instructions events\n" \
+"				y[period]:    		synthesize cycles events (same period as i)\n" \
 "				b:	    		synthesize branches events (branch misses for Arm SPE)\n" \
 "				c:	    		synthesize branches events (calls only)\n"	\
 "				r:	    		synthesize branches events (returns only)\n" \
@@ -657,7 +662,7 @@ bool auxtrace__evsel_is_auxtrace(struct perf_session *session,
 "				A:			approximate IPC\n" \
 "				Z:			prefer to ignore timestamps (so-called \"timeless\" decoding)\n" \
 "				PERIOD[ns|us|ms|i|t]:   specify period to sample stream\n" \
-"				concatenate multiple options. Default is ibxwpe or cewp\n"
+"				concatenate multiple options. Default is iybxwpe or cewp\n"
 
 static inline
 void itrace_synth_opts__set_time_range(struct itrace_synth_opts *opts,
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index e8613cbda331..f6597db89962 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -5,6 +5,7 @@
  */
 
 #include <inttypes.h>
+#include <linux/perf_event.h>
 #include <stdio.h>
 #include <stdbool.h>
 #include <errno.h>
@@ -89,6 +90,10 @@ struct intel_pt {
 	u64 instructions_sample_type;
 	u64 instructions_id;
 
+	bool sample_cycles;
+	u64 cycles_sample_type;
+	u64 cycles_id;
+
 	bool sample_branches;
 	u32 branches_filter;
 	u64 branches_sample_type;
@@ -1217,7 +1222,7 @@ static struct intel_pt_queue *intel_pt_alloc_queue(struct intel_pt *pt,
 	if (pt->filts.cnt > 0)
 		params.pgd_ip = intel_pt_pgd_ip;
 
-	if (pt->synth_opts.instructions) {
+	if (pt->synth_opts.instructions || pt->synth_opts.cycles) {
 		if (pt->synth_opts.period) {
 			switch (pt->synth_opts.period_type) {
 			case PERF_ITRACE_PERIOD_INSTRUCTIONS:
@@ -1615,11 +1620,12 @@ static void intel_pt_prep_sample(struct intel_pt *pt,
 	}
 }
 
-static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
+static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq)
 {
 	struct intel_pt *pt = ptq->pt;
 	union perf_event *event = ptq->event_buf;
 	struct perf_sample sample = { .ip = 0, };
+	int err;
 
 	if (intel_pt_skip_event(pt))
 		return 0;
@@ -1633,7 +1639,7 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
 	else
 		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
 
-	if (ptq->sample_ipc)
+	if (ptq->sample_ipc || pt->sample_cycles)
 		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
 	if (sample.cyc_cnt) {
 		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
@@ -1643,8 +1649,30 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
 
 	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;
 
-	return intel_pt_deliver_synth_event(pt, event, &sample,
-					    pt->instructions_sample_type);
+	if (pt->sample_instructions) {
+		err = intel_pt_deliver_synth_event(pt, event, &sample,
+						   pt->instructions_sample_type);
+		if (err)
+			return err;
+	}
+
+	/*
+	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
+	 * only see cycles being attributed to branches, since CYC packets
+	 * only are emitted together with other packets are emitted.
+	 * We should perhaps consider spreading it out over everything since
+	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
+	 */
+	if (pt->sample_cycles && sample.cyc_cnt) {
+		sample.id = ptq->pt->cycles_id;
+		sample.stream_id = ptq->pt->cycles_id;
+		err = intel_pt_deliver_synth_event(pt, event, &sample,
+						   pt->cycles_sample_type);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 static int intel_pt_synth_transaction_sample(struct intel_pt_queue *ptq)
@@ -2301,8 +2329,9 @@ static int intel_pt_sample(struct intel_pt_queue *ptq)
 		}
 	}
 
-	if (pt->sample_instructions && (state->type & INTEL_PT_INSTRUCTION)) {
-		err = intel_pt_synth_instruction_sample(ptq);
+	if ((pt->sample_instructions || pt->sample_cycles) &&
+	    (state->type & INTEL_PT_INSTRUCTION)) {
+		err = intel_pt_synth_instruction_or_cycle_sample(ptq);
 		if (err)
 			return err;
 	}
@@ -3378,6 +3407,22 @@ static int intel_pt_synth_events(struct intel_pt *pt,
 		id += 1;
 	}
 
+	if (pt->synth_opts.cycles) {
+		attr.config = PERF_COUNT_HW_CPU_CYCLES;
+		if (pt->synth_opts.period_type == PERF_ITRACE_PERIOD_NANOSECS)
+			attr.sample_period =
+				intel_pt_ns_to_ticks(pt, pt->synth_opts.period);
+		else
+			attr.sample_period = pt->synth_opts.period;
+		err = intel_pt_synth_event(session, "cycles", &attr, id);
+		if (err)
+			return err;
+		pt->sample_cycles = true;
+		pt->cycles_sample_type = attr.sample_type;
+		pt->cycles_id = id;
+		id += 1;
+	}
+
 	attr.sample_type &= ~(u64)PERF_SAMPLE_PERIOD;
 	attr.sample_period = 1;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-10  9:38 [PATCH] perf intel-pt: Synthesize cycle events Steinar H. Gunderson
@ 2022-03-11  9:10 ` Adrian Hunter
  2022-03-11 17:42   ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-11  9:10 UTC (permalink / raw)
  To: Steinar H. Gunderson, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin
  Cc: Jiri Olsa, Namhyung Kim, linux-perf-users, linux-kernel

On 10.3.2022 11.38, Steinar H. Gunderson wrote:
> There is no good reason why we cannot synthesize "cycle" events
> from Intel PT just as we can synthesize "instruction" events,
> in particular when CYC packets are available. This enables using
> PT to getting much more accurate cycle profiles than regular sampling
> (record -e cycles) when the work last for very short periods (<10 ms).
> Thus, add support for this, based off of the existing IPC calculation
> framework. The new option to --itrace is "y" (for cYcles), as c was
> taken for calls. Cycle and instruction events can be synthesized
> together, and are by default.
> 
> The only real caveat is that CYC packets are only emitted whenever
> some other packet is, which in practice is when a branch instruction
> is encountered. Thus, even at no subsampling (e.g. --itrace=y0ns),
> it is impossible to get more accuracy than a single basic block, and all
> cycles spent executing that block will get attributed to the branch
> instruction that ends it. Thus, one cannot know whether the cycles came
> from e.g. a specific load, a mispredicted branch, or something else.
> When subsampling (which is the default), the cycle events will get
> smeared out even more, but will still be useful to attribute cycle
> counts to functions.
> 
> Signed-off-by: Steinar H. Gunderson <sesse@google.com>
> ---
>  tools/perf/Documentation/itrace.txt        |  3 +-
>  tools/perf/Documentation/perf-intel-pt.txt | 33 +++++++-----
>  tools/perf/util/auxtrace.c                 |  9 +++-
>  tools/perf/util/auxtrace.h                 |  7 ++-
>  tools/perf/util/intel-pt.c                 | 59 +++++++++++++++++++---
>  5 files changed, 88 insertions(+), 23 deletions(-)
> 

<SNIP>

> -static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
> +static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq)
>  {
>  	struct intel_pt *pt = ptq->pt;
>  	union perf_event *event = ptq->event_buf;
>  	struct perf_sample sample = { .ip = 0, };
> +	int err;
>  
>  	if (intel_pt_skip_event(pt))
>  		return 0;
> @@ -1633,7 +1639,7 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>  	else
>  		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
>  
> -	if (ptq->sample_ipc)
> +	if (ptq->sample_ipc || pt->sample_cycles)

This is not quite right.  ptq->sample_ipc is set to indicate when the
cycle count is accurate for the current instruction.  It can be weakened
by using "Approx IPC" which was introduced for dlfilter-show-cycles.
Probably that approach should be followed for a "cycles" event also.

From perf-intel-pt man page:

dlfilter-show-cycles.so
~~~~~~~~~~~~~~~~~~~~~~~

Cycles can be displayed using dlfilter-show-cycles.so in which case the itrace A
option can be useful to provide higher granularity cycle information:

	perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so

To see a list of dlfilters:

	perf script -v --list-dlfilters

See also linkperf:perf-dlfilters[1]


>  		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
>  	if (sample.cyc_cnt) {
>  		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
> @@ -1643,8 +1649,30 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>  
>  	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;

There are variables here that are specific to the "instructions" event, so
mixing "cycles" with "instructions" means duplicating those, however maybe
it would be better not to allow "y" and "i" options at the same time?

>  
> -	return intel_pt_deliver_synth_event(pt, event, &sample,
> -					    pt->instructions_sample_type);
> +	if (pt->sample_instructions) {
> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
> +						   pt->instructions_sample_type);
> +		if (err)
> +			return err;
> +	}
> +
> +	/*
> +	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
> +	 * only see cycles being attributed to branches, since CYC packets
> +	 * only are emitted together with other packets are emitted.
> +	 * We should perhaps consider spreading it out over everything since
> +	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
> +	 */
> +	if (pt->sample_cycles && sample.cyc_cnt) {
> +		sample.id = ptq->pt->cycles_id;
> +		sample.stream_id = ptq->pt->cycles_id;

A "cycles" sample needs to set the sample period to the number of cycles since the
last "cycles" sample.

> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
> +						   pt->cycles_sample_type);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-11  9:10 ` Adrian Hunter
@ 2022-03-11 17:42   ` Steinar H. Gunderson
  2022-03-14 16:24     ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-11 17:42 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Fri, Mar 11, 2022 at 11:10:30AM +0200, Adrian Hunter wrote:
>> @@ -1633,7 +1639,7 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>>  	else
>>  		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
>>  
>> -	if (ptq->sample_ipc)
>> +	if (ptq->sample_ipc || pt->sample_cycles)
> This is not quite right.  ptq->sample_ipc is set to indicate when the
> cycle count is accurate for the current instruction.  It can be weakened
> by using "Approx IPC" which was introduced for dlfilter-show-cycles.
> Probably that approach should be followed for a "cycles" event also.

Thanks for the review!

I'm not sure if I understand this entirely. The point of the code here
is to get sample.cyc_cnt computed, even if we're not sampling IPC.
I've seen the approx IPC code, but I'm not entirely sure how it
interacts with this?

>>  		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
>>  	if (sample.cyc_cnt) {
>>  		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
>> @@ -1643,8 +1649,30 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>>  
>>  	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;
> There are variables here that are specific to the "instructions" event, so
> mixing "cycles" with "instructions" means duplicating those, however maybe
> it would be better not to allow "y" and "i" options at the same time?

Given that a decode can easily take an hour, it would be really nice to
be able to keep y and i at the same time :-) (A long-standing pet peeve
of mine in perf is that you can't show two events side-by-side;
oprofile did that back in the day, at least on annotations.)

What specifically do you mean by duplicating? That we need to calculate
them twice in a way I don't do in my current patch, or something else?
It feels like I'm missing some context here.

>> -	return intel_pt_deliver_synth_event(pt, event, &sample,
>> -					    pt->instructions_sample_type);
>> +	if (pt->sample_instructions) {
>> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
>> +						   pt->instructions_sample_type);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	/*
>> +	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
>> +	 * only see cycles being attributed to branches, since CYC packets
>> +	 * only are emitted together with other packets are emitted.
>> +	 * We should perhaps consider spreading it out over everything since
>> +	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
>> +	 */
>> +	if (pt->sample_cycles && sample.cyc_cnt) {
>> +		sample.id = ptq->pt->cycles_id;
>> +		sample.stream_id = ptq->pt->cycles_id;
> A "cycles" sample needs to set the sample period to the number of cycles since the
> last "cycles" sample.

OK. If I understand you right, is this as simple as sample.period =
sample.cyc_cnt?

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-11 17:42   ` Steinar H. Gunderson
@ 2022-03-14 16:24     ` Adrian Hunter
  2022-03-15 10:16       ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-14 16:24 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 11/03/2022 19:42, Steinar H. Gunderson wrote:
> On Fri, Mar 11, 2022 at 11:10:30AM +0200, Adrian Hunter wrote:
>>> @@ -1633,7 +1639,7 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>>>  	else
>>>  		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
>>>  
>>> -	if (ptq->sample_ipc)
>>> +	if (ptq->sample_ipc || pt->sample_cycles)
>> This is not quite right.  ptq->sample_ipc is set to indicate when the
>> cycle count is accurate for the current instruction.  It can be weakened
>> by using "Approx IPC" which was introduced for dlfilter-show-cycles.
>> Probably that approach should be followed for a "cycles" event also.
> 
> Thanks for the review!
> 
> I'm not sure if I understand this entirely. The point of the code here
> is to get sample.cyc_cnt computed, even if we're not sampling IPC.
> I've seen the approx IPC code, but I'm not entirely sure how it
> interacts with this?
> 
>>>  		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
>>>  	if (sample.cyc_cnt) {
>>>  		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
>>> @@ -1643,8 +1649,30 @@ static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
>>>  
>>>  	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;
>> There are variables here that are specific to the "instructions" event, so
>> mixing "cycles" with "instructions" means duplicating those, however maybe
>> it would be better not to allow "y" and "i" options at the same time?
> 
> Given that a decode can easily take an hour, it would be really nice to
> be able to keep y and i at the same time :-) (A long-standing pet peeve
> of mine in perf is that you can't show two events side-by-side;
> oprofile did that back in the day, at least on annotations.)
> 
> What specifically do you mean by duplicating? That we need to calculate
> them twice in a way I don't do in my current patch, or something else?
> It feels like I'm missing some context here.
> 
>>> -	return intel_pt_deliver_synth_event(pt, event, &sample,
>>> -					    pt->instructions_sample_type);
>>> +	if (pt->sample_instructions) {
>>> +		err = intel_pt_deliver_synth_event(pt, event, &sample,
>>> +						   pt->instructions_sample_type);
>>> +		if (err)
>>> +			return err;
>>> +	}
>>> +
>>> +	/*
>>> +	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
>>> +	 * only see cycles being attributed to branches, since CYC packets
>>> +	 * only are emitted together with other packets are emitted.
>>> +	 * We should perhaps consider spreading it out over everything since
>>> +	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
>>> +	 */
>>> +	if (pt->sample_cycles && sample.cyc_cnt) {
>>> +		sample.id = ptq->pt->cycles_id;
>>> +		sample.stream_id = ptq->pt->cycles_id;
>> A "cycles" sample needs to set the sample period to the number of cycles since the
>> last "cycles" sample.
> 
> OK. If I understand you right, is this as simple as sample.period =
> sample.cyc_cnt?
> 
> /* Steinar */

Perhaps changing it something like below.  What do you think?


diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 062cd25f7cd7..cec0444277f6 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -215,6 +215,8 @@ struct intel_pt_queue {
 	u64 ipc_cyc_cnt;
 	u64 last_in_insn_cnt;
 	u64 last_in_cyc_cnt;
+	u64 last_cy_insn_cnt;
+	u64 last_cy_cyc_cnt;
 	u64 last_br_insn_cnt;
 	u64 last_br_cyc_cnt;
 	unsigned int cbr_seen;
@@ -1666,12 +1668,11 @@ static void intel_pt_prep_sample(struct intel_pt *pt,
 	}
 }
 
-static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq)
+static int intel_pt_synth_instruction_sample(struct intel_pt_queue *ptq)
 {
 	struct intel_pt *pt = ptq->pt;
 	union perf_event *event = ptq->event_buf;
 	struct perf_sample sample = { .ip = 0, };
-	int err;
 
 	if (intel_pt_skip_event(pt))
 		return 0;
@@ -1685,7 +1686,7 @@ static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq
 	else
 		sample.period = ptq->state->tot_insn_cnt - ptq->last_insn_cnt;
 
-	if (ptq->sample_ipc || pt->sample_cycles)
+	if (ptq->sample_ipc)
 		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_in_cyc_cnt;
 	if (sample.cyc_cnt) {
 		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_in_insn_cnt;
@@ -1695,30 +1696,40 @@ static int intel_pt_synth_instruction_or_cycle_sample(struct intel_pt_queue *ptq
 
 	ptq->last_insn_cnt = ptq->state->tot_insn_cnt;
 
-	if (pt->sample_instructions) {
-		err = intel_pt_deliver_synth_event(pt, event, &sample,
-						   pt->instructions_sample_type);
-		if (err)
-			return err;
-	}
+	return intel_pt_deliver_synth_event(pt, event, &sample,
+					    pt->instructions_sample_type);
+}
 
-	/*
-	 * NOTE: If not doing sampling (e.g. itrace=y0us), we will in practice
-	 * only see cycles being attributed to branches, since CYC packets
-	 * only are emitted together with other packets are emitted.
-	 * We should perhaps consider spreading it out over everything since
-	 * the last CYC packet, ie., since last time sample.cyc_cnt was nonzero.
-	 */
-	if (pt->sample_cycles && sample.cyc_cnt) {
-		sample.id = ptq->pt->cycles_id;
-		sample.stream_id = ptq->pt->cycles_id;
-		err = intel_pt_deliver_synth_event(pt, event, &sample,
-						   pt->cycles_sample_type);
-		if (err)
-			return err;
+static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
+{
+	struct intel_pt *pt = ptq->pt;
+	union perf_event *event = ptq->event_buf;
+	struct perf_sample sample = { .ip = 0, };
+	u64 period;
+
+	if (pt->synth_opts.quick)
+		period = 1;
+	else
+		period = ptq->ipc_cyc_cnt - ptq->last_cy_cyc_cnt;
+
+	if (!period || intel_pt_skip_event(pt))
+		return 0;
+
+	intel_pt_prep_sample(pt, ptq, event, &sample);
+
+	sample.id = ptq->pt->cycles_id;
+	sample.stream_id = ptq->pt->cycles_id;
+	sample.period = period;
+
+	if (ptq->sample_ipc)
+		sample.cyc_cnt = sample.period;
+	if (sample.cyc_cnt) {
+		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_cy_insn_cnt;
+		ptq->last_cy_insn_cnt = ptq->ipc_insn_cnt;
+		ptq->last_cy_cyc_cnt = ptq->ipc_cyc_cnt;
 	}
 
-	return 0;
+	return intel_pt_deliver_synth_event(pt, event, &sample, pt->cycles_sample_type);
 }
 
 static int intel_pt_synth_transaction_sample(struct intel_pt_queue *ptq)
@@ -2457,11 +2468,17 @@ static int intel_pt_sample(struct intel_pt_queue *ptq)
 		}
 	}
 
-	if ((pt->sample_instructions || pt->sample_cycles) &&
-	    (state->type & INTEL_PT_INSTRUCTION)) {
-		err = intel_pt_synth_instruction_or_cycle_sample(ptq);
-		if (err)
-			return err;
+	if (state->type & INTEL_PT_INSTRUCTION) {
+		if (pt->sample_instructions) {
+			err = intel_pt_synth_instruction_sample(ptq);
+			if (err)
+				return err;
+		}
+		if (pt->sample_cycles) {
+			err = intel_pt_synth_cycle_sample(ptq);
+			if (err)
+				return err;
+		}
 	}
 
 	if (pt->sample_transactions && (state->type & INTEL_PT_TRANSACTION)) {



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-14 16:24     ` Adrian Hunter
@ 2022-03-15 10:16       ` Steinar H. Gunderson
  2022-03-15 11:32         ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-15 10:16 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Mon, Mar 14, 2022 at 06:24:19PM +0200, Adrian Hunter wrote:
> Perhaps changing it something like below.  What do you think?

I think the structure looks good, but I'm not sure about updating
e.g. ptq->last_cy_insn_cnt in both functions? Does that make sense?

I ran this and found something strange: I've started getting some hits
(very small amounts, e.g. 0.14%) on instructions that are not branches.
How can that happen?

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-15 10:16       ` Steinar H. Gunderson
@ 2022-03-15 11:32         ` Adrian Hunter
  2022-03-15 18:00           ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-15 11:32 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 15/03/2022 12:16, Steinar H. Gunderson wrote:
> On Mon, Mar 14, 2022 at 06:24:19PM +0200, Adrian Hunter wrote:
>> Perhaps changing it something like below.  What do you think?
> 
> I think the structure looks good, but I'm not sure about updating
> e.g. ptq->last_cy_insn_cnt in both functions? Does that make sense?

It should only be updated in the new intel_pt_synth_cycle_sample().
intel_pt_synth_instruction_sample() should be unchanged.

Also it would be better to keep the IPC count separate from the
sample period. i.e. a third new variable ptq->last_cyc_cnt
which works the same as ptq->last_insn_cnt does for intel_pt_synth_instruction_sample()

diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index c1f261229932..c7a4b5feea2a 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -211,10 +211,13 @@ struct intel_pt_queue {
 	u32 flags;
 	u16 insn_len;
 	u64 last_insn_cnt;
+	u64 last_cyc_cnt;
 	u64 ipc_insn_cnt;
 	u64 ipc_cyc_cnt;
 	u64 last_in_insn_cnt;
 	u64 last_in_cyc_cnt;
+	u64 last_cy_insn_cnt;
+	u64 last_cy_cyc_cnt;
 	u64 last_br_insn_cnt;
 	u64 last_br_cyc_cnt;
 	unsigned int cbr_seen;

Then the new function becomes:

static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
{
	struct intel_pt *pt = ptq->pt;
	union perf_event *event = ptq->event_buf;
	struct perf_sample sample = { .ip = 0, };
	u64 period;

	if (pt->synth_opts.quick)
		period = 1;
	else
		period = ptq->ipc_cyc_cnt - ptq->last_cyc_cnt;

	if (!period || intel_pt_skip_event(pt))
		return 0;

	intel_pt_prep_sample(pt, ptq, event, &sample);

	sample.id = ptq->pt->cycles_id;
	sample.stream_id = ptq->pt->cycles_id;
	sample.period = period;

	if (ptq->sample_ipc)
		sample.cyc_cnt = ptq->ipc_cyc_cnt - ptq->last_cy_cyc_cnt;
	if (sample.cyc_cnt) {
		sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_cy_insn_cnt;
		ptq->last_cy_insn_cnt = ptq->ipc_insn_cnt;
		ptq->last_cy_cyc_cnt = ptq->ipc_cyc_cnt;
	}

	ptq->last_cyc_cnt = ptq->ipc_cyc_cnt;

	return intel_pt_deliver_synth_event(pt, event, &sample, pt->cycles_sample_type);
}


> 
> I ran this and found something strange: I've started getting some hits
> (very small amounts, e.g. 0.14%) on instructions that are not branches.
> How can that happen?

Possibly a PSB packet.

If you have an example and recent perf tools you can see the decoder
log as follows:

perf script --itrace=y0nsed+o --time <start time>,<end time> -C <cpu>

where you select <start time>, <end time> and <cpu> to limit the
decoding to that time range and cpu.




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-15 11:32         ` Adrian Hunter
@ 2022-03-15 18:00           ` Steinar H. Gunderson
  2022-03-15 20:11             ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-15 18:00 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Tue, Mar 15, 2022 at 01:32:38PM +0200, Adrian Hunter wrote:
>> I think the structure looks good, but I'm not sure about updating
>> e.g. ptq->last_cy_insn_cnt in both functions? Does that make sense?
> It should only be updated in the new intel_pt_synth_cycle_sample().
> intel_pt_synth_instruction_sample() should be unchanged.

Hm, OK. But something definitely changed between my original patch and
your change. (The first patch; I didn't try the last one yet.) With my
patch, I got (on a specific trace, synthing cycles only with perf report
--itrace=y0nse):

Samples: 4M of event 'cycles:uH', Event count (approx.): 4844309

With yours on the same file:

Samples: 2M of event 'cycles:uH', Event count (approx.): 77622449

The relative times between functions are also pretty different (although
none of them are obviously crazy), so one of them has to be wrong.
Is this to be expected, ie., would you expect your change to fix some
bad bug on cycle-only synth? For reference, “perf script --itrace=i0ns
-F +ipc | grep -c IPC:” (a quick proxy for the number of CYC packets :-) )
yields 4836782, so I'm a bit surprised why there are only 2M events
being emitted from that.

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-15 18:00           ` Steinar H. Gunderson
@ 2022-03-15 20:11             ` Adrian Hunter
  2022-03-16  8:19               ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-15 20:11 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 15.3.2022 20.00, Steinar H. Gunderson wrote:
> On Tue, Mar 15, 2022 at 01:32:38PM +0200, Adrian Hunter wrote:
>>> I think the structure looks good, but I'm not sure about updating
>>> e.g. ptq->last_cy_insn_cnt in both functions? Does that make sense?
>> It should only be updated in the new intel_pt_synth_cycle_sample().
>> intel_pt_synth_instruction_sample() should be unchanged.
> 
> Hm, OK. But something definitely changed between my original patch and
> your change. (The first patch; I didn't try the last one yet.) With my
> patch, I got (on a specific trace, synthing cycles only with perf report
> --itrace=y0nse):
> 
> Samples: 4M of event 'cycles:uH', Event count (approx.): 4844309
> 
> With yours on the same file:
> 
> Samples: 2M of event 'cycles:uH', Event count (approx.): 77622449
> 
> The relative times between functions are also pretty different (although
> none of them are obviously crazy), so one of them has to be wrong.
> Is this to be expected, ie., would you expect your change to fix some
> bad bug on cycle-only synth? For reference, “perf script --itrace=i0ns
> -F +ipc | grep -c IPC:” (a quick proxy for the number of CYC packets :-) )
> yields 4836782, so I'm a bit surprised why there are only 2M events
> being emitted from that.

Sorry, my first suggestion has issues, but the second is better.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-15 20:11             ` Adrian Hunter
@ 2022-03-16  8:19               ` Steinar H. Gunderson
  2022-03-16 11:19                 ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-16  8:19 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Tue, Mar 15, 2022 at 10:11:54PM +0200, Adrian Hunter wrote:
> Sorry, my first suggestion has issues, but the second is better.

I tried your second one, but can't see much difference. The original and
your patched version still differ by a lot, and I still see various
non-branch instructions getting very tiny fractions. (PSB packets
_might_ be it, as there are ~1000 of them in the 165M-cycle trace.)

I guess the good news is that the perf report coming out of your version
looks more likely to me; I have some functions that are around 1% that
shouldn't intuitively be that much (and, if I write some Perl to sum up
the cycles from the IPC lines in perf script, are more around 0.1%).
So perhaps we should stop chasing the difference? I don't know.

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-16  8:19               ` Steinar H. Gunderson
@ 2022-03-16 11:19                 ` Adrian Hunter
  2022-03-16 12:59                   ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-16 11:19 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 16.3.2022 10.19, Steinar H. Gunderson wrote:
> On Tue, Mar 15, 2022 at 10:11:54PM +0200, Adrian Hunter wrote:
>> Sorry, my first suggestion has issues, but the second is better.
> 
> I tried your second one, but can't see much difference. The original and
> your patched version still differ by a lot, and I still see various
> non-branch instructions getting very tiny fractions. (PSB packets
> _might_ be it, as there are ~1000 of them in the 165M-cycle trace.)
> 
> I guess the good news is that the perf report coming out of your version
> looks more likely to me; I have some functions that are around 1% that
> shouldn't intuitively be that much (and, if I write some Perl to sum up
> the cycles from the IPC lines in perf script, are more around 0.1%).
> So perhaps we should stop chasing the difference? I don't know.

That doesn't sound right.  I will look at it more closely in the next few days.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-16 11:19                 ` Adrian Hunter
@ 2022-03-16 12:59                   ` Steinar H. Gunderson
  2022-03-21  9:16                     ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-16 12:59 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Wed, Mar 16, 2022 at 01:19:46PM +0200, Adrian Hunter wrote:
>> I guess the good news is that the perf report coming out of your version
>> looks more likely to me; I have some functions that are around 1% that
>> shouldn't intuitively be that much (and, if I write some Perl to sum up
>> the cycles from the IPC lines in perf script, are more around 0.1%).
>> So perhaps we should stop chasing the difference? I don't know.
> That doesn't sound right.  I will look at it more closely in the next few days.

If you need, I can supply the perf.data and binaries, but we're talking
a couple of gigabytes of data (and I don't know immediately if there's
an easy way I can package up everything perf.data references) :-)

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-16 12:59                   ` Steinar H. Gunderson
@ 2022-03-21  9:16                     ` Adrian Hunter
  2022-03-21 10:33                       ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-21  9:16 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 16.3.2022 14.59, Steinar H. Gunderson wrote:
> On Wed, Mar 16, 2022 at 01:19:46PM +0200, Adrian Hunter wrote:
>>> I guess the good news is that the perf report coming out of your version
>>> looks more likely to me; I have some functions that are around 1% that
>>> shouldn't intuitively be that much (and, if I write some Perl to sum up
>>> the cycles from the IPC lines in perf script, are more around 0.1%).
>>> So perhaps we should stop chasing the difference? I don't know.
>> That doesn't sound right.  I will look at it more closely in the next few days.
> 
> If you need, I can supply the perf.data and binaries, but we're talking
> a couple of gigabytes of data (and I don't know immediately if there's
> an easy way I can package up everything perf.data references) :-)
> 
> /* Steinar */

I had another look at this and it seemed *mostly* OK for me.  One change
I would make is to subject the cycle period to the logic of the 'A' option
(approximate IPC).

So what does the 'A' option do.

By default, IPC is output only when the exact number of cycles and
instructions is known for the sample.  Decoding walks instructions
to reconstruct the control flow, so the exact number of instructions
is known, but the cycle count (CYC packet) is only produced with
another packet, so only indirect/async branches or the first
conditional branch of a TNT packet.

Reporting exact IPC makes sense when sampling every branch or
instruction, but makes less sense when sampling less often.

For example with:

$ perf record -e intel_pt/cyc/u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.218 MB perf.data ]

Sampling every 50us, exact IPC is reported only twice:

$ perf script --itrace=i50us -F+ipc
           uname 2007962 [005] 2426597.185314:      91866 instructions:uH:      7f3feb913deb _dl_relocate_object+0x40b (/usr/lib/x86_64-linux-gnu/ld-2.31.so)
           uname 2007962 [005] 2426597.185353:      21959 instructions:uH:      7f3feb91158f do_lookup_x+0xcf (/usr/lib/x86_64-linux-gnu/ld-2.31.so)
           uname 2007962 [005] 2426597.185670:     129834 instructions:uH:      7f3feb72e05a read_alias_file+0x23a (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
           uname 2007962 [005] 2426597.185709:      39373 instructions:uH:      7f3feb72ed52 _nl_explode_name+0x52 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
           uname 2007962 [005] 2426597.185947:     137486 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 0.88 (420518/472789) 
           uname 2007962 [005] 2426597.186026:      79196 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.34 (79196/59092) 
           uname 2007962 [005] 2426597.186066:      29855 instructions:uH:      7f3feb78dee6 _int_malloc+0x446 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)

But if we relax the requirement and just use the number of cycles
counted so far, whether it is exactly correct or not, we can get
approx IPC for every sample:

$ perf script --itrace=i50usA -F+ipc
           uname 2007962 [005] 2426597.185314:      91866 instructions:uH:      7f3feb913deb _dl_relocate_object+0x40b (/usr/lib/x86_64-linux-gnu/ld-2.31.so)    IPC: 0.74 (91866/122744) 
           uname 2007962 [005] 2426597.185353:      21959 instructions:uH:      7f3feb91158f do_lookup_x+0xcf (/usr/lib/x86_64-linux-gnu/ld-2.31.so)     IPC: 0.92 (21959/23822) 
           uname 2007962 [005] 2426597.185670:     129834 instructions:uH:      7f3feb72e05a read_alias_file+0x23a (/usr/lib/x86_64-linux-gnu/libc-2.31.so)      IPC: 0.77 (129834/167753) 
           uname 2007962 [005] 2426597.185709:      39373 instructions:uH:      7f3feb72ed52 _nl_explode_name+0x52 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)      IPC: 1.01 (39373/38881) 
           uname 2007962 [005] 2426597.185947:     137486 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.14 (137486/119589) 
           uname 2007962 [005] 2426597.186026:      79196 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.34 (79196/59092) 
           uname 2007962 [005] 2426597.186066:      29855 instructions:uH:      7f3feb78dee6 _int_malloc+0x446 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)          IPC: 1.33 (29855/22282) 


So the cycle sample function looks like this:

static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
{
	struct intel_pt *pt = ptq->pt;
	union perf_event *event = ptq->event_buf;
	struct perf_sample sample = { .ip = 0, };
	u64 period = 0;

	if (ptq->sample_ipc)
		period = ptq->ipc_cyc_cnt - ptq->last_cy_cyc_cnt;

	if (!period || intel_pt_skip_event(pt))
		return 0;

	intel_pt_prep_sample(pt, ptq, event, &sample);

	sample.id = ptq->pt->cycles_id;
	sample.stream_id = ptq->pt->cycles_id;
	sample.period = period;

	sample.cyc_cnt = period;
	sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_cy_insn_cnt;
	ptq->last_cy_insn_cnt = ptq->ipc_insn_cnt;
	ptq->last_cy_cyc_cnt = ptq->ipc_cyc_cnt;

	return intel_pt_deliver_synth_event(pt, event, &sample, pt->cycles_sample_type);
}


With regard to the results you got with perf report, please try:

	perf report --itrace=y0nse --show-total-period --stdio

and see if the percentages and cycle counts for rarely executed
functions make more sense.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-21  9:16                     ` Adrian Hunter
@ 2022-03-21 10:33                       ` Steinar H. Gunderson
  2022-03-21 13:09                         ` Adrian Hunter
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-21 10:33 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Mon, Mar 21, 2022 at 11:16:56AM +0200, Adrian Hunter wrote:
> I had another look at this and it seemed *mostly* OK for me.  One change
> I would make is to subject the cycle period to the logic of the 'A' option
> (approximate IPC).
> 
> So what does the 'A' option do.
> 
> By default, IPC is output only when the exact number of cycles and
> instructions is known for the sample.  Decoding walks instructions
> to reconstruct the control flow, so the exact number of instructions
> is known, but the cycle count (CYC packet) is only produced with
> another packet, so only indirect/async branches or the first
> conditional branch of a TNT packet.

Ah, I hadn't thought of the fact that you only get the first branch per
packet. It's a bit unfortunate for (exact) cycle counts, since I guess
TNT packets can also easily cross functions?

> So the cycle sample function looks like this:
> 
> static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
>
> [...]
>
> With regard to the results you got with perf report, please try:
> 
> 	perf report --itrace=y0nse --show-total-period --stdio
> 
> and see if the percentages and cycle counts for rarely executed
> functions make more sense.

I already run mostly with 0ns period, so I don't think that's it.
I tried your new version, and it's very similar to your previous one;
there are some small changes (largest is that one function goes from
2.5% to 2.2% or so), but the general gist of it is the same.
I am increasingly leaning towards that my original version is wrong
somehow, though.

By the way, I noticed that synthesized call stacks do not respect
--inline; is that on purpose? The patch seems simple enough (just
a call to add_inlines), although it exposes extreme slowness in libbfd
when run over large binaries, which I'll have to look into.
(10+ ms for each address-to-symbol lookup is rather expensive when you
have 4M samples to churn through!)

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-21 10:33                       ` Steinar H. Gunderson
@ 2022-03-21 13:09                         ` Adrian Hunter
  2022-03-21 16:58                           ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Adrian Hunter @ 2022-03-21 13:09 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 21.3.2022 12.33, Steinar H. Gunderson wrote:
> On Mon, Mar 21, 2022 at 11:16:56AM +0200, Adrian Hunter wrote:
>> I had another look at this and it seemed *mostly* OK for me.  One change
>> I would make is to subject the cycle period to the logic of the 'A' option
>> (approximate IPC).
>>
>> So what does the 'A' option do.
>>
>> By default, IPC is output only when the exact number of cycles and
>> instructions is known for the sample.  Decoding walks instructions
>> to reconstruct the control flow, so the exact number of instructions
>> is known, but the cycle count (CYC packet) is only produced with
>> another packet, so only indirect/async branches or the first
>> conditional branch of a TNT packet.
> 
> Ah, I hadn't thought of the fact that you only get the first branch per
> packet. It's a bit unfortunate for (exact) cycle counts, since I guess
> TNT packets can also easily cross functions?

Yes, it can cross calls and returns.  'returns' due to "Return Compression"
which can be switched off at record time with config term noretcomp, but
that may cause more overflows / trace data loss.

To get accurate times for a single function there is Intel PT
address filtering.

Otherwise LBRs can have cycle times.

> 
>> So the cycle sample function looks like this:
>>
>> static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
>>
>> [...]
>>
>> With regard to the results you got with perf report, please try:
>>
>> 	perf report --itrace=y0nse --show-total-period --stdio
>>
>> and see if the percentages and cycle counts for rarely executed
>> functions make more sense.
> 
> I already run mostly with 0ns period, so I don't think that's it.
> I tried your new version, and it's very similar to your previous one;
> there are some small changes (largest is that one function goes from
> 2.5% to 2.2% or so), but the general gist of it is the same.
> I am increasingly leaning towards that my original version is wrong
> somehow, though.
> 
> By the way, I noticed that synthesized call stacks do not respect
> --inline; is that on purpose? The patch seems simple enough (just
> a call to add_inlines), although it exposes extreme slowness in libbfd
> when run over large binaries, which I'll have to look into.
> (10+ ms for each address-to-symbol lookup is rather expensive when you
> have 4M samples to churn through!)

No, not on purpose.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-21 13:09                         ` Adrian Hunter
@ 2022-03-21 16:58                           ` Steinar H. Gunderson
  2022-03-21 17:40                             ` Adrian Hunter
  2022-03-22 11:57                             ` Steinar H. Gunderson
  0 siblings, 2 replies; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-21 16:58 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Mon, Mar 21, 2022 at 03:09:08PM +0200, Adrian Hunter wrote:
> Yes, it can cross calls and returns.  'returns' due to "Return Compression"
> which can be switched off at record time with config term noretcomp, but
> that may cause more overflows / trace data loss.
> 
> To get accurate times for a single function there is Intel PT
> address filtering.
> 
> Otherwise LBRs can have cycle times.

Many interesting points, I'll be sure to look into them.

Meanwhile, should I send a new patch with your latest changes?
It's more your patch than mine now, it seems, but I think we're
converging on something useful.

>> By the way, I noticed that synthesized call stacks do not respect
>> --inline; is that on purpose? The patch seems simple enough (just
>> a call to add_inlines), although it exposes extreme slowness in libbfd
>> when run over large binaries, which I'll have to look into.
>> (10+ ms for each address-to-symbol lookup is rather expensive when you
>> have 4M samples to churn through!)
> No, not on purpose.

The patch appears to be trivial:

--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -44,6 +44,7 @@
 #include <linux/zalloc.h>

 static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
+static int append_inlines(struct callchain_cursor *cursor, struct map_symbol *ms, u64 ip);

 static struct dso *machine__kernel_dso(struct machine *machine)
 {
@@ -2217,6 +2218,10 @@ static int add_callchain_ip(struct thread *thread,
 	ms.maps = al.maps;
 	ms.map = al.map;
 	ms.sym = al.sym;
+
+	if (append_inlines(cursor, &ms, ip) == 0)
+		return 0;
+
        srcline = callchain_srcline(&ms, al.addr);
        return callchain_cursor_append(cursor, ip, &ms,
                                       branch, flags, nr_loop_iter,

but I'm seeing problems with “failing to process sample” when I go from
10us to 1us, so I'll have to look into that.

I've sent some libbfd patches try to help the slowness:

  https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=30cbd32aec30b4bc13427bbd87c4c63c739d4578
  https://sourceware.org/pipermail/binutils/2022-March/120131.html
  https://sourceware.org/pipermail/binutils/2022-March/120133.html

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-21 16:58                           ` Steinar H. Gunderson
@ 2022-03-21 17:40                             ` Adrian Hunter
  2022-03-22 11:57                             ` Steinar H. Gunderson
  1 sibling, 0 replies; 19+ messages in thread
From: Adrian Hunter @ 2022-03-21 17:40 UTC (permalink / raw)
  To: Steinar H. Gunderson
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On 21/03/2022 18:58, Steinar H. Gunderson wrote:
> On Mon, Mar 21, 2022 at 03:09:08PM +0200, Adrian Hunter wrote:
>> Yes, it can cross calls and returns.  'returns' due to "Return Compression"
>> which can be switched off at record time with config term noretcomp, but
>> that may cause more overflows / trace data loss.
>>
>> To get accurate times for a single function there is Intel PT
>> address filtering.
>>
>> Otherwise LBRs can have cycle times.
> 
> Many interesting points, I'll be sure to look into them.
> 
> Meanwhile, should I send a new patch with your latest changes?

Yes please.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-21 16:58                           ` Steinar H. Gunderson
  2022-03-21 17:40                             ` Adrian Hunter
@ 2022-03-22 11:57                             ` Steinar H. Gunderson
  2022-03-29 12:31                               ` Steinar H. Gunderson
  1 sibling, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-22 11:57 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Mon, Mar 21, 2022 at 05:58:08PM +0100, Steinar H. Gunderson wrote:
> but I'm seeing problems with “failing to process sample” when I go from
> 10us to 1us, so I'll have to look into that.

I took a debug log from perf report; the last few lines are (I abridged
some symbol names):

symbol__new: cppgc::internal::MemberBase::GetRaw() const 0x6078a90-0x60792d0
symbol__new: cppgc::internal::BasicMember<...>::Get() const 0x6078a90-0x60792d0
symbol__new: cppgc::internal::BasicMember<...>::operator->() const 0x6078a90-0x60792d0
__symbol__inc_addr_samples: addr=0x5561dc37cbf6
__symbol__inc_addr_samples(896): ERANGE! sym->name=void blink::ElementRuleCollector::CollectMatchingRulesForList<...>(...), start=0x6074310, addr=0x6078bf6, end=0x6078a90
problem adding hist entry, skipping event
Intel PT: failed to deliver event, error -34
0x5739a8 [0x8]: failed to process type: 68 [Numerical result out of range]

and this is evidently fatal. So for whatever reason, the sample address
is attributed to some symbol, and that symbol is assumed to have a
single range (is this even necessarily true?), we refuse the event,
and then we fail the entire report. (It doesn't happen with --stdio,
though!)

I'm a bit at a loss how to debug this. It doesn't happen without the
inlines being synthesized, but I don't know if there's something wrong
with it, or if it's just the symptom (I've certainly seen “failed to
process sample” without PT before).

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-22 11:57                             ` Steinar H. Gunderson
@ 2022-03-29 12:31                               ` Steinar H. Gunderson
  2022-03-29 14:16                                 ` Steinar H. Gunderson
  0 siblings, 1 reply; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-29 12:31 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel, He Kuang

On Tue, Mar 22, 2022 at 12:57:38PM +0100, Steinar H. Gunderson wrote:
> and this is evidently fatal. So for whatever reason, the sample address
> is attributed to some symbol, and that symbol is assumed to have a
> single range (is this even necessarily true?), we refuse the event,
> and then we fail the entire report. (It doesn't happen with --stdio,
> though!)

I think I traced this. The basic issue is this routine, which compares
two symbols (to see if they should be combined for the sake of perf
report):

  int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r)
  {
          if (!sym_l || !sym_r)
                  return cmp_null(sym_l, sym_r);

          if (sym_l == sym_r)
                  return 0;

          if (sym_l->inlined || sym_r->inlined) {
                  int ret = strcmp(sym_l->name, sym_r->name);

                  if (ret)
                          return ret;
                  if ((sym_l->start <= sym_r->end) && (sym_l->end >= sym_r->start))
                          return 0;
          }

          if (sym_l->start != sym_r->start)
                  return (int64_t)(sym_r->start - sym_l->start);

          return (int64_t)(sym_r->end - sym_l->end);
  }

The problem is that part comparing sym_l->start and sym_r->end. I'm not
entirely sure what the logic here is, but it seems to me that the
intention is that if sym_l is a subset of sym_r, it collapses them.
However, it seems to assume that [start,end] is inclusive, whereas it is
generally [start,end) in later code. This means that if you have e.g.
a symbol [0x1000,0x1010) which is then inlined as the very first thing
in the adjacent function [0x1010,0x1030) (say, sym_r is the inlined
symbol [0x1010,0x1020)), _sort__sym_cmp will indeed collapse them.
This causes the ERANGE problem later, as you can have 0x1005 being
mapped to a hist_entry correpsonding to the symbol [0x1010,0x1030).
It seems this logic was indeed added to fix ERANGE problems in 2019
(see 7346195e8), but it is seemingly incomplete.

If I change <= to < and >= to >, it stops erroring out; however, I'm
still not sure whether this is actually right. Doesn't the collapsing
then depend on what order the entries happen to be added in, ie.,
whether the larger symbol comes first or last?

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] perf intel-pt: Synthesize cycle events
  2022-03-29 12:31                               ` Steinar H. Gunderson
@ 2022-03-29 14:16                                 ` Steinar H. Gunderson
  0 siblings, 0 replies; 19+ messages in thread
From: Steinar H. Gunderson @ 2022-03-29 14:16 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	linux-kernel

On Tue, Mar 29, 2022 at 02:31:14PM +0200, Steinar H. Gunderson wrote:
>   int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r)
>   {
>           if (!sym_l || !sym_r)
>                   return cmp_null(sym_l, sym_r);
> 
>           if (sym_l == sym_r)
>                   return 0;
> 
>           if (sym_l->inlined || sym_r->inlined) {
>                   int ret = strcmp(sym_l->name, sym_r->name);
> 
>                   if (ret)
>                           return ret;
>                   if ((sym_l->start <= sym_r->end) && (sym_l->end >= sym_r->start))
>                           return 0;
>           }
> 
>           if (sym_l->start != sym_r->start)
>                   return (int64_t)(sym_r->start - sym_l->start);

Even fixing <= to <, I do get nonsensical results like an inlined
(and very small) destructor taking 50%+ of CPU time, and having a huge
call chain under it. It largely goes away (I'm not sure if it's perfect)
if I remove the entire if (sym_l->inlined || ... branch, but I guess
it's there for a reason.

Thinking about it, I wonder if this code breaks the entire tree
invariant of comparison being transitive. If left _or_ right is inlined,
it compares them by name, but if not, it compares them by address. So
you can have three symbols A, B (inline) and C, where A < B (on name),
B < C (on name) but C < A (on address; assuming C has a lower start
address than A). That doesn't look good to me.

I'm wondering if the right fix would be something like replacing the
entire if with something like

           if (sym_l->inlined && sym_r->inlined &&
	       strcmp(sym_l->name, sym_r->name) == 0) &&
	       <keep [start,end) overlap test here>) {
                   return 0;
	   }

but I'm not sure.

/* Steinar */

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-03-29 14:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-10  9:38 [PATCH] perf intel-pt: Synthesize cycle events Steinar H. Gunderson
2022-03-11  9:10 ` Adrian Hunter
2022-03-11 17:42   ` Steinar H. Gunderson
2022-03-14 16:24     ` Adrian Hunter
2022-03-15 10:16       ` Steinar H. Gunderson
2022-03-15 11:32         ` Adrian Hunter
2022-03-15 18:00           ` Steinar H. Gunderson
2022-03-15 20:11             ` Adrian Hunter
2022-03-16  8:19               ` Steinar H. Gunderson
2022-03-16 11:19                 ` Adrian Hunter
2022-03-16 12:59                   ` Steinar H. Gunderson
2022-03-21  9:16                     ` Adrian Hunter
2022-03-21 10:33                       ` Steinar H. Gunderson
2022-03-21 13:09                         ` Adrian Hunter
2022-03-21 16:58                           ` Steinar H. Gunderson
2022-03-21 17:40                             ` Adrian Hunter
2022-03-22 11:57                             ` Steinar H. Gunderson
2022-03-29 12:31                               ` Steinar H. Gunderson
2022-03-29 14:16                                 ` Steinar H. Gunderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).