linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples
@ 2019-11-01  2:07 Leo Yan
  2019-11-01  2:07 ` [PATCH v2 1/4] perf cs-etm: Continuously record last branches Leo Yan
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Leo Yan @ 2019-11-01  2:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML, Robert Walker
  Cc: Leo Yan

This patch series is to address the issue for synthesizing instruction
samples, especially when the instruction sample period is small enough,
the current logic cannot synthesize multiple instruction samples within
one instruction range packet.

To fix this issue, patch 0001 avoids to reset the last branches for
every instruction sample; if reset the last branches when every time
generate instruction sample, then the later samples in the same range
packet cannot use the last branches anymore.

Patch 0002 is the main patch to fix the logic for synthesizing
instruction samples; it allows to handle different instruction periods.

Patch 0003 is an optimization for copying last branches; it only copies
last branches once if the instruction samples share the same last
branches.

Patch 0004 is a minor fix for unsigned variable comparison to zero.

To verify my changing for synthesizing instruction samples, I added
some logs in the code, and reviewed the output log manually for
instuctions samples.  The below commands are tested on DB410c board:

  # perf script --itrace=i2
  # perf script --itrace=i2il16
  # perf inject --itrace=i2il16 -i perf.data -o perf.data.new
  # perf inject --itrace=i100il16 -i perf.data -o perf.data.new


Changes from v1:
* Rebased patch set on perf/core branch with latest commit 9fec3cd5fa4a
  ("perf map: Check if the map still has some refcounts on exit").


Leo Yan (4):
  perf cs-etm: Continuously record last branches
  perf cs-etm: Correct synthesizing instruction samples
  perf cs-etm: Optimize copying last branches
  perf cs-etm: Fix unsigned variable comparison to zero

 tools/perf/util/cs-etm.c | 137 ++++++++++++++++++++++++++++++++-------
 1 file changed, 115 insertions(+), 22 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/4] perf cs-etm: Continuously record last branches
  2019-11-01  2:07 [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples Leo Yan
@ 2019-11-01  2:07 ` Leo Yan
  2019-11-01 15:30   ` Robert Walker
  2019-11-01  2:07 ` [PATCH v2 2/4] perf cs-etm: Correct synthesizing instruction samples Leo Yan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Leo Yan @ 2019-11-01  2:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML, Robert Walker
  Cc: Leo Yan

Every time synthesize instruction sample, the last branches recording
will be reset.  This would be fine if the instruction period is big
enough, for example if we use the option '--itrace=i100000', the last
branch array is reset for every instruction sample (10000 instructions
per period); before generate the next instruction sample, there has the
enough packets coming to fill last branch array.  On the other hand,
if set a very small period, the packets will be significantly reduced
between two continuous instruction samples, thus if the last branch
array is reset for the previous instruction sample, it's almost empty
for the next instruction sample.

To allow the last branches to work for any instruction periods, this
patch avoids to reset the last branches for every instruction sample
and only reset it when flush the trace data.  The last branches will
be reset only for two cases, one is for trace starting, another case
is for discontinuous trace; thus it can continuously record last
branches.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/cs-etm.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index f5f855fff412..8be6d010ae84 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1153,9 +1153,6 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 			"CS ETM Trace: failed to deliver instruction event, error %d\n",
 			ret);
 
-	if (etm->synth_opts.last_branch)
-		cs_etm__reset_last_branch_rb(tidq);
-
 	return ret;
 }
 
@@ -1486,6 +1483,10 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 		tidq->prev_packet = tmp;
 	}
 
+	/* Reset last branches after flush the trace */
+	if (etm->synth_opts.last_branch)
+		cs_etm__reset_last_branch_rb(tidq);
+
 	return err;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/4] perf cs-etm: Correct synthesizing instruction samples
  2019-11-01  2:07 [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples Leo Yan
  2019-11-01  2:07 ` [PATCH v2 1/4] perf cs-etm: Continuously record last branches Leo Yan
@ 2019-11-01  2:07 ` Leo Yan
  2019-11-01  2:07 ` [PATCH v2 3/4] perf cs-etm: Optimize copying last branches Leo Yan
  2019-11-01  2:07 ` [PATCH v2 4/4] perf cs-etm: Fix unsigned variable comparison to zero Leo Yan
  3 siblings, 0 replies; 7+ messages in thread
From: Leo Yan @ 2019-11-01  2:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML, Robert Walker
  Cc: Leo Yan

When 'etm->instructions_sample_period' is less than
'tidq->period_instructions', the function cs_etm__sample() cannot handle
this case properly with its logic.

Let's see below flow as an example:

- If we set itrace option '--itrace=i4', then function cs_etm__sample()
  has variables with initialized values:

  tidq->period_instructions = 0
  etm->instructions_sample_period = 4

- When the first packet is coming:

  packet->instr_count = 10; the number of instructions executed in this
  packet is 10, thus update period_instructions as below:

  tidq->period_instructions = 0 + 10 = 10
  instrs_over = 10 - 4 = 6
  offset = 10 - 6 - 1 = 3
  tidq->period_instructions = instrs_over = 6

- When the second packet is coming:

  packet->instr_count = 10; in the second pass, assume 10 instructions
  in the trace sample again:

  tidq->period_instructions = 6 + 10 = 16
  instrs_over = 16 - 4 = 12
  offset = 10 - 12 - 1 = -3  -> the negative value
  tidq->period_instructions = instrs_over = 12

So after handle these two packets, there have below issues:

The first issue is that cs_etm__instr_addr() returns the address within
the current trace sample of the instruction related to offset, so the
offset is supposed to be always unsigned value.  But in fact, function
cs_etm__sample() might calculate a negative offset value (in handling
the second packet, the offset is -3) and pass to cs_etm__instr_addr()
with u64 type with a big positive integer.

The second issue is it only synthesizes 2 samples for sample period = 4.
In theory, every packet has 10 instructions so the two packets have
total 20 instructions, 20 instructions should generate 5 samples
(4 x 5 = 20).  This is because cs_etm__sample() only calls once
cs_etm__synth_instruction_sample() to generate instruction sample per
range packet.

This patch fixes the logic in function cs_etm__sample(); the basic
idea is to divide into three parts for handling coming packet:

- The first part is for synthesizing the first instruction sample, it
  combines the instructions from the tail of previous packet and the
  instructions from the head of the new packet;
- The second part is to simply generate samples with sample period
  aligned;
- The third part is the tail of new packet, the rest instructions will
  be left to next time handling with sequential packet.

Suggested-by: Mike Leach <mike.leach@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/cs-etm.c | 106 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 93 insertions(+), 13 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 8be6d010ae84..8e9eb7583bcd 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1360,23 +1360,103 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 		 * TODO: allow period to be defined in cycles and clock time
 		 */
 
-		/* Get number of instructions executed after the sample point */
-		u64 instrs_over = tidq->period_instructions -
-			etm->instructions_sample_period;
+		/*
+		 * Below diagram is used to demonstrate the instruction samples
+		 * generation flows:
+		 *
+		 *    Instrs     Instrs       Instrs       Instrs
+		 *   Sample(n)  Sample(n+1)  Sample(n+2)  Sample(n+3)
+		 *    |            |            |            |
+		 *    V            V            V            V
+		 *   --------------------------------------------------
+		 *            ^                                  ^
+		 *            |                                  |
+		 *         Period                             Period
+		 *    instructions(Pi)                   instructions(Pi')
+		 *
+		 *            |                                  |
+		 *            \---------------- -----------------/
+		 *                             V
+		 *                      instrs_executed
+		 *
+		 * When the new instruction packet is coming, period
+		 * instructions (Pi) contains the the number of instructions
+		 * executed after the sample point(n).  So for the next sample
+		 * point(n+1), it is combined the two parts instructions, one
+		 * is the tail of the old packet and another is the head of
+		 * the new coming packet.  So we use 'head' variable to cauclate
+		 * the instruction numbers in the new packet for sample(n+1).
+		 *
+		 * For sample(n+2) and sample(n+3), they consume the instruction
+		 * for sample period, so we directly generate samples based on
+		 * the sampe period.
+		 *
+		 * After sample(n+3), there still leave some instructions which
+		 * will be used by later packet; so we use 'instrs_over' to
+		 * track the rest instruction number and its final value
+		 * presents the tail of the packet, it will be assigned to
+		 * 'tidq->period_instructions' for next round calculation.
+		 */
+		u64 head, offset = 0;
+		u64 addr;
 
 		/*
-		 * Calculate the address of the sampled instruction (-1 as
-		 * sample is reported as though instruction has just been
-		 * executed, but PC has not advanced to next instruction)
+		 * 'instrs_over' is the number of instructions executed after
+		 * sample points, initialise it to 'instrs_executed' and will
+		 * decrease it for consumed instructions in every synthesized
+		 * instruction sample.
 		 */
-		u64 offset = (instrs_executed - instrs_over - 1);
-		u64 addr = cs_etm__instr_addr(etmq, trace_chan_id,
-					      tidq->packet, offset);
+		u64 instrs_over = instrs_executed;
 
-		ret = cs_etm__synth_instruction_sample(
-			etmq, tidq, addr, etm->instructions_sample_period);
-		if (ret)
-			return ret;
+		/*
+		 * 'head' is the instructions number of the head in the new
+		 * packet, it combines with the tail of previous packet to
+		 * generate a sample.  So 'head' uses the sample period to
+		 * decrease the instruction number introduced by the previous
+		 * packet.
+		 */
+		head = etm->instructions_sample_period -
+				  (tidq->period_instructions - instrs_executed);
+
+		if (head) {
+			offset = head;
+
+			/*
+			 * Calculate the address of the sampled instruction (-1
+			 * as sample is reported as though instruction has just
+			 * been executed, but PC has not advanced to next
+			 * instruction)
+			 */
+			addr = cs_etm__instr_addr(etmq, trace_chan_id,
+						  tidq->packet, offset - 1);
+			ret = cs_etm__synth_instruction_sample(
+				etmq, tidq, addr,
+				etm->instructions_sample_period);
+			if (ret)
+				return ret;
+
+			instrs_over -= head;
+		}
+
+		while (instrs_over >= etm->instructions_sample_period) {
+			offset += etm->instructions_sample_period;
+
+			/*
+			 * Calculate the address of the sampled instruction (-1
+			 * as sample is reported as though instruction has just
+			 * been executed, but PC has not advanced to next
+			 * instruction)
+			 */
+			addr = cs_etm__instr_addr(etmq, trace_chan_id,
+						  tidq->packet, offset - 1);
+			ret = cs_etm__synth_instruction_sample(
+				etmq, tidq, addr,
+				etm->instructions_sample_period);
+			if (ret)
+				return ret;
+
+			instrs_over -= etm->instructions_sample_period;
+		}
 
 		/* Carry remaining instructions into next sample period */
 		tidq->period_instructions = instrs_over;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 3/4] perf cs-etm: Optimize copying last branches
  2019-11-01  2:07 [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples Leo Yan
  2019-11-01  2:07 ` [PATCH v2 1/4] perf cs-etm: Continuously record last branches Leo Yan
  2019-11-01  2:07 ` [PATCH v2 2/4] perf cs-etm: Correct synthesizing instruction samples Leo Yan
@ 2019-11-01  2:07 ` Leo Yan
  2019-11-01  2:07 ` [PATCH v2 4/4] perf cs-etm: Fix unsigned variable comparison to zero Leo Yan
  3 siblings, 0 replies; 7+ messages in thread
From: Leo Yan @ 2019-11-01  2:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML, Robert Walker
  Cc: Leo Yan

If an instruction range packet can generate multiple instruction
samples, these samples share the same last branches; it's not necessary
to copy the same last branches repeatedly for these samples within the
same packet.

This patch moves out the last branches copying from function
cs_etm__synth_instruction_sample(), and execute it once prior to
generating instruction samples.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/cs-etm.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 8e9eb7583bcd..d9a857abaca8 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1134,10 +1134,8 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 
 	cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
 
-	if (etm->synth_opts.last_branch) {
-		cs_etm__copy_last_branch_rb(etmq, tidq);
+	if (etm->synth_opts.last_branch)
 		sample.branch_stack = tidq->last_branch;
-	}
 
 	if (etm->synth_opts.inject) {
 		ret = cs_etm__inject_event(event, &sample,
@@ -1408,6 +1406,10 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 		 */
 		u64 instrs_over = instrs_executed;
 
+		/* Prepare last branches for instruction sample */
+		if (etm->synth_opts.last_branch)
+			cs_etm__copy_last_branch_rb(etmq, tidq);
+
 		/*
 		 * 'head' is the instructions number of the head in the new
 		 * packet, it combines with the tail of previous packet to
@@ -1526,6 +1528,11 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 
 	if (etmq->etm->synth_opts.last_branch &&
 	    tidq->prev_packet->sample_type == CS_ETM_RANGE) {
+		u64 addr;
+
+		/* Prepare last branches for instruction sample */
+		cs_etm__copy_last_branch_rb(etmq, tidq);
+
 		/*
 		 * Generate a last branch event for the branches left in the
 		 * circular buffer at the end of the trace.
@@ -1533,7 +1540,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 		 * Use the address of the end of the last reported execution
 		 * range
 		 */
-		u64 addr = cs_etm__last_executed_instr(tidq->prev_packet);
+		addr = cs_etm__last_executed_instr(tidq->prev_packet);
 
 		err = cs_etm__synth_instruction_sample(
 			etmq, tidq, addr,
@@ -1586,11 +1593,16 @@ static int cs_etm__end_block(struct cs_etm_queue *etmq,
 	 */
 	if (etmq->etm->synth_opts.last_branch &&
 	    tidq->prev_packet->sample_type == CS_ETM_RANGE) {
+		u64 addr;
+
+		/* Prepare last branches for instruction sample */
+		cs_etm__copy_last_branch_rb(etmq, tidq);
+
 		/*
 		 * Use the address of the end of the last reported execution
 		 * range.
 		 */
-		u64 addr = cs_etm__last_executed_instr(tidq->prev_packet);
+		addr = cs_etm__last_executed_instr(tidq->prev_packet);
 
 		err = cs_etm__synth_instruction_sample(
 			etmq, tidq, addr,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 4/4] perf cs-etm: Fix unsigned variable comparison to zero
  2019-11-01  2:07 [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples Leo Yan
                   ` (2 preceding siblings ...)
  2019-11-01  2:07 ` [PATCH v2 3/4] perf cs-etm: Optimize copying last branches Leo Yan
@ 2019-11-01  2:07 ` Leo Yan
  3 siblings, 0 replies; 7+ messages in thread
From: Leo Yan @ 2019-11-01  2:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML, Robert Walker
  Cc: Leo Yan

The variable 'offset' in function cs_etm__sample() is u64 type, it's not
appropriate to check it with 'while (offset > 0)'; this patch changes to
'while (offset)'.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/cs-etm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index d9a857abaca8..52fe7d6d4f29 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -945,7 +945,7 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq,
 	if (packet->isa == CS_ETM_ISA_T32) {
 		u64 addr = packet->start_addr;
 
-		while (offset > 0) {
+		while (offset) {
 			addr += cs_etm__t32_instr_size(etmq,
 						       trace_chan_id, addr);
 			offset--;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/4] perf cs-etm: Continuously record last branches
  2019-11-01  2:07 ` [PATCH v2 1/4] perf cs-etm: Continuously record last branches Leo Yan
@ 2019-11-01 15:30   ` Robert Walker
  2019-11-02  6:11     ` Leo Yan
  0 siblings, 1 reply; 7+ messages in thread
From: Robert Walker @ 2019-11-01 15:30 UTC (permalink / raw)
  To: Leo Yan, Arnaldo Carvalho de Melo, Mathieu Poirier,
	Suzuki K Poulose, Mike Leach, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-arm-kernel, Linux Kernel Mailing List, Coresight ML

On 01/11/2019 02:07, Leo Yan wrote:
> Every time synthesize instruction sample, the last branches recording
> will be reset.  This would be fine if the instruction period is big
> enough, for example if we use the option '--itrace=i100000', the last
> branch array is reset for every instruction sample (10000 instructions
> per period); before generate the next instruction sample, there has the
> enough packets coming to fill last branch array.  On the other hand,
> if set a very small period, the packets will be significantly reduced
> between two continuous instruction samples, thus if the last branch
> array is reset for the previous instruction sample, it's almost empty
> for the next instruction sample.
>
> To allow the last branches to work for any instruction periods, this
> patch avoids to reset the last branches for every instruction sample
> and only reset it when flush the trace data.  The last branches will
> be reset only for two cases, one is for trace starting, another case
> is for discontinuous trace; thus it can continuously record last
> branches.

Is this the right thing to do?  This would cause profiling tools to 
count the same branch several times if it appears in multiple 
instruction samples, which could result in a biased profile.

The current implementation matches the behavior of intel_pt where the 
branch buffer is reset after each sample, so  the instruction sample 
only includes branches since the previous sample.

However x86 lbr (perf record -b) does appear to repeat the same full 
branch stack on several samples until a new stack is captured.

I'm not sure what the right or wrong answer is here.  For AutoFDO, we're 
likely to use a much bigger period (>10000 instructions) so won't be 
affected, but other tools might be.

Regards


Rob


> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
>   tools/perf/util/cs-etm.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> index f5f855fff412..8be6d010ae84 100644
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
> @@ -1153,9 +1153,6 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
>   			"CS ETM Trace: failed to deliver instruction event, error %d\n",
>   			ret);
>   
> -	if (etm->synth_opts.last_branch)
> -		cs_etm__reset_last_branch_rb(tidq);
> -
>   	return ret;
>   }
>   
> @@ -1486,6 +1483,10 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
>   		tidq->prev_packet = tmp;
>   	}
>   
> +	/* Reset last branches after flush the trace */
> +	if (etm->synth_opts.last_branch)
> +		cs_etm__reset_last_branch_rb(tidq);
> +
>   	return err;
>   }
>   

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/4] perf cs-etm: Continuously record last branches
  2019-11-01 15:30   ` Robert Walker
@ 2019-11-02  6:11     ` Leo Yan
  0 siblings, 0 replies; 7+ messages in thread
From: Leo Yan @ 2019-11-02  6:11 UTC (permalink / raw)
  To: Robert Walker, Adrian Hunter
  Cc: Arnaldo Carvalho de Melo, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-arm-kernel,
	Linux Kernel Mailing List, Coresight ML

Hi Rob,

On Fri, Nov 01, 2019 at 03:30:19PM +0000, Robert Walker wrote:
> On 01/11/2019 02:07, Leo Yan wrote:
> > Every time synthesize instruction sample, the last branches recording
> > will be reset.  This would be fine if the instruction period is big
> > enough, for example if we use the option '--itrace=i100000', the last
> > branch array is reset for every instruction sample (10000 instructions
> > per period); before generate the next instruction sample, there has the
> > enough packets coming to fill last branch array.  On the other hand,
> > if set a very small period, the packets will be significantly reduced
> > between two continuous instruction samples, thus if the last branch
> > array is reset for the previous instruction sample, it's almost empty
> > for the next instruction sample.
> > 
> > To allow the last branches to work for any instruction periods, this
> > patch avoids to reset the last branches for every instruction sample
> > and only reset it when flush the trace data.  The last branches will
> > be reset only for two cases, one is for trace starting, another case
> > is for discontinuous trace; thus it can continuously record last
> > branches.
> 
> Is this the right thing to do?

Thanks for reviewing and bringing up the questions.  To be honest, my
concern was mainly related with AudoFDO but I don't aware other
potential issues.  So any concern is welcome, in case I miss anything;
hope we can get conclusion with some dicussion.  Please see more
detailed explanation in below.

> This would cause profiling tools to count
> the same branch several times if it appears in multiple instruction samples,
> which could result in a biased profile.

Let's clarify for this.  Firstly, here the 'branch' doesn't refer to
'branch' sample, it means the last branch recording for instruction
samples.  So basically, neither instruction sample nor branch sample
will be changed with this patch.

This patch tries to fix the issue as below:

Before this patch:

  ffff800010083580 <el0_sync>:
  ffff800010083580:  stp     x0, x1, [sp]         -> synthesize instruction sample(n),
                                                     record the last branch,
                                                     reset the last branch.
  ffff800010083584:  stp     x2, x3, [sp,#16]
  ffff800010083588:  stp     x4, x5, [sp,#32]     -> synthesize instruction sample(n+1),
                                                     the last branch is empty which is
                                                     reset by the instructiom sample(n).
  ffff80001008358c:  stp     x6, x7, [sp,#48]
  ffff800010083590:  stp     x8, x9, [sp,#64]     -> synthesize instruction sample(n+2),
                                                     the last branch is empty which is
                                                     reset by the instructiom sample(n).
  [...]


After this patch:

  ffff800010083580 <el0_sync>:
  ffff800010083580:  stp     x0, x1, [sp]         -> synthesize instruction sample(n),
                                                     record the last branch.
  ffff800010083584:  stp     x2, x3, [sp,#16]
  ffff800010083588:  stp     x4, x5, [sp,#32]     -> synthesize instruction sample(n+1),
                                                     record the last branch.
  ffff80001008358c:  stp     x6, x7, [sp,#48]
  ffff800010083590:  stp     x8, x9, [sp,#64]     -> synthesize instruction sample(n+2),
                                                     record the last branch.
  [...]


So from my understanding, the last branch recording works as the
affiliate info for instruction samples and it allows us (or tools) to
know what's the execution flow for the instruction samples.  Seems to
me, it doesn't change value for instruction sample, but we can have
correct info of the last branch recording for every instruction samples.

> The current implementation matches the behavior of intel_pt where the branch
> buffer is reset after each sample, so  the instruction sample only includes
> branches since the previous sample.

Exactly.

@Adrian, it would be nice if you could confirm intel_pt should apply
the samiliar fixing or not?

> However x86 lbr (perf record -b) does appear to repeat the same full branch
> stack on several samples until a new stack is captured.
> 
> I'm not sure what the right or wrong answer is here.  For AutoFDO, we're
> likely to use a much bigger period (>10000 instructions) so won't be
> affected, but other tools might be.

Agree, if AutoFDO uses big period (e.g. --itrace=i10000), this patch
will not change anything.  With big period, it has enough packets to
generate branch recording between two instruction samples.

Could you elaborate what's 'other tools'?  If it's open sourced tool,
I can try to test with this patch set.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-11-02  6:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-01  2:07 [PATCH v2 0/4] perf cs-etm: Fix synthesizing instruction samples Leo Yan
2019-11-01  2:07 ` [PATCH v2 1/4] perf cs-etm: Continuously record last branches Leo Yan
2019-11-01 15:30   ` Robert Walker
2019-11-02  6:11     ` Leo Yan
2019-11-01  2:07 ` [PATCH v2 2/4] perf cs-etm: Correct synthesizing instruction samples Leo Yan
2019-11-01  2:07 ` [PATCH v2 3/4] perf cs-etm: Optimize copying last branches Leo Yan
2019-11-01  2:07 ` [PATCH v2 4/4] perf cs-etm: Fix unsigned variable comparison to zero Leo Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).