linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Stitch LBR call stack
@ 2019-10-07 17:59 kan.liang
  2019-10-07 17:59 ` [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS kan.liang
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Start from Haswell, Linux perf can utilize the existing Last Branch
Record (LBR) facility to record call stack. However, the depth of the
reconstructed LBR call stack limits to the number of LBR registers.
E.g. on skylake, the depth of reconstructed LBR call stack is <= 32
That's because HW will overwrite the oldest LBR registers when it's
full.

However, the overwritten LBRs may still be retrieved from previous
sample. At that moment, HW hasn't overwritten the LBR registers yet.
Perf tools can stitch those overwritten LBRs on current call stacks to
get a more complete call stack.

To determine if LBRs can be stitched, the physical index of LBR
registers is required. A new sample type is introduced in patch 1 & 2
to dump the LBR Top-of-Stack (TOS) information for perf tools.
Besides, the maximum number of LBRs is required as well. Patch 3 & 4
retrieve the capabilities information from sysfs and save them in perf
header.
Patch 5 & 6 implements the LBR stitching approach.
Users can use the options introduced in patch 7-10 to enable the LBR
stitching approach for perf report, script, top and c2c.

The stitching approach base on LBR call stack technology. The known
limitations of LBR call stack technology still apply to the approach,
e.g. Exception handing such as setjmp/longjmp will have calls/returns
not match.
This approach is not full proof. There can be cases where it creates
incorrect call stacks from incorrect matches. There is no attempt
to validate any matches in another way. So it is not enabled by default.
However in many common cases with call stack overflows it can recreate
better call stacks than the default lbr call stack output. So if there
are problems with LBR overflows this is a possible workaround.

Performance impact:
The processing time may increase with the LBR stitching approach
enabled. The impact depends on the number of samples with stitched LBRs.

For sqlite's tcltest,
perf record --call-graph lbr -- make tcltest
perf report --stitch-lbr

There are 4.11% samples has stitched LBRs.
Total number of samples:                        2833728
The number of samples with stitched LBRs        116478

The processing time of perf report increases 6.8%
Without --stitch-lbr:                           55906106 usec
With --stitch-lbr:                              59728701 usec

For a simple test case tchain_edit with 43 depth of call stacks.
perf record --call-graph lbr -- ./tchain_edit
perf report --stitch-lbr

There are 99.9% samples has stitched LBRs.
Total number of samples:                        10915
The number of samples with stitched LBRs        10905

The processing time of perf report increases 67.4%
Without --stitch-lbr:                           11970508 usec
With --stitch-lbr:                              20036055 usec

The source code of tchain_edit.c is something like as below.
noinline void f43(void)
{
        int i;
        for (i = 0; i < 10000;) {

                if(i%2)
                        i++;
                else
                        i++;
        }
}

noinline void f42(void)
{
        int i;
        for (i = 0; i < 100; i++) {
                f43();
                f43();
                f43();
        }
}

noinline void f41(void)
{
        int i;
        for (i = 0; i < 100; i++) {
                f42();
                f42();
                f42();
        }
}

noinline void f40(void)
{
        f41();
}

... ...

noinline void f32(void)
{
        f33();
}

noinline void f31(void)
{
        int i;

        for (i = 0; i < 10000; i++) {
                if(i%2)
                        i++;
                else
                        i++;
        }

        f32();
}

noinline void f30(void)
{
        f31();
}

... ...

noinline void f1(void)
{
        f2();
}

int main()
{
        f1();
}

Kan Liang (10):
  perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  perf tools: Support PERF_SAMPLE_LBR_TOS
  perf pmu: Add support for PMU capabilities
  perf header: Support CPU PMU capabilities
  perf machine: Refine the function for LBR call stack reconstruction
  perf tools: Stitch LBR call stack
  perf report: Add option to enable the LBR stitching approach
  perf script: Add option to enable the LBR stitching approach
  perf top: Add option to enable the LBR stitching approach
  perf c2c: Add option to enable the LBR stitching approach

 arch/x86/events/intel/lbr.c                   |   9 +
 include/linux/perf_event.h                    |   1 +
 include/uapi/linux/perf_event.h               |   4 +-
 kernel/events/core.c                          |  12 +
 tools/include/uapi/linux/perf_event.h         |   4 +-
 tools/perf/Documentation/perf-c2c.txt         |  11 +
 tools/perf/Documentation/perf-report.txt      |  11 +
 tools/perf/Documentation/perf-script.txt      |  11 +
 tools/perf/Documentation/perf-top.txt         |   9 +
 .../Documentation/perf.data-file-format.txt   |  16 +
 tools/perf/builtin-c2c.c                      |   6 +
 tools/perf/builtin-record.c                   |   3 +
 tools/perf/builtin-report.c                   |   6 +
 tools/perf/builtin-script.c                   |   6 +
 tools/perf/builtin-stat.c                     |   1 +
 tools/perf/builtin-top.c                      |  11 +
 tools/perf/util/branch.h                      |  10 +-
 tools/perf/util/env.h                         |   3 +
 tools/perf/util/event.h                       |   1 +
 tools/perf/util/evsel.c                       |  16 +-
 tools/perf/util/evsel.h                       |   1 +
 tools/perf/util/header.c                      | 110 +++++++
 tools/perf/util/header.h                      |   1 +
 tools/perf/util/machine.c                     | 303 ++++++++++++++----
 tools/perf/util/perf_event_attr_fprintf.c     |   2 +-
 tools/perf/util/pmu.c                         |  87 +++++
 tools/perf/util/pmu.h                         |  12 +
 tools/perf/util/synthetic-events.c            |   8 +
 tools/perf/util/thread.c                      |   3 +
 tools/perf/util/thread.h                      |  18 ++
 tools/perf/util/top.h                         |   1 +
 31 files changed, 626 insertions(+), 71 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-08  8:31   ` Peter Zijlstra
  2019-10-07 17:59 ` [PATCH 02/10] perf tools: Support PERF_SAMPLE_LBR_TOS kan.liang
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In LBR call stack mode, the depth of reconstructed LBR call stack limits
to the number of LBR registers. With LBR Top-of-Stack (TOS) information,
perf tool may stitch the stacks of two samples. The reconstructed LBR
call stack can break the HW limitation.

Add a new sample type for LBR TOS.

PEBS record doesn't store TOS information. For single PEBS, TOS can be
directly read from MSR, because the PMI is triggered immediately after
PEBS is written. TOS MSR is still unchanged.
For large PEBS, TOS MSR has stale value. Set -1ULL to indicate that the
TOS information is not available.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/lbr.c     |  9 +++++++++
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h |  4 +++-
 kernel/events/core.c            | 12 ++++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index ea54634eabf3..4640ff1c9ecb 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -562,6 +562,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
+	cpuc->lbr_stack.tos = tos;
 }
 
 /*
@@ -657,6 +658,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		out++;
 	}
 	cpuc->lbr_stack.nr = out;
+	cpuc->lbr_stack.tos = tos;
 }
 
 void intel_pmu_lbr_read(void)
@@ -1097,6 +1099,13 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr)
 	int i;
 
 	cpuc->lbr_stack.nr = x86_pmu.lbr_nr;
+
+	/* Cannot get TOS for large PEBS */
+	if (cpuc->n_pebs == cpuc->n_large_pebs)
+		cpuc->lbr_stack.tos = -1ULL;
+	else
+		cpuc->lbr_stack.tos = intel_pmu_lbr_tos();
+
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		u64 info = lbr->lbr[i].info;
 		struct perf_branch_entry *e = &cpuc->lbr_entries[i];
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..ee9ef0c4cb08 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -100,6 +100,7 @@ struct perf_raw_record {
  */
 struct perf_branch_stack {
 	__u64				nr;
+	__u64				tos;
 	struct perf_branch_entry	entries[0];
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bb7b271397a6..fe36ebb7dc2e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_LBR_TOS			= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -864,6 +865,7 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 275eae05af20..6ab0913c7b36 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6468,6 +6468,15 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		perf_output_put(handle, data->phys_addr);
 
+	if (sample_type & PERF_SAMPLE_LBR_TOS) {
+		u64 tos = -1ULL;
+
+		if (data->br_stack)
+			tos = data->br_stack->tos;
+
+		perf_output_put(handle, tos);
+	}
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
@@ -6656,6 +6665,9 @@ void perf_prepare_sample(struct perf_event_header *header,
 
 	if (sample_type & PERF_SAMPLE_PHYS_ADDR)
 		data->phys_addr = perf_virt_to_phys(data->addr);
+
+	if (sample_type & PERF_SAMPLE_LBR_TOS)
+		header->size += sizeof(u64);
 }
 
 static __always_inline int
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 02/10] perf tools: Support PERF_SAMPLE_LBR_TOS
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
  2019-10-07 17:59 ` [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 03/10] perf pmu: Add support for PMU capabilities kan.liang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Support new sample type PERF_SAMPLE_LBR_TOS.

Enable LBR_TOS by default in LBR call stack mode.
If kernel doesn't support the sample type, switching it off.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/include/uapi/linux/perf_event.h     |  4 +++-
 tools/perf/util/event.h                   |  1 +
 tools/perf/util/evsel.c                   | 16 +++++++++++++++-
 tools/perf/util/evsel.h                   |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  2 +-
 tools/perf/util/synthetic-events.c        |  8 ++++++++
 6 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index bb7b271397a6..fe36ebb7dc2e 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -141,8 +141,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
 	PERF_SAMPLE_REGS_INTR			= 1U << 18,
 	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
+	PERF_SAMPLE_LBR_TOS			= 1U << 20,
 
-	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -864,6 +865,7 @@ enum perf_event_type {
 	 *	{ u64			abi; # enum perf_sample_regs_abi
 	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
 	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
+	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index a0a0c91cde4a..268c1715e032 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -130,6 +130,7 @@ struct perf_sample {
 	u32 raw_size;
 	u64 data_src;
 	u64 phys_addr;
+	u64 tos;
 	u32 flags;
 	u16 insn_len;
 	u8  cpumode;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index abc7fda4a0fe..0752417bbc45 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -709,6 +709,7 @@ static void __perf_evsel__config_callchain(struct evsel *evsel,
 					   "Falling back to framepointers.\n");
 			} else {
 				perf_evsel__set_sample_bit(evsel, BRANCH_STACK);
+				perf_evsel__set_sample_bit(evsel, LBR_TOS);
 				attr->branch_sample_type = PERF_SAMPLE_BRANCH_USER |
 							PERF_SAMPLE_BRANCH_CALL_STACK |
 							PERF_SAMPLE_BRANCH_NO_CYCLES |
@@ -762,6 +763,7 @@ perf_evsel__reset_callgraph(struct evsel *evsel,
 	perf_evsel__reset_sample_bit(evsel, CALLCHAIN);
 	if (param->record_mode == CALLCHAIN_LBR) {
 		perf_evsel__reset_sample_bit(evsel, BRANCH_STACK);
+		perf_evsel__reset_sample_bit(evsel, LBR_TOS);
 		attr->branch_sample_type &= ~(PERF_SAMPLE_BRANCH_USER |
 					      PERF_SAMPLE_BRANCH_CALL_STACK);
 	}
@@ -1641,6 +1643,8 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
 		evsel->core.attr.ksymbol = 0;
 	if (perf_missing_features.bpf)
 		evsel->core.attr.bpf_event = 0;
+	if (perf_missing_features.lbr_tos)
+		perf_evsel__reset_sample_bit(evsel, LBR_TOS);
 retry_sample_id:
 	if (perf_missing_features.sample_id_all)
 		evsel->core.attr.sample_id_all = 0;
@@ -1752,7 +1756,11 @@ int evsel__open(struct evsel *evsel, struct perf_cpu_map *cpus,
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-	if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
+	if (!perf_missing_features.lbr_tos && (evsel->core.attr.sample_type & PERF_SAMPLE_LBR_TOS)) {
+		perf_missing_features.lbr_tos = true;
+		pr_debug2("switching off LBR TOS support\n");
+		goto fallback_missing_features;
+	} else if (!perf_missing_features.aux_output && evsel->core.attr.aux_output) {
 		perf_missing_features.aux_output = true;
 		pr_debug2("Kernel has no attr.aux_output support, bailing out\n");
 		goto out_close;
@@ -2206,6 +2214,12 @@ int perf_evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		array++;
 	}
 
+	data->tos = -1ULL;
+	if (type & PERF_SAMPLE_LBR_TOS) {
+		data->tos = *array;
+		array++;
+	}
+
 	return 0;
 }
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ddc5ee6f6592..e4768c60da93 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -115,6 +115,7 @@ struct perf_missing_features {
 	bool ksymbol;
 	bool bpf;
 	bool aux_output;
+	bool lbr_tos;
 };
 
 extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index d4ad3f04923a..254f1bf8dcae 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -34,7 +34,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
 		bit_name(PERIOD), bit_name(STREAM_ID), bit_name(RAW),
 		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
 		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
-		bit_name(WEIGHT), bit_name(PHYS_ADDR),
+		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(LBR_TOS),
 		{ .name = NULL, }
 	};
 #undef bit_name
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 807cbca403a7..a7d02e81defe 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1228,6 +1228,9 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
 	if (type & PERF_SAMPLE_PHYS_ADDR)
 		result += sizeof(u64);
 
+	if (type & PERF_SAMPLE_LBR_TOS)
+		result += sizeof(u64);
+
 	return result;
 }
 
@@ -1396,6 +1399,11 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 		array++;
 	}
 
+	if (type & PERF_SAMPLE_LBR_TOS) {
+		*array = sample->tos;
+		array++;
+	}
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 03/10] perf pmu: Add support for PMU capabilities
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
  2019-10-07 17:59 ` [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS kan.liang
  2019-10-07 17:59 ` [PATCH 02/10] perf tools: Support PERF_SAMPLE_LBR_TOS kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 04/10] perf header: Support CPU " kan.liang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The PMU capabilities information, which is located at
/sys/bus/event_source/devices/<dev>/caps, is required by perf tool.
For example, the max LBR information is required to stitch LBR call
stack.

Add perf_pmu__caps_parse() to parse the PMU capabilities information.
The information is stored in a list.

Add perf_pmu__scan_caps() to scan the capabilities once by one.

The following patch will store the capabilities information in perf
header.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/pmu.c | 87 +++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/pmu.h | 12 ++++++
 2 files changed, 99 insertions(+)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 5608da82ad23..d3dc9d4f9479 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -847,6 +847,7 @@ static struct perf_pmu *pmu_lookup(const char *name)
 
 	INIT_LIST_HEAD(&pmu->format);
 	INIT_LIST_HEAD(&pmu->aliases);
+	INIT_LIST_HEAD(&pmu->caps);
 	list_splice(&format, &pmu->format);
 	list_splice(&aliases, &pmu->aliases);
 	list_add_tail(&pmu->list, &pmus);
@@ -1552,3 +1553,89 @@ int perf_pmu__scan_file(struct perf_pmu *pmu, const char *name, const char *fmt,
 	va_end(args);
 	return ret;
 }
+
+static int perf_pmu__new_caps(struct list_head *list, char *name, char *value)
+{
+	struct perf_pmu_caps *caps;
+
+	caps = zalloc(sizeof(*caps));
+	if (!caps)
+		return -ENOMEM;
+
+	caps->name = strdup(name);
+	caps->value = strndup(value, strlen(value) - 1);
+	list_add_tail(&caps->list, list);
+	return 0;
+}
+
+/*
+ * Reading/parsing the given pmu capabilities, which should be located at:
+ * /sys/bus/event_source/devices/<dev>/caps as sysfs group attributes.
+ * Return the number of capabilities
+ */
+int perf_pmu__caps_parse(struct perf_pmu *pmu)
+{
+	struct stat st;
+	char caps_path[PATH_MAX];
+	const char *sysfs = sysfs__mountpoint();
+	DIR *caps_dir;
+	struct dirent *evt_ent;
+	int nr_caps = 0;
+
+	if (!sysfs)
+		return -1;
+
+	snprintf(caps_path, PATH_MAX,
+		 "%s" EVENT_SOURCE_DEVICE_PATH "%s/caps", sysfs, pmu->name);
+
+	if (stat(caps_path, &st) < 0)
+		return 0;	/* no error if caps does not exist */
+
+	caps_dir = opendir(caps_path);
+	if (!caps_dir)
+		return -EINVAL;
+
+	while ((evt_ent = readdir(caps_dir)) != NULL) {
+		char *name = evt_ent->d_name;
+		char path[PATH_MAX];
+		char value[128];
+		FILE *file;
+
+		if (!strcmp(name, ".") || !strcmp(name, ".."))
+			continue;
+
+		snprintf(path, PATH_MAX, "%s/%s", caps_path, name);
+
+		file = fopen(path, "r");
+		if (!file)
+			break;
+
+		if (!fgets(value, sizeof(value), file) ||
+		    (perf_pmu__new_caps(&pmu->caps, name, value) < 0)) {
+			fclose(file);
+			break;
+		}
+
+		nr_caps++;
+		fclose(file);
+	}
+
+	closedir(caps_dir);
+
+	return nr_caps;
+}
+
+struct perf_pmu_caps *perf_pmu__scan_caps(struct perf_pmu *pmu,
+					  struct perf_pmu_caps *caps)
+{
+	if (!pmu)
+		return NULL;
+
+	if (!caps)
+		caps = list_prepare_entry(caps, &pmu->caps, list);
+
+	list_for_each_entry_continue(caps, &pmu->caps, list)
+		return caps;
+
+	return NULL;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index f36ade6df76d..5ded4e3e28e4 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -21,6 +21,12 @@ enum {
 
 struct perf_event_attr;
 
+struct perf_pmu_caps {
+	char *name;
+	char *value;
+	struct list_head list;
+};
+
 struct perf_pmu {
 	char *name;
 	__u32 type;
@@ -31,6 +37,7 @@ struct perf_pmu {
 	struct perf_cpu_map *cpus;
 	struct list_head format;  /* HEAD struct perf_pmu_format -> list */
 	struct list_head aliases; /* HEAD struct perf_pmu_alias -> list */
+	struct list_head caps;    /* HEAD struct perf_pmu_caps -> list */
 	struct list_head list;    /* ELEM */
 };
 
@@ -98,4 +105,9 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu);
 
 int perf_pmu__convert_scale(const char *scale, char **end, double *sval);
 
+int perf_pmu__caps_parse(struct perf_pmu *pmu);
+
+struct perf_pmu_caps *perf_pmu__scan_caps(struct perf_pmu *pmu,
+					  struct perf_pmu_caps *caps);
+
 #endif /* __PMU_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/10] perf header: Support CPU PMU capabilities
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (2 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 03/10] perf pmu: Add support for PMU capabilities kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 05/10] perf machine: Refine the function for LBR call stack reconstruction kan.liang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To stitch LBR call stack, the max LBR information is required. So the
CPU PMU capabilities information has to be stored in perf header.

Add a new feature HEADER_CPU_PMU_CAPS for CPU PMU capabilities.
Retrieve all CPU PMU capabilities, not just max LBR information.

Add variable max_branches to facilitate future usage.

The CPU PMU capabilities information is only useful for LBR call stack
mode. Clear the feature for perf stat and other perf record mode.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 .../Documentation/perf.data-file-format.txt   |  16 +++
 tools/perf/builtin-record.c                   |   3 +
 tools/perf/builtin-stat.c                     |   1 +
 tools/perf/util/env.h                         |   3 +
 tools/perf/util/header.c                      | 110 ++++++++++++++++++
 tools/perf/util/header.h                      |   1 +
 6 files changed, 134 insertions(+)

diff --git a/tools/perf/Documentation/perf.data-file-format.txt b/tools/perf/Documentation/perf.data-file-format.txt
index b0152e1095c5..b6472e463284 100644
--- a/tools/perf/Documentation/perf.data-file-format.txt
+++ b/tools/perf/Documentation/perf.data-file-format.txt
@@ -373,6 +373,22 @@ struct {
 Indicates that trace contains records of PERF_RECORD_COMPRESSED type
 that have perf_events records in compressed form.
 
+	HEADER_CPU_PMU_CAPS = 28,
+
+	A list of cpu PMU capabilities. The format of data is as below.
+
+struct {
+	u32 nr_cpu_pmu_caps;
+	{
+		char	name[];
+		char	value[];
+	} [nr_cpu_pmu_caps]
+};
+
+
+Example:
+ cpu pmu capabilities: branches=32, max_precise=3, pmu_name=icelake
+
 	other bits are reserved and should ignored for now
 	HEADER_FEAT_BITS	= 256,
 
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 23332861de6e..fbbeb1e625ef 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1057,6 +1057,9 @@ static void record__init_features(struct record *rec)
 	if (!record__comp_enabled(rec))
 		perf_header__clear_feat(&session->header, HEADER_COMPRESSED);
 
+	if (!callchain_param.enabled || (callchain_param.record_mode != CALLCHAIN_LBR))
+		perf_header__clear_feat(&session->header, HEADER_CPU_PMU_CAPS);
+
 	perf_header__clear_feat(&session->header, HEADER_STAT);
 }
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 468fc49420ce..26bb9794e95a 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1418,6 +1418,7 @@ static void init_features(struct perf_session *session)
 	perf_header__clear_feat(&session->header, HEADER_TRACING_DATA);
 	perf_header__clear_feat(&session->header, HEADER_BRANCH_STACK);
 	perf_header__clear_feat(&session->header, HEADER_AUXTRACE);
+	perf_header__clear_feat(&session->header, HEADER_CPU_PMU_CAPS);
 }
 
 static int __cmd_record(int argc, const char **argv)
diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index a3059dc1abe5..dae64e20b280 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -48,6 +48,7 @@ struct perf_env {
 	char			*cpuid;
 	unsigned long long	total_mem;
 	unsigned int		msr_pmu_type;
+	unsigned int		max_branches;
 
 	int			nr_cmdline;
 	int			nr_sibling_cores;
@@ -57,12 +58,14 @@ struct perf_env {
 	int			nr_memory_nodes;
 	int			nr_pmu_mappings;
 	int			nr_groups;
+	int			nr_cpu_pmu_caps;
 	char			*cmdline;
 	const char		**cmdline_argv;
 	char			*sibling_cores;
 	char			*sibling_dies;
 	char			*sibling_threads;
 	char			*pmu_mappings;
+	char			*cpu_pmu_caps;
 	struct cpu_topology_map	*cpu;
 	struct cpu_cache_level	*caches;
 	int			 caches_cnt;
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 86d9396cb131..b8d0487f0cea 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1402,6 +1402,39 @@ static int write_compressed(struct feat_fd *ff __maybe_unused,
 	return do_write(ff, &(ff->ph->env.comp_mmap_len), sizeof(ff->ph->env.comp_mmap_len));
 }
 
+static int write_cpu_pmu_caps(struct feat_fd *ff,
+			      struct evlist *evlist __maybe_unused)
+{
+	struct perf_pmu_caps *caps = NULL;
+	struct perf_pmu *cpu_pmu;
+	int nr_caps;
+	int ret;
+
+	cpu_pmu = perf_pmu__find("cpu");
+	if (!cpu_pmu)
+		return -ENOENT;
+
+	nr_caps = perf_pmu__caps_parse(cpu_pmu);
+	if (nr_caps < 0)
+		return nr_caps;
+
+	ret = do_write(ff, &nr_caps, sizeof(nr_caps));
+	if (ret < 0)
+		return ret;
+
+	while ((caps = perf_pmu__scan_caps(cpu_pmu, caps))) {
+		ret = do_write_string(ff, caps->name);
+		if (ret < 0)
+			return ret;
+
+		ret = do_write_string(ff, caps->value);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static void print_hostname(struct feat_fd *ff, FILE *fp)
 {
 	fprintf(fp, "# hostname : %s\n", ff->ph->env.hostname);
@@ -1779,6 +1812,28 @@ static void print_compressed(struct feat_fd *ff, FILE *fp)
 		ff->ph->env.comp_level, ff->ph->env.comp_ratio);
 }
 
+static void print_cpu_pmu_caps(struct feat_fd *ff, FILE *fp)
+{
+	const char *delimiter = "# cpu pmu capabilities: ";
+	char *str;
+	u32 nr_caps;
+
+	nr_caps = ff->ph->env.nr_cpu_pmu_caps;
+	if (!nr_caps) {
+		fprintf(fp, "# cpu pmu capabilities: not available\n");
+		return;
+	}
+
+	str = ff->ph->env.cpu_pmu_caps;
+	while (nr_caps--) {
+		fprintf(fp, "%s%s", delimiter, str);
+		delimiter = ", ";
+		str += strlen(str) + 1;
+	}
+
+	fprintf(fp, "\n");
+}
+
 static void print_pmu_mappings(struct feat_fd *ff, FILE *fp)
 {
 	const char *delimiter = "# pmu mappings: ";
@@ -2816,6 +2871,60 @@ static int process_compressed(struct feat_fd *ff,
 	return 0;
 }
 
+static int process_cpu_pmu_caps(struct feat_fd *ff,
+				void *data __maybe_unused)
+{
+	char *name, *value;
+	struct strbuf sb;
+	u32 nr_caps;
+
+	if (do_read_u32(ff, &nr_caps))
+		return -1;
+
+	if (!nr_caps) {
+		pr_debug("cpu pmu capabilities not available\n");
+		return 0;
+	}
+
+	ff->ph->env.nr_cpu_pmu_caps = nr_caps;
+
+	if (strbuf_init(&sb, 128) < 0)
+		return -1;
+
+	while (nr_caps--) {
+		name = do_read_string(ff);
+		if (!name)
+			goto error;
+
+		value = do_read_string(ff);
+		if (!value)
+			goto free_name;
+
+		if (strbuf_addf(&sb, "%s=%s", name, value) < 0)
+			goto free_value;
+
+		/* include a NULL character at the end */
+		if (strbuf_add(&sb, "", 1) < 0)
+			goto free_value;
+
+		if (!strcmp(name, "branches"))
+			ff->ph->env.max_branches = atoi(value);
+
+		free(value);
+		free(name);
+	}
+	ff->ph->env.cpu_pmu_caps = strbuf_detach(&sb, NULL);
+	return 0;
+
+free_value:
+	free(value);
+free_name:
+	free(name);
+error:
+	strbuf_release(&sb);
+	return -1;
+}
+
 #define FEAT_OPR(n, func, __full_only) \
 	[HEADER_##n] = {					\
 		.name	    = __stringify(n),			\
@@ -2873,6 +2982,7 @@ const struct perf_header_feature_ops feat_ops[HEADER_LAST_FEATURE] = {
 	FEAT_OPR(BPF_PROG_INFO, bpf_prog_info,  false),
 	FEAT_OPR(BPF_BTF,       bpf_btf,        false),
 	FEAT_OPR(COMPRESSED,	compressed,	false),
+	FEAT_OPR(CPU_PMU_CAPS,	cpu_pmu_caps,	false),
 };
 
 struct header_print_data {
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index ca53a929e9fd..ae8a7108f52b 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -43,6 +43,7 @@ enum {
 	HEADER_BPF_PROG_INFO,
 	HEADER_BPF_BTF,
 	HEADER_COMPRESSED,
+	HEADER_CPU_PMU_CAPS,
 	HEADER_LAST_FEATURE,
 	HEADER_FEAT_BITS	= 256,
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 05/10] perf machine: Refine the function for LBR call stack reconstruction
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (3 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 04/10] perf header: Support CPU " kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 06/10] perf tools: Stitch LBR call stack kan.liang
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

LBR only collect the user call stack. To reconstruct a call stack, both
kernel call stack and user call stack are required. The function
resolve_lbr_callchain_sample() mix the kernel call stack and user
call stack. Now, with the help of TOS, perf tool can reconstruct a more
complete call stack by adding some user call stack from previous sample.
However, current implementation is hard to be extended to support it.

Abstract two new functions to resolve user call stack and kernel
call stack respectively.

No functional changes.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/machine.c | 186 ++++++++++++++++++++++++--------------
 1 file changed, 119 insertions(+), 67 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 70a9f8716a4b..e3e516e30093 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2183,6 +2183,96 @@ static int remove_loops(struct branch_entry *l, int nr,
 	return nr;
 }
 
+
+static int lbr_callchain_add_kernel_ip(struct thread *thread,
+				       struct callchain_cursor *cursor,
+				       struct perf_sample *sample,
+				       struct symbol **parent,
+				       struct addr_location *root_al,
+				       bool callee, int end)
+{
+	struct ip_callchain *chain = sample->callchain;
+	u8 cpumode = PERF_RECORD_MISC_USER;
+	int err, i;
+
+	if (callee) {
+		for (i = 0; i < end + 1; i++) {
+			err = add_callchain_ip(thread, cursor, parent,
+					       root_al, &cpumode, chain->ips[i],
+					       false, NULL, NULL, 0);
+			if (err)
+				return err;
+		}
+	} else {
+		for (i = end; i >= 0; i--) {
+			err = add_callchain_ip(thread, cursor, parent,
+					       root_al, &cpumode, chain->ips[i],
+					       false, NULL, NULL, 0);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
+}
+
+static int lbr_callchain_add_lbr_ip(struct thread *thread,
+				    struct callchain_cursor *cursor,
+				    struct perf_sample *sample,
+				    struct symbol **parent,
+				    struct addr_location *root_al,
+				    bool callee)
+{
+	struct branch_stack *lbr_stack = sample->branch_stack;
+	u8 cpumode = PERF_RECORD_MISC_USER;
+	int lbr_nr = lbr_stack->nr;
+	struct branch_flags *flags;
+	u64 ip, branch_from = 0;
+	int err, i;
+
+	if (callee) {
+		ip = lbr_stack->entries[0].to;
+		flags = &lbr_stack->entries[0].flags;
+		branch_from = lbr_stack->entries[0].from;
+		err = add_callchain_ip(thread, cursor, parent,
+				       root_al, &cpumode, ip,
+				       true, flags, NULL, branch_from);
+		if (err)
+			return err;
+
+		for (i = 0; i < lbr_nr; i++) {
+			ip = lbr_stack->entries[i].from;
+			flags = &lbr_stack->entries[i].flags;
+			err = add_callchain_ip(thread, cursor, parent,
+					       root_al, &cpumode, ip,
+					       true, flags, NULL, branch_from);
+			if (err)
+				return err;
+		}
+	} else {
+		for (i = lbr_nr - 1; i >= 0; i--) {
+			ip = lbr_stack->entries[i].from;
+			flags = &lbr_stack->entries[i].flags;
+			err = add_callchain_ip(thread, cursor, parent,
+					       root_al, &cpumode, ip,
+					       true, flags, NULL, branch_from);
+			if (err)
+				return err;
+		}
+
+		ip = lbr_stack->entries[0].to;
+		flags = &lbr_stack->entries[0].flags;
+		branch_from = lbr_stack->entries[0].from;
+		err = add_callchain_ip(thread, cursor, parent,
+				       root_al, &cpumode, ip,
+				       true, flags, NULL, branch_from);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
 /*
  * Recolve LBR callstack chain sample
  * Return:
@@ -2198,82 +2288,44 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
 					int max_stack)
 {
 	struct ip_callchain *chain = sample->callchain;
-	int chain_nr = min(max_stack, (int)chain->nr), i;
-	u8 cpumode = PERF_RECORD_MISC_USER;
-	u64 ip, branch_from = 0;
+	int chain_nr = min(max_stack, (int)chain->nr);
+	int i, err;
 
 	for (i = 0; i < chain_nr; i++) {
 		if (chain->ips[i] == PERF_CONTEXT_USER)
 			break;
 	}
 
-	/* LBR only affects the user callchain */
-	if (i != chain_nr) {
-		struct branch_stack *lbr_stack = sample->branch_stack;
-		int lbr_nr = lbr_stack->nr, j, k;
-		bool branch;
-		struct branch_flags *flags;
-		/*
-		 * LBR callstack can only get user call chain.
-		 * The mix_chain_nr is kernel call chain
-		 * number plus LBR user call chain number.
-		 * i is kernel call chain number,
-		 * 1 is PERF_CONTEXT_USER,
-		 * lbr_nr + 1 is the user call chain number.
-		 * For details, please refer to the comments
-		 * in callchain__printf
-		 */
-		int mix_chain_nr = i + 1 + lbr_nr + 1;
-
-		for (j = 0; j < mix_chain_nr; j++) {
-			int err;
-			branch = false;
-			flags = NULL;
-
-			if (callchain_param.order == ORDER_CALLEE) {
-				if (j < i + 1)
-					ip = chain->ips[j];
-				else if (j > i + 1) {
-					k = j - i - 2;
-					ip = lbr_stack->entries[k].from;
-					branch = true;
-					flags = &lbr_stack->entries[k].flags;
-				} else {
-					ip = lbr_stack->entries[0].to;
-					branch = true;
-					flags = &lbr_stack->entries[0].flags;
-					branch_from =
-						lbr_stack->entries[0].from;
-				}
-			} else {
-				if (j < lbr_nr) {
-					k = lbr_nr - j - 1;
-					ip = lbr_stack->entries[k].from;
-					branch = true;
-					flags = &lbr_stack->entries[k].flags;
-				}
-				else if (j > lbr_nr)
-					ip = chain->ips[i + 1 - (j - lbr_nr)];
-				else {
-					ip = lbr_stack->entries[0].to;
-					branch = true;
-					flags = &lbr_stack->entries[0].flags;
-					branch_from =
-						lbr_stack->entries[0].from;
-				}
-			}
+	/*
+	 * LBR only affects the user callchain.
+	 * Fall back if there is no user callchain.
+	 */
+	if (i == chain_nr)
+		return 0;
 
-			err = add_callchain_ip(thread, cursor, parent,
-					       root_al, &cpumode, ip,
-					       branch, flags, NULL,
-					       branch_from);
-			if (err)
-				return (err < 0) ? err : 0;
-		}
-		return 1;
+	if (callchain_param.order == ORDER_CALLEE) {
+		err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
+						  parent, root_al, true, i);
+		if (err)
+			goto error;
+		err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
+					       parent, root_al, true);
+		if (err)
+			goto error;
+	} else {
+		err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
+					       parent, root_al, false);
+		if (err)
+			goto error;
+		err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
+						  parent, root_al, false, i);
+		if (err)
+			goto error;
 	}
 
-	return 0;
+	return 1;
+error:
+	return (err < 0) ? err : 0;
 }
 
 static int find_prev_cpumode(struct ip_callchain *chain, struct thread *thread,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/10] perf tools: Stitch LBR call stack
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (4 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 05/10] perf machine: Refine the function for LBR call stack reconstruction kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 07/10] perf report: Add option to enable the LBR stitching approach kan.liang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In LBR call stack mode, the depth of reconstructed LBR call stack limits
to the number of LBR registers.

  For example, on skylake, the depth of reconstructed LBR call stack is
  always <= 32.

  # To display the perf.data header info, please use
  # --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 6K of event 'cycles'
  # Event count (approx.): 6487119731
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................
  # ................................

    99.97%    99.97%  tchain_edit      tchain_edit        [.] f43
            |
             --99.64%--f11
                       f12
                       f13
                       f14
                       f15
                       f16
                       f17
                       f18
                       f19
                       f20
                       f21
                       f22
                       f23
                       f24
                       f25
                       f26
                       f27
                       f28
                       f29
                       f30
                       f31
                       f32
                       f33
                       f34
                       f35
                       f36
                       f37
                       f38
                       f39
                       f40
                       f41
                       f42
                       f43

For a call stack which is deeper than LBR limit, HW will overwrite the
LBR register with oldest branch. Only partial call stacks can be
reconstructed.

However, the overwritten LBRs may still be retrieved from previous
sample. At that moment, HW hasn't overwritten the LBR registers yet.
Perf tools can stitch those overwritten LBRs on current call stacks to
get a more complete call stack.

To determine if LBRs can be stitched, perf tools need to compare current
sample with previous sample.
- They should have identical LBR records (Same from, to and flags
  values, and the same physical index of LBR registers).
- The searching starts from the base-of-stack of current sample.

Add prev_sample in struct thread to save the previous sample.
Add lbr_stitch_lists to save the LBRs can be used to stitch.

lbr_stitch_enable is used to indicate whether enable LBR stitch
approach, which is disabled by default. The following patch will
introduce a new option to enable the LBR stitch approach.
This is because,
- The stitching approach base on LBR call stack technology. The known
limitations of LBR call stack technology still apply to the approach,
e.g. Exception handing such as setjmp/longjmp will have calls/returns
not match.
- This approach is not full proof. There can be cases where it creates
incorrect call stacks from incorrect matches. There is no attempt
to validate any matches in another way. So it is not enabled by default.
However in many common cases with call stack overflows it can recreate
better call stacks than the default lbr call stack output. So if there
are problems with LBR overflows this is a possible workaround.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/branch.h  |  10 ++-
 tools/perf/util/machine.c | 125 +++++++++++++++++++++++++++++++++++++-
 tools/perf/util/thread.c  |   3 +
 tools/perf/util/thread.h  |  18 ++++++
 4 files changed, 152 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/branch.h b/tools/perf/util/branch.h
index 88e00d268f6f..a9c399703281 100644
--- a/tools/perf/util/branch.h
+++ b/tools/perf/util/branch.h
@@ -34,7 +34,15 @@ struct branch_info {
 struct branch_entry {
 	u64			from;
 	u64			to;
-	struct branch_flags	flags;
+	union {
+		struct branch_flags	flags;
+		u64			flags_value;
+	};
+};
+
+struct stitch_list {
+	struct list_head	node;
+	struct branch_entry	br_entry;
 };
 
 struct branch_stack {
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index e3e516e30093..02bd1740d547 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2273,6 +2273,98 @@ static int lbr_callchain_add_lbr_ip(struct thread *thread,
 	return 0;
 }
 
+static int lbr_callchain_add_stitched_lbr_ip(struct thread *thread,
+					     struct callchain_cursor *cursor,
+					     struct symbol **parent,
+					     struct addr_location *root_al)
+{
+	u8 cpumode = PERF_RECORD_MISC_USER;
+	struct stitch_list *stitch_node;
+	struct branch_flags *flags;
+	int err;
+	u64 ip;
+
+	list_for_each_entry(stitch_node, &thread->lbr_stitch_lists, node) {
+		ip = stitch_node->br_entry.from;
+		flags = &stitch_node->br_entry.flags;
+
+		err = add_callchain_ip(thread, cursor, parent,
+				       root_al, &cpumode, ip,
+				       true, flags, NULL, 0);
+		if (err)
+			return err;
+
+	}
+	return 0;
+}
+
+
+static bool has_stitched_lbr(struct thread *thread,
+			     struct perf_sample *cur,
+			     struct perf_sample *prev,
+			     unsigned int max_lbr,
+			     bool callee)
+{
+	struct branch_stack *cur_stack = cur->branch_stack;
+	struct branch_stack *prev_stack = prev->branch_stack;
+	int i, j, nr_identical_branches = 0;
+	struct stitch_list *stitch_node;
+	u64 cur_base, distance;
+
+	if (!cur_stack || !prev_stack)
+		return false;
+
+	/* Find the physical index of the base-of-stack for current sample. */
+	cur_base = max_lbr - cur_stack->nr + cur->tos + 1;
+
+	distance = (prev->tos > cur_base) ? (prev->tos - cur_base) :
+					    (max_lbr + prev->tos - cur_base);
+	/* Previous sample has shorter stack. Nothing can be stitched. */
+	if (distance + 1 > prev_stack->nr)
+		return false;
+
+	/*
+	 * Check if there are identical LBRs between two samples.
+	 * Identicall LBRs must have same from, to and flags values. Also,
+	 * they have to be saved in the same LBR registers (same physical
+	 * index).
+	 *
+	 * Starts from the base-of-stack of current sample.
+	 */
+	for (i = distance, j = cur_stack->nr - 1; (i >= 0) && (j >= 0); i--, j--) {
+		if ((prev_stack->entries[i].from != cur_stack->entries[j].from) ||
+		    (prev_stack->entries[i].to != cur_stack->entries[j].to) ||
+		    (prev_stack->entries[i].flags_value != cur_stack->entries[j].flags_value))
+			break;
+
+		nr_identical_branches++;
+	}
+
+	if (!nr_identical_branches)
+		return false;
+
+	/*
+	 * Save the LBRs between the base-of-stack of previous sample
+	 * and the base-of-stack of current sample into lbr_stitch_lists.
+	 * These LBRs will be stitched later.
+	 */
+	for (i = prev_stack->nr - 1; i > (int)distance; i--) {
+		stitch_node = malloc(sizeof(*stitch_node));
+		if (!stitch_node)
+			return false;
+
+		memcpy(&stitch_node->br_entry, &prev_stack->entries[i],
+		       sizeof(struct branch_entry));
+
+		if (callee)
+			list_add(&stitch_node->node, &thread->lbr_stitch_lists);
+		else
+			list_add_tail(&stitch_node->node, &thread->lbr_stitch_lists);
+	}
+
+	return true;
+}
+
 /*
  * Recolve LBR callstack chain sample
  * Return:
@@ -2285,10 +2377,13 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
 					struct perf_sample *sample,
 					struct symbol **parent,
 					struct addr_location *root_al,
-					int max_stack)
+					int max_stack,
+					unsigned int max_lbr)
 {
 	struct ip_callchain *chain = sample->callchain;
 	int chain_nr = min(max_stack, (int)chain->nr);
+	bool callee = (callchain_param.order == ORDER_CALLEE);
+	bool stitched_lbr = false;
 	int i, err;
 
 	for (i = 0; i < chain_nr; i++) {
@@ -2303,7 +2398,16 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
 	if (i == chain_nr)
 		return 0;
 
-	if (callchain_param.order == ORDER_CALLEE) {
+	if (thread->lbr_stitch_enable && sample->tos != (-1ULL) && (max_lbr > 0)) {
+		stitched_lbr = has_stitched_lbr(thread, sample,
+						&thread->prev_sample,
+						max_lbr, callee);
+		if (!stitched_lbr)
+			thread__free_stitch_list(thread);
+		memcpy(&thread->prev_sample, sample, sizeof(*sample));
+	}
+
+	if (callee) {
 		err = lbr_callchain_add_kernel_ip(thread, cursor, sample,
 						  parent, root_al, true, i);
 		if (err)
@@ -2312,7 +2416,19 @@ static int resolve_lbr_callchain_sample(struct thread *thread,
 					       parent, root_al, true);
 		if (err)
 			goto error;
+		if (stitched_lbr) {
+			err = lbr_callchain_add_stitched_lbr_ip(thread, cursor,
+								parent, root_al);
+			if (err)
+				goto error;
+		}
 	} else {
+		if (stitched_lbr) {
+			err = lbr_callchain_add_stitched_lbr_ip(thread, cursor,
+								parent, root_al);
+			if (err)
+				goto error;
+		}
 		err = lbr_callchain_add_lbr_ip(thread, cursor, sample,
 					       parent, root_al, false);
 		if (err)
@@ -2369,8 +2485,11 @@ static int thread__resolve_callchain_sample(struct thread *thread,
 		chain_nr = chain->nr;
 
 	if (perf_evsel__has_branch_callstack(evsel)) {
+		struct perf_env *env = perf_evsel__env(evsel);
+
 		err = resolve_lbr_callchain_sample(thread, cursor, sample, parent,
-						   root_al, max_stack);
+						   root_al, max_stack,
+						   !env ? 0 : env->max_branches);
 		if (err)
 			return (err < 0) ? err : 0;
 	}
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index b64e9e049636..eca53b1c7de3 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -47,8 +47,10 @@ struct thread *thread__new(pid_t pid, pid_t tid)
 		thread->tid = tid;
 		thread->ppid = -1;
 		thread->cpu = -1;
+		thread->lbr_stitch_enable = false;
 		INIT_LIST_HEAD(&thread->namespaces_list);
 		INIT_LIST_HEAD(&thread->comm_list);
+		INIT_LIST_HEAD(&thread->lbr_stitch_lists);
 		init_rwsem(&thread->namespaces_lock);
 		init_rwsem(&thread->comm_lock);
 
@@ -110,6 +112,7 @@ void thread__delete(struct thread *thread)
 
 	exit_rwsem(&thread->namespaces_lock);
 	exit_rwsem(&thread->comm_lock);
+	thread__free_stitch_list(thread);
 	free(thread);
 }
 
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 51bdb9a7af7f..112eccc6979b 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -13,6 +13,9 @@
 #include <strlist.h>
 #include <intlist.h>
 #include "rwsem.h"
+#include "event.h"
+#include "map_symbol.h"
+#include "branch.h"
 
 struct addr_location;
 struct map;
@@ -46,6 +49,11 @@ struct thread {
 	struct srccode_state	srccode_state;
 	bool			filter;
 	int			filter_entry_depth;
+
+	/* stitch LBR call stack */
+	bool			lbr_stitch_enable;
+	struct list_head	lbr_stitch_lists;
+	struct perf_sample	prev_sample;
 };
 
 struct machine;
@@ -142,4 +150,14 @@ static inline bool thread__is_filtered(struct thread *thread)
 	return false;
 }
 
+static inline void thread__free_stitch_list(struct thread *thread)
+{
+	struct stitch_list *pos, *tmp;
+
+	list_for_each_entry_safe(pos, tmp, &thread->lbr_stitch_lists, node) {
+		list_del_init(&pos->node);
+		free(pos);
+	}
+}
+
 #endif	/* __PERF_THREAD_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 07/10] perf report: Add option to enable the LBR stitching approach
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (5 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 06/10] perf tools: Stitch LBR call stack kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 08/10] perf script: " kan.liang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

  # To display the perf.data header info, please use
  # --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 6K of event 'cycles'
  # Event count (approx.): 6492797701
  #
  # Children      Self  Command          Shared Object       Symbol
  # ........  ........  ...............  ..................
  # .................................
  #
    99.99%    99.99%  tchain_edit      tchain_edit        [.] f43
            |
            ---main
               f1
               f2
               f3
               f4
               f5
               f6
               f7
               f8
               f9
               f10
               f11
               f12
               f13
               f14
               f15
               f16
               f17
               f18
               f19
               f20
               f21
               f22
               f23
               f24
               f25
               f26
               f27
               f28
               f29
               f30
               f31
               |
                --99.65%--f32
                          f33
                          f34
                          f35
                          f36
                          f37
                          f38
                          f39
                          f40
                          f41
                          f42
                          f43

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt | 11 +++++++++++
 tools/perf/builtin-report.c              |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 7315f155803f..c0651b68ed5d 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -476,6 +476,17 @@ include::itrace.txt[]
 	This option extends the perf report to show reference callgraphs,
 	which collected by reference event, in no callgraph event.
 
+--stitch-lbr::
+	Show callgraph with stitched LBRs, which may have more complete
+	callgraph. The perf.data file must have been obtained using
+	perf record --call-graph lbr.
+	Disabled by default. In common cases with call stack overflows,
+	it can recreate better call stacks than the default lbr call stack
+	output. But this approach is not full proof. There can be cases
+	where it creates incorrect call stacks from incorrect matches.
+	The known limitations include exception handing such as
+	setjmp/longjmp will have calls/returns not match.
+
 --socket-filter::
 	Only report the samples on the processor socket that match with this filter
 
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index aae0e57c60fb..0d4275a46645 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -83,6 +83,7 @@ struct report {
 	bool			header_only;
 	bool			nonany_branch_mode;
 	bool			group_set;
+	bool			stitch_lbr;
 	int			max_stack;
 	struct perf_read_values	show_threads_values;
 	struct annotation_options annotation_opts;
@@ -263,6 +264,9 @@ static int process_sample_event(struct perf_tool *tool,
 		return -1;
 	}
 
+	if (rep->stitch_lbr)
+		al.thread->lbr_stitch_enable = true;
+
 	if (symbol_conf.hide_unresolved && al.sym == NULL)
 		goto out_put;
 
@@ -1183,6 +1187,8 @@ int cmd_report(int argc, const char **argv)
 			"Show full source file name path for source lines"),
 	OPT_BOOLEAN(0, "show-ref-call-graph", &symbol_conf.show_ref_callgraph,
 		    "Show callgraph from reference event"),
+	OPT_BOOLEAN(0, "stitch-lbr", &report.stitch_lbr,
+		    "Enable LBR callgraph stitching approach"),
 	OPT_INTEGER(0, "socket-filter", &report.socket_filter,
 		    "only show processor socket that match with this filter"),
 	OPT_BOOLEAN(0, "raw-trace", &symbol_conf.raw_trace,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/10] perf script: Add option to enable the LBR stitching approach
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (6 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 07/10] perf report: Add option to enable the LBR stitching approach kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 09/10] perf top: " kan.liang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-script.txt | 11 +++++++++++
 tools/perf/builtin-script.c              |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index 2599b057e47b..472f20f1e479 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -426,6 +426,17 @@ include::itrace.txt[]
 --show-on-off-events::
 	Show the --switch-on/off events too.
 
+--stitch-lbr::
+	Show callgraph with stitched LBRs, which may have more complete
+	callgraph. The perf.data file must have been obtained using
+	perf record --call-graph lbr.
+	Disabled by default. In common cases with call stack overflows,
+	it can recreate better call stacks than the default lbr call stack
+	output. But this approach is not full proof. There can be cases
+	where it creates incorrect call stacks from incorrect matches.
+	The known limitations include exception handing such as
+	setjmp/longjmp will have calls/returns not match.
+
 SEE ALSO
 --------
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 67be8d31afab..0fc4d07864d1 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -1641,6 +1641,7 @@ struct perf_script {
 	bool			show_bpf_events;
 	bool			allocated;
 	bool			per_event_dump;
+	bool			stitch_lbr;
 	struct evswitch		evswitch;
 	struct perf_cpu_map	*cpus;
 	struct perf_thread_map *threads;
@@ -1867,6 +1868,9 @@ static void process_event(struct perf_script *script,
 	if (PRINT_FIELD(IP)) {
 		struct callchain_cursor *cursor = NULL;
 
+		if (script->stitch_lbr)
+			al->thread->lbr_stitch_enable = true;
+
 		if (symbol_conf.use_callchain && sample->callchain &&
 		    thread__resolve_callchain(al->thread, &callchain_cursor, evsel,
 					      sample, NULL, NULL, scripting_max_stack) == 0)
@@ -3556,6 +3560,8 @@ int cmd_script(int argc, const char **argv)
 		   "file", "file saving guest os /proc/kallsyms"),
 	OPT_STRING(0, "guestmodules", &symbol_conf.default_guest_modules,
 		   "file", "file saving guest os /proc/modules"),
+	OPT_BOOLEAN('\0', "stitch-lbr", &script.stitch_lbr,
+		    "Enable LBR callgraph stitching approach"),
 	OPTS_EVSWITCH(&script.evswitch),
 	OPT_END()
 	};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 09/10] perf top: Add option to enable the LBR stitching approach
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (7 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 08/10] perf script: " kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 17:59 ` [PATCH 10/10] perf c2c: " kan.liang
  2019-10-07 18:24 ` [PATCH 00/10] Stitch LBR call stack Ingo Molnar
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.
The option must be used with --call-graph lbr.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-top.txt |  9 +++++++++
 tools/perf/builtin-top.c              | 11 +++++++++++
 tools/perf/util/top.h                 |  1 +
 3 files changed, 21 insertions(+)

diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index 5596129a71cf..80b57f942a86 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -304,6 +304,15 @@ Default is to monitor all CPUS.
 	go straight to the histogram browser, just like 'perf top' with no events
 	explicitely specified does.
 
+--stitch-lbr::
+	Show callgraph with stitched LBRs, which may have more complete
+	callgraph. The option must be used with --call-graph lbr recording.
+	Disabled by default. In common cases with call stack overflows,
+	it can recreate better call stacks than the default lbr call stack
+	output. But this approach is not full proof. There can be cases
+	where it creates incorrect call stacks from incorrect matches.
+	The known limitations include exception handing such as
+	setjmp/longjmp will have calls/returns not match.
 
 INTERACTIVE PROMPTING KEYS
 --------------------------
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 611d03030abc..539670377e0f 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -33,6 +33,7 @@
 #include "util/map.h"
 #include "util/mmap.h"
 #include "util/session.h"
+#include "util/thread.h"
 #include "util/symbol.h"
 #include "util/synthetic-events.h"
 #include "util/top.h"
@@ -764,6 +765,9 @@ static void perf_event__process_sample(struct perf_tool *tool,
 	if (machine__resolve(machine, &al, sample) < 0)
 		return;
 
+	if (top->stitch_lbr)
+		al.thread->lbr_stitch_enable = true;
+
 	if (!machine->kptr_restrict_warned &&
 	    symbol_conf.kptr_restrict &&
 	    al.cpumode == PERF_RECORD_MISC_KERNEL) {
@@ -1537,6 +1541,8 @@ int cmd_top(int argc, const char **argv)
 			"number of thread to run event synthesize"),
 	OPT_BOOLEAN(0, "namespaces", &opts->record_namespaces,
 		    "Record namespaces events"),
+	OPT_BOOLEAN(0, "stitch-lbr", &top.stitch_lbr,
+		    "Enable LBR callgraph stitching approach"),
 	OPTS_EVSWITCH(&top.evswitch),
 	OPT_END()
 	};
@@ -1599,6 +1605,11 @@ int cmd_top(int argc, const char **argv)
 		}
 	}
 
+	if (top.stitch_lbr && !(callchain_param.record_mode == CALLCHAIN_LBR)) {
+		pr_err("Error: --stitch-lbr must be used with --call-graph lbr\n");
+		goto out_delete_evlist;
+	}
+
 	if (opts->branch_stack && callchain_param.enabled)
 		symbol_conf.show_branchflag_count = true;
 
diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h
index f117d4f4821e..45dc84ddff37 100644
--- a/tools/perf/util/top.h
+++ b/tools/perf/util/top.h
@@ -36,6 +36,7 @@ struct perf_top {
 	bool		   use_tui, use_stdio;
 	bool		   vmlinux_warned;
 	bool		   dump_symtab;
+	bool		   stitch_lbr;
 	struct hist_entry  *sym_filter_entry;
 	struct evsel 	   *sym_evsel;
 	struct perf_session *session;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 10/10] perf c2c: Add option to enable the LBR stitching approach
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (8 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 09/10] perf top: " kan.liang
@ 2019-10-07 17:59 ` kan.liang
  2019-10-07 18:24 ` [PATCH 00/10] Stitch LBR call stack Ingo Molnar
  10 siblings, 0 replies; 18+ messages in thread
From: kan.liang @ 2019-10-07 17:59 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: jolsa, namhyung, ak, vitaly.slobodskoy, pavel.gerasimov, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With the LBR stitching approach, the reconstructed LBR call stack
can break the HW limitation. However, it may reconstruct invalid call
stacks in some cases, e.g. exception handing such as setjmp/longjmp.
Also, it may impact the processing time especially when the number of
samples with stitched LBRs are huge.

Add an option to enable the approach.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-c2c.txt | 11 +++++++++++
 tools/perf/builtin-c2c.c              |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt
index e6150f21267d..2133eb320cb0 100644
--- a/tools/perf/Documentation/perf-c2c.txt
+++ b/tools/perf/Documentation/perf-c2c.txt
@@ -111,6 +111,17 @@ REPORT OPTIONS
 --display::
 	Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.
 
+--stitch-lbr::
+	Show callgraph with stitched LBRs, which may have more complete
+	callgraph. The perf.data file must have been obtained using
+	perf c2c record --call-graph lbr.
+	Disabled by default. In common cases with call stack overflows,
+	it can recreate better call stacks than the default lbr call stack
+	output. But this approach is not full proof. There can be cases
+	where it creates incorrect call stacks from incorrect matches.
+	The known limitations include exception handing such as
+	setjmp/longjmp will have calls/returns not match.
+
 C2C RECORD
 ----------
 The perf c2c record command setup options related to HITM cacheline analysis
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 3542b6ab9813..c3658986c38a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -95,6 +95,7 @@ struct perf_c2c {
 	bool			 use_stdio;
 	bool			 stats_only;
 	bool			 symbol_full;
+	bool			 stitch_lbr;
 
 	/* HITM shared clines stats */
 	struct c2c_stats	hitm_stats;
@@ -273,6 +274,9 @@ static int process_sample_event(struct perf_tool *tool __maybe_unused,
 		return -1;
 	}
 
+	if (c2c.stitch_lbr)
+		al.thread->lbr_stitch_enable = true;
+
 	ret = sample__resolve_callchain(sample, &callchain_cursor, NULL,
 					evsel, &al, sysctl_perf_event_max_stack);
 	if (ret)
@@ -2746,6 +2750,8 @@ static int perf_c2c__report(int argc, const char **argv)
 	OPT_STRING('c', "coalesce", &coalesce, "coalesce fields",
 		   "coalesce fields: pid,tid,iaddr,dso"),
 	OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"),
+	OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr,
+		    "Enable LBR callgraph stitching approach"),
 	OPT_PARENT(c2c_options),
 	OPT_END()
 	};
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/10] Stitch LBR call stack
  2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
                   ` (9 preceding siblings ...)
  2019-10-07 17:59 ` [PATCH 10/10] perf c2c: " kan.liang
@ 2019-10-07 18:24 ` Ingo Molnar
  2019-10-07 20:06   ` Liang, Kan
  10 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2019-10-07 18:24 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, acme, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov


* kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:

> Performance impact:
> The processing time may increase with the LBR stitching approach
> enabled. The impact depends on the number of samples with stitched LBRs.
> 
> For sqlite's tcltest,
> perf record --call-graph lbr -- make tcltest
> perf report --stitch-lbr
> 
> There are 4.11% samples has stitched LBRs.
> Total number of samples:                        2833728
> The number of samples with stitched LBRs        116478
> 
> The processing time of perf report increases 6.8%
> Without --stitch-lbr:                           55906106 usec
> With --stitch-lbr:                              59728701 usec
> 
> For a simple test case tchain_edit with 43 depth of call stacks.
> perf record --call-graph lbr -- ./tchain_edit
> perf report --stitch-lbr
> 
> There are 99.9% samples has stitched LBRs.
> Total number of samples:                        10915
> The number of samples with stitched LBRs        10905
> 
> The processing time of perf report increases 67.4%
> Without --stitch-lbr:                           11970508 usec
> With --stitch-lbr:                              20036055 usec

That cost seems pretty high, while the feature sounds useful - is there 
any way to speed this up?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 00/10] Stitch LBR call stack
  2019-10-07 18:24 ` [PATCH 00/10] Stitch LBR call stack Ingo Molnar
@ 2019-10-07 20:06   ` Liang, Kan
  0 siblings, 0 replies; 18+ messages in thread
From: Liang, Kan @ 2019-10-07 20:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, acme, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov



On 10/7/2019 2:24 PM, Ingo Molnar wrote:
> 
> * kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:
> 
>> Performance impact:
>> The processing time may increase with the LBR stitching approach
>> enabled. The impact depends on the number of samples with stitched LBRs.
>>
>> For sqlite's tcltest,
>> perf record --call-graph lbr -- make tcltest
>> perf report --stitch-lbr
>>
>> There are 4.11% samples has stitched LBRs.
>> Total number of samples:                        2833728
>> The number of samples with stitched LBRs        116478
>>
>> The processing time of perf report increases 6.8%
>> Without --stitch-lbr:                           55906106 usec
>> With --stitch-lbr:                              59728701 usec
>>
>> For a simple test case tchain_edit with 43 depth of call stacks.
>> perf record --call-graph lbr -- ./tchain_edit
>> perf report --stitch-lbr
>>
>> There are 99.9% samples has stitched LBRs.
>> Total number of samples:                        10915
>> The number of samples with stitched LBRs        10905
>>
>> The processing time of perf report increases 67.4%
>> Without --stitch-lbr:                           11970508 usec
>> With --stitch-lbr:                              20036055 usec
> 
> That cost seems pretty high, while the feature sounds useful - is there
> any way to speed this up?
> 

For each LBR entry, perf tool will calculate and generate an appended 
node for callchain_cursor.
The stitched LBR entries are from previous sample. It looks like we 
don't need to do the calculation again for them. That should speed up 
the whole process. I will do more test for it.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-07 17:59 ` [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS kan.liang
@ 2019-10-08  8:31   ` Peter Zijlstra
  2019-10-08 13:53     ` Liang, Kan
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-10-08  8:31 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov

On Mon, Oct 07, 2019 at 10:59:01AM -0700, kan.liang@linux.intel.com wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 61448c19a132..ee9ef0c4cb08 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -100,6 +100,7 @@ struct perf_raw_record {
>   */
>  struct perf_branch_stack {
>  	__u64				nr;
> +	__u64				tos;
>  	struct perf_branch_entry	entries[0];
>  };
>  
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index bb7b271397a6..fe36ebb7dc2e 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -141,8 +141,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_TRANSACTION			= 1U << 17,
>  	PERF_SAMPLE_REGS_INTR			= 1U << 18,
>  	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
> +	PERF_SAMPLE_LBR_TOS			= 1U << 20,
>  
> -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
>  
>  	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
>  };
> @@ -864,6 +865,7 @@ enum perf_event_type {
>  	 *	{ u64			abi; # enum perf_sample_regs_abi
>  	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>  	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> +	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,

I have problems with the API.. You're introducing the intel specific LBR
naming, and adding a whole new sample type vs extending the existing
BRANCH_STACK (like you really already do with struct perf_branch_stack).

So why not add a bit to PERF_SAMPLE_BRANCH_* to request the presence of
the TOS field in the PERF_SAMPLE_BRANCH_STACK output?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-08  8:31   ` Peter Zijlstra
@ 2019-10-08 13:53     ` Liang, Kan
  2019-10-08 14:38       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Liang, Kan @ 2019-10-08 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov



On 10/8/2019 4:31 AM, Peter Zijlstra wrote:
> On Mon, Oct 07, 2019 at 10:59:01AM -0700, kan.liang@linux.intel.com wrote:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 61448c19a132..ee9ef0c4cb08 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -100,6 +100,7 @@ struct perf_raw_record {
>>    */
>>   struct perf_branch_stack {
>>   	__u64				nr;
>> +	__u64				tos;
>>   	struct perf_branch_entry	entries[0];
>>   };
>>   
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index bb7b271397a6..fe36ebb7dc2e 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -141,8 +141,9 @@ enum perf_event_sample_format {
>>   	PERF_SAMPLE_TRANSACTION			= 1U << 17,
>>   	PERF_SAMPLE_REGS_INTR			= 1U << 18,
>>   	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
>> +	PERF_SAMPLE_LBR_TOS			= 1U << 20,
>>   
>> -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
>> +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
>>   
>>   	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
>>   };
>> @@ -864,6 +865,7 @@ enum perf_event_type {
>>   	 *	{ u64			abi; # enum perf_sample_regs_abi
>>   	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>>   	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>> +	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
>>   	 * };
>>   	 */
>>   	PERF_RECORD_SAMPLE			= 9,
> 
> I have problems with the API.. You're introducing the intel specific LBR
> naming, and adding a whole new sample type vs extending the existing
> BRANCH_STACK (like you really already do with struct perf_branch_stack). >
> So why not add a bit to PERF_SAMPLE_BRANCH_* to request the presence of
> the TOS field in the PERF_SAMPLE_BRANCH_STACK output?

We never store PERF_SAMPLE_BRANCH_* in a sample. The perf tool cannot 
tell if the sample includes TOS field.
There will be a problem when a new perf tool parsing the data generated 
by an old kernel.


Can we rename the new sample type PERF_SAMPLE_BRANCH_STACK_EXTENSION?

{ u64			version;
   u64			tos;} 		&& PERF_SAMPLE_LBR_TOS

If other platforms want to add their extension, we just need to increase 
the version number. Perf tool will check the version before parsing the 
sample.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-08 13:53     ` Liang, Kan
@ 2019-10-08 14:38       ` Peter Zijlstra
  2019-10-08 15:25         ` Liang, Kan
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2019-10-08 14:38 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov

On Tue, Oct 08, 2019 at 09:53:24AM -0400, Liang, Kan wrote:
> 
> 
> On 10/8/2019 4:31 AM, Peter Zijlstra wrote:
> > On Mon, Oct 07, 2019 at 10:59:01AM -0700, kan.liang@linux.intel.com wrote:
> > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > > index 61448c19a132..ee9ef0c4cb08 100644
> > > --- a/include/linux/perf_event.h
> > > +++ b/include/linux/perf_event.h
> > > @@ -100,6 +100,7 @@ struct perf_raw_record {
> > >    */
> > >   struct perf_branch_stack {
> > >   	__u64				nr;
> > > +	__u64				tos;
> > >   	struct perf_branch_entry	entries[0];
> > >   };
> > > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > > index bb7b271397a6..fe36ebb7dc2e 100644
> > > --- a/include/uapi/linux/perf_event.h
> > > +++ b/include/uapi/linux/perf_event.h
> > > @@ -141,8 +141,9 @@ enum perf_event_sample_format {
> > >   	PERF_SAMPLE_TRANSACTION			= 1U << 17,
> > >   	PERF_SAMPLE_REGS_INTR			= 1U << 18,
> > >   	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
> > > +	PERF_SAMPLE_LBR_TOS			= 1U << 20,
> > > -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
> > > +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
> > >   	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
> > >   };
> > > @@ -864,6 +865,7 @@ enum perf_event_type {
> > >   	 *	{ u64			abi; # enum perf_sample_regs_abi
> > >   	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
> > >   	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
> > > +	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
> > >   	 * };
> > >   	 */
> > >   	PERF_RECORD_SAMPLE			= 9,
> > 
> > I have problems with the API.. You're introducing the intel specific LBR
> > naming, and adding a whole new sample type vs extending the existing
> > BRANCH_STACK (like you really already do with struct perf_branch_stack). >
> > So why not add a bit to PERF_SAMPLE_BRANCH_* to request the presence of
> > the TOS field in the PERF_SAMPLE_BRANCH_STACK output?
> 
> We never store PERF_SAMPLE_BRANCH_* in a sample. The perf tool cannot tell
> if the sample includes TOS field.

The perf tool bloody sets the perf_event_attr::branch_sample_type value!
Of course it knows to expect the TOS field when it asks for it in the
first place.

> There will be a problem when a new perf tool parsing the data generated by
> an old kernel.

ISTR perf stores the full perf_event_attr in the .data file these days,
and therefore such confusion should never happen.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-08 14:38       ` Peter Zijlstra
@ 2019-10-08 15:25         ` Liang, Kan
  2019-10-08 16:32           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Liang, Kan @ 2019-10-08 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov



On 10/8/2019 10:38 AM, Peter Zijlstra wrote:
> On Tue, Oct 08, 2019 at 09:53:24AM -0400, Liang, Kan wrote:
>>
>>
>> On 10/8/2019 4:31 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 07, 2019 at 10:59:01AM -0700, kan.liang@linux.intel.com wrote:
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 61448c19a132..ee9ef0c4cb08 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -100,6 +100,7 @@ struct perf_raw_record {
>>>>     */
>>>>    struct perf_branch_stack {
>>>>    	__u64				nr;
>>>> +	__u64				tos;
>>>>    	struct perf_branch_entry	entries[0];
>>>>    };
>>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>>>> index bb7b271397a6..fe36ebb7dc2e 100644
>>>> --- a/include/uapi/linux/perf_event.h
>>>> +++ b/include/uapi/linux/perf_event.h
>>>> @@ -141,8 +141,9 @@ enum perf_event_sample_format {
>>>>    	PERF_SAMPLE_TRANSACTION			= 1U << 17,
>>>>    	PERF_SAMPLE_REGS_INTR			= 1U << 18,
>>>>    	PERF_SAMPLE_PHYS_ADDR			= 1U << 19,
>>>> +	PERF_SAMPLE_LBR_TOS			= 1U << 20,
>>>> -	PERF_SAMPLE_MAX = 1U << 20,		/* non-ABI */
>>>> +	PERF_SAMPLE_MAX = 1U << 21,		/* non-ABI */
>>>>    	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
>>>>    };
>>>> @@ -864,6 +865,7 @@ enum perf_event_type {
>>>>    	 *	{ u64			abi; # enum perf_sample_regs_abi
>>>>    	 *	  u64			regs[weight(mask)]; } && PERF_SAMPLE_REGS_INTR
>>>>    	 *	{ u64			phys_addr;} && PERF_SAMPLE_PHYS_ADDR
>>>> +	 *	{ u64			tos;} && PERF_SAMPLE_LBR_TOS
>>>>    	 * };
>>>>    	 */
>>>>    	PERF_RECORD_SAMPLE			= 9,
>>>
>>> I have problems with the API.. You're introducing the intel specific LBR
>>> naming, and adding a whole new sample type vs extending the existing
>>> BRANCH_STACK (like you really already do with struct perf_branch_stack). >
>>> So why not add a bit to PERF_SAMPLE_BRANCH_* to request the presence of
>>> the TOS field in the PERF_SAMPLE_BRANCH_STACK output?
>>
>> We never store PERF_SAMPLE_BRANCH_* in a sample. The perf tool cannot tell
>> if the sample includes TOS field.
> 
> The perf tool bloody sets the perf_event_attr::branch_sample_type value!
> Of course it knows to expect the TOS field when it asks for it in the
> first place.
>

Users may generate the perf.data on one machine, and parse the data on 
another machine.
If the perf.data is from a new kernel with a new perf tool on one 
machine, but users have an old perf tool on another machine to parse it. 
The old perf tool doesn't know the exists of TOS field.


Thanks,
Kan

>> There will be a problem when a new perf tool parsing the data generated by
>> an old kernel.
> 
> ISTR perf stores the full perf_event_attr in the .data file these days,
> and therefore such confusion should never happen.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS
  2019-10-08 15:25         ` Liang, Kan
@ 2019-10-08 16:32           ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-10-08 16:32 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, jolsa, namhyung, ak,
	vitaly.slobodskoy, pavel.gerasimov

On Tue, Oct 08, 2019 at 11:25:01AM -0400, Liang, Kan wrote:
> > The perf tool bloody sets the perf_event_attr::branch_sample_type value!
> > Of course it knows to expect the TOS field when it asks for it in the
> > first place.
> > 
> 
> Users may generate the perf.data on one machine, and parse the data on
> another machine.
> If the perf.data is from a new kernel with a new perf tool on one machine,
> but users have an old perf tool on another machine to parse it. The old perf
> tool doesn't know the exists of TOS field.

Feh, the tool should check for unknown input bits in attr.

Anyway, the proposed API is horrendous crap, that's just not going to
happen.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-10-08 16:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-07 17:59 [PATCH 00/10] Stitch LBR call stack kan.liang
2019-10-07 17:59 ` [PATCH 01/10] perf/core, x86: Add PERF_SAMPLE_LBR_TOS kan.liang
2019-10-08  8:31   ` Peter Zijlstra
2019-10-08 13:53     ` Liang, Kan
2019-10-08 14:38       ` Peter Zijlstra
2019-10-08 15:25         ` Liang, Kan
2019-10-08 16:32           ` Peter Zijlstra
2019-10-07 17:59 ` [PATCH 02/10] perf tools: Support PERF_SAMPLE_LBR_TOS kan.liang
2019-10-07 17:59 ` [PATCH 03/10] perf pmu: Add support for PMU capabilities kan.liang
2019-10-07 17:59 ` [PATCH 04/10] perf header: Support CPU " kan.liang
2019-10-07 17:59 ` [PATCH 05/10] perf machine: Refine the function for LBR call stack reconstruction kan.liang
2019-10-07 17:59 ` [PATCH 06/10] perf tools: Stitch LBR call stack kan.liang
2019-10-07 17:59 ` [PATCH 07/10] perf report: Add option to enable the LBR stitching approach kan.liang
2019-10-07 17:59 ` [PATCH 08/10] perf script: " kan.liang
2019-10-07 17:59 ` [PATCH 09/10] perf top: " kan.liang
2019-10-07 17:59 ` [PATCH 10/10] perf c2c: " kan.liang
2019-10-07 18:24 ` [PATCH 00/10] Stitch LBR call stack Ingo Molnar
2019-10-07 20:06   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).