linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
@ 2021-09-09  7:56 Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry Stephane Eranian
                   ` (13 more replies)
  0 siblings, 14 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch series adds support for the AMD Fam19h 16-deep branch sampling
feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
This is a model specific extension. It is not an architected AMD feature.

The Branch Sampling Feature (BRS) provides the statistical taken branch information necessary
to enable autoFDO-style optimization by compilers, i.e., basic block execution counts.

BRS operates with a 16-deep saturating buffer in MSR registers. There is no
hardware branch type filtering. All control flow changes are captured. BRS
relies on specific programming of the core PMU of Fam19h.  In particular,
the following requirements must be met:
 - the sampling period be greater than 16 (BRS depth)
 - the sampling period must use fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
it is only activated after P event occurrences, where P is the desired sampling period.
At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
Between the overflow and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by reducing the sampling
period by 16.

BRS is integrated into perf_events seamlessly via the same PERF_RECORD_BRANCH_STACK sample
format. BRS generates branch perf_branch_entry records in the sampling buffer. There is
no prediction or latency information supported. The branches are stored in reverse order of
execution.  The most recent branch is the first entry in each record.

Because BRS must be stopped when a CPU goes into low power mode, the series includes patches
to add callbacks on low power entry and exit (mwait and halt).

Given that there is no privilege filterting with BRS, the kernel implements filtering on
privlege level.

This version adds a few simple modifications to perf record and report.
1. add the branch-brs event as a builtin such as it can used directly: perf record -e branch-brs ...
2. improve error handling for AMD IBS and is contributed by Kim Phillips.
3. use the better error handling to improve error handling for branch sampling
4. add two new sort dimensions to help display the branch sampling information. Because there is no
   latency information associated with the branch sampling feature perf report would collapse all
   samples within a function into a single histogram entry. This is expected because the default
   sort mode for PERF_SAMPLE_BRANCH_STACK is symbol_from/symbol_to. This propagates to the annotation.

For more detailed view of the branch samples, the new sort dimensions addr_from,addr_to can be used
instead as follows:

$ perf report --sort=overhead,comm,dso,addr_from,addr_to 
# Overhead  Command          Shared Object     Source Address  Target Address
# ........  ...............  ................  ..............  ..............
#
     4.21%  test_prg        test_prg         [.] test_threa+0x3c  [.] test_threa+0x4
     4.14%  test_prg        test_prg         [.] test_threa+0x3e  [.] test_threa+0x2
     4.10%  test_prg        test_prg         [.] test_threa+0x4  [.] test_threa+0x3a
     4.07%  test_prg        test_prg         [.] test_threa+0x2  [.] test_threa+0x3c

Versus the default output:

$ perf report 
# Overhead  Command          Source Shared Object  Source Symbol                        Target Symbol                        Basic Block Cycles
# ........  ...............  ....................  ...................................  ...................................  ..................
#
    99.52%  test_prg        test_prg             [.] test_thread                      [.] test_thread                      -                 

BRS can be used with any sampling event. However, it is recommended to use the RETIRED_BRANCH
event because it matches what the BRS captures. For convenience, a pseudo event matching the
branches captured by BRS is exported by the kernel (branch-brs):

$ perf record -b -e cpu/branch-brs/ -c 1000037 test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
.....  0: 0000000000401d24 -> 0000000000401d5a 0 cycles      0
.....  1: 0000000000401d5c -> 0000000000401d24 0 cycles      0
.....  2: 0000000000401d22 -> 0000000000401d5c 0 cycles      0
.....  3: 0000000000401d5e -> 0000000000401d22 0 cycles      0
.....  4: 0000000000401d20 -> 0000000000401d5e 0 cycles      0
.....  5: 0000000000401d3e -> 0000000000401d20 0 cycles      0
.....  6: 0000000000401d42 -> 0000000000401d3e 0 cycles      0
.....  7: 0000000000401d3c -> 0000000000401d42 0 cycles      0
.....  8: 0000000000401d44 -> 0000000000401d3c 0 cycles      0
.....  9: 0000000000401d3a -> 0000000000401d44 0 cycles      0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles      0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles      0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles      0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles      0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles      0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles      0
 ... thread: test:18230
 ...... dso: test

Special thanks to Kim Phillips @ AMD for the testing, reviews and contributions.

Kim Phillips (1):
  perf tools: improve IBS error handling

Stephane Eranian (12):
  perf/core: add union to struct perf_branch_entry
  x86/cpufeatures: add AMD Fam19h Branch Sampling feature
  perf/x86/amd: add AMD Fam19h Branch Sampling support
  perf/x86/amd: add branch-brs helper event for Fam19h BRS
  perf/x86/amd: enable branch sampling priv level filtering
  perf/x86/amd: add AMD branch sampling period adjustment
  perf/core: add idle hooks
  perf/x86/core: add idle hooks
  perf/x86/amd: add idle hooks for branch sampling
  perf tools: add branch-brs as a new event
  perf tools: improve error handling of AMD Branch Sampling
  perf report: add addr_from/addr_to sort dimensions

 arch/x86/events/amd/Makefile       |   2 +-
 arch/x86/events/amd/brs.c          | 342 +++++++++++++++++++++++++++++
 arch/x86/events/amd/core.c         | 222 ++++++++++++++++++-
 arch/x86/events/core.c             |  22 +-
 arch/x86/events/intel/lbr.c        |  13 +-
 arch/x86/events/perf_event.h       | 106 +++++++--
 arch/x86/include/asm/cpufeatures.h |   1 +
 arch/x86/include/asm/msr-index.h   |   4 +
 arch/x86/include/asm/mwait.h       |   6 +-
 include/linux/perf_event.h         |   8 +
 include/uapi/linux/perf_event.h    |  19 +-
 kernel/events/core.c               |  58 +++++
 kernel/sched/idle.c                |  19 +-
 tools/perf/util/evsel.c            |  50 +++++
 tools/perf/util/hist.c             |   2 +
 tools/perf/util/hist.h             |   2 +
 tools/perf/util/parse-events.l     |   1 +
 tools/perf/util/sort.c             | 128 +++++++++++
 tools/perf/util/sort.h             |   2 +
 19 files changed, 964 insertions(+), 43 deletions(-)
 create mode 100644 arch/x86/events/amd/brs.c

-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09 19:03   ` Peter Zijlstra
  2021-09-09  7:56 ` [PATCH v1 02/13] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

To make it simpler to reset all the info fields on the perf_branch_entry in
a single store, we use a union. This avoids missing some of the bitfields and
improves generated code by minimizing the number of stores.

Using an anonymous struct around the bitfields to guarantee field ordering.

A single perf_branch_entry.val = 0 guarantees all fields are all zeroes on any arch.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/intel/lbr.c     | 13 +++----------
 include/uapi/linux/perf_event.h | 19 ++++++++++++-------
 2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9e6d6eaeb4cb..27aa48cce3c6 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -801,15 +801,9 @@ void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
+		cpuc->lbr_entries[i].val	= 0;
 		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
 		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
-		cpuc->lbr_entries[i].mispred	= 0;
-		cpuc->lbr_entries[i].predicted	= 0;
-		cpuc->lbr_entries[i].in_tx	= 0;
-		cpuc->lbr_entries[i].abort	= 0;
-		cpuc->lbr_entries[i].cycles	= 0;
-		cpuc->lbr_entries[i].type	= 0;
-		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 	cpuc->lbr_stack.hw_idx = tos;
@@ -896,6 +890,7 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		if (abort && x86_pmu.lbr_double_abort && out > 0)
 			out--;
 
+		cpuc->lbr_entries[out].val	 = 0;
 		cpuc->lbr_entries[out].from	 = from;
 		cpuc->lbr_entries[out].to	 = to;
 		cpuc->lbr_entries[out].mispred	 = mis;
@@ -903,8 +898,6 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		cpuc->lbr_entries[out].in_tx	 = in_tx;
 		cpuc->lbr_entries[out].abort	 = abort;
 		cpuc->lbr_entries[out].cycles	 = cycles;
-		cpuc->lbr_entries[out].type	 = 0;
-		cpuc->lbr_entries[out].reserved	 = 0;
 		out++;
 	}
 	cpuc->lbr_stack.nr = out;
@@ -966,6 +959,7 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		to = rdlbr_to(i, lbr);
 		info = rdlbr_info(i, lbr);
 
+		e->val		= 0;
 		e->from		= from;
 		e->to		= to;
 		e->mispred	= get_lbr_mispred(info);
@@ -974,7 +968,6 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		e->abort	= !!(info & LBR_INFO_ABORT);
 		e->cycles	= get_lbr_cycles(info);
 		e->type		= get_lbr_br_type(info);
-		e->reserved	= 0;
 	}
 
 	cpuc->lbr_stack.nr = i;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index f92880a15645..eb11f383f4be 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1329,13 +1329,18 @@ union perf_mem_data_src {
 struct perf_branch_entry {
 	__u64	from;
 	__u64	to;
-	__u64	mispred:1,  /* target mispredicted */
-		predicted:1,/* target predicted */
-		in_tx:1,    /* in transaction */
-		abort:1,    /* transaction abort */
-		cycles:16,  /* cycle count to last branch */
-		type:4,     /* branch type */
-		reserved:40;
+	union {
+		__u64	val;	    /* to make it easier to clear all fields */
+		struct {
+			__u64	mispred:1,  /* target mispredicted */
+				predicted:1,/* target predicted */
+				in_tx:1,    /* in transaction */
+				abort:1,    /* transaction abort */
+				cycles:16,  /* cycle count to last branch */
+				type:4,     /* branch type */
+				reserved:40;
+		};
+	};
 };
 
 union perf_sample_weight {
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 02/13] x86/cpufeatures: add AMD Fam19h Branch Sampling feature
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds a cpu feature for AMD Fam19h Branch Sampling feature as bit 31
of EBX on CPUID leaf function 0x80000008.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/cpufeatures.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..74fd7ab7d74d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -313,6 +313,7 @@
 #define X86_FEATURE_AMD_SSBD		(13*32+24) /* "" Speculative Store Bypass Disable */
 #define X86_FEATURE_VIRT_SSBD		(13*32+25) /* Virtualized Speculative Store Bypass Disable */
 #define X86_FEATURE_AMD_SSB_NO		(13*32+26) /* "" Speculative Store Bypass is fixed in hardware. */
+#define X86_FEATURE_AMD_BRS		(13*32+31) /* Branch Sampling available */
 
 /* Thermal and Power Management Leaf, CPUID level 0x00000006 (EAX), word 14 */
 #define X86_FEATURE_DTHERM		(14*32+ 0) /* Digital Thermal Sensor */
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 02/13] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09 10:44   ` kernel test robot
  2021-09-09 15:33   ` kernel test robot
  2021-09-09  7:56 ` [PATCH v1 04/13] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds support for the AMD Fam19h 16-deep branch sampling
feature as described in the AMD PPR Fam19h Model 01h Revision B1.
This is a model specific extension. It is not an architected AMD feature.

The Branch Sampling (BRS) operates with a 16-deep saturating buffer in
MSR registers. There is no branch type filtering. All control flow changes
are captured. BRS relies on specific programming of the core PMU of Fam19h.
In particular, the following requirements must be met:
 - the sampling period be greater than 16 (BRS depth)
 - the sampling period must use a fixed and not frequency mode

BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
it is only activated after P event occurrences, where P is the desired sampling period.
At P occurrences of the event, the counter overflows, the CPU catches the interrupt,
activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
Between the overflow and the time BRS activates more branches may be executed skewing the period.
All along, the sampling event keeps counting. The skid may be attenuated by reducing the sampling
period by 16 (subsequent patch).

BRS is integrated into perf_events seamlessly via the same PERF_RECORD_BRANCH_STACK sample
format. BRS generates perf_branch_entry records in the sampling buffer. No prediction
information is supported. The branches are stored in reverse order of execution.
The most recent branch is the first entry in each record.

No modification to the perf tool is necessary.

BRS can be used with any sampling event. However, it is recommended to use the
RETIRED_BRANCH_INSTRUCTIONS event because it matches what the BRS captures.

$ perf record -b -c 1000037 -e cpu/event=0xc2,name=retired_branch_instructions/ test

$ perf report -D
56531696056126 0x193c000 [0x1a8]: PERF_RECORD_SAMPLE(IP, 0x2): 18122/18230: 0x401d24 period: 1000037 addr: 0
... branch stack: nr:16
.....  0: 0000000000401d24 -> 0000000000401d5a 0 cycles      0
.....  1: 0000000000401d5c -> 0000000000401d24 0 cycles      0
.....  2: 0000000000401d22 -> 0000000000401d5c 0 cycles      0
.....  3: 0000000000401d5e -> 0000000000401d22 0 cycles      0
.....  4: 0000000000401d20 -> 0000000000401d5e 0 cycles      0
.....  5: 0000000000401d3e -> 0000000000401d20 0 cycles      0
.....  6: 0000000000401d42 -> 0000000000401d3e 0 cycles      0
.....  7: 0000000000401d3c -> 0000000000401d42 0 cycles      0
.....  8: 0000000000401d44 -> 0000000000401d3c 0 cycles      0
.....  9: 0000000000401d3a -> 0000000000401d44 0 cycles      0
..... 10: 0000000000401d46 -> 0000000000401d3a 0 cycles      0
..... 11: 0000000000401d38 -> 0000000000401d46 0 cycles      0
..... 12: 0000000000401d48 -> 0000000000401d38 0 cycles      0
..... 13: 0000000000401d36 -> 0000000000401d48 0 cycles      0
..... 14: 0000000000401d4a -> 0000000000401d36 0 cycles      0
..... 15: 0000000000401d34 -> 0000000000401d4a 0 cycles      0
 ... thread: test:18230
 ...... dso: test

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/Makefile     |   2 +-
 arch/x86/events/amd/brs.c        | 303 +++++++++++++++++++++++++++++++
 arch/x86/events/amd/core.c       | 202 ++++++++++++++++++++-
 arch/x86/events/core.c           |  15 +-
 arch/x86/events/perf_event.h     |  93 ++++++++--
 arch/x86/include/asm/msr-index.h |   4 +
 6 files changed, 595 insertions(+), 24 deletions(-)
 create mode 100644 arch/x86/events/amd/brs.c

diff --git a/arch/x86/events/amd/Makefile b/arch/x86/events/amd/Makefile
index 6cbe38d5fd9d..cf323ffab5cd 100644
--- a/arch/x86/events/amd/Makefile
+++ b/arch/x86/events/amd/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-$(CONFIG_CPU_SUP_AMD)		+= core.o
+obj-$(CONFIG_CPU_SUP_AMD)		+= core.o brs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_POWER)	+= power.o
 obj-$(CONFIG_X86_LOCAL_APIC)		+= ibs.o
 obj-$(CONFIG_PERF_EVENTS_AMD_UNCORE)	+= amd-uncore.o
diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
new file mode 100644
index 000000000000..86dbc6d06815
--- /dev/null
+++ b/arch/x86/events/amd/brs.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Implement support for AMD Fam19h Branch Sampling feature
+ * Based on specifications published in AMD PPR Fam19 Model 01
+ *
+ * Copyright 2021 Google LLC
+ * Contributed by Stephane Eranian <eranian@google.com>
+ */
+#include <linux/kernel.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+
+#include "../perf_event.h"
+
+#define BRS_POISON	0xFFFFFFFFFFFFFFFEULL /* mark limit of valid entries */
+
+/* Debug Extension Configuration register layout */
+union amd_debug_extn_cfg {
+	__u64 val;
+	struct {
+		__u64	rsvd0:2,  /* reserved */
+			brsmen:1, /* branch sample enable */
+			rsvd4_3:2,/* reserved - must be 0x3 */
+			vb:1,     /* valid branches recorded */
+			rsvd2:10, /* reserved */
+			msroff:4, /* index of next entry to write */
+			rsvd3:4,  /* reserved */
+			pmc:3,    /* #PMC holding the sampling event */
+			rsvd4:37; /* reserved */
+	};
+};
+
+static inline unsigned int brs_from(int idx)
+{
+	return MSR_AMD_SAMP_BR_FROM + 2 * idx;
+}
+
+static inline unsigned int brs_to(int idx)
+{
+	return MSR_AMD_SAMP_BR_FROM + 2 * idx + 1;
+}
+
+static inline void set_debug_extn_cfg(u64 val)
+{
+	/* bits[4:3] must always be set to 11b */
+	wrmsrl(MSR_AMD_DBG_EXTN_CFG, val | 3ULL << 3);
+}
+
+static inline u64 get_debug_extn_cfg(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD_DBG_EXTN_CFG, val);
+	return val;
+}
+
+static bool __init amd_brs_detect(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_AMD_BRS))
+		return false;
+
+	switch (boot_cpu_data.x86) {
+	case 0x19: /* AMD Fam19h (Zen3) */
+		x86_pmu.lbr_nr = 16;
+
+		/* No hardware filtering supported */
+		x86_pmu.lbr_sel_map = NULL;
+		x86_pmu.lbr_sel_mask = 0;
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Current BRS implementation does not support branch type or privilege level
+ * filtering. Therefore, this function simply enforces these limitations. No need for
+ * a br_sel_map. Software filtering is not supported because it would not correlate well
+ * with a sampling period.
+ */
+int amd_brs_setup_filter(struct perf_event *event)
+{
+	u64 type = event->attr.branch_sample_type;
+
+	/* No BRS support */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/* Can only capture all branches, i.e., no filtering */
+	if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
+		return -EINVAL;
+
+	/* can only capture at all priv levels due to the way BRS works */
+	if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
+		return -EINVAL;
+
+	return 0;
+}
+
+/* tos = top of stack, i.e., last valid entry written */
+static inline int amd_brs_get_tos(union amd_debug_extn_cfg *cfg)
+{
+	/*
+	 * msroff: index of next entry to write so top-of-stack is one off
+	 * if BRS is full then msroff is set back to 0.
+	 */
+	return (cfg->msroff ? cfg->msroff : x86_pmu.lbr_nr) - 1;
+}
+
+/*
+ * make sure we have a sane BRS offset to begin with
+ * especially with kexec
+ */
+void amd_brs_reset(void)
+{
+	/*
+	 * Reset config
+	 */
+	set_debug_extn_cfg(0);
+
+	/*
+	 * Mark first entry as poisoned
+	 */
+	wrmsrl(brs_to(0), BRS_POISON);
+}
+
+int __init amd_brs_init(void)
+{
+	if (!amd_brs_detect())
+		return -EOPNOTSUPP;
+
+	pr_cont("%d-deep BRS, ", x86_pmu.lbr_nr);
+
+	return 0;
+}
+
+void amd_brs_enable(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/* Activate only on first user */
+	if (++cpuc->brs_active > 1)
+		return;
+
+	cfg.val    = 0; /* reset all fields */
+	cfg.brsmen = 1; /* enable branch sampling */
+
+	/* Set enable bit */
+	set_debug_extn_cfg(cfg.val);
+}
+
+void amd_brs_disable(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/* Check if active (could be disabled via x86_pmu_disable_all()) */
+	if (!cpuc->brs_active)
+		return;
+
+	/* Only disable for last user */
+	if (--cpuc->brs_active)
+		return;
+
+	/*
+	 * Clear the brsmen bit but preserve the others as they contain
+	 * useful state such as vb and msroff
+	 */
+	cfg.val = get_debug_extn_cfg();
+
+	/*
+	 * When coming in on interrupt and BRS is full, then hw will have
+	 * already stopped BRS, no need to issue wrmsr again
+	 */
+	if (cfg.brsmen) {
+		cfg.brsmen = 0;
+		set_debug_extn_cfg(cfg.val);
+	}
+}
+
+/*
+ * Caller must ensure amd_brs_inuse() is true before calling
+ * return:
+ */
+void amd_brs_drain(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *event = cpuc->events[0];
+	union amd_debug_extn_cfg cfg;
+	u32 i, nr = 0, num, tos, start;
+	u32 shift = 64 - boot_cpu_data.x86_virt_bits;
+
+	/*
+	 * BRS event forced on PMC0,
+	 * so check if there is an event.
+	 * It is possible to have lbr_users > 0 but the event
+	 * not yet scheduled due to long latency PMU irq
+	 */
+	if (!event)
+		goto empty;
+
+	cfg.val = get_debug_extn_cfg();
+
+	/* Sanity check [0-x86_pmu.lbr_nr] */
+	if (WARN_ON_ONCE(cfg.msroff >= x86_pmu.lbr_nr))
+		goto empty;
+
+	/* No valid branch */
+	if (cfg.vb == 0)
+		goto empty;
+
+	/*
+	 * msr.off points to next entry to be written
+	 * tos = most recent entry index = msr.off - 1
+	 * BRS register buffer saturates, so we know we have
+	 * start < tos and that we have to read from start to tos
+	 */
+	start = 0;
+	tos = amd_brs_get_tos(&cfg);
+
+	num = tos - start + 1;
+
+	/*
+	 * BRS is only one pass (saturation) from MSROFF to depth-1
+	 * MSROFF wraps to zero when buffer is full
+	 */
+	for (i = 0; i < num; i++) {
+		u32 brs_idx = tos - i;
+		u64 from, to;
+
+		rdmsrl(brs_to(brs_idx), to);
+
+		/* Entry does not belong to us (as marked by kernel) */
+		if (to == BRS_POISON)
+			break;
+
+		rdmsrl(brs_from(brs_idx), from);
+
+		/*
+		 * Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
+		 * Necessary to generate proper virtual addresses suitable for
+		 * symbolization
+		 */
+		to = (u64)(((s64)to << shift) >> shift);
+
+		cpuc->lbr_entries[nr].from = from;
+		cpuc->lbr_entries[nr].to   = to;
+
+		/* Other lbr_entries fields are unsupported, set to zero */
+		cpuc->lbr_entries[nr].val = 0;
+
+		nr++;
+	}
+empty:
+	/* Record number of sampled branches */
+	cpuc->lbr_stack.nr = nr;
+}
+
+/*
+ * Poison most recent entry to prevent reuse by next task
+ * required because BRS entry are not tagged by PID
+ */
+static void amd_brs_poison_buffer(void)
+{
+	union amd_debug_extn_cfg cfg;
+	unsigned int idx;
+
+	/* Get current state */
+	cfg.val = get_debug_extn_cfg();
+
+	/* idx is most recently written entry */
+	idx = amd_brs_get_tos(&cfg);
+
+	/* Poison target of entry */
+	wrmsrl(brs_to(idx), BRS_POISON);
+}
+
+/*
+ * On context switch in, we need to make sure no samples from previous user
+ * are left in the BRS.
+ *
+ * On ctxswin, sched_in = true, called after the PMU has started
+ * On ctxswout, sched_in = false, called before the PMU is stopped
+ */
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/* no active users */
+	if (!cpuc->lbr_users)
+		return;
+
+	/*
+	 * On context switch in, we need to ensure we do not use entries
+	 * from previous BRS user on that CPU, so we poison the buffer as
+	 * a faster way compared to resetting all entries.
+	 */
+	if (sched_in)
+		amd_brs_poison_buffer();
+}
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 9687a8aef01c..86adb0e879c6 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -327,6 +327,8 @@ static inline bool amd_is_pair_event_code(struct hw_perf_event *hwc)
 
 static int amd_core_hw_config(struct perf_event *event)
 {
+	int ret = 0;
+
 	if (event->attr.exclude_host && event->attr.exclude_guest)
 		/*
 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
@@ -343,7 +345,32 @@ static int amd_core_hw_config(struct perf_event *event)
 	if ((x86_pmu.flags & PMU_FL_PAIR) && amd_is_pair_event_code(&event->hw))
 		event->hw.flags |= PERF_X86_EVENT_PAIR;
 
-	return 0;
+	/*
+	 * if branch stack is requested
+	 */
+	if (has_branch_stack(event) && is_sampling_event(event)) {
+		/*
+		 * BRS implementation does not work with frequency mode
+		 * reprogramming of the period.
+		 */
+		if (event->attr.freq)
+			return -EINVAL;
+		/*
+		 * The kernel subtracts BRS depth from period, so it must be big enough
+		 */
+		if (event->attr.sample_period <= x86_pmu.lbr_nr)
+			return -EINVAL;
+
+		/*
+		 * Check if we can allow PERF_SAMPLE_BRANCH_STACK
+		 */
+		ret = amd_brs_setup_filter(event);
+
+		/* only set in case of success */
+		if (!ret)
+			event->hw.flags |= PERF_X86_EVENT_AMD_BRS;
+	}
+	return ret;
 }
 
 static inline int amd_is_nb_event(struct hw_perf_event *hwc)
@@ -366,9 +393,6 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (event->attr.precise_ip && get_ibs_caps())
 		return -ENOENT;
 
-	if (has_branch_stack(event))
-		return -EOPNOTSUPP;
-
 	ret = x86_pmu_hw_config(event);
 	if (ret)
 		return ret;
@@ -555,6 +579,8 @@ static void amd_pmu_cpu_starting(int cpu)
 
 	cpuc->amd_nb->nb_id = nb_id;
 	cpuc->amd_nb->refcnt++;
+
+	amd_brs_reset();
 }
 
 static void amd_pmu_cpu_dead(int cpu)
@@ -634,8 +660,38 @@ static void amd_pmu_disable_all(void)
 	}
 }
 
+static void amd_pmu_enable_event(struct perf_event *event)
+{
+	x86_pmu_enable_event(event);
+
+	if (__this_cpu_read(cpu_hw_events.enabled)) {
+		if (is_amd_brs(&event->hw))
+			amd_brs_enable();
+	}
+}
+
+static void amd_pmu_enable_all(int added)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct hw_perf_event *hwc;
+	int idx;
+
+	for (idx = 0; idx < x86_pmu.num_counters; idx++) {
+		hwc = &cpuc->events[idx]->hw;
+
+		/* only activate events which are marked as active */
+		if (!test_bit(idx, cpuc->active_mask))
+			continue;
+
+		amd_pmu_enable_event(cpuc->events[idx]);
+	}
+}
+
 static void amd_pmu_disable_event(struct perf_event *event)
 {
+	if (is_amd_brs(&event->hw))
+		amd_brs_disable();
+
 	x86_pmu_disable_event(event);
 
 	/*
@@ -651,6 +707,18 @@ static void amd_pmu_disable_event(struct perf_event *event)
 	amd_pmu_wait_on_overflow(event->hw.idx);
 }
 
+static void amd_pmu_add_event(struct perf_event *event)
+{
+	if (needs_branch_stack(event))
+		amd_pmu_brs_add(event);
+}
+
+static void amd_pmu_del_event(struct perf_event *event)
+{
+	if (needs_branch_stack(event))
+		amd_pmu_brs_del(event);
+}
+
 /*
  * Because of NMI latency, if multiple PMC counters are active or other sources
  * of NMIs are received, the perf NMI handler can handle one or more overflowed
@@ -671,11 +739,31 @@ static void amd_pmu_disable_event(struct perf_event *event)
  */
 static int amd_pmu_handle_irq(struct pt_regs *regs)
 {
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int handled;
+	int pmu_enabled;
+
+	/*
+	 * Save the PMU state.
+	 * It needs to be restored when leaving the handler.
+	 */
+	pmu_enabled = cpuc->enabled;
+	cpuc->enabled = 0;
+
+	/* stop everything (includes BRS) */
+	amd_pmu_disable_all();
+
+	/* Drain BRS is in use (could be inactive) */
+	if (cpuc->lbr_users)
+		amd_brs_drain();
 
 	/* Process any counter overflows */
 	handled = x86_pmu_handle_irq(regs);
 
+	cpuc->enabled = pmu_enabled;
+	if (pmu_enabled)
+		amd_pmu_enable_all(0);
+
 	/*
 	 * If a counter was handled, record a timestamp such that un-handled
 	 * NMIs will be claimed if arriving within that window.
@@ -897,6 +985,51 @@ static void amd_put_event_constraints_f17h(struct cpu_hw_events *cpuc,
 		--cpuc->n_pair;
 }
 
+/*
+ * Because of the way BRS operates with an inactive and active phases, and
+ * the link to one counter, it is not possible to have two events using BRS
+ * scheduled at the same time. There would be an issue with enforcing the
+ * period of each one and given that the BRS saturates, it would not be possible
+ * to guarantee correlated content for all events. Therefore, in situations
+ * where multiple events want to use BRS, the kernel enforces mutual exclusion.
+ * Exclusion is enforced by chosing only one counter for events using BRS.
+ * The event scheduling logic will then automatically multiplex the
+ * events and ensure that at most one event is actively using BRS.
+ *
+ * The BRS counter could be any counter, but there is no constraint on Fam19h,
+ * therefore all counters are equal and thus we pick the first one: PMC0
+ */
+static struct event_constraint amd_fam19h_brs_cntr0_constraint =
+	EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK);
+
+static struct event_constraint amd_fam19h_brs_pair_cntr0_constraint =
+	__EVENT_CONSTRAINT(0, 0x1, AMD64_RAW_EVENT_MASK, 1, 0, PERF_X86_EVENT_PAIR);
+
+static struct event_constraint *
+amd_get_event_constraints_f19h(struct cpu_hw_events *cpuc, int idx,
+			  struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	bool has_brs = is_amd_brs(hwc);
+
+	/*
+	 * In case BRS is used with an event requiring a counter pair,
+	 * the kernel allows it but only on counter 0 & 1 to enforce
+	 * multiplexing requiring to protect BRS in case of multiple
+	 * BRS users
+	 */
+	if (amd_is_pair_event_code(hwc)) {
+		return has_brs ? &amd_fam19h_brs_pair_cntr0_constraint
+			       : &pair_constraint;
+	}
+
+	if (has_brs)
+		return &amd_fam19h_brs_cntr0_constraint;
+
+	return &unconstrained;
+}
+
+
 static ssize_t amd_event_sysfs_show(char *page, u64 config)
 {
 	u64 event = (config & ARCH_PERFMON_EVENTSEL_EVENT) |
@@ -905,12 +1038,19 @@ static ssize_t amd_event_sysfs_show(char *page, u64 config)
 	return x86_event_sysfs_show(page, config, event);
 }
 
+static void amd_pmu_sched_task(struct perf_event_context *ctx,
+				 bool sched_in)
+{
+	if (sched_in && x86_pmu.lbr_nr)
+		amd_pmu_brs_sched_task(ctx, sched_in);
+}
+
 static __initconst const struct x86_pmu amd_pmu = {
 	.name			= "AMD",
 	.handle_irq		= amd_pmu_handle_irq,
 	.disable_all		= amd_pmu_disable_all,
-	.enable_all		= x86_pmu_enable_all,
-	.enable			= x86_pmu_enable_event,
+	.enable_all		= amd_pmu_enable_all,
+	.enable			= amd_pmu_enable_event,
 	.disable		= amd_pmu_disable_event,
 	.hw_config		= amd_pmu_hw_config,
 	.schedule_events	= x86_schedule_events,
@@ -920,6 +1060,8 @@ static __initconst const struct x86_pmu amd_pmu = {
 	.event_map		= amd_pmu_event_map,
 	.max_events		= ARRAY_SIZE(amd_perfmon_event_map),
 	.num_counters		= AMD64_NUM_COUNTERS,
+	.add			= amd_pmu_add_event,
+	.del			= amd_pmu_del_event,
 	.cntval_bits		= 48,
 	.cntval_mask		= (1ULL << 48) - 1,
 	.apic			= 1,
@@ -938,6 +1080,37 @@ static __initconst const struct x86_pmu amd_pmu = {
 	.amd_nb_constraints	= 1,
 };
 
+static ssize_t branches_show(struct device *cdev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu.lbr_nr);
+}
+
+static DEVICE_ATTR_RO(branches);
+
+static struct attribute *amd_pmu_brs_attrs[] = {
+	&dev_attr_branches.attr,
+	NULL,
+};
+
+static umode_t
+amd_brs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+	return x86_pmu.lbr_nr ? attr->mode : 0;
+}
+
+static struct attribute_group group_caps_amd_brs = {
+	.name  = "caps",
+	.attrs = amd_pmu_brs_attrs,
+	.is_visible = amd_brs_is_visible,
+};
+
+static const struct attribute_group *amd_attr_update[] = {
+	&group_caps_amd_brs,
+	NULL,
+};
+
 static int __init amd_core_pmu_init(void)
 {
 	u64 even_ctr_mask = 0ULL;
@@ -989,6 +1162,23 @@ static int __init amd_core_pmu_init(void)
 		x86_pmu.flags |= PMU_FL_PAIR;
 	}
 
+	if (boot_cpu_data.x86 >= 0x19) {
+		/*
+		 * On AMD, invoking pmu_disable_all() is very expensive and the function is
+		 * invoked on context-switch in via sched_task_in(), so enable only when necessary
+		 */
+		if (!amd_brs_init()) {
+			x86_pmu.get_event_constraints = amd_get_event_constraints_f19h;
+			x86_pmu.sched_task = amd_pmu_sched_task;
+			/*
+			 * The put_event_constraints callback is shared with
+			 * Fam17h, set above
+			 */
+		}
+	}
+
+	x86_pmu.attr_update = amd_attr_update;
+
 	pr_cont("core perfctr, ");
 	return 0;
 }
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2a57dbed4894..66a073f21bb7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -681,11 +681,16 @@ void x86_pmu_disable_all(void)
 
 		if (!test_bit(idx, cpuc->active_mask))
 			continue;
+
+		if (is_amd_brs(hwc))
+			amd_brs_disable();
+
 		rdmsrl(x86_pmu_config_addr(idx), val);
 		if (!(val & ARCH_PERFMON_EVENTSEL_ENABLE))
 			continue;
 		val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
 		wrmsrl(x86_pmu_config_addr(idx), val);
+
 		if (is_counter_pair(hwc))
 			wrmsrl(x86_pmu_config_addr(idx + 1), 0);
 	}
@@ -1334,6 +1339,10 @@ static void x86_pmu_enable(struct pmu *pmu)
 			if (hwc->state & PERF_HES_ARCH)
 				continue;
 
+			/*
+			 * if cpuc->enabled = 0, then no wrmsr as
+			 * per x86_pmu_enable_event()
+			 */
 			x86_pmu_start(event, PERF_EF_RELOAD);
 		}
 		cpuc->n_added = 0;
@@ -1700,11 +1709,15 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
 		 * event overflow
 		 */
 		handled++;
-		perf_sample_data_init(&data, 0, event->hw.last_period);
 
 		if (!x86_perf_event_set_period(event))
 			continue;
 
+		perf_sample_data_init(&data, 0, event->hw.last_period);
+
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e3ac05c97b5e..12fb1e68f188 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -66,22 +66,23 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
 /*
  * struct hw_perf_event.flags flags
  */
-#define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
-#define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
-#define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_PEBS_LD_HSW	0x0008 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW	0x0010 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL		0x0020 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC		0x0040 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED	0x0080 /* grant rdpmc permission */
-#define PERF_X86_EVENT_EXCL_ACCT	0x0100 /* accounted EXCL event */
-#define PERF_X86_EVENT_AUTO_RELOAD	0x0200 /* use PEBS auto-reload */
-#define PERF_X86_EVENT_LARGE_PEBS	0x0400 /* use large PEBS */
-#define PERF_X86_EVENT_PEBS_VIA_PT	0x0800 /* use PT buffer for PEBS */
-#define PERF_X86_EVENT_PAIR		0x1000 /* Large Increment per Cycle */
-#define PERF_X86_EVENT_LBR_SELECT	0x2000 /* Save/Restore MSR_LBR_SELECT */
-#define PERF_X86_EVENT_TOPDOWN		0x4000 /* Count Topdown slots/metrics events */
-#define PERF_X86_EVENT_PEBS_STLAT	0x8000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_PEBS_LDLAT	0x00001 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST		0x00002 /* st data address sampling */
+#define PERF_X86_EVENT_PEBS_ST_HSW	0x00004 /* haswell style datala, store */
+#define PERF_X86_EVENT_PEBS_LD_HSW	0x00008 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW	0x00010 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL		0x00020 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC		0x00040 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED	0x00080 /* grant rdpmc permission */
+#define PERF_X86_EVENT_EXCL_ACCT	0x00100 /* accounted EXCL event */
+#define PERF_X86_EVENT_AUTO_RELOAD	0x00200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_LARGE_PEBS	0x00400 /* use large PEBS */
+#define PERF_X86_EVENT_PEBS_VIA_PT	0x00800 /* use PT buffer for PEBS */
+#define PERF_X86_EVENT_PAIR		0x01000 /* Large Increment per Cycle */
+#define PERF_X86_EVENT_LBR_SELECT	0x02000 /* Save/Restore MSR_LBR_SELECT */
+#define PERF_X86_EVENT_TOPDOWN		0x04000 /* Count Topdown slots/metrics events */
+#define PERF_X86_EVENT_PEBS_STLAT	0x08000 /* st+stlat data address sampling */
+#define PERF_X86_EVENT_AMD_BRS		0x10000 /* AMD Branch Sampling */
 
 static inline bool is_topdown_count(struct perf_event *event)
 {
@@ -323,6 +324,8 @@ struct cpu_hw_events {
 	 * AMD specific bits
 	 */
 	struct amd_nb			*amd_nb;
+	int				brs_active; /* BRS is enabled */
+
 	/* Inverted mask of bits to clear in the perf_ctr ctrl registers */
 	u64				perf_ctr_virt_mask;
 	int				n_pair; /* Large increment events */
@@ -1097,6 +1100,11 @@ int x86_pmu_hw_config(struct perf_event *event);
 
 void x86_pmu_disable_all(void);
 
+static inline bool is_amd_brs(struct hw_perf_event *hwc)
+{
+	return hwc->flags & PERF_X86_EVENT_AMD_BRS;
+}
+
 static inline bool is_counter_pair(struct hw_perf_event *hwc)
 {
 	return hwc->flags & PERF_X86_EVENT_PAIR;
@@ -1137,6 +1145,7 @@ static inline void x86_pmu_disable_event(struct perf_event *event)
 
 	if (is_counter_pair(hwc))
 		wrmsrl(x86_pmu_config_addr(hwc->idx + 1), 0);
+
 }
 
 void x86_pmu_enable_event(struct perf_event *event);
@@ -1202,6 +1211,48 @@ static inline bool fixed_counter_disabled(int i, struct pmu *pmu)
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
+int amd_brs_init(void);
+void amd_brs_disable(void);
+void amd_brs_enable(void);
+void amd_brs_drain(void);
+void amd_brs_disable_all(void);
+int amd_brs_setup_filter(struct perf_event *event);
+void amd_brs_reset(void);
+
+static inline void amd_pmu_brs_add(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/*
+	 * No need to reset BRS because it is reset
+	 * on brs_enable() and it is saturating
+	 */
+	cpuc->lbr_users++;
+	perf_sched_cb_inc(event->ctx->pmu);
+}
+
+static inline void amd_pmu_brs_del(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->lbr_users--;
+	WARN_ON_ONCE(cpuc->lbr_users < 0);
+
+	perf_sched_cb_dec(event->ctx->pmu);
+}
+
+void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
+/*
+ * check if BRS is activated on the CPU
+ * active defined as it has non-zero users and DBG_EXT_CFG.BRSEN=1
+ */
+static inline bool amd_brs_active(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	return cpuc->brs_active;
+}
 
 #else /* CONFIG_CPU_SUP_AMD */
 
@@ -1210,6 +1261,16 @@ static inline int amd_pmu_init(void)
 	return 0;
 }
 
+static inline int amd_brs_init(void)
+{
+	return 0;
+}
+
+static inline void amd_brs_drain(void)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CPU_SUP_AMD */
 
 static inline int is_pebs_pt(struct perf_event *event)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..ed73325a8499 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -648,6 +648,10 @@
 #define MSR_IA32_PERF_CTL		0x00000199
 #define INTEL_PERF_CTL_MASK		0xffff
 
+/* AMD Branch Sampling configuration */
+#define MSR_AMD_DBG_EXTN_CFG		0xc000010f
+#define MSR_AMD_SAMP_BR_FROM		0xc0010300
+
 #define MSR_IA32_MPERF			0x000000e7
 #define MSR_IA32_APERF			0x000000e8
 
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 04/13] perf/x86/amd: add branch-brs helper event for Fam19h BRS
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (2 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 05/13] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds a pseudo event called branch-brs to help use the FAM Fam19h Branch
Sampling feature (BRS). BRS samples taken branches, so it is best used when sampling
on a retired taken branch event (0xc4) which is what BRS captures.
Instead of trying to remember the event code or actual event name, users can simply do:

$ perf record -b -e cpu/branch-brs/ -c 1000037 .....

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 86adb0e879c6..d6d5119260a9 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1106,8 +1106,24 @@ static struct attribute_group group_caps_amd_brs = {
 	.is_visible = amd_brs_is_visible,
 };
 
+#define AMD_FAM19H_BRS_EVENT 0xc4 /* Fam19h RETIRED_TAKEN_BRANCH_INSTRUCTIONS */
+EVENT_ATTR_STR(branch-brs, amd_branch_brs,
+	       "event=" __stringify(AMD_FAM19H_BRS_EVENT)"\n");
+
+static struct attribute *amd_brs_events_attrs[] = {
+	EVENT_PTR(amd_branch_brs),
+	NULL,
+};
+
+static struct attribute_group group_events_amd_brs = {
+	.name       = "events",
+	.attrs      = amd_brs_events_attrs,
+	.is_visible = amd_brs_is_visible,
+};
+
 static const struct attribute_group *amd_attr_update[] = {
 	&group_caps_amd_brs,
+	&group_events_amd_brs,
 	NULL,
 };
 
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 05/13] perf/x86/amd: enable branch sampling priv level filtering
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (3 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 04/13] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 06/13] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

The AMD Branch Sampling features does not provide hardware filtering
by privilege level. The associated PMU counter does but not the branch
sampling by itself. Given how BRS operates there is a possibility that
BRS captures kernel level branches even though the event is programmed to
count only at the user level. This patch implements a workaround in software
by removing the branches which belong to the wrong privilege level. The privilege
level is evaluated on the target of the branch and not the source so as to be
compatible with other architectures. As a consequence of this patch, the number
of entries in the PERF_RECORD_BRANCH_STACK buffer may be less than the maximum (16).
It could even be zero. Another consequence is that consecutive entries in the
branch stack may not reflect actual code path and may have discontinuities,
in case kernel branches were suppressed. But this is no different than what
happens on other architectures.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/brs.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 86dbc6d06815..23b969001857 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -92,10 +92,6 @@ int amd_brs_setup_filter(struct perf_event *event)
 	if ((type & ~PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_ANY)
 		return -EINVAL;
 
-	/* can only capture at all priv levels due to the way BRS works */
-	if ((type & PERF_SAMPLE_BRANCH_PLM_ALL) != PERF_SAMPLE_BRANCH_PLM_ALL)
-		return -EINVAL;
-
 	return 0;
 }
 
@@ -181,6 +177,21 @@ void amd_brs_disable(void)
 	}
 }
 
+static bool amd_brs_match_plm(struct perf_event *event, u64 to)
+{
+	int type = event->attr.branch_sample_type;
+	int plm_k = PERF_SAMPLE_BRANCH_KERNEL | PERF_SAMPLE_BRANCH_HV;
+	int plm_u = PERF_SAMPLE_BRANCH_USER;
+
+	if (!(type & plm_k) && kernel_ip(to))
+		return 0;
+
+	if (!(type & plm_u) && !kernel_ip(to))
+		return 0;
+
+	return 1;
+}
+
 /*
  * Caller must ensure amd_brs_inuse() is true before calling
  * return:
@@ -237,8 +248,6 @@ void amd_brs_drain(void)
 		if (to == BRS_POISON)
 			break;
 
-		rdmsrl(brs_from(brs_idx), from);
-
 		/*
 		 * Sign-extend SAMP_BR_TO to 64 bits, bits 61-63 are reserved.
 		 * Necessary to generate proper virtual addresses suitable for
@@ -246,6 +255,11 @@ void amd_brs_drain(void)
 		 */
 		to = (u64)(((s64)to << shift) >> shift);
 
+		if (!amd_brs_match_plm(event, to))
+			continue;
+
+		rdmsrl(brs_from(brs_idx), from);
+
 		cpuc->lbr_entries[nr].from = from;
 		cpuc->lbr_entries[nr].to   = to;
 
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 06/13] perf/x86/amd: add AMD branch sampling period adjustment
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (4 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 05/13] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This supplemental patch adds the code to adjust the sampling event period
when used with the Branch Sampling feature (BRS). Given the depth of the
BRS (16 on Fam19h Zen3), the period is reduced by that depth such that in
the best case scenario, BRS saturates at the desired sampling period.
In practice, though, the processor may execute more branches. Given a desired
period P and a depth D, the kernel programs the actual period at P - D. After
P occurrences of the sampling event, the counter overflows. It then may take
X branches (skid) before the NMI is caught and held by the hardware and BRS activates. Then,
after D branches, BRS saturates and the NMI is delivered.  With no skid, the
effective period would be (P - D) + D = P. In practice, however, it will likely
be (P - D) + X + D. There is no way to eliminate X or predict X.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/core.c       |  7 +++++++
 arch/x86/events/perf_event.h | 12 ++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 66a073f21bb7..e1fb0e5ceaa3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1375,6 +1375,13 @@ int x86_perf_event_set_period(struct perf_event *event)
 	    x86_pmu.set_topdown_event_period)
 		return x86_pmu.set_topdown_event_period(event);
 
+	/*
+	 * decrease period by the depth of the BRS feature to get
+	 * the last N taken branches and approximate the desired period
+	 */
+	if (has_branch_stack(event))
+		period = amd_brs_adjust_period(period);
+
 	/*
 	 * If we are way outside a reasonable range then just skip forward:
 	 */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 12fb1e68f188..a275553e78b9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1254,6 +1254,14 @@ static inline bool amd_brs_active(void)
 	return cpuc->brs_active;
 }
 
+static inline s64 amd_brs_adjust_period(s64 period)
+{
+	if (period > x86_pmu.lbr_nr)
+		return period - x86_pmu.lbr_nr;
+
+	return period;
+}
+
 #else /* CONFIG_CPU_SUP_AMD */
 
 static inline int amd_pmu_init(void)
@@ -1271,6 +1279,10 @@ static inline void amd_brs_drain(void)
 	return 0;
 }
 
+static inline s64 amd_brs_adjust_period(s64 period)
+{
+	return period;
+}
 #endif /* CONFIG_CPU_SUP_AMD */
 
 static inline int is_pebs_pt(struct perf_event *event)
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 07/13] perf/core: add idle hooks
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (5 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 06/13] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  9:15   ` Peter Zijlstra
                     ` (2 more replies)
  2021-09-09  7:56 ` [PATCH v1 08/13] perf/x86/core: " Stephane Eranian
                   ` (6 subsequent siblings)
  13 siblings, 3 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds a new set of hooks to connect perf_events with the
idle task. On some PMU models, it may be necessary to flush or stop
the PMU when going to low power. Upon return from low power, the opposite
action, i.e., re-enable the PMU, may be necessary. The patch adds
perf_pmu_register_lopwr_cb() to register a callback on entry or return
from low power. The callback is invoked with a boolean arg. If true,
then this is an entry. If false, this is a return.

The callback is invoked from the idle code with interrupts already
disabled.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 include/linux/perf_event.h |  8 ++++++
 kernel/events/core.c       | 58 ++++++++++++++++++++++++++++++++++++++
 kernel/sched/idle.c        | 15 +++++++++-
 3 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750ed..32ffc009b2ec 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -300,6 +300,7 @@ struct pmu {
 	/* number of address filters this PMU can do */
 	unsigned int			nr_addr_filters;
 
+	struct list_head		lopwr_entry;
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
 	 * as well as for lazy/batch writing of the MSRs.
@@ -430,6 +431,8 @@ struct pmu {
 	void (*sched_task)		(struct perf_event_context *ctx,
 					bool sched_in);
 
+	void (*lopwr_cb)		(bool lopwr_in);
+
 	/*
 	 * Kmem cache of PMU specific data
 	 */
@@ -1429,6 +1432,11 @@ extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
+extern void perf_lopwr_cb(bool lopwr_in);
+extern void perf_lopwr_active_inc(void);
+extern void perf_lopwr_active_dec(void);
+extern void perf_register_lopwr_cb(struct pmu *pmu, void (*lowpwr_cb)(bool));
+
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e..f739fd92e74b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3521,6 +3521,64 @@ void perf_sched_cb_inc(struct pmu *pmu)
 	this_cpu_inc(perf_sched_cb_usages);
 }
 
+/*
+ * The perf_lopwr_cb() is invoked from the idle task. As such it
+ * cannot grab a mutex that may end up sleeping. The idle task cannot
+ * sleep by construction. Therefore we create a spinlock and a new
+ * list of PMUs to invoke on idle. The list is protected by a spinlock
+ * Normally we would use the pmus_lock and iterate over each PMUs. But
+ * mutex is not possible and we need to iterate only over the PMU which
+ * do require a idle callback.
+ */
+static DEFINE_SPINLOCK(lopwr_cb_lock);
+static LIST_HEAD(lopwr_cb_pmus);
+static DEFINE_PER_CPU(int, lopwr_nr_active);
+
+void perf_lopwr_active_inc(void)
+{
+	__this_cpu_inc(lopwr_nr_active);
+}
+
+void perf_lopwr_active_dec(void)
+{
+	__this_cpu_dec(lopwr_nr_active);
+}
+
+/*
+ * lopwr_in = true means going to low power state
+ * lopwr_in = false means returning from low power state
+ */
+void perf_lopwr_cb(bool lopwr_in)
+{
+	struct pmu *pmu;
+	unsigned long flags;
+
+	if (!__this_cpu_read(lopwr_nr_active))
+		return;
+
+	spin_lock_irqsave(&lopwr_cb_lock, flags);
+
+	list_for_each_entry(pmu, &lopwr_cb_pmus, lopwr_entry) {
+		if (pmu->lopwr_cb)
+			pmu->lopwr_cb(lopwr_in);
+	}
+
+	spin_unlock_irqrestore(&lopwr_cb_lock, flags);
+}
+EXPORT_SYMBOL_GPL(perf_lopwr_cb);
+
+void perf_register_lopwr_cb(struct pmu *pmu, void (*func)(bool))
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&lopwr_cb_lock, flags);
+
+	pmu->lopwr_cb   = func;
+	list_add_tail(&pmu->lopwr_entry, &lopwr_cb_pmus);
+
+	spin_unlock_irqrestore(&lopwr_cb_lock, flags);
+}
+
 /*
  * This function provides the context switch callback to the lower code
  * layer. It is invoked ONLY when the context switch callback is enabled.
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 912b47aa99d8..14ce130aee1b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -179,6 +179,7 @@ static void cpuidle_idle_call(void)
 	 */
 	if (need_resched()) {
 		local_irq_enable();
+		perf_lopwr_cb(false);
 		return;
 	}
 
@@ -191,7 +192,14 @@ static void cpuidle_idle_call(void)
 	if (cpuidle_not_available(drv, dev)) {
 		tick_nohz_idle_stop_tick();
 
+		if (!cpu_idle_force_poll)
+			perf_lopwr_cb(true);
+
 		default_idle_call();
+
+		if (!cpu_idle_force_poll)
+			perf_lopwr_cb(false);
+
 		goto exit_idle;
 	}
 
@@ -249,8 +257,10 @@ static void cpuidle_idle_call(void)
 	/*
 	 * It is up to the idle functions to reenable local interrupts
 	 */
-	if (WARN_ON_ONCE(irqs_disabled()))
+	if (WARN_ON_ONCE(irqs_disabled())) {
 		local_irq_enable();
+		perf_lopwr_cb(false);
+	}
 }
 
 /*
@@ -279,9 +289,12 @@ static void do_idle(void)
 	__current_set_polling();
 	tick_nohz_idle_enter();
 
+
 	while (!need_resched()) {
 		rmb();
 
+		perf_lopwr_cb(true);
+
 		local_irq_disable();
 
 		if (cpu_is_offline(cpu)) {
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 08/13] perf/x86/core: add idle hooks
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (6 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  9:16   ` Peter Zijlstra
  2021-09-09  7:56 ` [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds the perf idle hooks x86 idle routines.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/mwait.h |  6 +++++-
 kernel/sched/idle.c          | 12 ++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 29dd27b5a339..92080fdc1d2a 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
+#include <linux/perf_event.h>
 
 #include <asm/cpufeature.h>
 #include <asm/nospec-branch.h>
@@ -114,8 +115,11 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 		}
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		if (!need_resched())
+		if (!need_resched()) {
+			perf_lopwr_cb(true);
 			__mwait(eax, ecx);
+			perf_lopwr_cb(false);
+		}
 	}
 	current_clr_polling();
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ce130aee1b..c0ddc3c32a33 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -179,7 +179,6 @@ static void cpuidle_idle_call(void)
 	 */
 	if (need_resched()) {
 		local_irq_enable();
-		perf_lopwr_cb(false);
 		return;
 	}
 
@@ -230,6 +229,9 @@ static void cpuidle_idle_call(void)
 		tick_nohz_idle_stop_tick();
 
 		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
+		if (!cpu_idle_force_poll)
+			perf_lopwr_cb(true);
+
 		call_cpuidle(drv, dev, next_state);
 	} else {
 		bool stop_tick = true;
@@ -244,12 +246,17 @@ static void cpuidle_idle_call(void)
 		else
 			tick_nohz_idle_retain_tick();
 
+		if (!cpu_idle_force_poll)
+			perf_lopwr_cb(true);
+
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome
 		 */
 		cpuidle_reflect(dev, entered_state);
 	}
+	if (!cpu_idle_force_poll)
+		perf_lopwr_cb(false);
 
 exit_idle:
 	__current_set_polling();
@@ -259,7 +266,6 @@ static void cpuidle_idle_call(void)
 	 */
 	if (WARN_ON_ONCE(irqs_disabled())) {
 		local_irq_enable();
-		perf_lopwr_cb(false);
 	}
 }
 
@@ -293,8 +299,6 @@ static void do_idle(void)
 	while (!need_resched()) {
 		rmb();
 
-		perf_lopwr_cb(true);
-
 		local_irq_disable();
 
 		if (cpu_is_offline(cpu)) {
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (7 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 08/13] perf/x86/core: " Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  9:20   ` Peter Zijlstra
  2021-09-09  7:56 ` [PATCH v1 10/13] perf tools: add branch-brs as a new event Stephane Eranian
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before entering low power
and re-enabled (if was active) when returning from low power. Otherwise, the NMI interrupt may
be held up for too long and cause problems. Stopping BRS will cause the NMI to be delivered if it
was held up.

This patch connects the branch sampling code to the perf_events idle callbacks.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/events/amd/brs.c    | 25 +++++++++++++++++++++++++
 arch/x86/events/amd/core.c   |  4 ++++
 arch/x86/events/perf_event.h |  1 +
 3 files changed, 30 insertions(+)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 23b969001857..7d27591ba537 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -146,6 +146,7 @@ void amd_brs_enable(void)
 
 	/* Set enable bit */
 	set_debug_extn_cfg(cfg.val);
+	perf_lopwr_active_inc();
 }
 
 void amd_brs_disable(void)
@@ -175,6 +176,7 @@ void amd_brs_disable(void)
 		cfg.brsmen = 0;
 		set_debug_extn_cfg(cfg.val);
 	}
+	perf_lopwr_active_dec();
 }
 
 static bool amd_brs_match_plm(struct perf_event *event, u64 to)
@@ -292,6 +294,29 @@ static void amd_brs_poison_buffer(void)
 	wrmsrl(brs_to(idx), BRS_POISON);
 }
 
+/*
+ * called indirectly with irqs masked from mwait_idle_*()
+ */
+void amd_pmu_brs_lopwr_cb(bool lopwr_in)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union amd_debug_extn_cfg cfg;
+
+	/*
+	 * on mwait in, we may end up in non C0 state.
+	 * we must disable branch sampling to avoid holding the NMI
+	 * for too long. We disable it in hardware but we
+	 * keep the state in cpuc, so we can re-enable.
+	 *
+	 * The hardware will deliver the NMI if needed when brsmen cleared
+	 */
+	if (cpuc->brs_active) {
+		cfg.val = get_debug_extn_cfg();
+		cfg.brsmen = !lopwr_in;
+		set_debug_extn_cfg(cfg.val);
+	}
+}
+
 /*
  * On context switch in, we need to make sure no samples from previous user
  * are left in the BRS.
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index d6d5119260a9..3e1985cd414d 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1184,12 +1184,16 @@ static int __init amd_core_pmu_init(void)
 		 * invoked on context-switch in via sched_task_in(), so enable only when necessary
 		 */
 		if (!amd_brs_init()) {
+			struct pmu *pmu = x86_get_pmu(smp_processor_id());
 			x86_pmu.get_event_constraints = amd_get_event_constraints_f19h;
 			x86_pmu.sched_task = amd_pmu_sched_task;
 			/*
 			 * The put_event_constraints callback is shared with
 			 * Fam17h, set above
 			 */
+
+			/* branch sampling must be stopped when entering low power */
+			perf_register_lopwr_cb(pmu, amd_pmu_brs_lopwr_cb);
 		}
 	}
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a275553e78b9..73eac9d34bd9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1242,6 +1242,7 @@ static inline void amd_pmu_brs_del(struct perf_event *event)
 }
 
 void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void amd_pmu_brs_lopwr_cb(bool lopwr_in);
 
 /*
  * check if BRS is activated on the CPU
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 10/13] perf tools: add branch-brs as a new event
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (8 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-09  7:56 ` [PATCH v1 11/13] perf tools: improve IBS error handling Stephane Eranian
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

Although branch-brs is specific to AMD Zen3, we provide it in the list of events
recognized by perf as a convenience. It allows:
 $ perf stat -e branch-brs
 $ perf list

and enable some test suite to pass.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/parse-events.l | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 923849024b15..ae229f5c4758 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -361,6 +361,7 @@ mem-loads-aux				|
 mem-stores				|
 topdown-[a-z-]+				|
 tx-capacity-[a-z-]+			|
+branch-brs				|
 el-capacity-[a-z-]+			{ return str(yyscanner, PE_KERNEL_PMU_EVENT); }
 
 L1-dcache|l1-d|l1d|L1-data		|
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 11/13] perf tools: improve IBS error handling
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (9 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 10/13] perf tools: add branch-brs as a new event Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-09-13 19:34   ` Arnaldo Carvalho de Melo
  2021-09-09  7:56 ` [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling Stephane Eranian
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

From: Kim Phillips <kim.phillips@amd.com>

This patch improves the error message returned on failed perf_event_open() on
AMD when using IBS.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
---
 tools/perf/util/evsel.c | 42 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f61e5dd53f5d..f203f178fdb9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2684,12 +2684,52 @@ static bool find_process(const char *name)
 	return ret ? false : true;
 }
 
+static bool is_amd;
+
+static char *fgrep(FILE *inf, const char *str)
+{
+	char line[256];
+	int slen = strlen(str);
+
+	while (!feof(inf)) {
+		if (!fgets(line, 256, inf))
+			break;
+		if (strncmp(line, str, slen))
+			continue;
+
+		return strdup(line);
+	}
+
+	return NULL;
+}
+
+static void detect_amd(void)
+{
+	FILE *inf = fopen("/proc/cpuinfo", "r");
+	char *res;
+
+	if (!inf)
+		return;
+
+	res = fgrep(inf, "vendor_id");
+
+	if (res) {
+		char *s = strchr(res, ':');
+
+		is_amd = s && !strcmp(s, ": AuthenticAMD\n");
+		free(res);
+	}
+	fclose(inf);
+}
+
 int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			 int err, char *msg, size_t size)
 {
 	char sbuf[STRERR_BUFSIZE];
 	int printed = 0, enforced = 0;
 
+	detect_amd();
+
 	switch (err) {
 	case EPERM:
 	case EACCES:
@@ -2782,6 +2822,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			return scnprintf(msg, size, "wrong clockid (%d).", clockid);
 		if (perf_missing_features.aux_output)
 			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
+		if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
+			return scnprintf(msg, size, "AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
 		break;
 	case ENODATA:
 		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (10 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 11/13] perf tools: improve IBS error handling Stephane Eranian
@ 2021-09-09  7:56 ` Stephane Eranian
  2021-10-04 21:57   ` Kim Phillips
  2021-09-09  7:57 ` [PATCH v1 13/13] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
  2021-09-09  8:55 ` [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Peter Zijlstra
  13 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch improves the error message printed by perf when perf_event_open()
fails on AMD Zen3 when using the branch sampling feature. In the case of
EINVAL, there are two main reasons: frequency mode or period is smaller than
the depth of the branch sampling buffer (16). The patch checks the parameters
of the call and tries to print a relevant message to explain the error:

$ perf record -b -e cpu/branch-brs/ -c 10 ls
Error:
AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.

$ perf record -b -e cpu/branch-brs/ ls
Error:
AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/evsel.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f203f178fdb9..8945f08f41b2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2824,6 +2824,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
 		if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
 			return scnprintf(msg, size, "AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
+
+		if (is_amd && ((evsel->core.attr.config & 0xff) == 0xc4) && (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK)) {
+			if (evsel->core.attr.freq)
+				return scnprintf(msg, size, "AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
+			/* another reason is that the period is too small */
+			return scnprintf(msg, size, "AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
+		}
+
 		break;
 	case ENODATA:
 		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 13/13] perf report: add addr_from/addr_to sort dimensions
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (11 preceding siblings ...)
  2021-09-09  7:56 ` [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling Stephane Eranian
@ 2021-09-09  7:57 ` Stephane Eranian
  2021-09-09  8:55 ` [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Peter Zijlstra
  13 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-09  7:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, acme, jolsa, kim.phillips, namhyung, irogers

This patch adds a new pair of sorting dimensions used with branch sampling.
With the existing symbol_from/symbol_to, branches captured in the same function
would be collapsed into a single function if the latency (cycles) was all the same.
That is the case on Intel Broadwell for instance. Since Intel Skylake, the latency
is captured by hardware and therefore is used to disambiguate branches.

This patch adds addr_from/addr_to sort dimensions that will sort branches based on
their addresses and not the function there are in. The output is still the function
name but the offset within the function is provided to uniquely identify each branch.
These new sort dimensions also help with annotate because they create different entries
in the histogram which, in turn, generates proper branch annotation.

Here is an example using AMD's branch sampling:

$ perf record -a -b -c 1000037 -e cpu/branch-brs/ test_prg

$ perf report
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead  Command          Source Shared Object  Source Symbol                                   Target Symbol                                   Basic Block Cycle
  99.65%  test_prg	   test_prg              [.] test_thread                                 [.] test_thread                                 -
   0.02%  test_prg         [kernel.vmlinux]      [k] asm_sysvec_apic_timer_interrupt             [k] error_entry                                 -

$ perf report -F overhead,comm,dso,addr_from,addr_to
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead  Command          Shared Object     Source Address          Target Address
   4.22%  test_prg         test_prg          [.] test_thread+0x3c    [.] test_thread+0x4
   4.13%  test_prg         test_prg          [.] test_thread+0x4     [.] test_thread+0x3a
   4.09%  test_prg         test_prg          [.] test_thread+0x3a    [.] test_thread+0x6
   4.08%  test_prg         test_prg          [.] test_thread+0x2     [.] test_thread+0x3c
   4.06%  test_prg         test_prg          [.] test_thread+0x3e    [.] test_thread+0x2
   3.87%  test_prg         test_prg          [.] test_thread+0x6     [.] test_thread+0x38
   3.84%  test_prg         test_prg          [.] test_thread         [.] test_thread+0x3e
   3.76%  test_prg         test_prg          [.] test_thread+0x1e    [.] test_thread
   3.76%  test_prg         test_prg          [.] test_thread+0x38    [.] test_thread+0x8
   3.56%  test_prg         test_prg          [.] test_thread+0x22    [.] test_thread+0x1e
   3.54%  test_prg         test_prg          [.] test_thread+0x8     [.] test_thread+0x36
   3.47%  test_prg         test_prg          [.] test_thread+0x1c    [.] test_thread+0x22
   3.45%  test_prg         test_prg          [.] test_thread+0x36    [.] test_thread+0xa
   3.28%  test_prg         test_prg          [.] test_thread+0x24    [.] test_thread+0x1c
   3.25%  test_prg         test_prg          [.] test_thread+0xa     [.] test_thread+0x34
   3.24%  test_prg         test_prg          [.] test_thread+0x1a    [.] test_thread+0x24
   3.20%  test_prg         test_prg          [.] test_thread+0x34    [.] test_thread+0xc
   3.04%  test_prg         test_prg          [.] test_thread+0x26    [.] test_thread+0x1a
   3.01%  test_prg         test_prg          [.] test_thread+0xc     [.] test_thread+0x32
   2.98%  test_prg         test_prg          [.] test_thread+0x18    [.] test_thread+0x26
   2.94%  test_prg         test_prg          [.] test_thread+0x32    [.] test_thread+0xe
   2.76%  test_prg         test_prg          [.] test_thread+0x28    [.] test_thread+0x18
   2.73%  test_prg         test_prg          [.] test_thread+0xe     [.] test_thread+0x30
   2.67%  test_prg         test_prg          [.] test_thread+0x30    [.] test_thread+0x10
   2.67%  test_prg         test_prg          [.] test_thread+0x16    [.] test_thread+0x28
   2.46%  test_prg         test_prg          [.] test_thread+0x10    [.] test_thread+0x2e
   2.44%  test_prg         test_prg          [.] test_thread+0x2a    [.] test_thread+0x16
   2.38%  test_prg         test_prg          [.] test_thread+0x14    [.] test_thread+0x2a
   2.32%  test_prg         test_prg          [.] test_thread+0x2e    [.] test_thread+0x12
   2.28%  test_prg         test_prg          [.] test_thread+0x12    [.] test_thread+0x2c
   2.16%  test_prg         test_prg          [.] test_thread+0x2c    [.] test_thread+0x14
   0.02%  test_prg         [kernel.vmlinux]  [k] asm_sysvec_apic_ti+0x5  [k] error_entry

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/hist.c |   2 +
 tools/perf/util/hist.h |   2 +
 tools/perf/util/sort.c | 128 +++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/sort.h |   2 +
 4 files changed, 134 insertions(+)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 65fe65ba03c2..39924202568f 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -124,6 +124,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 		} else {
 			symlen = unresolved_col_width + 4 + 2;
 			hists__new_col_len(hists, HISTC_SYMBOL_FROM, symlen);
+			hists__new_col_len(hists, HISTC_ADDR_FROM, symlen);
 			hists__set_unres_dso_col_len(hists, HISTC_DSO_FROM);
 		}
 
@@ -138,6 +139,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 		} else {
 			symlen = unresolved_col_width + 4 + 2;
 			hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
+			hists__new_col_len(hists, HISTC_ADDR_TO, symlen);
 			hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
 		}
 
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 5343b62476e6..9592853fab73 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -76,6 +76,8 @@ enum hist_column {
 	HISTC_LOCAL_INS_LAT,
 	HISTC_GLOBAL_INS_LAT,
 	HISTC_P_STAGE_CYC,
+	HISTC_ADDR_FROM,
+	HISTC_ADDR_TO,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 568a88c001c6..e5be5e24f9c5 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -990,6 +990,128 @@ struct sort_entry sort_sym_to = {
 	.se_width_idx	= HISTC_SYMBOL_TO,
 };
 
+static int _hist_entry__addr_snprintf(struct map_symbol *ms,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width)
+{
+	struct symbol *sym = ms->sym;
+	struct map *map = ms->map;
+	size_t ret = 0, offs;
+
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym && map) {
+		if (sym->type == STT_OBJECT) {
+			ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
+			ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
+					ip - map->unmap_ip(map, sym->start));
+		} else {
+			ret += repsep_snprintf(bf + ret, size - ret, "%.*s",
+					       width - ret,
+					       sym->name);
+			offs = ip - sym->start;
+			if (offs)
+				ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx", offs);
+		}
+	} else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+	}
+
+	return ret;
+}
+
+static int hist_entry__addr_from_snprintf(struct hist_entry *he, char *bf,
+					 size_t size, unsigned int width)
+{
+	if (he->branch_info) {
+		struct addr_map_symbol *from = &he->branch_info->from;
+
+		return _hist_entry__addr_snprintf(&from->ms, from->al_addr,
+						 he->level, bf, size, width);
+	}
+
+	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int hist_entry__addr_to_snprintf(struct hist_entry *he, char *bf,
+				       size_t size, unsigned int width)
+{
+	if (he->branch_info) {
+		struct addr_map_symbol *to = &he->branch_info->to;
+
+		return _hist_entry__addr_snprintf(&to->ms, to->al_addr,
+						 he->level, bf, size, width);
+	}
+
+	return repsep_snprintf(bf, size, "%-*.*s", width, width, "N/A");
+}
+
+static int64_t
+sort__addr_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *from_l;
+	struct addr_map_symbol *from_r;
+	int64_t ret;
+
+	if (!left->branch_info || !right->branch_info)
+		return cmp_null(left->branch_info, right->branch_info);
+
+	from_l = &left->branch_info->from;
+	from_r = &right->branch_info->from;
+
+	/*
+	 * comparing symbol address alone is not enough since it's a
+	 * relative address within a dso.
+	 */
+	ret = _sort__dso_cmp(from_l->ms.map, from_r->ms.map);
+	if (ret != 0)
+		return ret;
+
+	return _sort__addr_cmp(from_l->addr, from_r->addr);
+}
+
+static int64_t
+sort__addr_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *to_l;
+	struct addr_map_symbol *to_r;
+	int64_t ret;
+
+	if (!left->branch_info || !right->branch_info)
+		return cmp_null(left->branch_info, right->branch_info);
+
+	to_l = &left->branch_info->to;
+	to_r = &right->branch_info->to;
+
+	/*
+	 * comparing symbol address alone is not enough since it's a
+	 * relative address within a dso.
+	 */
+	ret = _sort__dso_cmp(to_l->ms.map, to_r->ms.map);
+	if (ret != 0)
+		return ret;
+
+	return _sort__addr_cmp(to_l->addr, to_r->addr);
+}
+
+struct sort_entry sort_addr_from = {
+	.se_header	= "Source Address",
+	.se_cmp		= sort__addr_from_cmp,
+	.se_snprintf	= hist_entry__addr_from_snprintf,
+	.se_filter	= hist_entry__sym_from_filter, /* shared with sym_from */
+	.se_width_idx	= HISTC_ADDR_FROM,
+};
+
+struct sort_entry sort_addr_to = {
+	.se_header	= "Target Address",
+	.se_cmp		= sort__addr_to_cmp,
+	.se_snprintf	= hist_entry__addr_to_snprintf,
+	.se_filter	= hist_entry__sym_to_filter, /* shared with sym_to */
+	.se_width_idx	= HISTC_ADDR_TO,
+};
+
+
 static int64_t
 sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -1897,6 +2019,8 @@ static struct sort_dimension bstack_sort_dimensions[] = {
 	DIM(SORT_SRCLINE_FROM, "srcline_from", sort_srcline_from),
 	DIM(SORT_SRCLINE_TO, "srcline_to", sort_srcline_to),
 	DIM(SORT_SYM_IPC, "ipc_lbr", sort_sym_ipc),
+	DIM(SORT_ADDR_FROM, "addr_from", sort_addr_from),
+	DIM(SORT_ADDR_TO, "addr_to", sort_addr_to),
 };
 
 #undef DIM
@@ -3128,6 +3252,10 @@ static bool get_elide(int idx, FILE *output)
 		return __get_elide(symbol_conf.dso_from_list, "dso_from", output);
 	case HISTC_DSO_TO:
 		return __get_elide(symbol_conf.dso_to_list, "dso_to", output);
+	case HISTC_ADDR_FROM:
+		return __get_elide(symbol_conf.sym_from_list, "addr_from", output);
+	case HISTC_ADDR_TO:
+		return __get_elide(symbol_conf.sym_to_list, "addr_to", output);
 	default:
 		break;
 	}
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index b67c469aba79..571f6745dced 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -250,6 +250,8 @@ enum sort_type {
 	SORT_SRCLINE_FROM,
 	SORT_SRCLINE_TO,
 	SORT_SYM_IPC,
+	SORT_ADDR_FROM,
+	SORT_ADDR_TO,
 
 	/* memory mode specific sort keys */
 	__SORT_MEMORY_MODE,
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
  2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
                   ` (12 preceding siblings ...)
  2021-09-09  7:57 ` [PATCH v1 13/13] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
@ 2021-09-09  8:55 ` Peter Zijlstra
  2021-09-15  5:55   ` Stephane Eranian
  13 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09  8:55 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> This patch series adds support for the AMD Fam19h 16-deep branch sampling
> feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.

Yay..

> BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> it is only activated after P event occurrences, where P is the desired sampling period.
> At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.

WTF... ?!? Srsly? You're joking right?

Also, can you please fix you MUA to wrap at 78 chars like normal people?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 07/13] perf/core: add idle hooks
  2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
@ 2021-09-09  9:15   ` Peter Zijlstra
  2021-09-09 10:42   ` kernel test robot
  2021-09-09 11:02   ` kernel test robot
  2 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09  9:15 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Thu, Sep 09, 2021 at 12:56:54AM -0700, Stephane Eranian wrote:
> This patch adds a new set of hooks to connect perf_events with the
> idle task. On some PMU models, it may be necessary to flush or stop
> the PMU when going to low power. Upon return from low power, the opposite
> action, i.e., re-enable the PMU, may be necessary. The patch adds
> perf_pmu_register_lopwr_cb() to register a callback on entry or return
> from low power. The callback is invoked with a boolean arg. If true,
> then this is an entry. If false, this is a return.
> 
> The callback is invoked from the idle code with interrupts already
> disabled.

Urghh... we've had this before and always managed to refuse doing this
(in generic code).

Also, it's broken..

> +/*
> + * The perf_lopwr_cb() is invoked from the idle task. As such it
> + * cannot grab a mutex that may end up sleeping. The idle task cannot
> + * sleep by construction. Therefore we create a spinlock and a new
> + * list of PMUs to invoke on idle. The list is protected by a spinlock
> + * Normally we would use the pmus_lock and iterate over each PMUs. But
> + * mutex is not possible and we need to iterate only over the PMU which
> + * do require a idle callback.

That rationale is broken, pmus is also SRCU protected and SRCU iteration
would actually work from the idle context, unlike:

> + */
> +static DEFINE_SPINLOCK(lopwr_cb_lock);

which is not allowed from preempt disable context, if you really need
that lock, you need a raw_spinlock_t.

> +static LIST_HEAD(lopwr_cb_pmus);
> +static DEFINE_PER_CPU(int, lopwr_nr_active);
> +
> +void perf_lopwr_active_inc(void)
> +{
> +	__this_cpu_inc(lopwr_nr_active);
> +}
> +
> +void perf_lopwr_active_dec(void)
> +{
> +	__this_cpu_dec(lopwr_nr_active);
> +}
> +
> +/*
> + * lopwr_in = true means going to low power state
> + * lopwr_in = false means returning from low power state
> + */
> +void perf_lopwr_cb(bool lopwr_in)
> +{
> +	struct pmu *pmu;
> +	unsigned long flags;
> +
> +	if (!__this_cpu_read(lopwr_nr_active))
> +		return;

We have static_branch and static_call to deal with these things.

> +
> +	spin_lock_irqsave(&lopwr_cb_lock, flags);
> +
> +	list_for_each_entry(pmu, &lopwr_cb_pmus, lopwr_entry) {
> +		if (pmu->lopwr_cb)
> +			pmu->lopwr_cb(lopwr_in);
> +	}

I *really* do not like unbound iteration in the idle path.

> +
> +	spin_unlock_irqrestore(&lopwr_cb_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(perf_lopwr_cb);

Why?

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 912b47aa99d8..14ce130aee1b 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -179,6 +179,7 @@ static void cpuidle_idle_call(void)
>  	 */
>  	if (need_resched()) {
>  		local_irq_enable();
> +		perf_lopwr_cb(false);
>  		return;
>  	}
>  
> @@ -191,7 +192,14 @@ static void cpuidle_idle_call(void)
>  	if (cpuidle_not_available(drv, dev)) {
>  		tick_nohz_idle_stop_tick();
>  
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(true);
> +
>  		default_idle_call();
> +
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(false);
> +
>  		goto exit_idle;
>  	}
>  
> @@ -249,8 +257,10 @@ static void cpuidle_idle_call(void)
>  	/*
>  	 * It is up to the idle functions to reenable local interrupts
>  	 */
> -	if (WARN_ON_ONCE(irqs_disabled()))
> +	if (WARN_ON_ONCE(irqs_disabled())) {
>  		local_irq_enable();
> +		perf_lopwr_cb(false);
> +	}
>  }
>  
>  /*
> @@ -279,9 +289,12 @@ static void do_idle(void)
>  	__current_set_polling();
>  	tick_nohz_idle_enter();
>  
> +
>  	while (!need_resched()) {
>  		rmb();
>  
> +		perf_lopwr_cb(true);
> +
>  		local_irq_disable();
>  
>  		if (cpu_is_offline(cpu)) {

I so hate all of this...  but mostly, this doesn't seem right, you're
unconditionally calling this, even for high power idle states.

*IFF* you really have to do this, stuff it in an AMD idle state driver
that knows about the relevant idle states. But really, why can't we just
take the his while the event is running?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 08/13] perf/x86/core: add idle hooks
  2021-09-09  7:56 ` [PATCH v1 08/13] perf/x86/core: " Stephane Eranian
@ 2021-09-09  9:16   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09  9:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Thu, Sep 09, 2021 at 12:56:55AM -0700, Stephane Eranian wrote:

> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 14ce130aee1b..c0ddc3c32a33 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -179,7 +179,6 @@ static void cpuidle_idle_call(void)
>  	 */
>  	if (need_resched()) {
>  		local_irq_enable();
> -		perf_lopwr_cb(false);
>  		return;
>  	}
>  
> @@ -230,6 +229,9 @@ static void cpuidle_idle_call(void)
>  		tick_nohz_idle_stop_tick();
>  
>  		next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(true);
> +
>  		call_cpuidle(drv, dev, next_state);
>  	} else {
>  		bool stop_tick = true;
> @@ -244,12 +246,17 @@ static void cpuidle_idle_call(void)
>  		else
>  			tick_nohz_idle_retain_tick();
>  
> +		if (!cpu_idle_force_poll)
> +			perf_lopwr_cb(true);
> +
>  		entered_state = call_cpuidle(drv, dev, next_state);
>  		/*
>  		 * Give the governor an opportunity to reflect on the outcome
>  		 */
>  		cpuidle_reflect(dev, entered_state);
>  	}
> +	if (!cpu_idle_force_poll)
> +		perf_lopwr_cb(false);
>  
>  exit_idle:
>  	__current_set_polling();
> @@ -259,7 +266,6 @@ static void cpuidle_idle_call(void)
>  	 */
>  	if (WARN_ON_ONCE(irqs_disabled())) {
>  		local_irq_enable();
> -		perf_lopwr_cb(false);
>  	}
>  }
>  
> @@ -293,8 +299,6 @@ static void do_idle(void)
>  	while (!need_resched()) {
>  		rmb();
>  
> -		perf_lopwr_cb(true);
> -
>  		local_irq_disable();
>  
>  		if (cpu_is_offline(cpu)) {

Why are you sendimg me a patch that undoes most of what the previous
patch does? That's just bad form.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling
  2021-09-09  7:56 ` [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
@ 2021-09-09  9:20   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09  9:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Thu, Sep 09, 2021 at 12:56:56AM -0700, Stephane Eranian wrote:
> On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before entering low power
> and re-enabled (if was active) when returning from low power. Otherwise, the NMI interrupt may
> be held up for too long and cause problems. Stopping BRS will cause the NMI to be delivered if it
> was held up.

What you sent me ^^^, vs reflow at 72 chars vvv:

On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled
before entering low power and re-enabled (if was active) when returning
from low power. Otherwise, the NMI interrupt may be held up for too long
and cause problems. Stopping BRS will cause the NMI to be delivered if
it was held up.

(because git show has that stupid extra indentation for Changelogs)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 07/13] perf/core: add idle hooks
  2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
  2021-09-09  9:15   ` Peter Zijlstra
@ 2021-09-09 10:42   ` kernel test robot
  2021-09-09 11:02   ` kernel test robot
  2 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2021-09-09 10:42 UTC (permalink / raw)
  To: Stephane Eranian, linux-kernel
  Cc: kbuild-all, peterz, acme, jolsa, kim.phillips, namhyung, irogers

[-- Attachment #1: Type: text/plain, Size: 5233 bytes --]

Hi Stephane,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/35333fae36d4dee3e2225674fca8d745364482fa
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
        git checkout 35333fae36d4dee3e2225674fca8d745364482fa
        # save the attached .config to linux build tree
        make W=1 ARCH=um SUBARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   kernel/sched/idle.c: In function 'cpuidle_idle_call':
>> kernel/sched/idle.c:182:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror=implicit-function-declaration]
     182 |   perf_lopwr_cb(false);
         |   ^~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/perf_lopwr_cb +182 kernel/sched/idle.c

   160	
   161	/**
   162	 * cpuidle_idle_call - the main idle function
   163	 *
   164	 * NOTE: no locks or semaphores should be used here
   165	 *
   166	 * On architectures that support TIF_POLLING_NRFLAG, is called with polling
   167	 * set, and it returns with polling set.  If it ever stops polling, it
   168	 * must clear the polling bit.
   169	 */
   170	static void cpuidle_idle_call(void)
   171	{
   172		struct cpuidle_device *dev = cpuidle_get_device();
   173		struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
   174		int next_state, entered_state;
   175	
   176		/*
   177		 * Check if the idle task must be rescheduled. If it is the
   178		 * case, exit the function after re-enabling the local irq.
   179		 */
   180		if (need_resched()) {
   181			local_irq_enable();
 > 182			perf_lopwr_cb(false);
   183			return;
   184		}
   185	
   186		/*
   187		 * The RCU framework needs to be told that we are entering an idle
   188		 * section, so no more rcu read side critical sections and one more
   189		 * step to the grace period
   190		 */
   191	
   192		if (cpuidle_not_available(drv, dev)) {
   193			tick_nohz_idle_stop_tick();
   194	
   195			if (!cpu_idle_force_poll)
   196				perf_lopwr_cb(true);
   197	
   198			default_idle_call();
   199	
   200			if (!cpu_idle_force_poll)
   201				perf_lopwr_cb(false);
   202	
   203			goto exit_idle;
   204		}
   205	
   206		/*
   207		 * Suspend-to-idle ("s2idle") is a system state in which all user space
   208		 * has been frozen, all I/O devices have been suspended and the only
   209		 * activity happens here and in interrupts (if any). In that case bypass
   210		 * the cpuidle governor and go straight for the deepest idle state
   211		 * available.  Possibly also suspend the local tick and the entire
   212		 * timekeeping to prevent timer interrupts from kicking us out of idle
   213		 * until a proper wakeup interrupt happens.
   214		 */
   215	
   216		if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) {
   217			u64 max_latency_ns;
   218	
   219			if (idle_should_enter_s2idle()) {
   220	
   221				entered_state = call_cpuidle_s2idle(drv, dev);
   222				if (entered_state > 0)
   223					goto exit_idle;
   224	
   225				max_latency_ns = U64_MAX;
   226			} else {
   227				max_latency_ns = dev->forced_idle_latency_limit_ns;
   228			}
   229	
   230			tick_nohz_idle_stop_tick();
   231	
   232			next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
   233			call_cpuidle(drv, dev, next_state);
   234		} else {
   235			bool stop_tick = true;
   236	
   237			/*
   238			 * Ask the cpuidle framework to choose a convenient idle state.
   239			 */
   240			next_state = cpuidle_select(drv, dev, &stop_tick);
   241	
   242			if (stop_tick || tick_nohz_tick_stopped())
   243				tick_nohz_idle_stop_tick();
   244			else
   245				tick_nohz_idle_retain_tick();
   246	
   247			entered_state = call_cpuidle(drv, dev, next_state);
   248			/*
   249			 * Give the governor an opportunity to reflect on the outcome
   250			 */
   251			cpuidle_reflect(dev, entered_state);
   252		}
   253	
   254	exit_idle:
   255		__current_set_polling();
   256	
   257		/*
   258		 * It is up to the idle functions to reenable local interrupts
   259		 */
   260		if (WARN_ON_ONCE(irqs_disabled())) {
   261			local_irq_enable();
   262			perf_lopwr_cb(false);
   263		}
   264	}
   265	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 9530 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support
  2021-09-09  7:56 ` [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
@ 2021-09-09 10:44   ` kernel test robot
  2021-09-09 15:33   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2021-09-09 10:44 UTC (permalink / raw)
  To: Stephane Eranian, linux-kernel
  Cc: kbuild-all, acme, jolsa, kim.phillips, namhyung, irogers

[-- Attachment #1: Type: text/plain, Size: 2270 bytes --]

Hi Stephane,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
        git checkout 7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   arch/x86/events/amd/core.c: In function 'amd_pmu_enable_all':
>> arch/x86/events/amd/core.c:676:24: warning: variable 'hwc' set but not used [-Wunused-but-set-variable]
     676 |  struct hw_perf_event *hwc;
         |                        ^~~


vim +/hwc +676 arch/x86/events/amd/core.c

   672	
   673	static void amd_pmu_enable_all(int added)
   674	{
   675		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 > 676		struct hw_perf_event *hwc;
   677		int idx;
   678	
   679		for (idx = 0; idx < x86_pmu.num_counters; idx++) {
   680			hwc = &cpuc->events[idx]->hw;
   681	
   682			/* only activate events which are marked as active */
   683			if (!test_bit(idx, cpuc->active_mask))
   684				continue;
   685	
   686			amd_pmu_enable_event(cpuc->events[idx]);
   687		}
   688	}
   689	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 65977 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 07/13] perf/core: add idle hooks
  2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
  2021-09-09  9:15   ` Peter Zijlstra
  2021-09-09 10:42   ` kernel test robot
@ 2021-09-09 11:02   ` kernel test robot
  2 siblings, 0 replies; 41+ messages in thread
From: kernel test robot @ 2021-09-09 11:02 UTC (permalink / raw)
  To: Stephane Eranian, linux-kernel
  Cc: llvm, kbuild-all, peterz, acme, jolsa, kim.phillips, namhyung, irogers

[-- Attachment #1: Type: text/plain, Size: 6114 bytes --]

Hi Stephane,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: riscv-randconfig-r042-20210908 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/35333fae36d4dee3e2225674fca8d745364482fa
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
        git checkout 35333fae36d4dee3e2225674fca8d745364482fa
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> kernel/sched/idle.c:182:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
                   perf_lopwr_cb(false);
                   ^
   kernel/sched/idle.c:196:4: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
                           perf_lopwr_cb(true);
                           ^
   kernel/sched/idle.c:262:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
                   perf_lopwr_cb(false);
                   ^
   kernel/sched/idle.c:296:3: error: implicit declaration of function 'perf_lopwr_cb' [-Werror,-Wimplicit-function-declaration]
                   perf_lopwr_cb(true);
                   ^
   4 errors generated.


vim +/perf_lopwr_cb +182 kernel/sched/idle.c

   160	
   161	/**
   162	 * cpuidle_idle_call - the main idle function
   163	 *
   164	 * NOTE: no locks or semaphores should be used here
   165	 *
   166	 * On architectures that support TIF_POLLING_NRFLAG, is called with polling
   167	 * set, and it returns with polling set.  If it ever stops polling, it
   168	 * must clear the polling bit.
   169	 */
   170	static void cpuidle_idle_call(void)
   171	{
   172		struct cpuidle_device *dev = cpuidle_get_device();
   173		struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
   174		int next_state, entered_state;
   175	
   176		/*
   177		 * Check if the idle task must be rescheduled. If it is the
   178		 * case, exit the function after re-enabling the local irq.
   179		 */
   180		if (need_resched()) {
   181			local_irq_enable();
 > 182			perf_lopwr_cb(false);
   183			return;
   184		}
   185	
   186		/*
   187		 * The RCU framework needs to be told that we are entering an idle
   188		 * section, so no more rcu read side critical sections and one more
   189		 * step to the grace period
   190		 */
   191	
   192		if (cpuidle_not_available(drv, dev)) {
   193			tick_nohz_idle_stop_tick();
   194	
   195			if (!cpu_idle_force_poll)
   196				perf_lopwr_cb(true);
   197	
   198			default_idle_call();
   199	
   200			if (!cpu_idle_force_poll)
   201				perf_lopwr_cb(false);
   202	
   203			goto exit_idle;
   204		}
   205	
   206		/*
   207		 * Suspend-to-idle ("s2idle") is a system state in which all user space
   208		 * has been frozen, all I/O devices have been suspended and the only
   209		 * activity happens here and in interrupts (if any). In that case bypass
   210		 * the cpuidle governor and go straight for the deepest idle state
   211		 * available.  Possibly also suspend the local tick and the entire
   212		 * timekeeping to prevent timer interrupts from kicking us out of idle
   213		 * until a proper wakeup interrupt happens.
   214		 */
   215	
   216		if (idle_should_enter_s2idle() || dev->forced_idle_latency_limit_ns) {
   217			u64 max_latency_ns;
   218	
   219			if (idle_should_enter_s2idle()) {
   220	
   221				entered_state = call_cpuidle_s2idle(drv, dev);
   222				if (entered_state > 0)
   223					goto exit_idle;
   224	
   225				max_latency_ns = U64_MAX;
   226			} else {
   227				max_latency_ns = dev->forced_idle_latency_limit_ns;
   228			}
   229	
   230			tick_nohz_idle_stop_tick();
   231	
   232			next_state = cpuidle_find_deepest_state(drv, dev, max_latency_ns);
   233			call_cpuidle(drv, dev, next_state);
   234		} else {
   235			bool stop_tick = true;
   236	
   237			/*
   238			 * Ask the cpuidle framework to choose a convenient idle state.
   239			 */
   240			next_state = cpuidle_select(drv, dev, &stop_tick);
   241	
   242			if (stop_tick || tick_nohz_tick_stopped())
   243				tick_nohz_idle_stop_tick();
   244			else
   245				tick_nohz_idle_retain_tick();
   246	
   247			entered_state = call_cpuidle(drv, dev, next_state);
   248			/*
   249			 * Give the governor an opportunity to reflect on the outcome
   250			 */
   251			cpuidle_reflect(dev, entered_state);
   252		}
   253	
   254	exit_idle:
   255		__current_set_polling();
   256	
   257		/*
   258		 * It is up to the idle functions to reenable local interrupts
   259		 */
   260		if (WARN_ON_ONCE(irqs_disabled())) {
   261			local_irq_enable();
   262			perf_lopwr_cb(false);
   263		}
   264	}
   265	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26209 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support
  2021-09-09  7:56 ` [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
  2021-09-09 10:44   ` kernel test robot
@ 2021-09-09 15:33   ` kernel test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kernel test robot @ 2021-09-09 15:33 UTC (permalink / raw)
  To: Stephane Eranian, linux-kernel
  Cc: llvm, kbuild-all, acme, jolsa, kim.phillips, namhyung, irogers

[-- Attachment #1: Type: text/plain, Size: 2494 bytes --]

Hi Stephane,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on next-20210909]
[cannot apply to tip/sched/core tip/x86/core v5.14]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 4034fb207e302cc0b1f304084d379640c1fb1436
config: i386-randconfig-a012-20210908 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 261cbe98c38f8c1ee1a482fe76511110e790f58a)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Stephane-Eranian/perf-x86-amd-Add-AMD-Fam19h-Branch-Sampling-support/20210909-160050
        git checkout 7ff5f8448eecfc188c2240b2a5a5ea0b00f55662
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/x86/events/amd/core.c:676:24: warning: variable 'hwc' set but not used [-Wunused-but-set-variable]
           struct hw_perf_event *hwc;
                                 ^
   1 warning generated.


vim +/hwc +676 arch/x86/events/amd/core.c

   672	
   673	static void amd_pmu_enable_all(int added)
   674	{
   675		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 > 676		struct hw_perf_event *hwc;
   677		int idx;
   678	
   679		for (idx = 0; idx < x86_pmu.num_counters; idx++) {
   680			hwc = &cpuc->events[idx]->hw;
   681	
   682			/* only activate events which are marked as active */
   683			if (!test_bit(idx, cpuc->active_mask))
   684				continue;
   685	
   686			amd_pmu_enable_event(cpuc->events[idx]);
   687		}
   688	}
   689	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35138 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-09  7:56 ` [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry Stephane Eranian
@ 2021-09-09 19:03   ` Peter Zijlstra
  2021-09-10 12:09     ` Michael Ellerman
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-09 19:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers,
	atrajeev, maddy, mpe

On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index f92880a15645..eb11f383f4be 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
>  struct perf_branch_entry {
>  	__u64	from;
>  	__u64	to;
> -	__u64	mispred:1,  /* target mispredicted */
> -		predicted:1,/* target predicted */
> -		in_tx:1,    /* in transaction */
> -		abort:1,    /* transaction abort */
> -		cycles:16,  /* cycle count to last branch */
> -		type:4,     /* branch type */
> -		reserved:40;
> +	union {
> +		__u64	val;	    /* to make it easier to clear all fields */
> +		struct {
> +			__u64	mispred:1,  /* target mispredicted */
> +				predicted:1,/* target predicted */
> +				in_tx:1,    /* in transaction */
> +				abort:1,    /* transaction abort */
> +				cycles:16,  /* cycle count to last branch */
> +				type:4,     /* branch type */
> +				reserved:40;
> +		};
> +	};
>  };


Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
one. Power folks, could you please have a look?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-09 19:03   ` Peter Zijlstra
@ 2021-09-10 12:09     ` Michael Ellerman
  2021-09-10 14:16       ` Michael Ellerman
  0 siblings, 1 reply; 41+ messages in thread
From: Michael Ellerman @ 2021-09-10 12:09 UTC (permalink / raw)
  To: Peter Zijlstra, Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers,
	atrajeev, maddy

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index f92880a15645..eb11f383f4be 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
>>  struct perf_branch_entry {
>>  	__u64	from;
>>  	__u64	to;
>> -	__u64	mispred:1,  /* target mispredicted */
>> -		predicted:1,/* target predicted */
>> -		in_tx:1,    /* in transaction */
>> -		abort:1,    /* transaction abort */
>> -		cycles:16,  /* cycle count to last branch */
>> -		type:4,     /* branch type */
>> -		reserved:40;
>> +	union {
>> +		__u64	val;	    /* to make it easier to clear all fields */
>> +		struct {
>> +			__u64	mispred:1,  /* target mispredicted */
>> +				predicted:1,/* target predicted */
>> +				in_tx:1,    /* in transaction */
>> +				abort:1,    /* transaction abort */
>> +				cycles:16,  /* cycle count to last branch */
>> +				type:4,     /* branch type */
>> +				reserved:40;
>> +		};
>> +	};
>>  };
>
>
> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
> one. Power folks, could you please have a look?

The bit number of each field changes between big and little endian, but
as long as kernel and userspace are the same endian, and both only
access values via the bitfields then it works.

It's preferable to have the ENDIAN_BITFIELD thing, so that the bit
numbers are stable, but we can't change it now without breaking ABI :/

Adding the union risks having code that accesses val and expects to see
mispred in bit 0 for example, which it's not on big endian.

If it's just for clearing easily we could do that with a static inline
that sets all the bitfields. With my compiler here (GCC 10) it's smart
enough to just turn it into a single unsigned long store of 0.

eg:

static inline void clear_perf_branch_entry_flags(struct perf_branch_entry *e)
{
        e->mispred = 0;
        e->predicted = 0;
        e->in_tx = 0;
        e->abort = 0;
        e->cycles = 0;
        e->type = 0;
        e->reserved = 0;
}


It does look like we have a bug in perf tool though, if I take a
perf.data from a big endian system to a little endian one I don't see
any of the branch flags decoded. eg:

BE:

2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
... branch stack: nr:28
.....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0

LE:

2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
... branch stack: nr:28
.....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
                                                         ^
                                                         missing P

I guess we're missing a byte swap somewhere.

cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-10 12:09     ` Michael Ellerman
@ 2021-09-10 14:16       ` Michael Ellerman
  2021-09-15  6:03         ` Stephane Eranian
  0 siblings, 1 reply; 41+ messages in thread
From: Michael Ellerman @ 2021-09-10 14:16 UTC (permalink / raw)
  To: Peter Zijlstra, Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers,
	atrajeev, maddy

Michael Ellerman <mpe@ellerman.id.au> writes:
> Peter Zijlstra <peterz@infradead.org> writes:
>> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>>> index f92880a15645..eb11f383f4be 100644
>>> --- a/include/uapi/linux/perf_event.h
>>> +++ b/include/uapi/linux/perf_event.h
>>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
>>>  struct perf_branch_entry {
>>>  	__u64	from;
>>>  	__u64	to;
>>> -	__u64	mispred:1,  /* target mispredicted */
>>> -		predicted:1,/* target predicted */
>>> -		in_tx:1,    /* in transaction */
>>> -		abort:1,    /* transaction abort */
>>> -		cycles:16,  /* cycle count to last branch */
>>> -		type:4,     /* branch type */
>>> -		reserved:40;
>>> +	union {
>>> +		__u64	val;	    /* to make it easier to clear all fields */
>>> +		struct {
>>> +			__u64	mispred:1,  /* target mispredicted */
>>> +				predicted:1,/* target predicted */
>>> +				in_tx:1,    /* in transaction */
>>> +				abort:1,    /* transaction abort */
>>> +				cycles:16,  /* cycle count to last branch */
>>> +				type:4,     /* branch type */
>>> +				reserved:40;
>>> +		};
>>> +	};
>>>  };
>>
>>
>> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
>> one. Power folks, could you please have a look?
>
> The bit number of each field changes between big and little endian, but
> as long as kernel and userspace are the same endian, and both only
> access values via the bitfields then it works.
...
>
> It does look like we have a bug in perf tool though, if I take a
> perf.data from a big endian system to a little endian one I don't see
> any of the branch flags decoded. eg:
>
> BE:
>
> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> ... branch stack: nr:28
> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
>
> LE:
>
> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> ... branch stack: nr:28
> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
>                                                          ^
>                                                          missing P
>
> I guess we're missing a byte swap somewhere.

Ugh. We _do_ have a byte swap, but we also need a bit swap.

That works for the single bit fields, not sure if it will for the
multi-bit fields.

So that's a bit of a mess :/

cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 11/13] perf tools: improve IBS error handling
  2021-09-09  7:56 ` [PATCH v1 11/13] perf tools: improve IBS error handling Stephane Eranian
@ 2021-09-13 19:34   ` Arnaldo Carvalho de Melo
  2021-10-04 21:57     ` Kim Phillips
  0 siblings, 1 reply; 41+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-09-13 19:34 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, acme, jolsa, kim.phillips, namhyung, irogers

Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
> From: Kim Phillips <kim.phillips@amd.com>
> 
> This patch improves the error message returned on failed perf_event_open() on
> AMD when using IBS.
> 
> Signed-off-by: Kim Phillips <kim.phillips@amd.com>
> ---
>  tools/perf/util/evsel.c | 42 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index f61e5dd53f5d..f203f178fdb9 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2684,12 +2684,52 @@ static bool find_process(const char *name)
>  	return ret ? false : true;
>  }
>  
> +static bool is_amd;
> +
> +static char *fgrep(FILE *inf, const char *str)
> +{
> +	char line[256];
> +	int slen = strlen(str);
> +
> +	while (!feof(inf)) {
> +		if (!fgets(line, 256, inf))
> +			break;
> +		if (strncmp(line, str, slen))
> +			continue;
> +
> +		return strdup(line);
> +	}
> +
> +	return NULL;
> +}
> +
> +static void detect_amd(void)
> +{
> +	FILE *inf = fopen("/proc/cpuinfo", "r");
> +	char *res;
> +
> +	if (!inf)
> +		return;
> +
> +	res = fgrep(inf, "vendor_id");
> +
> +	if (res) {
> +		char *s = strchr(res, ':');
> +
> +		is_amd = s && !strcmp(s, ": AuthenticAMD\n");
> +		free(res);
> +	}
> +	fclose(inf);
> +}
> +

We have perf_env for such details, for instance in
tools/perf/util/sample-raw.c we have:o

        const char *arch_pf = perf_env__arch(evlist->env);
        const char *cpuid = perf_env__cpuid(evlist->env);

	        else if (arch_pf && !strcmp("x86", arch_pf) &&
                 cpuid && strstarts(cpuid, "AuthenticAMD") &&
                 evlist__has_amd_ibs(evlist)) {



>  int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  			 int err, char *msg, size_t size)
>  {
>  	char sbuf[STRERR_BUFSIZE];
>  	int printed = 0, enforced = 0;
>  
> +	detect_amd();
> +
>  	switch (err) {
>  	case EPERM:
>  	case EACCES:
> @@ -2782,6 +2822,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  			return scnprintf(msg, size, "wrong clockid (%d).", clockid);
>  		if (perf_missing_features.aux_output)
>  			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
> +		if (is_amd && (evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3)) && (evsel->core.attr.exclude_kernel))
> +			return scnprintf(msg, size, "AMD IBS can't exclude kernel events.  Try running at a higher privilege level.");
>  		break;
>  	case ENODATA:
>  		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> -- 
> 2.33.0.153.gba50c8fa24-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
  2021-09-09  8:55 ` [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Peter Zijlstra
@ 2021-09-15  5:55   ` Stephane Eranian
  2021-09-15  9:04     ` Peter Zijlstra
  2021-09-27 20:17     ` Song Liu
  0 siblings, 2 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-09-15  5:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
>
> Yay..
>
> > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > it is only activated after P event occurrences, where P is the desired sampling period.
> > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
>
> WTF... ?!? Srsly? You're joking right?
>

As I said, this is because of the cost of running BRS usually for
millions of branches to keep only the last 16.
Running branch sampling in general on any arch is  never totally free.

>
> Also, can you please fix you MUA to wrap at 78 chars like normal people?

Ok, I fixed that now.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-10 14:16       ` Michael Ellerman
@ 2021-09-15  6:03         ` Stephane Eranian
  2021-09-17  6:37           ` Madhavan Srinivasan
  0 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-15  6:03 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, linux-kernel, acme, jolsa, kim.phillips,
	namhyung, irogers, atrajeev, maddy

Michael,


On Fri, Sep 10, 2021 at 7:16 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Michael Ellerman <mpe@ellerman.id.au> writes:
> > Peter Zijlstra <peterz@infradead.org> writes:
> >> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
> >>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> >>> index f92880a15645..eb11f383f4be 100644
> >>> --- a/include/uapi/linux/perf_event.h
> >>> +++ b/include/uapi/linux/perf_event.h
> >>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
> >>>  struct perf_branch_entry {
> >>>     __u64   from;
> >>>     __u64   to;
> >>> -   __u64   mispred:1,  /* target mispredicted */
> >>> -           predicted:1,/* target predicted */
> >>> -           in_tx:1,    /* in transaction */
> >>> -           abort:1,    /* transaction abort */
> >>> -           cycles:16,  /* cycle count to last branch */
> >>> -           type:4,     /* branch type */
> >>> -           reserved:40;
> >>> +   union {
> >>> +           __u64   val;        /* to make it easier to clear all fields */
> >>> +           struct {
> >>> +                   __u64   mispred:1,  /* target mispredicted */
> >>> +                           predicted:1,/* target predicted */
> >>> +                           in_tx:1,    /* in transaction */
> >>> +                           abort:1,    /* transaction abort */
> >>> +                           cycles:16,  /* cycle count to last branch */
> >>> +                           type:4,     /* branch type */
> >>> +                           reserved:40;
> >>> +           };
> >>> +   };
> >>>  };
> >>
> >>
> >> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
> >> one. Power folks, could you please have a look?
> >
> > The bit number of each field changes between big and little endian, but
> > as long as kernel and userspace are the same endian, and both only
> > access values via the bitfields then it works.
> ...
> >
> > It does look like we have a bug in perf tool though, if I take a
> > perf.data from a big endian system to a little endian one I don't see
> > any of the branch flags decoded. eg:
> >
> > BE:
> >
> > 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> > ... branch stack: nr:28
> > .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
> >
> > LE:
> >
> > 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> > ... branch stack: nr:28
> > .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
> >                                                          ^
> >                                                          missing P
> >
> > I guess we're missing a byte swap somewhere.
>
> Ugh. We _do_ have a byte swap, but we also need a bit swap.
>
> That works for the single bit fields, not sure if it will for the
> multi-bit fields.
>
> So that's a bit of a mess :/
>
Based on what I see in perf_event.h for other structures, I think I
can make up what you would need for struct branch_entry. But Iit would
be easier if you could send me a patch that you would have verified on
your systems.
Thanks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
  2021-09-15  5:55   ` Stephane Eranian
@ 2021-09-15  9:04     ` Peter Zijlstra
  2021-10-28 18:30       ` Stephane Eranian
  2021-09-27 20:17     ` Song Liu
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2021-09-15  9:04 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Tue, Sep 14, 2021 at 10:55:12PM -0700, Stephane Eranian wrote:
> On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> >
> > Yay..
> >
> > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> >
> > WTF... ?!? Srsly? You're joking right?
> >
> 
> As I said, this is because of the cost of running BRS usually for
> millions of branches to keep only the last 16.
> Running branch sampling in general on any arch is  never totally free.

Holding up the NMI will disrupt the sampling of the other events, which
is, IMO unacceptible and would require this event to be exclusive on the
whole PMU, simply because sharing it doesn't work.

(also, other NMI sources might object)

Also, by only having LBRs post overflow you can't apply LBR based
analysis to other events, which seems quite limiting.

This really seems like a very sub-optimal solution. I mean, it's awesome
AMD gets branch records, but this seems a very poor solution.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-15  6:03         ` Stephane Eranian
@ 2021-09-17  6:37           ` Madhavan Srinivasan
  2021-09-17  6:48             ` Stephane Eranian
  0 siblings, 1 reply; 41+ messages in thread
From: Madhavan Srinivasan @ 2021-09-17  6:37 UTC (permalink / raw)
  To: Stephane Eranian, Michael Ellerman
  Cc: Peter Zijlstra, linux-kernel, acme, jolsa, kim.phillips,
	namhyung, irogers, atrajeev, Arnaldo Carvalho de Melo


On 9/15/21 11:33 AM, Stephane Eranian wrote:
> Michael,
>
>
> On Fri, Sep 10, 2021 at 7:16 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>> Peter Zijlstra <peterz@infradead.org> writes:
>>>> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
>>>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>>>>> index f92880a15645..eb11f383f4be 100644
>>>>> --- a/include/uapi/linux/perf_event.h
>>>>> +++ b/include/uapi/linux/perf_event.h
>>>>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
>>>>>   struct perf_branch_entry {
>>>>>      __u64   from;
>>>>>      __u64   to;
>>>>> -   __u64   mispred:1,  /* target mispredicted */
>>>>> -           predicted:1,/* target predicted */
>>>>> -           in_tx:1,    /* in transaction */
>>>>> -           abort:1,    /* transaction abort */
>>>>> -           cycles:16,  /* cycle count to last branch */
>>>>> -           type:4,     /* branch type */
>>>>> -           reserved:40;
>>>>> +   union {
>>>>> +           __u64   val;        /* to make it easier to clear all fields */
>>>>> +           struct {
>>>>> +                   __u64   mispred:1,  /* target mispredicted */
>>>>> +                           predicted:1,/* target predicted */
>>>>> +                           in_tx:1,    /* in transaction */
>>>>> +                           abort:1,    /* transaction abort */
>>>>> +                           cycles:16,  /* cycle count to last branch */
>>>>> +                           type:4,     /* branch type */
>>>>> +                           reserved:40;
>>>>> +           };
>>>>> +   };
>>>>>   };
>>>>
>>>> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
>>>> one. Power folks, could you please have a look?
>>> The bit number of each field changes between big and little endian, but
>>> as long as kernel and userspace are the same endian, and both only
>>> access values via the bitfields then it works.
>> ...
>>> It does look like we have a bug in perf tool though, if I take a
>>> perf.data from a big endian system to a little endian one I don't see
>>> any of the branch flags decoded. eg:
>>>
>>> BE:
>>>
>>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
>>> ... branch stack: nr:28
>>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
>>>
>>> LE:
>>>
>>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
>>> ... branch stack: nr:28
>>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
>>>                                                           ^
>>>                                                           missing P
>>>
>>> I guess we're missing a byte swap somewhere.
>> Ugh. We _do_ have a byte swap, but we also need a bit swap.
>>
>> That works for the single bit fields, not sure if it will for the
>> multi-bit fields.
>>
>> So that's a bit of a mess :/
>>
> Based on what I see in perf_event.h for other structures, I think I
> can make up what you would need for struct branch_entry. But Iit would
> be easier if you could send me a patch that you would have verified on
> your systems.
> Thanks.
Attached patch fixes the issue. Have tested both in both in BE and LE case.

Maddy

 From f816ba2e6ef8d5975f78442d7ecb50d66c3c4326 Mon Sep 17 00:00:00 2001
From: Madhavan Srinivasan <maddy@linux.ibm.com>
Date: Wed, 15 Sep 2021 22:29:09 +0530
Subject: [RFC PATCH] tools/perf: Add reverse_64b macro

branch_stack struct has bit field definition
producing different bit ordering for big/little endian.
Because of this, when branch_stack sample collected
in a BE system viewed/reported in a LE system,
bit fields of the branch stack are not presented
properly. To address this issue, a reverse_64b
macro is defined and introduced in evsel__parse_sample.

Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
---
  tools/perf/util/evsel.c | 35 +++++++++++++++++++++++++++++++++--
  1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index dbfeceb2546c..3151606e516e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2221,6 +2221,9 @@ void __weak arch_perf_parse_sample_weight(struct 
perf_sample *data,
      data->weight = *array;
  }

+#define reverse_64b(src, pos, size)    \
+    (((src >> pos) & (( 1ull <<size) - 1)) << (63 - (pos + size - 1)))
+
  int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
              struct perf_sample *data)
  {
@@ -2408,6 +2411,8 @@ int evsel__parse_sample(struct evsel *evsel, union 
perf_event *event,
      if (type & PERF_SAMPLE_BRANCH_STACK) {
          const u64 max_branch_nr = UINT64_MAX /
                        sizeof(struct branch_entry);
+        struct branch_entry *e;
+        unsigned i;

          OVERFLOW_CHECK_u64(array);
          data->branch_stack = (struct branch_stack *)array++;
@@ -2416,10 +2421,36 @@ int evsel__parse_sample(struct evsel *evsel, 
union perf_event *event,
              return -EFAULT;

          sz = data->branch_stack->nr * sizeof(struct branch_entry);
-        if (evsel__has_branch_hw_idx(evsel))
+        if (evsel__has_branch_hw_idx(evsel)) {
              sz += sizeof(u64);
-        else
+            e = &data->branch_stack->entries[0];
+        } else {
              data->no_hw_idx = true;
+            e = (struct branch_entry *)&data->branch_stack->hw_idx;
+        }
+
+        if (swapped) {
+            for (i = 0; i < data->branch_stack->nr; i++, e++) {
+                u64 new_val = 0;
+
+                /* mispred:1  target mispredicted */
+                new_val = reverse_64b(e->flags.value, 0, 1);
+                /* predicted:1  target predicted */
+                new_val |= reverse_64b(e->flags.value, 1, 1);
+                /* in_tx:1  in transaction */
+                new_val |= reverse_64b(e->flags.value, 2, 1);
+                /* abort:1  transaction abort */
+                new_val |= reverse_64b(e->flags.value, 3, 1);
+                /* cycles:16  cycle count to last branch */
+                new_val |= reverse_64b(e->flags.value, 4, 16);
+                /* type:4  branch type */
+                new_val |= reverse_64b(e->flags.value, 20, 4);
+                /* reserved:40 */
+                new_val |= reverse_64b(e->flags.value, 24, 40);
+                e->flags.value = new_val;
+            }
+        }
+
          OVERFLOW_CHECK(array, sz, max_size);
          array = (void *)array + sz;
      }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17  6:37           ` Madhavan Srinivasan
@ 2021-09-17  6:48             ` Stephane Eranian
  2021-09-17  7:05               ` Michael Ellerman
  0 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-17  6:48 UTC (permalink / raw)
  To: Madhavan Srinivasan
  Cc: Michael Ellerman, Peter Zijlstra, linux-kernel, acme, jolsa,
	kim.phillips, namhyung, irogers, atrajeev,
	Arnaldo Carvalho de Melo

Hi,


Thanks for fixing this in the perf tool. But what about the struct
branch_entry in the header?


On Thu, Sep 16, 2021 at 11:38 PM Madhavan Srinivasan
<maddy@linux.ibm.com> wrote:
>
>
> On 9/15/21 11:33 AM, Stephane Eranian wrote:
> > Michael,
> >
> >
> > On Fri, Sep 10, 2021 at 7:16 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> >> Michael Ellerman <mpe@ellerman.id.au> writes:
> >>> Peter Zijlstra <peterz@infradead.org> writes:
> >>>> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
> >>>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> >>>>> index f92880a15645..eb11f383f4be 100644
> >>>>> --- a/include/uapi/linux/perf_event.h
> >>>>> +++ b/include/uapi/linux/perf_event.h
> >>>>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
> >>>>>   struct perf_branch_entry {
> >>>>>      __u64   from;
> >>>>>      __u64   to;
> >>>>> -   __u64   mispred:1,  /* target mispredicted */
> >>>>> -           predicted:1,/* target predicted */
> >>>>> -           in_tx:1,    /* in transaction */
> >>>>> -           abort:1,    /* transaction abort */
> >>>>> -           cycles:16,  /* cycle count to last branch */
> >>>>> -           type:4,     /* branch type */
> >>>>> -           reserved:40;
> >>>>> +   union {
> >>>>> +           __u64   val;        /* to make it easier to clear all fields */
> >>>>> +           struct {
> >>>>> +                   __u64   mispred:1,  /* target mispredicted */
> >>>>> +                           predicted:1,/* target predicted */
> >>>>> +                           in_tx:1,    /* in transaction */
> >>>>> +                           abort:1,    /* transaction abort */
> >>>>> +                           cycles:16,  /* cycle count to last branch */
> >>>>> +                           type:4,     /* branch type */
> >>>>> +                           reserved:40;
> >>>>> +           };
> >>>>> +   };
> >>>>>   };
> >>>>
> >>>> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
> >>>> one. Power folks, could you please have a look?
> >>> The bit number of each field changes between big and little endian, but
> >>> as long as kernel and userspace are the same endian, and both only
> >>> access values via the bitfields then it works.
> >> ...
> >>> It does look like we have a bug in perf tool though, if I take a
> >>> perf.data from a big endian system to a little endian one I don't see
> >>> any of the branch flags decoded. eg:
> >>>
> >>> BE:
> >>>
> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> >>> ... branch stack: nr:28
> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
> >>>
> >>> LE:
> >>>
> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> >>> ... branch stack: nr:28
> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
> >>>                                                           ^
> >>>                                                           missing P
> >>>
> >>> I guess we're missing a byte swap somewhere.
> >> Ugh. We _do_ have a byte swap, but we also need a bit swap.
> >>
> >> That works for the single bit fields, not sure if it will for the
> >> multi-bit fields.
> >>
> >> So that's a bit of a mess :/
> >>
> > Based on what I see in perf_event.h for other structures, I think I
> > can make up what you would need for struct branch_entry. But Iit would
> > be easier if you could send me a patch that you would have verified on
> > your systems.
> > Thanks.
> Attached patch fixes the issue. Have tested both in both in BE and LE case.
>
> Maddy
>
>  From f816ba2e6ef8d5975f78442d7ecb50d66c3c4326 Mon Sep 17 00:00:00 2001
> From: Madhavan Srinivasan <maddy@linux.ibm.com>
> Date: Wed, 15 Sep 2021 22:29:09 +0530
> Subject: [RFC PATCH] tools/perf: Add reverse_64b macro
>
> branch_stack struct has bit field definition
> producing different bit ordering for big/little endian.
> Because of this, when branch_stack sample collected
> in a BE system viewed/reported in a LE system,
> bit fields of the branch stack are not presented
> properly. To address this issue, a reverse_64b
> macro is defined and introduced in evsel__parse_sample.
>
> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> ---
>   tools/perf/util/evsel.c | 35 +++++++++++++++++++++++++++++++++--
>   1 file changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index dbfeceb2546c..3151606e516e 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2221,6 +2221,9 @@ void __weak arch_perf_parse_sample_weight(struct
> perf_sample *data,
>       data->weight = *array;
>   }
>
> +#define reverse_64b(src, pos, size)    \
> +    (((src >> pos) & (( 1ull <<size) - 1)) << (63 - (pos + size - 1)))
> +
>   int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>               struct perf_sample *data)
>   {
> @@ -2408,6 +2411,8 @@ int evsel__parse_sample(struct evsel *evsel, union
> perf_event *event,
>       if (type & PERF_SAMPLE_BRANCH_STACK) {
>           const u64 max_branch_nr = UINT64_MAX /
>                         sizeof(struct branch_entry);
> +        struct branch_entry *e;
> +        unsigned i;
>
>           OVERFLOW_CHECK_u64(array);
>           data->branch_stack = (struct branch_stack *)array++;
> @@ -2416,10 +2421,36 @@ int evsel__parse_sample(struct evsel *evsel,
> union perf_event *event,
>               return -EFAULT;
>
>           sz = data->branch_stack->nr * sizeof(struct branch_entry);
> -        if (evsel__has_branch_hw_idx(evsel))
> +        if (evsel__has_branch_hw_idx(evsel)) {
>               sz += sizeof(u64);
> -        else
> +            e = &data->branch_stack->entries[0];
> +        } else {
>               data->no_hw_idx = true;
> +            e = (struct branch_entry *)&data->branch_stack->hw_idx;
> +        }
> +
> +        if (swapped) {
> +            for (i = 0; i < data->branch_stack->nr; i++, e++) {
> +                u64 new_val = 0;
> +
> +                /* mispred:1  target mispredicted */
> +                new_val = reverse_64b(e->flags.value, 0, 1);
> +                /* predicted:1  target predicted */
> +                new_val |= reverse_64b(e->flags.value, 1, 1);
> +                /* in_tx:1  in transaction */
> +                new_val |= reverse_64b(e->flags.value, 2, 1);
> +                /* abort:1  transaction abort */
> +                new_val |= reverse_64b(e->flags.value, 3, 1);
> +                /* cycles:16  cycle count to last branch */
> +                new_val |= reverse_64b(e->flags.value, 4, 16);
> +                /* type:4  branch type */
> +                new_val |= reverse_64b(e->flags.value, 20, 4);
> +                /* reserved:40 */
> +                new_val |= reverse_64b(e->flags.value, 24, 40);
> +                e->flags.value = new_val;
> +            }
> +        }
> +
>           OVERFLOW_CHECK(array, sz, max_size);
>           array = (void *)array + sz;
>       }
> --
> 2.31.1
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17  6:48             ` Stephane Eranian
@ 2021-09-17  7:05               ` Michael Ellerman
  2021-09-17  7:39                 ` Stephane Eranian
  0 siblings, 1 reply; 41+ messages in thread
From: Michael Ellerman @ 2021-09-17  7:05 UTC (permalink / raw)
  To: Stephane Eranian, Madhavan Srinivasan
  Cc: Peter Zijlstra, linux-kernel, acme, jolsa, kim.phillips,
	namhyung, irogers, atrajeev, Arnaldo Carvalho de Melo

Stephane Eranian <eranian@google.com> writes:
> Hi,
>
> Thanks for fixing this in the perf tool. But what about the struct
> branch_entry in the header?

I'm not sure what you mean.

We can't change the order of the fields in the header, without breaking
existing userspace on BE systems.

It's annoying that the bit numbers are different between LE & BE, but I
think it's too late to change that.

So nothing should change in the branch_entry definition in the header.

My comment on your patch was that adding the union with val, makes it
easier to misuse the bitfields, because now the values can be accessed
via the bitfields and also via val, but when using val you have to know
that the bit numbers differ between BE/LE.

Maybe that's over-paranoid on my part, but if we just want val for
clearing the values easily then I think the static inline I suggested is
preferable.

cheers

> On Thu, Sep 16, 2021 at 11:38 PM Madhavan Srinivasan
> <maddy@linux.ibm.com> wrote:
>>
>>
>> On 9/15/21 11:33 AM, Stephane Eranian wrote:
>> > Michael,
>> >
>> >
>> > On Fri, Sep 10, 2021 at 7:16 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> >> Michael Ellerman <mpe@ellerman.id.au> writes:
>> >>> Peter Zijlstra <peterz@infradead.org> writes:
>> >>>> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
>> >>>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> >>>>> index f92880a15645..eb11f383f4be 100644
>> >>>>> --- a/include/uapi/linux/perf_event.h
>> >>>>> +++ b/include/uapi/linux/perf_event.h
>> >>>>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
>> >>>>>   struct perf_branch_entry {
>> >>>>>      __u64   from;
>> >>>>>      __u64   to;
>> >>>>> -   __u64   mispred:1,  /* target mispredicted */
>> >>>>> -           predicted:1,/* target predicted */
>> >>>>> -           in_tx:1,    /* in transaction */
>> >>>>> -           abort:1,    /* transaction abort */
>> >>>>> -           cycles:16,  /* cycle count to last branch */
>> >>>>> -           type:4,     /* branch type */
>> >>>>> -           reserved:40;
>> >>>>> +   union {
>> >>>>> +           __u64   val;        /* to make it easier to clear all fields */
>> >>>>> +           struct {
>> >>>>> +                   __u64   mispred:1,  /* target mispredicted */
>> >>>>> +                           predicted:1,/* target predicted */
>> >>>>> +                           in_tx:1,    /* in transaction */
>> >>>>> +                           abort:1,    /* transaction abort */
>> >>>>> +                           cycles:16,  /* cycle count to last branch */
>> >>>>> +                           type:4,     /* branch type */
>> >>>>> +                           reserved:40;
>> >>>>> +           };
>> >>>>> +   };
>> >>>>>   };
>> >>>>
>> >>>> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
>> >>>> one. Power folks, could you please have a look?
>> >>> The bit number of each field changes between big and little endian, but
>> >>> as long as kernel and userspace are the same endian, and both only
>> >>> access values via the bitfields then it works.
>> >> ...
>> >>> It does look like we have a bug in perf tool though, if I take a
>> >>> perf.data from a big endian system to a little endian one I don't see
>> >>> any of the branch flags decoded. eg:
>> >>>
>> >>> BE:
>> >>>
>> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
>> >>> ... branch stack: nr:28
>> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
>> >>>
>> >>> LE:
>> >>>
>> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
>> >>> ... branch stack: nr:28
>> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
>> >>>                                                           ^
>> >>>                                                           missing P
>> >>>
>> >>> I guess we're missing a byte swap somewhere.
>> >> Ugh. We _do_ have a byte swap, but we also need a bit swap.
>> >>
>> >> That works for the single bit fields, not sure if it will for the
>> >> multi-bit fields.
>> >>
>> >> So that's a bit of a mess :/
>> >>
>> > Based on what I see in perf_event.h for other structures, I think I
>> > can make up what you would need for struct branch_entry. But Iit would
>> > be easier if you could send me a patch that you would have verified on
>> > your systems.
>> > Thanks.
>> Attached patch fixes the issue. Have tested both in both in BE and LE case.
>>
>> Maddy
>>
>>  From f816ba2e6ef8d5975f78442d7ecb50d66c3c4326 Mon Sep 17 00:00:00 2001
>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>> Date: Wed, 15 Sep 2021 22:29:09 +0530
>> Subject: [RFC PATCH] tools/perf: Add reverse_64b macro
>>
>> branch_stack struct has bit field definition
>> producing different bit ordering for big/little endian.
>> Because of this, when branch_stack sample collected
>> in a BE system viewed/reported in a LE system,
>> bit fields of the branch stack are not presented
>> properly. To address this issue, a reverse_64b
>> macro is defined and introduced in evsel__parse_sample.
>>
>> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> ---
>>   tools/perf/util/evsel.c | 35 +++++++++++++++++++++++++++++++++--
>>   1 file changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index dbfeceb2546c..3151606e516e 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -2221,6 +2221,9 @@ void __weak arch_perf_parse_sample_weight(struct
>> perf_sample *data,
>>       data->weight = *array;
>>   }
>>
>> +#define reverse_64b(src, pos, size)    \
>> +    (((src >> pos) & (( 1ull <<size) - 1)) << (63 - (pos + size - 1)))
>> +
>>   int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>>               struct perf_sample *data)
>>   {
>> @@ -2408,6 +2411,8 @@ int evsel__parse_sample(struct evsel *evsel, union
>> perf_event *event,
>>       if (type & PERF_SAMPLE_BRANCH_STACK) {
>>           const u64 max_branch_nr = UINT64_MAX /
>>                         sizeof(struct branch_entry);
>> +        struct branch_entry *e;
>> +        unsigned i;
>>
>>           OVERFLOW_CHECK_u64(array);
>>           data->branch_stack = (struct branch_stack *)array++;
>> @@ -2416,10 +2421,36 @@ int evsel__parse_sample(struct evsel *evsel,
>> union perf_event *event,
>>               return -EFAULT;
>>
>>           sz = data->branch_stack->nr * sizeof(struct branch_entry);
>> -        if (evsel__has_branch_hw_idx(evsel))
>> +        if (evsel__has_branch_hw_idx(evsel)) {
>>               sz += sizeof(u64);
>> -        else
>> +            e = &data->branch_stack->entries[0];
>> +        } else {
>>               data->no_hw_idx = true;
>> +            e = (struct branch_entry *)&data->branch_stack->hw_idx;
>> +        }
>> +
>> +        if (swapped) {
>> +            for (i = 0; i < data->branch_stack->nr; i++, e++) {
>> +                u64 new_val = 0;
>> +
>> +                /* mispred:1  target mispredicted */
>> +                new_val = reverse_64b(e->flags.value, 0, 1);
>> +                /* predicted:1  target predicted */
>> +                new_val |= reverse_64b(e->flags.value, 1, 1);
>> +                /* in_tx:1  in transaction */
>> +                new_val |= reverse_64b(e->flags.value, 2, 1);
>> +                /* abort:1  transaction abort */
>> +                new_val |= reverse_64b(e->flags.value, 3, 1);
>> +                /* cycles:16  cycle count to last branch */
>> +                new_val |= reverse_64b(e->flags.value, 4, 16);
>> +                /* type:4  branch type */
>> +                new_val |= reverse_64b(e->flags.value, 20, 4);
>> +                /* reserved:40 */
>> +                new_val |= reverse_64b(e->flags.value, 24, 40);
>> +                e->flags.value = new_val;
>> +            }
>> +        }
>> +
>>           OVERFLOW_CHECK(array, sz, max_size);
>>           array = (void *)array + sz;
>>       }
>> --
>> 2.31.1
>>
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17  7:05               ` Michael Ellerman
@ 2021-09-17  7:39                 ` Stephane Eranian
  2021-09-17 12:38                   ` Michael Ellerman
  0 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-17  7:39 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Madhavan Srinivasan, Peter Zijlstra, linux-kernel, acme, jolsa,
	kim.phillips, namhyung, irogers, atrajeev,
	Arnaldo Carvalho de Melo

Hi,

On Fri, Sep 17, 2021 at 12:05 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Stephane Eranian <eranian@google.com> writes:
> > Hi,
> >
> > Thanks for fixing this in the perf tool. But what about the struct
> > branch_entry in the header?
>
> I'm not sure what you mean.
>
> We can't change the order of the fields in the header, without breaking
> existing userspace on BE systems.
>
Ok, I think I had missed that. You are saying that the
#ifdef (__BIG_ENDIAN_BITFIELD) vs __LITTLE_ENDIAN_BITFIELD

is only added to kernel-only data structures?

> It's annoying that the bit numbers are different between LE & BE, but I
> think it's too late to change that.
>
I agree.

> So nothing should change in the branch_entry definition in the header.
>
> My comment on your patch was that adding the union with val, makes it
> easier to misuse the bitfields, because now the values can be accessed
> via the bitfields and also via val, but when using val you have to know
> that the bit numbers differ between BE/LE.
>
Ok, I get it now. We do not need to expose val to user. This is added
for kernel code
convenience only. But if we keep it in kernel, that may break some
other rules about
uapi headers.



> Maybe that's over-paranoid on my part, but if we just want val for
> clearing the values easily then I think the static inline I suggested is
> preferable.
>
> cheers
>
> > On Thu, Sep 16, 2021 at 11:38 PM Madhavan Srinivasan
> > <maddy@linux.ibm.com> wrote:
> >>
> >>
> >> On 9/15/21 11:33 AM, Stephane Eranian wrote:
> >> > Michael,
> >> >
> >> >
> >> > On Fri, Sep 10, 2021 at 7:16 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> >> >> Michael Ellerman <mpe@ellerman.id.au> writes:
> >> >>> Peter Zijlstra <peterz@infradead.org> writes:
> >> >>>> On Thu, Sep 09, 2021 at 12:56:48AM -0700, Stephane Eranian wrote:
> >> >>>>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> >> >>>>> index f92880a15645..eb11f383f4be 100644
> >> >>>>> --- a/include/uapi/linux/perf_event.h
> >> >>>>> +++ b/include/uapi/linux/perf_event.h
> >> >>>>> @@ -1329,13 +1329,18 @@ union perf_mem_data_src {
> >> >>>>>   struct perf_branch_entry {
> >> >>>>>      __u64   from;
> >> >>>>>      __u64   to;
> >> >>>>> -   __u64   mispred:1,  /* target mispredicted */
> >> >>>>> -           predicted:1,/* target predicted */
> >> >>>>> -           in_tx:1,    /* in transaction */
> >> >>>>> -           abort:1,    /* transaction abort */
> >> >>>>> -           cycles:16,  /* cycle count to last branch */
> >> >>>>> -           type:4,     /* branch type */
> >> >>>>> -           reserved:40;
> >> >>>>> +   union {
> >> >>>>> +           __u64   val;        /* to make it easier to clear all fields */
> >> >>>>> +           struct {
> >> >>>>> +                   __u64   mispred:1,  /* target mispredicted */
> >> >>>>> +                           predicted:1,/* target predicted */
> >> >>>>> +                           in_tx:1,    /* in transaction */
> >> >>>>> +                           abort:1,    /* transaction abort */
> >> >>>>> +                           cycles:16,  /* cycle count to last branch */
> >> >>>>> +                           type:4,     /* branch type */
> >> >>>>> +                           reserved:40;
> >> >>>>> +           };
> >> >>>>> +   };
> >> >>>>>   };
> >> >>>>
> >> >>>> Hurpmh... all other bitfields have ENDIAN_BITFIELD things except this
> >> >>>> one. Power folks, could you please have a look?
> >> >>> The bit number of each field changes between big and little endian, but
> >> >>> as long as kernel and userspace are the same endian, and both only
> >> >>> access values via the bitfields then it works.
> >> >> ...
> >> >>> It does look like we have a bug in perf tool though, if I take a
> >> >>> perf.data from a big endian system to a little endian one I don't see
> >> >>> any of the branch flags decoded. eg:
> >> >>>
> >> >>> BE:
> >> >>>
> >> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> >> >>> ... branch stack: nr:28
> >> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles  P   0
> >> >>>
> >> >>> LE:
> >> >>>
> >> >>> 2413132652524 0x1db8 [0x2d0]: PERF_RECORD_SAMPLE(IP, 0x1): 5279/5279: 0xc00000000045c028 period: 923003 addr: 0
> >> >>> ... branch stack: nr:28
> >> >>> .....  0: c00000000045c028 -> c00000000dce7604 0 cycles      0
> >> >>>                                                           ^
> >> >>>                                                           missing P
> >> >>>
> >> >>> I guess we're missing a byte swap somewhere.
> >> >> Ugh. We _do_ have a byte swap, but we also need a bit swap.
> >> >>
> >> >> That works for the single bit fields, not sure if it will for the
> >> >> multi-bit fields.
> >> >>
> >> >> So that's a bit of a mess :/
> >> >>
> >> > Based on what I see in perf_event.h for other structures, I think I
> >> > can make up what you would need for struct branch_entry. But Iit would
> >> > be easier if you could send me a patch that you would have verified on
> >> > your systems.
> >> > Thanks.
> >> Attached patch fixes the issue. Have tested both in both in BE and LE case.
> >>
> >> Maddy
> >>
> >>  From f816ba2e6ef8d5975f78442d7ecb50d66c3c4326 Mon Sep 17 00:00:00 2001
> >> From: Madhavan Srinivasan <maddy@linux.ibm.com>
> >> Date: Wed, 15 Sep 2021 22:29:09 +0530
> >> Subject: [RFC PATCH] tools/perf: Add reverse_64b macro
> >>
> >> branch_stack struct has bit field definition
> >> producing different bit ordering for big/little endian.
> >> Because of this, when branch_stack sample collected
> >> in a BE system viewed/reported in a LE system,
> >> bit fields of the branch stack are not presented
> >> properly. To address this issue, a reverse_64b
> >> macro is defined and introduced in evsel__parse_sample.
> >>
> >> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> >> ---
> >>   tools/perf/util/evsel.c | 35 +++++++++++++++++++++++++++++++++--
> >>   1 file changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >> index dbfeceb2546c..3151606e516e 100644
> >> --- a/tools/perf/util/evsel.c
> >> +++ b/tools/perf/util/evsel.c
> >> @@ -2221,6 +2221,9 @@ void __weak arch_perf_parse_sample_weight(struct
> >> perf_sample *data,
> >>       data->weight = *array;
> >>   }
> >>
> >> +#define reverse_64b(src, pos, size)    \
> >> +    (((src >> pos) & (( 1ull <<size) - 1)) << (63 - (pos + size - 1)))
> >> +
> >>   int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> >>               struct perf_sample *data)
> >>   {
> >> @@ -2408,6 +2411,8 @@ int evsel__parse_sample(struct evsel *evsel, union
> >> perf_event *event,
> >>       if (type & PERF_SAMPLE_BRANCH_STACK) {
> >>           const u64 max_branch_nr = UINT64_MAX /
> >>                         sizeof(struct branch_entry);
> >> +        struct branch_entry *e;
> >> +        unsigned i;
> >>
> >>           OVERFLOW_CHECK_u64(array);
> >>           data->branch_stack = (struct branch_stack *)array++;
> >> @@ -2416,10 +2421,36 @@ int evsel__parse_sample(struct evsel *evsel,
> >> union perf_event *event,
> >>               return -EFAULT;
> >>
> >>           sz = data->branch_stack->nr * sizeof(struct branch_entry);
> >> -        if (evsel__has_branch_hw_idx(evsel))
> >> +        if (evsel__has_branch_hw_idx(evsel)) {
> >>               sz += sizeof(u64);
> >> -        else
> >> +            e = &data->branch_stack->entries[0];
> >> +        } else {
> >>               data->no_hw_idx = true;
> >> +            e = (struct branch_entry *)&data->branch_stack->hw_idx;
> >> +        }
> >> +
> >> +        if (swapped) {
> >> +            for (i = 0; i < data->branch_stack->nr; i++, e++) {
> >> +                u64 new_val = 0;
> >> +
> >> +                /* mispred:1  target mispredicted */
> >> +                new_val = reverse_64b(e->flags.value, 0, 1);
> >> +                /* predicted:1  target predicted */
> >> +                new_val |= reverse_64b(e->flags.value, 1, 1);
> >> +                /* in_tx:1  in transaction */
> >> +                new_val |= reverse_64b(e->flags.value, 2, 1);
> >> +                /* abort:1  transaction abort */
> >> +                new_val |= reverse_64b(e->flags.value, 3, 1);
> >> +                /* cycles:16  cycle count to last branch */
> >> +                new_val |= reverse_64b(e->flags.value, 4, 16);
> >> +                /* type:4  branch type */
> >> +                new_val |= reverse_64b(e->flags.value, 20, 4);
> >> +                /* reserved:40 */
> >> +                new_val |= reverse_64b(e->flags.value, 24, 40);
> >> +                e->flags.value = new_val;
> >> +            }
> >> +        }
> >> +
> >>           OVERFLOW_CHECK(array, sz, max_size);
> >>           array = (void *)array + sz;
> >>       }
> >> --
> >> 2.31.1
> >>
> >>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17  7:39                 ` Stephane Eranian
@ 2021-09-17 12:38                   ` Michael Ellerman
  2021-09-17 16:42                     ` Stephane Eranian
  0 siblings, 1 reply; 41+ messages in thread
From: Michael Ellerman @ 2021-09-17 12:38 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Madhavan Srinivasan, Peter Zijlstra, linux-kernel, acme, jolsa,
	kim.phillips, namhyung, irogers, atrajeev,
	Arnaldo Carvalho de Melo

Stephane Eranian <eranian@google.com> writes:
> Hi,
>
> On Fri, Sep 17, 2021 at 12:05 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> Stephane Eranian <eranian@google.com> writes:
>> > Hi,
>> >
>> > Thanks for fixing this in the perf tool. But what about the struct
>> > branch_entry in the header?
>>
>> I'm not sure what you mean.
>>
>> We can't change the order of the fields in the header, without breaking
>> existing userspace on BE systems.
>>
> Ok, I think I had missed that. You are saying that the
> #ifdef (__BIG_ENDIAN_BITFIELD) vs __LITTLE_ENDIAN_BITFIELD
>
> is only added to kernel-only data structures?

No, we *should* have used __BIG/LITTLE_ENDIAN_BITFIELD for the uapi
definition, but we forgot.

>> It's annoying that the bit numbers are different between LE & BE, but I
>> think it's too late to change that.
>>
> I agree.
>
>> So nothing should change in the branch_entry definition in the header.
>>
>> My comment on your patch was that adding the union with val, makes it
>> easier to misuse the bitfields, because now the values can be accessed
>> via the bitfields and also via val, but when using val you have to know
>> that the bit numbers differ between BE/LE.
>>
> Ok, I get it now. We do not need to expose val to user. This is added
> for kernel code convenience only.

Yeah. Putting the union with val in the uapi encourages userspace to
misuse val to bypass the bitfields, and that risks causing endian bugs.

> But if we keep it in kernel, that may break some other rules about
> uapi headers.

I don't follow what you mean there.

We could use #ifdef __KERNEL__ in the uapi header to make the union
kernel-only, see below, but it's pretty gross.

 struct perf_branch_entry {
	__u64	from;
	__u64	to;
 #ifdef __KERNEL__
	union {
		__u64	val;	    /* to make it easier to clear all fields */
		struct {
 #endif
			__u64	mispred:1,  /* target mispredicted */
				predicted:1,/* target predicted */
				in_tx:1,    /* in transaction */
				abort:1,    /* transaction abort */
				cycles:16,  /* cycle count to last branch */
				type:4,     /* branch type */
				reserved:40;
 #ifdef __KERNEL__
		};
	};
 #endif
 };


If we just do the inline I suggested we can clear the flags in a single
source line, and the generated code seems fine too, eg:

static inline void clear_perf_branch_entry_flags(struct perf_branch_entry *e)
{
	e->mispred = 0;
	e->predicted = 0;
	e->in_tx = 0;
	e->abort = 0;
	e->cycles = 0;
	e->type = 0;
	e->reserved = 0;
}


cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17 12:38                   ` Michael Ellerman
@ 2021-09-17 16:42                     ` Stephane Eranian
  2021-09-19 10:27                       ` Michael Ellerman
  0 siblings, 1 reply; 41+ messages in thread
From: Stephane Eranian @ 2021-09-17 16:42 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Madhavan Srinivasan, Peter Zijlstra, linux-kernel, acme, jolsa,
	kim.phillips, namhyung, irogers, atrajeev,
	Arnaldo Carvalho de Melo

On Fri, Sep 17, 2021 at 5:38 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Stephane Eranian <eranian@google.com> writes:
> > Hi,
> >
> > On Fri, Sep 17, 2021 at 12:05 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> >>
> >> Stephane Eranian <eranian@google.com> writes:
> >> > Hi,
> >> >
> >> > Thanks for fixing this in the perf tool. But what about the struct
> >> > branch_entry in the header?
> >>
> >> I'm not sure what you mean.
> >>
> >> We can't change the order of the fields in the header, without breaking
> >> existing userspace on BE systems.
> >>
> > Ok, I think I had missed that. You are saying that the
> > #ifdef (__BIG_ENDIAN_BITFIELD) vs __LITTLE_ENDIAN_BITFIELD
> >
> > is only added to kernel-only data structures?
>
> No, we *should* have used __BIG/LITTLE_ENDIAN_BITFIELD for the uapi
> definition, but we forgot.
>
But are you suggesting it cannot be fixed?

> >> It's annoying that the bit numbers are different between LE & BE, but I
> >> think it's too late to change that.
> >>
> > I agree.
> >
> >> So nothing should change in the branch_entry definition in the header.
> >>
> >> My comment on your patch was that adding the union with val, makes it
> >> easier to misuse the bitfields, because now the values can be accessed
> >> via the bitfields and also via val, but when using val you have to know
> >> that the bit numbers differ between BE/LE.
> >>
> > Ok, I get it now. We do not need to expose val to user. This is added
> > for kernel code convenience only.
>
> Yeah. Putting the union with val in the uapi encourages userspace to
> misuse val to bypass the bitfields, and that risks causing endian bugs.
>
> > But if we keep it in kernel, that may break some other rules about
> > uapi headers.
>
> I don't follow what you mean there.
>
> We could use #ifdef __KERNEL__ in the uapi header to make the union
> kernel-only, see below, but it's pretty gross.
>
>  struct perf_branch_entry {
>         __u64   from;
>         __u64   to;
>  #ifdef __KERNEL__
>         union {
>                 __u64   val;        /* to make it easier to clear all fields */
>                 struct {
>  #endif
>                         __u64   mispred:1,  /* target mispredicted */
>                                 predicted:1,/* target predicted */
>                                 in_tx:1,    /* in transaction */
>                                 abort:1,    /* transaction abort */
>                                 cycles:16,  /* cycle count to last branch */
>                                 type:4,     /* branch type */
>                                 reserved:40;
>  #ifdef __KERNEL__
>                 };
>         };
>  #endif
>  };
>
>
> If we just do the inline I suggested we can clear the flags in a single
> source line, and the generated code seems fine too, eg:
>
> static inline void clear_perf_branch_entry_flags(struct perf_branch_entry *e)
> {
>         e->mispred = 0;
>         e->predicted = 0;
>         e->in_tx = 0;
>         e->abort = 0;
>         e->cycles = 0;
>         e->type = 0;
>         e->reserved = 0;
> }
>
Ok, let's do the inline then. That looks like a cleaner solution to me
assuming the compiler does the right thing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry
  2021-09-17 16:42                     ` Stephane Eranian
@ 2021-09-19 10:27                       ` Michael Ellerman
  0 siblings, 0 replies; 41+ messages in thread
From: Michael Ellerman @ 2021-09-19 10:27 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Madhavan Srinivasan, Peter Zijlstra, linux-kernel, acme, jolsa,
	kim.phillips, namhyung, irogers, atrajeev,
	Arnaldo Carvalho de Melo

Stephane Eranian <eranian@google.com> writes:
> On Fri, Sep 17, 2021 at 5:38 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> Stephane Eranian <eranian@google.com> writes:
>> > Hi,
>> >
>> > On Fri, Sep 17, 2021 at 12:05 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> >>
>> >> Stephane Eranian <eranian@google.com> writes:
>> >> > Hi,
>> >> >
>> >> > Thanks for fixing this in the perf tool. But what about the struct
>> >> > branch_entry in the header?
>> >>
>> >> I'm not sure what you mean.
>> >>
>> >> We can't change the order of the fields in the header, without breaking
>> >> existing userspace on BE systems.
>> >>
>> > Ok, I think I had missed that. You are saying that the
>> > #ifdef (__BIG_ENDIAN_BITFIELD) vs __LITTLE_ENDIAN_BITFIELD
>> >
>> > is only added to kernel-only data structures?
>>
>> No, we *should* have used __BIG/LITTLE_ENDIAN_BITFIELD for the uapi
>> definition, but we forgot.
>>
> But are you suggesting it cannot be fixed?

I'm not saying it *cannot* be fixed

But I don't think it's sufficiently broken to warrant fixing.

Just adding the __BIG/LITTLE_ENDIAN_BITFIELD ifdefs would break the ABI
for existing users.

So we'd have to continue to support the existing bitfield layout, and
then add a flag for userspace to request a new bitfield layout where the
bit numbers are stable across endian.

But that's way too much effort IMHO.

The existing definition works fine, *except* when perf.data files are
moved between machines of different endianness. That is pretty rare
these days, and can be handled in the perf tool easily enough.

>> >> It's annoying that the bit numbers are different between LE & BE, but I
>> >> think it's too late to change that.
>> >>
>> > I agree.
>> >
>> >> So nothing should change in the branch_entry definition in the header.
>> >>
>> >> My comment on your patch was that adding the union with val, makes it
>> >> easier to misuse the bitfields, because now the values can be accessed
>> >> via the bitfields and also via val, but when using val you have to know
>> >> that the bit numbers differ between BE/LE.
>> >>
>> > Ok, I get it now. We do not need to expose val to user. This is added
>> > for kernel code convenience only.
>>
>> Yeah. Putting the union with val in the uapi encourages userspace to
>> misuse val to bypass the bitfields, and that risks causing endian bugs.
>>
>> > But if we keep it in kernel, that may break some other rules about
>> > uapi headers.
>>
>> I don't follow what you mean there.
>>
>> We could use #ifdef __KERNEL__ in the uapi header to make the union
>> kernel-only, see below, but it's pretty gross.
>>
>>  struct perf_branch_entry {
>>         __u64   from;
>>         __u64   to;
>>  #ifdef __KERNEL__
>>         union {
>>                 __u64   val;        /* to make it easier to clear all fields */
>>                 struct {
>>  #endif
>>                         __u64   mispred:1,  /* target mispredicted */
>>                                 predicted:1,/* target predicted */
>>                                 in_tx:1,    /* in transaction */
>>                                 abort:1,    /* transaction abort */
>>                                 cycles:16,  /* cycle count to last branch */
>>                                 type:4,     /* branch type */
>>                                 reserved:40;
>>  #ifdef __KERNEL__
>>                 };
>>         };
>>  #endif
>>  };
>>
>>
>> If we just do the inline I suggested we can clear the flags in a single
>> source line, and the generated code seems fine too, eg:
>>
>> static inline void clear_perf_branch_entry_flags(struct perf_branch_entry *e)
>> {
>>         e->mispred = 0;
>>         e->predicted = 0;
>>         e->in_tx = 0;
>>         e->abort = 0;
>>         e->cycles = 0;
>>         e->type = 0;
>>         e->reserved = 0;
>> }
>>
> Ok, let's do the inline then. That looks like a cleaner solution to me
> assuming the compiler does the right thing.

It did when I checked with GCC 10.

cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
  2021-09-15  5:55   ` Stephane Eranian
  2021-09-15  9:04     ` Peter Zijlstra
@ 2021-09-27 20:17     ` Song Liu
  1 sibling, 0 replies; 41+ messages in thread
From: Song Liu @ 2021-09-27 20:17 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, open list, Arnaldo Carvalho de Melo, Jiri Olsa,
	kim.phillips, Namhyung Kim, Ian Rogers

Hi Stephane,

On Tue, Sep 14, 2021 at 10:57 PM Stephane Eranian <eranian@google.com> wrote:
>
> On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> >
> > Yay..
> >
> > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> >
> > WTF... ?!? Srsly? You're joking right?
> >
>
> As I said, this is because of the cost of running BRS usually for
> millions of branches to keep only the last 16.
> Running branch sampling in general on any arch is  never totally free.

Could you please share some data on how expensive the BRS is? We are
hoping to use
BRS/LBR without PMI (bpf_get_branch_snapshot). If it is too expensive,
we may need
some heuristic to turn it on/off.

Thanks,
Song

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 11/13] perf tools: improve IBS error handling
  2021-09-13 19:34   ` Arnaldo Carvalho de Melo
@ 2021-10-04 21:57     ` Kim Phillips
  2021-10-04 23:44       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 41+ messages in thread
From: Kim Phillips @ 2021-10-04 21:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Stephane Eranian
  Cc: linux-kernel, peterz, acme, jolsa, namhyung, irogers

On 9/13/21 2:34 PM, Arnaldo Carvalho de Melo wrote:
> Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
>> From: Kim Phillips <kim.phillips@amd.com>
>> +static void detect_amd(void)
>> +{
>> +	FILE *inf = fopen("/proc/cpuinfo", "r");
>> +	char *res;
>> +
>> +	if (!inf)
>> +		return;
>> +
>> +	res = fgrep(inf, "vendor_id");
>> +
>> +	if (res) {
>> +		char *s = strchr(res, ':');
>> +
>> +		is_amd = s && !strcmp(s, ": AuthenticAMD\n");
>> +		free(res);
>> +	}
>> +	fclose(inf);
>> +}
>> +
> 
> We have perf_env for such details, for instance in
> tools/perf/util/sample-raw.c we have:o
> 
>          const char *arch_pf = perf_env__arch(evlist->env);
>          const char *cpuid = perf_env__cpuid(evlist->env);
> 
> 	        else if (arch_pf && !strcmp("x86", arch_pf) &&
>                   cpuid && strstarts(cpuid, "AuthenticAMD") &&
>                   evlist__has_amd_ibs(evlist)) {

OK, I've rebased this 11/13 patch onto the new perf_env__{arch,cpuid} 
code, and posted it here:

https://lore.kernel.org/lkml/20211004214114.188477-1-kim.phillips@amd.com/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84

The following 12/13 patch in this series changes, too, but since it 
depends on prior patches in the series, I'll reply-all to 12/13 with its 
new version.

Arnaldo, would it be ok to apply the two new patches that replace this 
11/13?  They don't depend on any others in this series, and it would 
save Stephane from having to carry them.

Thanks,

Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling
  2021-09-09  7:56 ` [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling Stephane Eranian
@ 2021-10-04 21:57   ` Kim Phillips
  0 siblings, 0 replies; 41+ messages in thread
From: Kim Phillips @ 2021-10-04 21:57 UTC (permalink / raw)
  To: Stephane Eranian, linux-kernel; +Cc: peterz, acme, jolsa, namhyung, irogers

[-- Attachment #1: Type: text/plain, Size: 1163 bytes --]

On 9/9/21 2:56 AM, Stephane Eranian wrote:
> This patch improves the error message printed by perf when perf_event_open()
> fails on AMD Zen3 when using the branch sampling feature. In the case of
> EINVAL, there are two main reasons: frequency mode or period is smaller than
> the depth of the branch sampling buffer (16). The patch checks the parameters
> of the call and tries to print a relevant message to explain the error:
> 
> $ perf record -b -e cpu/branch-brs/ -c 10 ls
> Error:
> AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.
> 
> $ perf record -b -e cpu/branch-brs/ ls
> Error:
> AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---

Hi Stephane,

I've rewritten this patch based on Arnaldo's comments to the previous 
(11/13) patch.  The new version attached depends on this 2-patch series:

https://lore.kernel.org/lkml/20211004214114.188477-1-kim.phillips@amd.com/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84

Thanks,

Kim

[-- Attachment #2: 0003-perf-tools-Improve-error-handling-of-AMD-Branch-Samp.patch --]
[-- Type: text/x-patch, Size: 2583 bytes --]

From a4cbab762719b30bddec2e278cf8b8eb82e83865 Mon Sep 17 00:00:00 2001
From: Stephane Eranian <eranian@google.com>
Date: Thu, 9 Sep 2021 00:56:59 -0700
Subject: [PATCH] perf tools: Improve error handling of AMD Branch Sampling

This patch improves the error message printed by perf when
perf_event_open() fails on AMD Zen3 when using the branch sampling
feature. In the case of EINVAL, there are two main reasons: frequency
mode or period is smaller than the depth of the branch sampling
buffer (16). The patch checks the parameters of the call and tries
to print a relevant message to explain the error:

$ perf record -b -e cpu/branch-brs/ -c 10 ls
Error:
AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.

$ perf record -b -e cpu/branch-brs/ ls
Error:
AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.

Signed-off-by: Stephane Eranian <eranian@google.com>
[Rebased on commit 9fe8895a27a84 ("perf env: Add perf_env__cpuid, perf_env__{nr_}pmu_mappings")]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
---
 tools/perf/util/evsel.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f8a9cbd99314..e1f5eff07355 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2753,6 +2753,12 @@ static bool is_amd_ibs(struct evsel *evsel)
 	return evsel->core.attr.precise_ip || !strncmp(evsel->pmu_name, "ibs", 3);
 }
 
+static bool is_amd_brs(struct evsel *evsel)
+{
+	return ((evsel->core.attr.config & 0xff) == 0xc4) &&
+	       (evsel->core.attr.sample_type & PERF_SAMPLE_BRANCH_STACK);
+}
+
 int evsel__open_strerror(struct evsel *evsel, struct target *target,
 			 int err, char *msg, size_t size)
 {
@@ -2863,6 +2869,14 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 					return scnprintf(msg, size,
 	"AMD IBS may only be available in system-wide/per-cpu mode.  Try using -a, or -C and workload affinity");
 			}
+			if (is_amd_brs(evsel)) {
+				if (evsel->core.attr.freq)
+					return scnprintf(msg, size,
+	"AMD Branch Sampling does not support frequency mode sampling, must pass a fixed sampling period via -c option or cpu/branch-brs,period=xxxx/.");
+				/* another reason is that the period is too small */
+				return scnprintf(msg, size,
+	"AMD Branch Sampling does not support sampling period smaller than what is reported in /sys/devices/cpu/caps/branches.");
+			}
 		}
 
 		break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 11/13] perf tools: improve IBS error handling
  2021-10-04 21:57     ` Kim Phillips
@ 2021-10-04 23:44       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 41+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-10-04 23:44 UTC (permalink / raw)
  To: Kim Phillips, Arnaldo Carvalho de Melo, Stephane Eranian
  Cc: linux-kernel, peterz, acme, jolsa, namhyung, irogers



On October 4, 2021 6:57:09 PM GMT-03:00, Kim Phillips <kim.phillips@amd.com> wrote:
>On 9/13/21 2:34 PM, Arnaldo Carvalho de Melo wrote:
>> Em Thu, Sep 09, 2021 at 12:56:58AM -0700, Stephane Eranian escreveu:
>>> From: Kim Phillips <kim.phillips@amd.com>
>>> +static void detect_amd(void)
>>> +{
>>> +	FILE *inf = fopen("/proc/cpuinfo", "r");
>>> +	char *res;
>>> +
>>> +	if (!inf)
>>> +		return;
>>> +
>>> +	res = fgrep(inf, "vendor_id");
>>> +
>>> +	if (res) {
>>> +		char *s = strchr(res, ':');
>>> +
>>> +		is_amd = s && !strcmp(s, ": AuthenticAMD\n");
>>> +		free(res);
>>> +	}
>>> +	fclose(inf);
>>> +}
>>> +
>> 
>> We have perf_env for such details, for instance in
>> tools/perf/util/sample-raw.c we have:o
>> 
>>          const char *arch_pf = perf_env__arch(evlist->env);
>>          const char *cpuid = perf_env__cpuid(evlist->env);
>> 
>> 	        else if (arch_pf && !strcmp("x86", arch_pf) &&
>>                   cpuid && strstarts(cpuid, "AuthenticAMD") &&
>>                   evlist__has_amd_ibs(evlist)) {
>
>OK, I've rebased this 11/13 patch onto the new perf_env__{arch,cpuid} 
>code, and posted it here:
>
>https://lore.kernel.org/lkml/20211004214114.188477-1-kim.phillips@amd.com/T/#mc4c9c582e3816ab31af6d0187e6803de1a98ac84
>
>The following 12/13 patch in this series changes, too, but since it 
>depends on prior patches in the series, I'll reply-all to 12/13 with its 
>new version.
>
>Arnaldo, would it be ok to apply the two new patches that replace this 
>11/13?  They don't depend on any others in this series, and it would 
>save Stephane from having to carry them.


I'll work in this tomorrow, thanks

- Arnaldo
>
>Thanks,
>
>Kim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support
  2021-09-15  9:04     ` Peter Zijlstra
@ 2021-10-28 18:30       ` Stephane Eranian
  0 siblings, 0 replies; 41+ messages in thread
From: Stephane Eranian @ 2021-10-28 18:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, acme, jolsa, kim.phillips, namhyung, irogers

On Wed, Sep 15, 2021 at 2:04 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Sep 14, 2021 at 10:55:12PM -0700, Stephane Eranian wrote:
> > On Thu, Sep 9, 2021 at 1:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Sep 09, 2021 at 12:56:47AM -0700, Stephane Eranian wrote:
> > > > This patch series adds support for the AMD Fam19h 16-deep branch sampling
> > > > feature as described in the AMD PPR Fam19h Model 01h Revision B1 section 2.1.13.
> > >
> > > Yay..
> > >
> > > > BRS interacts with the NMI interrupt as well. Because enabling BRS is expensive,
> > > > it is only activated after P event occurrences, where P is the desired sampling period.
> > > > At P occurrences of the event, the counter overflows, the CPU catches the NMI interrupt,
> > > > activates BRS for 16 branches until it saturates, and then delivers the NMI to the kernel.
> > >
> > > WTF... ?!? Srsly? You're joking right?
> > >
> >
> > As I said, this is because of the cost of running BRS usually for
> > millions of branches to keep only the last 16.
> > Running branch sampling in general on any arch is  never totally free.
>
> Holding up the NMI will disrupt the sampling of the other events, which
> is, IMO unacceptible and would require this event to be exclusive on the
> whole PMU, simply because sharing it doesn't work.
>
Sorry for the long delay, I have been very busy.

You are right on this. It would hold the NMI for 16 taken branches.
Making the event exclusive creates a problem with the NMI watchdog.
We can try to hack something in to allow NMI watchdog + the sampling
event and nothing else.

> (also, other NMI sources might object)
>
On AMD, there is also IBS op, IBS Fetch both firing on NMI. but that
is less of a concern because the instruction address is captured by IBS
and the interrupted IP is not useful. So the interrupt skid is not important.

> Also, by only having LBRs post overflow you can't apply LBR based
> analysis to other events, which seems quite limiting.
>
This is a very limited functionality designed to support basic sampling
primarily to support autoFDO where there is only one sampling event.

> This really seems like a very sub-optimal solution. I mean, it's awesome
> AMD gets branch records, but this seems a very poor solution.

For now, this is what we have. It is important to get some basic form of branch
sampling on Zen3 even if it is not perfect because it enables optimizations such
as autoFDO for compilers today. We have verified that autoFDO works well with
branch sampling on Zen3.

I hope it will improve in the future.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-10-28 18:30 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-09  7:56 [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 01/13] perf/core: add union to struct perf_branch_entry Stephane Eranian
2021-09-09 19:03   ` Peter Zijlstra
2021-09-10 12:09     ` Michael Ellerman
2021-09-10 14:16       ` Michael Ellerman
2021-09-15  6:03         ` Stephane Eranian
2021-09-17  6:37           ` Madhavan Srinivasan
2021-09-17  6:48             ` Stephane Eranian
2021-09-17  7:05               ` Michael Ellerman
2021-09-17  7:39                 ` Stephane Eranian
2021-09-17 12:38                   ` Michael Ellerman
2021-09-17 16:42                     ` Stephane Eranian
2021-09-19 10:27                       ` Michael Ellerman
2021-09-09  7:56 ` [PATCH v1 02/13] x86/cpufeatures: add AMD Fam19h Branch Sampling feature Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 03/13] perf/x86/amd: add AMD Fam19h Branch Sampling support Stephane Eranian
2021-09-09 10:44   ` kernel test robot
2021-09-09 15:33   ` kernel test robot
2021-09-09  7:56 ` [PATCH v1 04/13] perf/x86/amd: add branch-brs helper event for Fam19h BRS Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 05/13] perf/x86/amd: enable branch sampling priv level filtering Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 06/13] perf/x86/amd: add AMD branch sampling period adjustment Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 07/13] perf/core: add idle hooks Stephane Eranian
2021-09-09  9:15   ` Peter Zijlstra
2021-09-09 10:42   ` kernel test robot
2021-09-09 11:02   ` kernel test robot
2021-09-09  7:56 ` [PATCH v1 08/13] perf/x86/core: " Stephane Eranian
2021-09-09  9:16   ` Peter Zijlstra
2021-09-09  7:56 ` [PATCH v1 09/13] perf/x86/amd: add idle hooks for branch sampling Stephane Eranian
2021-09-09  9:20   ` Peter Zijlstra
2021-09-09  7:56 ` [PATCH v1 10/13] perf tools: add branch-brs as a new event Stephane Eranian
2021-09-09  7:56 ` [PATCH v1 11/13] perf tools: improve IBS error handling Stephane Eranian
2021-09-13 19:34   ` Arnaldo Carvalho de Melo
2021-10-04 21:57     ` Kim Phillips
2021-10-04 23:44       ` Arnaldo Carvalho de Melo
2021-09-09  7:56 ` [PATCH v1 12/13] perf tools: improve error handling of AMD Branch Sampling Stephane Eranian
2021-10-04 21:57   ` Kim Phillips
2021-09-09  7:57 ` [PATCH v1 13/13] perf report: add addr_from/addr_to sort dimensions Stephane Eranian
2021-09-09  8:55 ` [PATCH v1 00/13] perf/x86/amd: Add AMD Fam19h Branch Sampling support Peter Zijlstra
2021-09-15  5:55   ` Stephane Eranian
2021-09-15  9:04     ` Peter Zijlstra
2021-10-28 18:30       ` Stephane Eranian
2021-09-27 20:17     ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).