linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] perf core PMU support for Sapphire Rapids
@ 2021-01-19 20:38 kan.liang
  2021-01-19 20:38 ` [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT kan.liang
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Intel Sapphire Rapids server is the successor of the Intel Ice Lake
server. The enabling code is based on Ice Lake, but there are several
new features introduced.
- The event encoding is changed and simplified.
- A new Precise Distribution (PDist) facility.
- Two new data source fields, data block & address block, are added in
  the PEBS Memory Info Record for the load latency event.
- A new store Latency facility is introduced.
- The layout of access latency field of PEBS Memory Info Record has been
  changed. Two latency, instruction latency and cache access latency are
  recorded. To support the new latency fields, a new sample type,
  PERF_SAMPLE_WEIGHT_EXT, is introduced.
- Extends the PERF_METRICS MSR to feature TMA method level 2 metrics.

Besides the Sapphire Rapids specific features, the CPUID 10.ECX
extension is also supported, which is available for all platforms with
Architectural Performance Monitoring Version 5.

The full description for the SPR features can be found at Intel
Architecture Instruction Set Extensions and Future Features Programming
Reference, 319433-041 (and later).

Both kernel and perf tool patches are included in the V1.

Kan Liang (12):
  perf/core: Add PERF_SAMPLE_WEIGHT_EXT
  perf/x86/intel: Factor out intel_update_topdown_event()
  perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  perf/x86/intel: Support CPUID 10.ECX to disable fixed counters
  tools headers uapi: Update tools's copy of linux/perf_event.h
  perf tools: Support data block and addr block
  perf c2c: Support data block and addr block
  perf tools: Support PERF_SAMPLE_WEIGHT_EXT
  perf report: Support instruction latency
  perf test: Support PERF_SAMPLE_WEIGHT_EXT
  perf stat: Support L2 Topdown events
  perf, tools: Update topdown documentation for Sapphire Rapids

 arch/x86/events/core.c                    |   8 +-
 arch/x86/events/intel/core.c              | 383 ++++++++++++++++++++++++++++--
 arch/x86/events/intel/ds.c                | 112 ++++++++-
 arch/x86/events/perf_event.h              |  17 +-
 arch/x86/include/asm/perf_event.h         |  16 +-
 include/linux/perf_event.h                |   1 +
 include/uapi/linux/perf_event.h           |  30 ++-
 kernel/events/core.c                      |   6 +
 tools/include/uapi/linux/perf_event.h     |  30 ++-
 tools/perf/Documentation/perf-report.txt  |   9 +-
 tools/perf/Documentation/perf-stat.txt    |  14 +-
 tools/perf/Documentation/topdown.txt      |  78 +++++-
 tools/perf/arch/x86/util/Build            |   1 +
 tools/perf/arch/x86/util/mem-events.c     |  44 ++++
 tools/perf/builtin-c2c.c                  |   3 +
 tools/perf/builtin-mem.c                  |   2 +-
 tools/perf/builtin-stat.c                 |  34 ++-
 tools/perf/tests/sample-parsing.c         |   3 +-
 tools/perf/util/event.h                   |   1 +
 tools/perf/util/evsel.c                   |  24 +-
 tools/perf/util/evsel.h                   |   1 +
 tools/perf/util/hist.c                    |  13 +-
 tools/perf/util/hist.h                    |   3 +
 tools/perf/util/mem-events.c              |  36 +++
 tools/perf/util/mem-events.h              |   5 +
 tools/perf/util/perf_event_attr_fprintf.c |   2 +-
 tools/perf/util/record.c                  |   4 +-
 tools/perf/util/session.c                 |   3 +
 tools/perf/util/sort.c                    |  83 ++++++-
 tools/perf/util/sort.h                    |   4 +
 tools/perf/util/stat-shadow.c             |  92 +++++++
 tools/perf/util/stat.c                    |   4 +
 tools/perf/util/stat.h                    |   9 +
 tools/perf/util/synthetic-events.c        |   8 +
 34 files changed, 1024 insertions(+), 59 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/mem-events.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-26 14:42   ` Peter Zijlstra
  2021-01-19 20:38 ` [PATCH 02/12] perf/x86/intel: Factor out intel_update_topdown_event() kan.liang
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
cost of an action represented by the sample. This allows the profiler
to scale the samples to be more informative to the programmer. It could
also help to locate a hotspot, e.g., when profiling by memory latencies,
the expensive load appear higher up in the histograms. But current
PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
could be a problem, if users want two or more factors to contribute to
the weight. For example, Golden Cove core PMU can provide both the
instruction latency and the cache Latency information as factors for the
memory profiling.

Add a new sample type, PERF_SAMPLE_WEIGHT_EXT, as an extension of the
PERF_SAMPLE_WEIGHT sample type.

The first 16-bit is used as the weight value for the instruction
latency, defined by the delay measured between the dispatch of an
instruction for execution and its completion. This is quite generic and
can be extended to other kinds of architectures, as long as the hardware
provides suitable values. Other fields are reserved for future usage.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h      |  1 +
 include/uapi/linux/perf_event.h | 18 +++++++++++++++++-
 kernel/events/core.c            |  6 ++++++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9a38f57..005b6b8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1030,6 +1030,7 @@ struct perf_sample_data {
 	u64				cgroup;
 	u64				data_page_size;
 	u64				code_page_size;
+	union perf_weight_ext		weight_ext;
 } ____cacheline_aligned;
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b15e344..d0129e5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -145,8 +145,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_CGROUP			= 1U << 21,
 	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
 	PERF_SAMPLE_CODE_PAGE_SIZE		= 1U << 23,
+	PERF_SAMPLE_WEIGHT_EXT			= 1U << 24,
 
-	PERF_SAMPLE_MAX = 1U << 24,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 25,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -900,6 +901,13 @@ enum perf_event_type {
 	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
 	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
+	 *	{ union {
+	 *		u64		weight_ext;
+	 *		struct {
+	 *			u64	instr_latency:16,
+	 *				reserved:48;
+	 *		};
+	 *	} && PERF_SAMPLE_WEIGHT_EXT
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -1248,4 +1256,12 @@ struct perf_branch_entry {
 		reserved:40;
 };
 
+union perf_weight_ext {
+	__u64		val;
+	struct {
+		__u64	instr_latency:16,
+			reserved:48;
+	};
+};
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 55d1879..9363d12 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
 	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
 		size += sizeof(data->code_page_size);
 
+	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
+		size += sizeof(data->weight_ext);
+
 	event->header_size = size;
 }
 
@@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 			perf_aux_sample_output(event, handle, data);
 	}
 
+	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
+		perf_output_put(handle, data->weight_ext);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/12] perf/x86/intel: Factor out intel_update_topdown_event()
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
  2021-01-19 20:38 ` [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Similar to Ice Lake, Intel Sapphire Rapids server also supports the
topdown performance metrics feature. The difference is that Intel
Sapphire Rapids server extends the PERF_METRICS MSR to feature TMA
method level two metrics, which will introduce 8 metrics events. Current
icl_update_topdown_event() only check 4 level one metrics events.

Factor out intel_update_topdown_event() to facilitate the code sharing
between Ice Lake and Sapphire Rapids.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d4569bf..8eba41b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2337,8 +2337,8 @@ static void __icl_update_topdown_event(struct perf_event *event,
 	}
 }
 
-static void update_saved_topdown_regs(struct perf_event *event,
-				      u64 slots, u64 metrics)
+static void update_saved_topdown_regs(struct perf_event *event, u64 slots,
+				      u64 metrics, int metric_end)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct perf_event *other;
@@ -2347,7 +2347,7 @@ static void update_saved_topdown_regs(struct perf_event *event,
 	event->hw.saved_slots = slots;
 	event->hw.saved_metric = metrics;
 
-	for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+	for_each_set_bit(idx, cpuc->active_mask, metric_end + 1) {
 		if (!is_topdown_idx(idx))
 			continue;
 		other = cpuc->events[idx];
@@ -2362,7 +2362,8 @@ static void update_saved_topdown_regs(struct perf_event *event,
  * The PERF_METRICS and Fixed counter 3 are read separately. The values may be
  * modify by a NMI. PMU has to be disabled before calling this function.
  */
-static u64 icl_update_topdown_event(struct perf_event *event)
+
+static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct perf_event *other;
@@ -2378,7 +2379,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 	/* read PERF_METRICS */
 	rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
 
-	for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+	for_each_set_bit(idx, cpuc->active_mask, metric_end + 1) {
 		if (!is_topdown_idx(idx))
 			continue;
 		other = cpuc->events[idx];
@@ -2404,7 +2405,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 		 * Don't need to reset the PERF_METRICS and Fixed counter 3.
 		 * Because the values will be restored in next schedule in.
 		 */
-		update_saved_topdown_regs(event, slots, metrics);
+		update_saved_topdown_regs(event, slots, metrics, metric_end);
 		reset = false;
 	}
 
@@ -2413,12 +2414,17 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
 		wrmsrl(MSR_PERF_METRICS, 0);
 		if (event)
-			update_saved_topdown_regs(event, 0, 0);
+			update_saved_topdown_regs(event, 0, 0, metric_end);
 	}
 
 	return slots;
 }
 
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+	return intel_update_topdown_event(event, INTEL_PMC_IDX_TD_BE_BOUND);
+}
+
 static void intel_pmu_read_topdown_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
  2021-01-19 20:38 ` [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT kan.liang
  2021-01-19 20:38 ` [PATCH 02/12] perf/x86/intel: Factor out intel_update_topdown_event() kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-26 14:43   ` Peter Zijlstra
                     ` (3 more replies)
  2021-01-19 20:38 ` [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters kan.liang
                   ` (8 subsequent siblings)
  11 siblings, 4 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Add perf core PMU support for the Intel Sapphire Rapids server, which is
the successor of the Intel Ice Lake server. The enabling code is based
on Ice Lake, but there are several new features introduced.

The event encoding is changed and simplified, e.g., the event codes
which are below 0x90 are restricted to counters 0-3. The event codes
which above 0x90 are likely to have no restrictions. The event
constraints, extra_regs(), and hardware cache events table are changed
accordingly.

A new Precise Distribution (PDist) facility is introduced, which
further minimizes the skid when a precise event is programmed on the GP
counter 0. Enable the Precise Distribution (PDist) facility with :ppp
event. For this facility to work, the period must be initialized with a
value larger than 127. Add spr_limit_period() to apply the limit for
:ppp event.

Two new data source fields, data block & address block, are added in the
PEBS Memory Info Record for the load latency event. To enable the
feature,
- An auxiliary event has to be enabled together with the load latency
  event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is
  introduced to indicate the case. A new event, mem-loads-aux, is
  exposed to sysfs for the user tool.
  Add a check in hw_config(). If the auxiliary event is not detected,
  return an unique error -ENODATA.
- The union perf_mem_data_src is extended to support the new fields.
- Ice Lake and earlier models do not support block information, but the
  fields may be set by HW on some machines. Add pebs_no_block to
  explicitly indicate the previous platforms which don't support the new
  block fields. Accessing the new block fields are ignored on those
  platforms.

A new store Latency facility is introduced, which leverages the PEBS
facility where it can provide additional information about sampled
stores. The additional information includes the data address, memory
auxiliary info (e.g. Data Source, STLB miss) and the latency of the
store access. To enable the facility, the new event (0x02cd) has to be
programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is
introduced to indicate the event. The store_latency_data() is introduced
to parse the memory auxiliary info.

The layout of access latency field of PEBS Memory Info Record has been
changed. Two latency, instruction latency (bit 15:0) and cache access
latency (bit 47:32) are recorded.
- The cache access latency is similar to previous memory access latency.
  For loads, the latency starts by the actual cache access until the
  data is returned by the memory subsystem.
  For stores, the latency starts when the demand write accesses the L1
  data cache and lasts until the cacheline write is completed in the
  memory subsystem.
  The cache access latency is stored in the perf record with the sample
  type PERF_SAMPLE_WEIGHT.
- The instruction latency starts by the dispatch of the load operation
  for execution and lasts until completion of the instruction it belongs
  to.
  Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction
  latency support. The instruction latency is stored in the perf record
  with the new sample type PERF_SAMPLE_WEIGHT_EXT.

Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The
lower half of the register is the TMA level 1 metrics (legacy). The
upper half is also divided into four 8-bit fields for the new level 2
metrics. Expose all eight Topdown metrics events to user space.

The full description for the SPR features can be found at Intel
Architecture Instruction Set Extensions and Future Features
Programming Reference, 319433-041.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 320 +++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/ds.c        | 112 ++++++++++++-
 arch/x86/events/perf_event.h      |  12 +-
 arch/x86/include/asm/perf_event.h |  16 +-
 include/uapi/linux/perf_event.h   |  12 +-
 5 files changed, 456 insertions(+), 16 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8eba41b..a54d4a9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -275,6 +275,55 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
 	EVENT_EXTRA_END
 };
 
+static struct extra_reg intel_spr_extra_regs[] __read_mostly = {
+	INTEL_UEVENT_EXTRA_REG(0x012a, MSR_OFFCORE_RSP_0, 0x3fffffffffull, RSP_0),
+	INTEL_UEVENT_EXTRA_REG(0x012b, MSR_OFFCORE_RSP_1, 0x3fffffffffull, RSP_1),
+	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
+	INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+	EVENT_EXTRA_END
+};
+
+static struct event_constraint intel_spr_event_constraints[] = {
+	FIXED_EVENT_CONSTRAINT(0x00c0, 0),	/* INST_RETIRED.ANY */
+	FIXED_EVENT_CONSTRAINT(0x01c0, 0),	/* INST_RETIRED.PREC_DIST */
+	FIXED_EVENT_CONSTRAINT(0x003c, 1),	/* CPU_CLK_UNHALTED.CORE */
+	FIXED_EVENT_CONSTRAINT(0x0300, 2),	/* CPU_CLK_UNHALTED.REF */
+	FIXED_EVENT_CONSTRAINT(0x0400, 3),	/* SLOTS */
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_RETIRING, 0),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BAD_SPEC, 1),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_FE_BOUND, 2),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BE_BOUND, 3),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_HEAVY_OPS, 4),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_BR_MISPREDICT, 5),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_FETCH_LAT, 6),
+	METRIC_EVENT_CONSTRAINT(INTEL_TD_METRIC_MEM_BOUND, 7),
+
+	INTEL_EVENT_CONSTRAINT(0x2e, 0xff),
+	INTEL_EVENT_CONSTRAINT(0x3c, 0xff),
+	/*
+	 * Generally event codes < 0x90 are restricted to counters 0-3.
+	 * The 0x2E and 0x3C are exception, which has no restriction.
+	 */
+	INTEL_EVENT_CONSTRAINT_RANGE(0x01, 0x8f, 0xf),
+
+	INTEL_UEVENT_CONSTRAINT(0x01a3, 0xf),
+	INTEL_UEVENT_CONSTRAINT(0x02a3, 0xf),
+	INTEL_UEVENT_CONSTRAINT(0x08a3, 0xf),
+	INTEL_UEVENT_CONSTRAINT(0x04a4, 0x1),
+	INTEL_UEVENT_CONSTRAINT(0x08a4, 0x1),
+	INTEL_UEVENT_CONSTRAINT(0x02cd, 0x1),
+	INTEL_EVENT_CONSTRAINT(0xce, 0x1),
+	INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xdf, 0xf),
+	/*
+	 * Generally event codes >= 0x90 are likely to have no restrictions.
+	 * The exception are defined as above.
+	 */
+	INTEL_EVENT_CONSTRAINT_RANGE(0x90, 0xfe, 0xff),
+
+	EVENT_CONSTRAINT_END
+};
+
+
 EVENT_ATTR_STR(mem-loads,	mem_ld_nhm,	"event=0x0b,umask=0x10,ldlat=3");
 EVENT_ATTR_STR(mem-loads,	mem_ld_snb,	"event=0xcd,umask=0x1,ldlat=3");
 EVENT_ATTR_STR(mem-stores,	mem_st_snb,	"event=0xcd,umask=0x2");
@@ -314,11 +363,15 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
 EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
 	"4", "2");
 
-EVENT_ATTR_STR(slots,			slots,		"event=0x00,umask=0x4");
-EVENT_ATTR_STR(topdown-retiring,	td_retiring,	"event=0x00,umask=0x80");
-EVENT_ATTR_STR(topdown-bad-spec,	td_bad_spec,	"event=0x00,umask=0x81");
-EVENT_ATTR_STR(topdown-fe-bound,	td_fe_bound,	"event=0x00,umask=0x82");
-EVENT_ATTR_STR(topdown-be-bound,	td_be_bound,	"event=0x00,umask=0x83");
+EVENT_ATTR_STR(slots,			slots,			"event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring,	td_retiring,		"event=0x00,umask=0x80");
+EVENT_ATTR_STR(topdown-bad-spec,	td_bad_spec,		"event=0x00,umask=0x81");
+EVENT_ATTR_STR(topdown-fe-bound,	td_fe_bound,		"event=0x00,umask=0x82");
+EVENT_ATTR_STR(topdown-be-bound,	td_be_bound,		"event=0x00,umask=0x83");
+EVENT_ATTR_STR(topdown-heavy-ops,	td_heavy_ops,		"event=0x00,umask=0x84");
+EVENT_ATTR_STR(topdown-br-mispredict,	td_br_mispredict,	"event=0x00,umask=0x85");
+EVENT_ATTR_STR(topdown-fetch-lat,	td_fetch_lat,		"event=0x00,umask=0x86");
+EVENT_ATTR_STR(topdown-mem-bound,	td_mem_bound,		"event=0x00,umask=0x87");
 
 static struct attribute *snb_events_attrs[] = {
 	EVENT_PTR(td_slots_issued),
@@ -384,6 +437,108 @@ static u64 intel_pmu_event_map(int hw_event)
 	return intel_perfmon_event_map[hw_event];
 }
 
+static __initconst const u64 spr_hw_cache_event_ids
+				[PERF_COUNT_HW_CACHE_MAX]
+				[PERF_COUNT_HW_CACHE_OP_MAX]
+				[PERF_COUNT_HW_CACHE_RESULT_MAX] =
+{
+ [ C(L1D ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x81d0,
+		[ C(RESULT_MISS)   ] = 0xe124,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x82d0,
+	},
+ },
+ [ C(L1I ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_MISS)   ] = 0xe424,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+ },
+ [ C(LL  ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x12a,
+		[ C(RESULT_MISS)   ] = 0x12a,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x12a,
+		[ C(RESULT_MISS)   ] = 0x12a,
+	},
+ },
+ [ C(DTLB) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x81d0,
+		[ C(RESULT_MISS)   ] = 0xe12,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x82d0,
+		[ C(RESULT_MISS)   ] = 0xe13,
+	},
+ },
+ [ C(ITLB) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = 0xe11,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+ },
+ [ C(BPU ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x4c4,
+		[ C(RESULT_MISS)   ] = 0x4c5,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+	[ C(OP_PREFETCH) ] = {
+		[ C(RESULT_ACCESS) ] = -1,
+		[ C(RESULT_MISS)   ] = -1,
+	},
+ },
+ [ C(NODE) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x12a,
+		[ C(RESULT_MISS)   ] = 0x12a,
+	},
+ },
+};
+
+static __initconst const u64 spr_hw_cache_extra_regs
+				[PERF_COUNT_HW_CACHE_MAX]
+				[PERF_COUNT_HW_CACHE_OP_MAX]
+				[PERF_COUNT_HW_CACHE_RESULT_MAX] =
+{
+ [ C(LL  ) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x10001,
+		[ C(RESULT_MISS)   ] = 0x3fbfc00001,
+	},
+	[ C(OP_WRITE) ] = {
+		[ C(RESULT_ACCESS) ] = 0x3f3ffc0002,
+		[ C(RESULT_MISS)   ] = 0x3f3fc00002,
+	},
+ },
+ [ C(NODE) ] = {
+	[ C(OP_READ) ] = {
+		[ C(RESULT_ACCESS) ] = 0x10c000001,
+		[ C(RESULT_MISS)   ] = 0x3fb3000001,
+	},
+ },
+};
+
 /*
  * Notes on the events:
  * - data reads do not include code reads (comparable to earlier tables)
@@ -2319,6 +2474,17 @@ static void __icl_update_topdown_event(struct perf_event *event,
 {
 	u64 delta, last = 0;
 
+	/*
+	 * Although the unsupported topdown events are not exposed to users,
+	 * users may mistakenly use the unsupported events via RAW format.
+	 * For example, using L2 topdown event, cpu/event=0x00,umask=0x84/,
+	 * on Ice Lake. In this case, the scheduler follows the unknown
+	 * event handling and assigns a GP counter to the event.
+	 * Check the case, and avoid updating unsupported events.
+	 */
+	if (event->hw.idx < INTEL_PMC_IDX_FIXED)
+		return;
+
 	delta = icl_get_topdown_value(event, slots, metrics);
 	if (last_slots)
 		last = icl_get_topdown_value(event, last_slots, last_metrics);
@@ -2425,6 +2591,11 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 	return intel_update_topdown_event(event, INTEL_PMC_IDX_TD_BE_BOUND);
 }
 
+static u64 spr_update_topdown_event(struct perf_event *event)
+{
+	return intel_update_topdown_event(event, INTEL_PMC_IDX_TD_MEM_BOUND);
+}
+
 static void intel_pmu_read_topdown_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -3569,6 +3740,17 @@ static int core_pmu_hw_config(struct perf_event *event)
 	return intel_pmu_bts_config(event);
 }
 
+static inline bool is_mem_loads_event(struct perf_event *event)
+{
+	return (event->attr.config & INTEL_ARCH_EVENT_MASK) == X86_CONFIG(.event=0xcd, .umask=0x01);
+}
+
+static inline bool is_mem_loads_aux_event(struct perf_event *event)
+{
+	return (event->attr.config & INTEL_ARCH_EVENT_MASK) == X86_CONFIG(.event=0x03, .umask=0x02);
+}
+
+
 static int intel_pmu_hw_config(struct perf_event *event)
 {
 	int ret = x86_pmu_hw_config(event);
@@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		}
 	}
 
+	/*
+	 * To retrieve complete Memory Info of the load latency event, an
+	 * auxiliary event has to be enabled simultaneously. Add a check for
+	 * the load latency event.
+	 *
+	 * In a group, the auxiliary event must be in front of the load latency
+	 * event. The rule is to simplify the implementation of the check.
+	 * That's because perf cannot have a complete group at the moment.
+	 */
+	if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX &&
+	    (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) &&
+	    is_mem_loads_event(event)) {
+		struct perf_event *leader = event->group_leader;
+		struct perf_event *sibling = NULL;
+
+		if (!is_mem_loads_aux_event(leader)) {
+			for_each_sibling_event(sibling, leader) {
+				if (is_mem_loads_aux_event(sibling))
+					break;
+			}
+			if (list_entry_is_head(sibling, &leader->sibling_list, sibling_list))
+				return -ENODATA;
+		}
+	}
+
 	if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
 		return 0;
 
@@ -3871,6 +4078,29 @@ icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 }
 
 static struct event_constraint *
+spr_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+			  struct perf_event *event)
+{
+	struct event_constraint *c;
+
+	c = icl_get_event_constraints(cpuc, idx, event);
+
+	/*
+	 * The :ppp indicates the Precise Distribution (PDist) facility, which
+	 * is only supported on the GP counter 0. If a :ppp event which is not
+	 * available on the GP counter 0, error out.
+	 */
+	if (event->attr.precise_ip == 3) {
+		if (c->idxmsk64 & BIT_ULL(0))
+			return &counter0_constraint;
+
+		return &emptyconstraint;
+	}
+
+	return c;
+}
+
+static struct event_constraint *
 glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 			  struct perf_event *event)
 {
@@ -3959,6 +4189,14 @@ static u64 nhm_limit_period(struct perf_event *event, u64 left)
 	return max(left, 32ULL);
 }
 
+static u64 spr_limit_period(struct perf_event *event, u64 left)
+{
+	if (event->attr.precise_ip == 3)
+		return max(left, 128ULL);
+
+	return left;
+}
+
 PMU_FORMAT_ATTR(event,	"config:0-7"	);
 PMU_FORMAT_ATTR(umask,	"config:8-15"	);
 PMU_FORMAT_ATTR(edge,	"config:18"	);
@@ -4709,6 +4947,42 @@ static struct attribute *icl_tsx_events_attrs[] = {
 	NULL,
 };
 
+
+EVENT_ATTR_STR(mem-stores,	mem_st_spr,	"event=0xcd,umask=0x2");
+EVENT_ATTR_STR(mem-loads-aux,	mem_ld_aux,	"event=0x03,umask=0x2");
+
+static struct attribute *spr_events_attrs[] = {
+	EVENT_PTR(mem_ld_hsw),
+	EVENT_PTR(mem_st_spr),
+	EVENT_PTR(mem_ld_aux),
+	NULL,
+};
+
+static struct attribute *spr_td_events_attrs[] = {
+	EVENT_PTR(slots),
+	EVENT_PTR(td_retiring),
+	EVENT_PTR(td_bad_spec),
+	EVENT_PTR(td_fe_bound),
+	EVENT_PTR(td_be_bound),
+	EVENT_PTR(td_heavy_ops),
+	EVENT_PTR(td_br_mispredict),
+	EVENT_PTR(td_fetch_lat),
+	EVENT_PTR(td_mem_bound),
+	NULL,
+};
+
+static struct attribute *spr_tsx_events_attrs[] = {
+	EVENT_PTR(tx_start),
+	EVENT_PTR(tx_abort),
+	EVENT_PTR(tx_commit),
+	EVENT_PTR(tx_capacity_read),
+	EVENT_PTR(tx_capacity_write),
+	EVENT_PTR(tx_conflict),
+	EVENT_PTR(cycles_t),
+	EVENT_PTR(cycles_ct),
+	NULL,
+};
+
 static ssize_t freeze_on_smi_show(struct device *cdev,
 				  struct device_attribute *attr,
 				  char *buf)
@@ -5475,6 +5749,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.extra_regs = intel_icl_extra_regs;
 		x86_pmu.pebs_aliases = NULL;
 		x86_pmu.pebs_prec_dist = true;
+		x86_pmu.pebs_no_block = true;
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
 
@@ -5495,6 +5770,41 @@ __init int intel_pmu_init(void)
 		name = "icelake";
 		break;
 
+	case INTEL_FAM6_SAPPHIRERAPIDS_X:
+		pmem = true;
+		x86_pmu.late_ack = true;
+		memcpy(hw_cache_event_ids, spr_hw_cache_event_ids, sizeof(hw_cache_event_ids));
+		memcpy(hw_cache_extra_regs, spr_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
+
+		x86_pmu.event_constraints = intel_spr_event_constraints;
+		x86_pmu.pebs_constraints = intel_spr_pebs_event_constraints;
+		x86_pmu.extra_regs = intel_spr_extra_regs;
+		x86_pmu.limit_period = spr_limit_period;
+		x86_pmu.pebs_aliases = NULL;
+		x86_pmu.pebs_prec_dist = true;
+		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
+		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+		x86_pmu.flags |= PMU_FL_PEBS_ALL;
+		x86_pmu.flags |= PMU_FL_INSTR_LATENCY;
+		x86_pmu.flags |= PMU_FL_MEM_LOADS_AUX;
+
+		x86_pmu.hw_config = hsw_hw_config;
+		x86_pmu.get_event_constraints = spr_get_event_constraints;
+		extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
+			hsw_format_attr : nhm_format_attr;
+		extra_skl_attr = skl_format_attr;
+		mem_attr = spr_events_attrs;
+		td_attr = spr_td_events_attrs;
+		tsx_attr = spr_tsx_events_attrs;
+		x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xc9, .umask=0x04);
+		x86_pmu.lbr_pt_coexist = true;
+		intel_pmu_pebs_data_source_skl(pmem);
+		x86_pmu.update_topdown_event = spr_update_topdown_event;
+		x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
+		pr_cont("Sapphire Rapids events, ");
+		name = "sapphire_rapids";
+		break;
+
 	default:
 		switch (x86_pmu.version) {
 		case 1:
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 67dbc91..d27a30c 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -36,7 +36,9 @@ union intel_x86_pebs_dse {
 		unsigned int ld_dse:4;
 		unsigned int ld_stlb_miss:1;
 		unsigned int ld_locked:1;
-		unsigned int ld_reserved:26;
+		unsigned int ld_data_blk:1;
+		unsigned int ld_addr_blk:1;
+		unsigned int ld_reserved:24;
 	};
 	struct {
 		unsigned int st_l1d_hit:1;
@@ -45,6 +47,12 @@ union intel_x86_pebs_dse {
 		unsigned int st_locked:1;
 		unsigned int st_reserved2:26;
 	};
+	struct {
+		unsigned int st_lat_dse:4;
+		unsigned int st_lat_stlb_miss:1;
+		unsigned int st_lat_locked:1;
+		unsigned int ld_reserved3:26;
+	};
 };
 
 
@@ -198,6 +206,63 @@ static u64 load_latency_data(u64 status)
 	if (dse.ld_locked)
 		val |= P(LOCK, LOCKED);
 
+	/*
+	 * Ice Lake and earlier models do not support block infos.
+	 */
+	if (x86_pmu.pebs_no_block) {
+		val |= P(BLK, NA);
+		return val;
+	}
+	/*
+	 * bit 6: load was blocked since its data could not be forwarded
+	 *        from a preceding store
+	 */
+	if (dse.ld_data_blk)
+		val |= P(BLK, DATA);
+
+	/*
+	 * bit 7: load was blocked due to potential address conflict with
+	 *        a preceding store
+	 */
+	if (dse.ld_addr_blk)
+		val |= P(BLK, ADDR);
+
+	if (!dse.ld_data_blk && !dse.ld_addr_blk)
+		val |= P(BLK, NA);
+
+	return val;
+}
+
+static u64 store_latency_data(u64 status)
+{
+	union intel_x86_pebs_dse dse;
+	u64 val;
+
+	dse.val = status;
+
+	/*
+	 * use the mapping table for bit 0-3
+	 */
+	val = pebs_data_source[dse.st_lat_dse];
+
+	/*
+	 * bit 4: TLB access
+	 * 0 = did not miss 2nd level TLB
+	 * 1 = missed 2nd level TLB
+	 */
+	if (dse.st_lat_stlb_miss)
+		val |= P(TLB, MISS) | P(TLB, L2);
+	else
+		val |= P(TLB, HIT) | P(TLB, L1) | P(TLB, L2);
+
+	/*
+	 * bit 5: locked prefix
+	 */
+	if (dse.st_lat_locked)
+		val |= P(LOCK, LOCKED);
+
+	val |= P(BLK, NA);
+
 	return val;
 }
 
@@ -870,6 +935,27 @@ struct event_constraint intel_icl_pebs_event_constraints[] = {
 	EVENT_CONSTRAINT_END
 };
 
+struct event_constraint intel_spr_pebs_event_constraints[] = {
+	INTEL_FLAGS_UEVENT_CONSTRAINT(0x1c0, 0x100000000ULL),
+	INTEL_FLAGS_UEVENT_CONSTRAINT(0x0400, 0x800000000ULL),
+
+	INTEL_PLD_CONSTRAINT(0x1cd, 0xfe),
+	INTEL_PSD_CONSTRAINT(0x2cd, 0x1),
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_LD(0x1d0, 0xf),
+	INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_ST(0x2d0, 0xf),
+
+	INTEL_FLAGS_EVENT_CONSTRAINT_DATALA_LD_RANGE(0xd1, 0xd4, 0xf),
+
+	INTEL_FLAGS_EVENT_CONSTRAINT(0xd0, 0xf),
+
+	/*
+	 * Everything else is handled by PMU_FL_PEBS_ALL, because we
+	 * need the full constraints from the main table.
+	 */
+
+	EVENT_CONSTRAINT_END
+};
+
 struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 {
 	struct event_constraint *c;
@@ -1331,6 +1417,8 @@ static u64 get_data_src(struct perf_event *event, u64 aux)
 
 	if (fl & PERF_X86_EVENT_PEBS_LDLAT)
 		val = load_latency_data(aux);
+	else if (fl & PERF_X86_EVENT_PEBS_STLAT)
+		val = store_latency_data(aux);
 	else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC))
 		val = precise_datala_hsw(event, aux);
 	else if (fst)
@@ -1507,6 +1595,9 @@ static void adaptive_pebs_save_regs(struct pt_regs *regs,
 #endif
 }
 
+#define PEBS_LATENCY_MASK			0xffff
+#define PEBS_CACHE_LATENCY_OFFSET		32
+
 /*
  * With adaptive PEBS the layout depends on what fields are configured.
  */
@@ -1577,9 +1668,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	}
 
 	if (format_size & PEBS_DATACFG_MEMINFO) {
-		if (sample_type & PERF_SAMPLE_WEIGHT)
-			data->weight = meminfo->latency ?:
+		if (sample_type & PERF_SAMPLE_WEIGHT) {
+			u64 weight = meminfo->latency;
+
+			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
+				weight >>= PEBS_CACHE_LATENCY_OFFSET;
+			data->weight = weight & PEBS_LATENCY_MASK ?:
 				intel_get_tsx_weight(meminfo->tsx_tuning);
+		}
+
+		if (sample_type & PERF_SAMPLE_WEIGHT_EXT) {
+			data->weight_ext.val = 0;
+			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
+				data->weight_ext.instr_latency = meminfo->latency & PEBS_LATENCY_MASK;
+		}
 
 		if (sample_type & PERF_SAMPLE_DATA_SRC)
 			data->data_src.val = get_data_src(event, meminfo->aux);
@@ -2026,8 +2128,10 @@ void __init intel_ds_init(void)
 	x86_pmu.bts  = boot_cpu_has(X86_FEATURE_BTS);
 	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
 	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
-	if (x86_pmu.version <= 4)
+	if (x86_pmu.version <= 4) {
 		x86_pmu.pebs_no_isolation = 1;
+		x86_pmu.pebs_no_block = 1;
+	}
 
 	if (x86_pmu.pebs) {
 		char pebs_type = x86_pmu.intel_cap.pebs_trap ?  '+' : '-';
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7895cf4..0ae4e50 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -80,6 +80,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
 #define PERF_X86_EVENT_PAIR		0x1000 /* Large Increment per Cycle */
 #define PERF_X86_EVENT_LBR_SELECT	0x2000 /* Save/Restore MSR_LBR_SELECT */
 #define PERF_X86_EVENT_TOPDOWN		0x4000 /* Count Topdown slots/metrics events */
+#define PERF_X86_EVENT_PEBS_STLAT	0x8000 /* st+stlat data address sampling */
 
 static inline bool is_topdown_count(struct perf_event *event)
 {
@@ -443,6 +444,10 @@ struct cpu_hw_events {
 	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
 			   HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
 
+#define INTEL_PSD_CONSTRAINT(c, n)	\
+	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
+			   HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_STLAT)
+
 #define INTEL_PST_CONSTRAINT(c, n)	\
 	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS, \
 			  HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST)
@@ -724,7 +729,8 @@ struct x86_pmu {
 			pebs_broken		:1,
 			pebs_prec_dist		:1,
 			pebs_no_tlb		:1,
-			pebs_no_isolation	:1;
+			pebs_no_isolation	:1,
+			pebs_no_block		:1;
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	int		max_pebs_events;
@@ -871,6 +877,8 @@ do {									\
 #define PMU_FL_PEBS_ALL		0x10 /* all events are valid PEBS events */
 #define PMU_FL_TFA		0x20 /* deal with TSX force abort */
 #define PMU_FL_PAIR		0x40 /* merge counters for large incr. events */
+#define PMU_FL_INSTR_LATENCY	0x80 /* Support Instruction Latency in PEBS Memory Info Record */
+#define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
@@ -1157,6 +1165,8 @@ extern struct event_constraint intel_skl_pebs_event_constraints[];
 
 extern struct event_constraint intel_icl_pebs_event_constraints[];
 
+extern struct event_constraint intel_spr_pebs_event_constraints[];
+
 struct event_constraint *intel_pebs_constraints(struct perf_event *event);
 
 void intel_pmu_pebs_add(struct perf_event *event);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index b9a7fd0..f54484b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -261,8 +261,12 @@ struct x86_pmu_capability {
 #define INTEL_PMC_IDX_TD_BAD_SPEC		(INTEL_PMC_IDX_METRIC_BASE + 1)
 #define INTEL_PMC_IDX_TD_FE_BOUND		(INTEL_PMC_IDX_METRIC_BASE + 2)
 #define INTEL_PMC_IDX_TD_BE_BOUND		(INTEL_PMC_IDX_METRIC_BASE + 3)
-#define INTEL_PMC_IDX_METRIC_END		INTEL_PMC_IDX_TD_BE_BOUND
-#define INTEL_PMC_MSK_TOPDOWN			((0xfull << INTEL_PMC_IDX_METRIC_BASE) | \
+#define INTEL_PMC_IDX_TD_HEAVY_OPS		(INTEL_PMC_IDX_METRIC_BASE + 4)
+#define INTEL_PMC_IDX_TD_BR_MISPREDICT		(INTEL_PMC_IDX_METRIC_BASE + 5)
+#define INTEL_PMC_IDX_TD_FETCH_LAT		(INTEL_PMC_IDX_METRIC_BASE + 6)
+#define INTEL_PMC_IDX_TD_MEM_BOUND		(INTEL_PMC_IDX_METRIC_BASE + 7)
+#define INTEL_PMC_IDX_METRIC_END		INTEL_PMC_IDX_TD_MEM_BOUND
+#define INTEL_PMC_MSK_TOPDOWN			((0xffull << INTEL_PMC_IDX_METRIC_BASE) | \
 						INTEL_PMC_MSK_FIXED_SLOTS)
 
 /*
@@ -280,8 +284,12 @@ struct x86_pmu_capability {
 #define INTEL_TD_METRIC_BAD_SPEC		0x8100	/* Bad speculation metric */
 #define INTEL_TD_METRIC_FE_BOUND		0x8200	/* FE bound metric */
 #define INTEL_TD_METRIC_BE_BOUND		0x8300	/* BE bound metric */
-#define INTEL_TD_METRIC_MAX			INTEL_TD_METRIC_BE_BOUND
-#define INTEL_TD_METRIC_NUM			4
+#define INTEL_TD_METRIC_HEAVY_OPS		0x8400	/* Heavy Operations metric */
+#define INTEL_TD_METRIC_BR_MISPREDICT		0x8500	/* Branch Mispredict metric */
+#define INTEL_TD_METRIC_FETCH_LAT		0x8600	/* Fetch Latency metric */
+#define INTEL_TD_METRIC_MEM_BOUND		0x8700	/* Memory bound metric */
+#define INTEL_TD_METRIC_MAX			INTEL_TD_METRIC_MEM_BOUND
+#define INTEL_TD_METRIC_NUM			8
 
 static inline bool is_metric_idx(int idx)
 {
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d0129e5..17f19cc 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1135,14 +1135,16 @@ union perf_mem_data_src {
 			mem_lvl_num:4,	/* memory hierarchy level number */
 			mem_remote:1,   /* remote */
 			mem_snoopx:2,	/* snoop mode, ext */
-			mem_rsvd:24;
+			mem_blk:3,	/* access blocked */
+			mem_rsvd:21;
 	};
 };
 #elif defined(__BIG_ENDIAN_BITFIELD)
 union perf_mem_data_src {
 	__u64 val;
 	struct {
-		__u64	mem_rsvd:24,
+		__u64	mem_rsvd:21,
+			mem_blk:3,	/* access blocked */
 			mem_snoopx:2,	/* snoop mode, ext */
 			mem_remote:1,   /* remote */
 			mem_lvl_num:4,	/* memory hierarchy level number */
@@ -1225,6 +1227,12 @@ union perf_mem_data_src {
 #define PERF_MEM_TLB_OS		0x40 /* OS fault handler */
 #define PERF_MEM_TLB_SHIFT	26
 
+/* Access blocked */
+#define PERF_MEM_BLK_NA		0x01 /* not available */
+#define PERF_MEM_BLK_DATA	0x02 /* data could not be forwarded */
+#define PERF_MEM_BLK_ADDR	0x04 /* address conflict */
+#define PERF_MEM_BLK_SHIFT	40
+
 #define PERF_MEM_S(a, s) \
 	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (2 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-26 15:44   ` Peter Zijlstra
  2021-01-26 15:53   ` Peter Zijlstra
  2021-01-19 20:38 ` [PATCH 05/12] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With Architectural Performance Monitoring Version 5, CPUID 10.ECX cpu
leaf indicates the fixed counter enumeration. This extends the previous
count to a bitmap which allows disabling even lower fixed counters.
It could be used by a Hypervisor.

The existing intel_ctrl variable is used to remember the bitmask of the
counters. All code that reads all counters is fixed to check this extra
bitmask.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       |  8 +++++++-
 arch/x86/events/intel/core.c | 43 +++++++++++++++++++++++++++++++++----------
 arch/x86/events/perf_event.h |  5 +++++
 3 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e37de29..3d6fdcf 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -253,6 +253,8 @@ static bool check_hw_exists(void)
 		if (ret)
 			goto msr_fail;
 		for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
+			if (fixed_counter_disabled(i))
+				continue;
 			if (val & (0x03 << i*4)) {
 				bios_fail = 1;
 				val_fail = val;
@@ -1523,6 +1525,8 @@ void perf_event_print_debug(void)
 			cpu, idx, prev_left);
 	}
 	for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++) {
+		if (fixed_counter_disabled(idx))
+			continue;
 		rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR0 + idx, pmc_count);
 
 		pr_info("CPU#%d: fixed-PMC%d count: %016llx\n",
@@ -1995,7 +1999,9 @@ static int __init init_hw_perf_events(void)
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
 	pr_info("... value mask:             %016Lx\n", x86_pmu.cntval_mask);
 	pr_info("... max period:             %016Lx\n", x86_pmu.max_period);
-	pr_info("... fixed-purpose events:   %d\n",     x86_pmu.num_counters_fixed);
+	pr_info("... fixed-purpose events:   %lu\n",
+			hweight64((((1ULL << x86_pmu.num_counters_fixed) - 1)
+					<< INTEL_PMC_IDX_FIXED) & x86_pmu.intel_ctrl));
 	pr_info("... event mask:             %016Lx\n", x86_pmu.intel_ctrl);
 
 	if (!x86_pmu.read)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a54d4a9..21267dc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2750,8 +2750,11 @@ static void intel_pmu_reset(void)
 		wrmsrl_safe(x86_pmu_config_addr(idx), 0ull);
 		wrmsrl_safe(x86_pmu_event_addr(idx),  0ull);
 	}
-	for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++)
+	for (idx = 0; idx < x86_pmu.num_counters_fixed; idx++) {
+		if (fixed_counter_disabled(idx))
+			continue;
 		wrmsrl_safe(MSR_ARCH_PERFMON_FIXED_CTR0 + idx, 0ull);
+	}
 
 	if (ds)
 		ds->bts_index = ds->bts_buffer_base;
@@ -5206,7 +5209,7 @@ __init int intel_pmu_init(void)
 	union cpuid10_eax eax;
 	union cpuid10_ebx ebx;
 	struct event_constraint *c;
-	unsigned int unused;
+	unsigned int fixed_mask;
 	struct extra_reg *er;
 	bool pmem = false;
 	int version, i;
@@ -5228,7 +5231,7 @@ __init int intel_pmu_init(void)
 	 * Check whether the Architectural PerfMon supports
 	 * Branch Misses Retired hw_event or not.
 	 */
-	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
+	cpuid(10, &eax.full, &ebx.full, &fixed_mask, &edx.full);
 	if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
 		return -ENODEV;
 
@@ -5255,8 +5258,16 @@ __init int intel_pmu_init(void)
 	if (version > 1) {
 		int assume = 3 * !boot_cpu_has(X86_FEATURE_HYPERVISOR);
 
-		x86_pmu.num_counters_fixed =
-			max((int)edx.split.num_counters_fixed, assume);
+		if (!fixed_mask) {
+			x86_pmu.num_counters_fixed =
+				max((int)edx.split.num_counters_fixed, assume);
+		} else {
+			/*
+			 * The fixed-purpose counters are enumerated in the ECX
+			 * since V5 perfmon.
+			 */
+			x86_pmu.num_counters_fixed = fls(fixed_mask);
+		}
 	}
 
 	if (version >= 4)
@@ -5847,8 +5858,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.num_counters_fixed = INTEL_PMC_MAX_FIXED;
 	}
 
-	x86_pmu.intel_ctrl |=
-		((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
+	if (!fixed_mask) {
+		x86_pmu.intel_ctrl |=
+			((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
+	} else
+		x86_pmu.intel_ctrl |= (u64)fixed_mask << INTEL_PMC_IDX_FIXED;
 
 	/* AnyThread may be deprecated on arch perfmon v5 or later */
 	if (x86_pmu.intel_cap.anythread_deprecated)
@@ -5865,13 +5879,22 @@ __init int intel_pmu_init(void)
 			 * events to the generic counters.
 			 */
 			if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+				/*
+				 * Disable topdown slots and metrics events,
+				 * if slots event is not in CPUID.
+				 */
+				if (!(INTEL_PMC_MSK_FIXED_SLOTS & x86_pmu.intel_ctrl))
+					c->idxmsk64 = 0;
 				c->weight = hweight64(c->idxmsk64);
 				continue;
 			}
 
-			if (c->cmask == FIXED_EVENT_FLAGS
-			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
-				c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+			if (c->cmask == FIXED_EVENT_FLAGS) {
+				/* Disabled fixed counters which are not in CPUID */
+				c->idxmsk64 &= x86_pmu.intel_ctrl;
+
+				if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
+					c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
 			}
 			c->idxmsk64 &=
 				~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 0ae4e50..ffa598c 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1068,6 +1068,11 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
 ssize_t events_ht_sysfs_show(struct device *dev, struct device_attribute *attr,
 			  char *page);
 
+static inline bool fixed_counter_disabled(int i)
+{
+	return !(x86_pmu.intel_ctrl >> (i + INTEL_PMC_IDX_FIXED));
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/12] tools headers uapi: Update tools's copy of linux/perf_event.h
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (3 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 06/12] perf tools: Support data block and addr block kan.liang
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To get the changes in:

    ("perf/core: Add PERF_SAMPLE_WEIGHT_EXT")

This cures the following warning during perf's build:

        Warning: Kernel ABI header at
'tools/include/uapi/linux/perf_event.h' differs from latest version at
'include/uapi/linux/perf_event.h'
        diff -u tools/include/uapi/linux/perf_event.h
include/uapi/linux/perf_event.h

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/include/uapi/linux/perf_event.h | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index b15e344..17f19cc 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -145,8 +145,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_CGROUP			= 1U << 21,
 	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
 	PERF_SAMPLE_CODE_PAGE_SIZE		= 1U << 23,
+	PERF_SAMPLE_WEIGHT_EXT			= 1U << 24,
 
-	PERF_SAMPLE_MAX = 1U << 24,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 25,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
@@ -900,6 +901,13 @@ enum perf_event_type {
 	 *	  char			data[size]; } && PERF_SAMPLE_AUX
 	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
 	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
+	 *	{ union {
+	 *		u64		weight_ext;
+	 *		struct {
+	 *			u64	instr_latency:16,
+	 *				reserved:48;
+	 *		};
+	 *	} && PERF_SAMPLE_WEIGHT_EXT
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -1127,14 +1135,16 @@ union perf_mem_data_src {
 			mem_lvl_num:4,	/* memory hierarchy level number */
 			mem_remote:1,   /* remote */
 			mem_snoopx:2,	/* snoop mode, ext */
-			mem_rsvd:24;
+			mem_blk:3,	/* access blocked */
+			mem_rsvd:21;
 	};
 };
 #elif defined(__BIG_ENDIAN_BITFIELD)
 union perf_mem_data_src {
 	__u64 val;
 	struct {
-		__u64	mem_rsvd:24,
+		__u64	mem_rsvd:21,
+			mem_blk:3,	/* access blocked */
 			mem_snoopx:2,	/* snoop mode, ext */
 			mem_remote:1,   /* remote */
 			mem_lvl_num:4,	/* memory hierarchy level number */
@@ -1217,6 +1227,12 @@ union perf_mem_data_src {
 #define PERF_MEM_TLB_OS		0x40 /* OS fault handler */
 #define PERF_MEM_TLB_SHIFT	26
 
+/* Access blocked */
+#define PERF_MEM_BLK_NA		0x01 /* not available */
+#define PERF_MEM_BLK_DATA	0x02 /* data could not be forwarded */
+#define PERF_MEM_BLK_ADDR	0x04 /* address conflict */
+#define PERF_MEM_BLK_SHIFT	40
+
 #define PERF_MEM_S(a, s) \
 	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
 
@@ -1248,4 +1264,12 @@ struct perf_branch_entry {
 		reserved:40;
 };
 
+union perf_weight_ext {
+	__u64		val;
+	struct {
+		__u64	instr_latency:16,
+			reserved:48;
+	};
+};
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/12] perf tools: Support data block and addr block
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (4 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 05/12] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 07/12] perf c2c: " kan.liang
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Two new data source fields, to indicate the block reasons of a load
instruction, are introduced on the Intel Sapphire Rapids server. The
fields can be used by the memory profiling.

Add a new sort function, SORT_MEM_BLOCKED, for the two fields.

For the previous platforms or the block reason is unknown, print "N/A"
for the block reason.

Add blocked as a default mem sort key for perf report and
perf mem report.

An auxiliary event has to be enabled simultaneously with the load
latency event to retrieve complete Memory Info. Add X86 specific
perf_mem_events__name() to handle the auxiliary event.
- Users are only interested in the samples of the mem-loads event.
  Sample read the auxiliary event.
- The auxiliary event must be in front of the load latency event in a
  group. Assume the second event to sample if the auxiliary event is the
  leader.
- Add a weak is_mem_loads_aux_event() to check the auxiliary event for
  X86. For other ARCH, it always return false.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt  |  5 ++--
 tools/perf/arch/x86/util/Build            |  1 +
 tools/perf/arch/x86/util/mem-events.c     | 44 +++++++++++++++++++++++++++++++
 tools/perf/builtin-mem.c                  |  2 +-
 tools/perf/util/evsel.c                   |  2 ++
 tools/perf/util/hist.c                    |  1 +
 tools/perf/util/hist.h                    |  1 +
 tools/perf/util/mem-events.c              | 30 +++++++++++++++++++++
 tools/perf/util/mem-events.h              |  3 +++
 tools/perf/util/perf_event_attr_fprintf.c |  2 +-
 tools/perf/util/record.c                  |  4 ++-
 tools/perf/util/sort.c                    | 38 +++++++++++++++++++++++++-
 tools/perf/util/sort.h                    |  1 +
 13 files changed, 128 insertions(+), 6 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/mem-events.c

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 8f7f4e9..826b5a9 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -139,7 +139,7 @@ OPTIONS
 
 	If the --mem-mode option is used, the following sort keys are also available
 	(incompatible with --branch-stack):
-	symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline.
+	symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline, blocked.
 
 	- symbol_daddr: name of data symbol being executed on at the time of sample
 	- dso_daddr: name of library or module containing the data being executed
@@ -151,9 +151,10 @@ OPTIONS
 	- dcacheline: the cacheline the data address is on at the time of the sample
 	- phys_daddr: physical address of data being executed on at the time of sample
 	- data_page_size: the data page size of data being executed on at the time of sample
+	- blocked: reason of blocked load access for the data at the time of the sample
 
 	And the default sort keys are changed to local_weight, mem, sym, dso,
-	symbol_daddr, dso_daddr, snoop, tlb, locked, see '--mem-mode'.
+	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
 
 	If the data file has tracepoint event(s), following (dynamic) sort keys
 	are also available:
diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 347c39b..d73f548 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
 perf-y += topdown.o
 perf-y += machine.o
 perf-y += event.o
+perf-y += mem-events.o
 
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
new file mode 100644
index 0000000..c21dce3
--- /dev/null
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/pmu.h"
+#include "map_symbol.h"
+#include "mem-events.h"
+
+static char mem_loads_name[100];
+static bool mem_loads_name__init;
+
+#define MEM_LOADS_AUX		0x0203
+#define MEM_LOADS_AUX_NAME	"{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/}:SP"
+
+bool is_mem_loads_aux_event(struct evsel *leader)
+{
+	if (!pmu_have_event("cpu", "mem-loads-aux"))
+		return false;
+
+	return leader->core.attr.config == MEM_LOADS_AUX;
+}
+
+char *perf_mem_events__name(int i)
+{
+	struct perf_mem_event *e = perf_mem_events__ptr(i);
+
+	if (!e)
+		return NULL;
+
+	if (i == PERF_MEM_EVENTS__LOAD) {
+		if (mem_loads_name__init)
+			return mem_loads_name;
+
+		mem_loads_name__init = true;
+
+		if (pmu_have_event("cpu", "mem-loads-aux")) {
+			scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+				  MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
+		} else {
+			scnprintf(mem_loads_name, sizeof(mem_loads_name),
+				  e->name, perf_mem_events__loads_ldlat);
+		}
+		return mem_loads_name;
+	}
+
+	return (char *)e->name;
+}
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 8237420..e5778aa 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -312,7 +312,7 @@ static char *get_sort_order(struct perf_mem *mem)
 			     "dso_daddr,tlb,locked");
 	} else if (has_extra_options) {
 		strcpy(sort, "--sort=local_weight,mem,sym,dso,symbol_daddr,"
-			     "dso_daddr,snoop,tlb,locked");
+			     "dso_daddr,snoop,tlb,locked,blocked");
 	} else
 		return NULL;
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c26ea822..97acde2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2689,6 +2689,8 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 		if (perf_missing_features.aux_output)
 			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
 		break;
+	case ENODATA:
+		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. Please add an auxiliary event in front of the load latency event.");
 	default:
 		break;
 	}
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index a08fb9e..7cb4cbe 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -208,6 +208,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	hists__new_col_len(hists, HISTC_MEM_LVL, 21 + 3);
 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
+	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 7);
 	if (symbol_conf.nanosecs)
 		hists__new_col_len(hists, HISTC_TIME, 16);
 	else
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 14f6633..522486b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -71,6 +71,7 @@ enum hist_column {
 	HISTC_SYM_SIZE,
 	HISTC_DSO_SIZE,
 	HISTC_SYMBOL_IPC,
+	HISTC_MEM_BLOCKED,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 19007e4..890f638 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -56,6 +56,11 @@ char * __weak perf_mem_events__name(int i)
 	return (char *)e->name;
 }
 
+__weak bool is_mem_loads_aux_event(struct evsel *leader __maybe_unused)
+{
+	return false;
+}
+
 int perf_mem_events__parse(const char *str)
 {
 	char *tok, *saveptr = NULL;
@@ -332,6 +337,29 @@ int perf_mem__lck_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 	return l;
 }
 
+int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
+{
+	size_t l = 0;
+	u64 mask = PERF_MEM_BLK_NA;
+
+	sz -= 1; /* -1 for null termination */
+	out[0] = '\0';
+
+	if (mem_info)
+		mask = mem_info->data_src.mem_blk;
+
+	if (!mask || (mask & PERF_MEM_BLK_NA)) {
+		l += scnprintf(out + l, sz - l, " N/A");
+		return l;
+	}
+	if (mask & PERF_MEM_BLK_DATA)
+		l += scnprintf(out + l, sz - l, " Data");
+	if (mask & PERF_MEM_BLK_ADDR)
+		l += scnprintf(out + l, sz - l, " Addr");
+
+	return l;
+}
+
 int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 {
 	int i = 0;
@@ -343,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_in
 	i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info);
 	i += scnprintf(out + i, sz - i, "|LCK ");
 	i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info);
+	i += scnprintf(out + i, sz - i, "|BLK ");
+	i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info);
 
 	return i;
 }
diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
index 5ef1782..5ddf447 100644
--- a/tools/perf/util/mem-events.h
+++ b/tools/perf/util/mem-events.h
@@ -9,6 +9,7 @@
 #include <linux/refcount.h>
 #include <linux/perf_event.h>
 #include "stat.h"
+#include "evsel.h"
 
 struct perf_mem_event {
 	bool		record;
@@ -39,6 +40,7 @@ int perf_mem_events__init(void);
 
 char *perf_mem_events__name(int i);
 struct perf_mem_event *perf_mem_events__ptr(int i);
+bool is_mem_loads_aux_event(struct evsel *leader);
 
 void perf_mem_events__list(void);
 
@@ -47,6 +49,7 @@ int perf_mem__tlb_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__lvl_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__lck_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
+int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 
 int perf_script__meminfo_scnprintf(char *bf, size_t size, struct mem_info *mem_info);
 
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index fb0bb66..b393acd 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
 		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
 		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
 		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
-		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
+		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_EXT),
 		{ .name = NULL, }
 	};
 #undef bit_name
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index e70c9dd..9f8c30a 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -15,6 +15,8 @@
 #include "record.h"
 #include "../perf-sys.h"
 #include "topdown.h"
+#include "map_symbol.h"
+#include "mem-events.h"
 
 /*
  * evsel__config_leader_sampling() uses special rules for leader sampling.
@@ -25,7 +27,7 @@ static struct evsel *evsel__read_sampler(struct evsel *evsel, struct evlist *evl
 {
 	struct evsel *leader = evsel->leader;
 
-	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader)) {
+	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader) || is_mem_loads_aux_event(leader)) {
 		evlist__for_each_entry(evlist, evsel) {
 			if (evsel->leader == leader && evsel != evsel->leader)
 				return evsel;
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 80907bc..af7f893 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
 const char	*parent_pattern = default_parent_pattern;
 const char	*default_sort_order = "comm,dso,symbol";
 const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
 const char	default_top_sort_order[] = "dso,symbol";
 const char	default_diff_sort_order[] = "dso,symbol";
 const char	default_tracepoint_sort_order[] = "trace";
@@ -1422,6 +1422,41 @@ struct sort_entry sort_mem_dcacheline = {
 };
 
 static int64_t
+sort__blocked_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_data_src data_src_l;
+	union perf_mem_data_src data_src_r;
+
+	if (left->mem_info)
+		data_src_l = left->mem_info->data_src;
+	else
+		data_src_l.mem_blk = PERF_MEM_BLK_NA;
+
+	if (right->mem_info)
+		data_src_r = right->mem_info->data_src;
+	else
+		data_src_r.mem_blk = PERF_MEM_BLK_NA;
+
+	return (int64_t)(data_src_r.mem_blk - data_src_l.mem_blk);
+}
+
+static int hist_entry__blocked_snprintf(struct hist_entry *he, char *bf,
+					size_t size, unsigned int width)
+{
+	char out[10];
+
+	perf_mem__blk_scnprintf(out, sizeof(out), he->mem_info);
+	return repsep_snprintf(bf, size, "%.*s", width, out);
+}
+
+struct sort_entry sort_mem_blocked = {
+	.se_header	= "Blocked",
+	.se_cmp		= sort__blocked_cmp,
+	.se_snprintf	= hist_entry__blocked_snprintf,
+	.se_width_idx	= HISTC_MEM_BLOCKED,
+};
+
+static int64_t
 sort__phys_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
 {
 	uint64_t l = 0, r = 0;
@@ -1770,6 +1805,7 @@ static struct sort_dimension memory_sort_dimensions[] = {
 	DIM(SORT_MEM_DCACHELINE, "dcacheline", sort_mem_dcacheline),
 	DIM(SORT_MEM_PHYS_DADDR, "phys_daddr", sort_mem_phys_daddr),
 	DIM(SORT_MEM_DATA_PAGE_SIZE, "data_page_size", sort_mem_data_page_size),
+	DIM(SORT_MEM_BLOCKED, "blocked", sort_mem_blocked),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index e50f2b6..2b2645b 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -256,6 +256,7 @@ enum sort_type {
 	SORT_MEM_IADDR_SYMBOL,
 	SORT_MEM_PHYS_DADDR,
 	SORT_MEM_DATA_PAGE_SIZE,
+	SORT_MEM_BLOCKED,
 };
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/12] perf c2c: Support data block and addr block
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (5 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 06/12] perf tools: Support data block and addr block kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 08/12] perf tools: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

perf c2c is also a memory profiling tool. Apply the two new data source
fields to perf c2c as well.

Extend perf c2c to display the number of loads which blocked by data or
address conflict.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/builtin-c2c.c     | 3 +++
 tools/perf/util/mem-events.c | 6 ++++++
 tools/perf/util/mem-events.h | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c5babea..4de49ae 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -2111,6 +2111,8 @@ static void print_c2c__display_stats(FILE *out)
 	fprintf(out, "  Load MESI State Exclusive         : %10d\n", stats->ld_excl);
 	fprintf(out, "  Load MESI State Shared            : %10d\n", stats->ld_shared);
 	fprintf(out, "  Load LLC Misses                   : %10d\n", llc_misses);
+	fprintf(out, "  Load access blocked by data       : %10d\n", stats->blk_data);
+	fprintf(out, "  Load access blocked by address    : %10d\n", stats->blk_addr);
 	fprintf(out, "  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->lcl_dram/(double)llc_misses) * 100.);
 	fprintf(out, "  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->rmt_dram/(double)llc_misses) * 100.);
 	fprintf(out, "  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->rmt_hit /(double)llc_misses) * 100.);
@@ -2139,6 +2141,7 @@ static void print_shared_cacheline_info(FILE *out)
 	fprintf(out, "  L2D hits on shared lines          : %10d\n", stats->ld_l2hit);
 	fprintf(out, "  LLC hits on shared lines          : %10d\n", stats->ld_llchit + stats->lcl_hitm);
 	fprintf(out, "  Locked Access on shared lines     : %10d\n", stats->locks);
+	fprintf(out, "  Blocked Access on shared lines    : %10d\n", stats->blk_data + stats->blk_addr);
 	fprintf(out, "  Store HITs on shared lines        : %10d\n", stats->store);
 	fprintf(out, "  Store L1D hits on shared lines    : %10d\n", stats->st_l1hit);
 	fprintf(out, "  Total Merged records              : %10d\n", hitm_cnt + stats->store);
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 890f638..f93a852 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -385,6 +385,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
 	u64 lvl    = data_src->mem_lvl;
 	u64 snoop  = data_src->mem_snoop;
 	u64 lock   = data_src->mem_lock;
+	u64 blk    = data_src->mem_blk;
 	/*
 	 * Skylake might report unknown remote level via this
 	 * bit, consider it when evaluating remote HITMs.
@@ -404,6 +405,9 @@ do {				\
 
 	if (lock & P(LOCK, LOCKED)) stats->locks++;
 
+	if (blk & P(BLK, DATA)) stats->blk_data++;
+	if (blk & P(BLK, ADDR)) stats->blk_addr++;
+
 	if (op & P(OP, LOAD)) {
 		/* load */
 		stats->load++;
@@ -515,6 +519,8 @@ void c2c_add_stats(struct c2c_stats *stats, struct c2c_stats *add)
 	stats->rmt_hit		+= add->rmt_hit;
 	stats->lcl_dram		+= add->lcl_dram;
 	stats->rmt_dram		+= add->rmt_dram;
+	stats->blk_data		+= add->blk_data;
+	stats->blk_addr		+= add->blk_addr;
 	stats->nomap		+= add->nomap;
 	stats->noparse		+= add->noparse;
 }
diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
index 5ddf447..755cef7 100644
--- a/tools/perf/util/mem-events.h
+++ b/tools/perf/util/mem-events.h
@@ -79,6 +79,8 @@ struct c2c_stats {
 	u32	rmt_hit;             /* count of loads with remote hit clean; */
 	u32	lcl_dram;            /* count of loads miss to local DRAM */
 	u32	rmt_dram;            /* count of loads miss to remote DRAM */
+	u32	blk_data;            /* count of loads blocked by data */
+	u32	blk_addr;            /* count of loads blocked by address conflict */
 	u32	nomap;               /* count of load/stores with no phys adrs */
 	u32	noparse;             /* count of unparsable data sources */
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/12] perf tools: Support PERF_SAMPLE_WEIGHT_EXT
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (6 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 07/12] perf c2c: " kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 09/12] perf report: Support instruction latency kan.liang
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The new sample type, PERF_SAMPLE_WEIGHT_EXT, is an extension of the
PERF_SAMPLE_WEIGHT sample type. Enable the sample type if the sample by
weight option is applied.

Add weight_ext in the struct perf_sample to record the value of the new
sample type. For the old kernel which doesn't support the new sample
type, clear the sample type.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/event.h            |  1 +
 tools/perf/util/evsel.c            | 22 +++++++++++++++++++---
 tools/perf/util/evsel.h            |  1 +
 tools/perf/util/synthetic-events.c |  8 ++++++++
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ff403ea..0852c86 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -128,6 +128,7 @@ struct perf_sample {
 	u64 stream_id;
 	u64 period;
 	u64 weight;
+	union perf_weight_ext weight_ext;
 	u64 transaction;
 	u64 insn_cnt;
 	u64 cyc_cnt;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 97acde2..bb05687 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1165,8 +1165,10 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 		attr->branch_sample_type = opts->branch_stack;
 	}
 
-	if (opts->sample_weight)
+	if (opts->sample_weight) {
 		evsel__set_sample_bit(evsel, WEIGHT);
+		evsel__set_sample_bit(evsel, WEIGHT_EXT);
+	}
 
 	attr->task  = track;
 	attr->mmap  = track;
@@ -1735,6 +1737,8 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 	}
 
 fallback_missing_features:
+	if (perf_missing_features.weight_ext)
+		evsel__reset_sample_bit(evsel, WEIGHT_EXT);
 	if (perf_missing_features.clockid_wrong)
 		evsel->core.attr.clockid = CLOCK_MONOTONIC; /* should always work */
 	if (perf_missing_features.clockid) {
@@ -1873,8 +1877,13 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-        if (!perf_missing_features.data_page_size &&
-	    (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) {
+	if (!perf_missing_features.weight_ext &&
+	    (evsel->core.attr.sample_type & PERF_SAMPLE_WEIGHT_EXT)) {
+		perf_missing_features.weight_ext = true;
+		pr_debug2("switching off weight extension support\n");
+		goto fallback_missing_features;
+	} else if (!perf_missing_features.data_page_size &&
+		   (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) {
 		perf_missing_features.data_page_size = true;
 		pr_debug2_peo("Kernel has no PERF_SAMPLE_DATA_PAGE_SIZE support, bailing out\n");
 		goto out_close;
@@ -2382,6 +2391,13 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		array = (void *)array + sz;
 	}
 
+	data->weight_ext.val = 0;
+	if (type & PERF_SAMPLE_WEIGHT_EXT) {
+		OVERFLOW_CHECK_u64(array);
+		data->weight_ext.val = *array;
+		array++;
+	}
+
 	return 0;
 }
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index cd1d8dd..ec598a6 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -145,6 +145,7 @@ struct perf_missing_features {
 	bool branch_hw_idx;
 	bool cgroup;
 	bool data_page_size;
+	bool weight_ext;
 };
 
 extern struct perf_missing_features perf_missing_features;
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 2947e3f..69291a9 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1417,6 +1417,9 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
 		result += sample->aux_sample.size;
 	}
 
+	if (type & PERF_SAMPLE_WEIGHT_EXT)
+		result += sizeof(u64);
+
 	return result;
 }
 
@@ -1603,6 +1606,11 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 		array = (void *)array + sz;
 	}
 
+	if (type & PERF_SAMPLE_WEIGHT_EXT) {
+		*array = sample->weight_ext.val;
+		array++;
+	}
+
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/12] perf report: Support instruction latency
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (7 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 08/12] perf tools: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 10/12] perf test: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications
in different pipeline stages.

Like the weight support, introduce a ins_lat for the global instruction
latency, and a local_ins_lat for the local instruction latency version.
Add new sort functions, INSTR Latency and Local INSTR Latency,
accordingly.

Add local_ins_lat to the default_mem_sort_order[].

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt |  6 +++-
 tools/perf/util/hist.c                   | 12 ++++++--
 tools/perf/util/hist.h                   |  2 ++
 tools/perf/util/session.c                |  3 ++
 tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
 tools/perf/util/sort.h                   |  3 ++
 6 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 826b5a9..0565b7c 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -108,6 +108,9 @@ OPTIONS
 	- period: Raw number of event count of sample
 	- time: Separate the samples by time stamp with the resolution specified by
 	--time-quantum (default 100ms). Specify with overhead and before it.
+	- ins_lat: Instruction latency in core cycles. This is the global
+	instruction latency
+	- local_ins_lat: Local instruction latency version
 
 	By default, comm, dso and symbol keys are used.
 	(i.e. --sort comm,dso,symbol)
@@ -154,7 +157,8 @@ OPTIONS
 	- blocked: reason of blocked load access for the data at the time of the sample
 
 	And the default sort keys are changed to local_weight, mem, sym, dso,
-	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
+	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
+	see '--mem-mode'.
 
 	If the data file has tracepoint event(s), following (dynamic) sort keys
 	are also available:
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 7cb4cbe..f93a3eb 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 7);
+	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
+	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
 	if (symbol_conf.nanosecs)
 		hists__new_col_len(hists, HISTC_TIME, 16);
 	else
@@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
 }
 
 static void he_stat__add_period(struct he_stat *he_stat, u64 period,
-				u64 weight)
+				u64 weight, u64 ins_lat)
 {
 
 	he_stat->period		+= period;
 	he_stat->weight		+= weight;
 	he_stat->nr_events	+= 1;
+	he_stat->ins_lat	+= ins_lat;
 }
 
 static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
@@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->period_guest_us	+= src->period_guest_us;
 	dest->nr_events		+= src->nr_events;
 	dest->weight		+= src->weight;
+	dest->ins_lat		+= src->ins_lat;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	int64_t cmp;
 	u64 period = entry->stat.period;
 	u64 weight = entry->stat.weight;
+	u64 ins_lat = entry->stat.ins_lat;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 
 		if (!cmp) {
 			if (sample_self) {
-				he_stat__add_period(&he->stat, period, weight);
+				he_stat__add_period(&he->stat, period, weight, ins_lat);
 				hist_entry__add_callchain_period(he, period);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period, weight);
+				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
 
 			/*
 			 * This mem info was allocated from sample__resolve_mem
@@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
 			.nr_events = 1,
 			.period	= sample->period,
 			.weight = sample->weight,
+			.ins_lat = sample->weight_ext.instr_latency,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 522486b..36bca33 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -72,6 +72,8 @@ enum hist_column {
 	HISTC_DSO_SIZE,
 	HISTC_SYMBOL_IPC,
 	HISTC_MEM_BLOCKED,
+	HISTC_LOCAL_INS_LAT,
+	HISTC_GLOBAL_INS_LAT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 25adbcc..df28a2f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1300,6 +1300,9 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 	if (sample_type & PERF_SAMPLE_WEIGHT)
 		printf("... weight: %" PRIu64 "\n", sample->weight);
 
+	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
+		printf("... weight_ext: 0x%"PRIx64"\n", (u64)sample->weight_ext.val);
+
 	if (sample_type & PERF_SAMPLE_DATA_SRC)
 		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index af7f893..cd14955 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
 const char	*parent_pattern = default_parent_pattern;
 const char	*default_sort_order = "comm,dso,symbol";
 const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
+const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
 const char	default_top_sort_order[] = "dso,symbol";
 const char	default_diff_sort_order[] = "dso,symbol";
 const char	default_tracepoint_sort_order[] = "trace";
@@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
 	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
 };
 
+static u64 he_ins_lat(struct hist_entry *he)
+{
+		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
+}
+
+static int64_t
+sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+		return he_ins_lat(left) - he_ins_lat(right);
+}
+
+static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
+					      size_t size, unsigned int width)
+{
+		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
+}
+
+struct sort_entry sort_local_ins_lat = {
+	.se_header	= "Local INSTR Latency",
+	.se_cmp		= sort__local_ins_lat_cmp,
+	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
+	.se_width_idx	= HISTC_LOCAL_INS_LAT,
+};
+
+static int64_t
+sort__global_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+		return left->stat.ins_lat - right->stat.ins_lat;
+}
+
+static int hist_entry__global_ins_lat_snprintf(struct hist_entry *he, char *bf,
+					       size_t size, unsigned int width)
+{
+		return repsep_snprintf(bf, size, "%-*u", width, he->stat.ins_lat);
+}
+
+struct sort_entry sort_global_ins_lat = {
+	.se_header	= "INSTR Latency",
+	.se_cmp		= sort__global_ins_lat_cmp,
+	.se_snprintf	= hist_entry__global_ins_lat_snprintf,
+	.se_width_idx	= HISTC_GLOBAL_INS_LAT,
+};
+
 struct sort_entry sort_mem_daddr_sym = {
 	.se_header	= "Data Symbol",
 	.se_cmp		= sort__daddr_cmp,
@@ -1770,6 +1813,8 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_CGROUP_ID, "cgroup_id", sort_cgroup_id),
 	DIM(SORT_SYM_IPC_NULL, "ipc_null", sort_sym_ipc_null),
 	DIM(SORT_TIME, "time", sort_time),
+	DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
+	DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 2b2645b..c92ca15 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -50,6 +50,7 @@ struct he_stat {
 	u64			period_guest_sys;
 	u64			period_guest_us;
 	u64			weight;
+	u64			ins_lat;
 	u32			nr_events;
 };
 
@@ -229,6 +230,8 @@ enum sort_type {
 	SORT_CGROUP_ID,
 	SORT_SYM_IPC_NULL,
 	SORT_TIME,
+	SORT_LOCAL_INS_LAT,
+	SORT_GLOBAL_INS_LAT,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/12] perf test: Support PERF_SAMPLE_WEIGHT_EXT
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (8 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 09/12] perf report: Support instruction latency kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 11/12] perf stat: Support L2 Topdown events kan.liang
  2021-01-19 20:38 ` [PATCH 12/12] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Support the new sample type for sample-parsing test case.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/tests/sample-parsing.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/tests/sample-parsing.c b/tools/perf/tests/sample-parsing.c
index 2393916..7f428b0 100644
--- a/tools/perf/tests/sample-parsing.c
+++ b/tools/perf/tests/sample-parsing.c
@@ -238,6 +238,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 read_format)
 		.phys_addr	= 113,
 		.cgroup		= 114,
 		.data_page_size = 115,
+		.weight_ext.val = 116,
 		.aux_sample	= {
 			.size	= sizeof(aux_data),
 			.data	= (void *)aux_data,
@@ -344,7 +345,7 @@ int test__sample_parsing(struct test *test __maybe_unused, int subtest __maybe_u
 	 * were added.  Please actually update the test rather than just change
 	 * the condition below.
 	 */
-	if (PERF_SAMPLE_MAX > PERF_SAMPLE_CODE_PAGE_SIZE << 1) {
+	if (PERF_SAMPLE_MAX > PERF_SAMPLE_WEIGHT_EXT << 1) {
 		pr_debug("sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating\n");
 		return -1;
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 11/12] perf stat: Support L2 Topdown events
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (9 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 10/12] perf test: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
@ 2021-01-19 20:38 ` kan.liang
  2021-01-19 20:38 ` [PATCH 12/12] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The TMA method level 2 metrics is supported from the Intel Sapphire
Rapids server, which expose four L2 Topdown metrics events to user
space. There are eight L2 events in total. The other four L2 Topdown
metrics events are calculated from the corresponding L1 and the exposed
L2 events.

Now, the --topdown prints the complete top-down metrics that supported
by the CPU. For the Intel Sapphire Rapids server, there are 4 L1 events
and 8 L2 events displyed in one line.

Add a new option, --td-level, to display the top-down statistics that
equal to or lower than the input level.

The L2 event is marked only when both its L1 parent event and itself
crosse the threshold.

Here is an example:

 $perf stat --topdown --td-level=2 --no-metric-only sleep 1
 Topdown accuracy may decrease when measuring long periods.
 Please print the result regularly, e.g. -I1000

 Performance counter stats for 'sleep 1':

        16,734,390      slots
         2,100,001      topdown-retiring          #     12.6% retiring
         2,034,376      topdown-bad-spec          #     12.3% bad
speculation
         4,003,128      topdown-fe-bound          #     24.1% frontend
bound
           328,125      topdown-heavy-ops         #      2.0% heavy
operations         #      10.6% light operations
         1,968,751      topdown-br-mispredict     #     11.9% branch
mispredict         #      0.4% machine clears
         2,953,127      topdown-fetch-lat         #     17.8% fetch
latency         #      6.3% fetch bandwidth
         5,906,255      topdown-mem-bound         #     35.6% memory
bound          #      15.4% core bound

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt | 14 +++++-
 tools/perf/builtin-stat.c              | 34 +++++++++++--
 tools/perf/util/stat-shadow.c          | 92 ++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c                 |  4 ++
 tools/perf/util/stat.h                 |  9 ++++
 5 files changed, 149 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 5d4a673d..796772c 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -358,7 +358,7 @@ See perf list output for the possble metrics and metricgroups.
 Do not aggregate counts across all monitored CPUs.
 
 --topdown::
-Print top down level 1 metrics if supported by the CPU. This allows to
+Print complete top-down metrics supported by the CPU. This allows to
 determine bottle necks in the CPU pipeline for CPU bound workloads,
 by breaking the cycles consumed down into frontend bound, backend bound,
 bad speculation and retiring.
@@ -393,6 +393,18 @@ To interpret the results it is usually needed to know on which
 CPUs the workload runs on. If needed the CPUs can be forced using
 taskset.
 
+--td-level::
+Print the top-down statistics that equal to or lower than the input level.
+It allows users to print the interested top-down metrics level instead of
+the complete top-down metrics.
+
+The availability of the top-down metrics level depends on the hardware. For
+example, Ice Lake only supports L1 top-down metrics. The Sapphire Rapids
+supports both L1 and L2 top-down metrics.
+
+Default: 0 means the max level that the current hardware support.
+Error out if the input is higher than the supported max level.
+
 --no-merge::
 Do not merge results from same PMUs.
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 8cc2496..bc84b31 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -137,6 +137,19 @@ static const char *topdown_metric_attrs[] = {
 	NULL,
 };
 
+static const char *topdown_metric_L2_attrs[] = {
+	"slots",
+	"topdown-retiring",
+	"topdown-bad-spec",
+	"topdown-fe-bound",
+	"topdown-be-bound",
+	"topdown-heavy-ops",
+	"topdown-br-mispredict",
+	"topdown-fetch-lat",
+	"topdown-mem-bound",
+	NULL,
+};
+
 static const char *smi_cost_attrs = {
 	"{"
 	"msr/aperf/,"
@@ -1153,7 +1166,9 @@ static struct option stat_options[] = {
 	OPT_BOOLEAN(0, "metric-no-merge", &stat_config.metric_no_merge,
 		       "don't try to share events between metrics in a group"),
 	OPT_BOOLEAN(0, "topdown", &topdown_run,
-			"measure topdown level 1 statistics"),
+			"measure top-down statistics"),
+	OPT_UINTEGER(0, "td-level", &stat_config.topdown_level,
+			"Set the metrics level for the top-down statistics (0: max level)"),
 	OPT_BOOLEAN(0, "smi-cost", &smi_cost,
 			"measure SMI cost"),
 	OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
@@ -1706,17 +1721,30 @@ static int add_default_attributes(void)
 	}
 
 	if (topdown_run) {
+		const char **metric_attrs = topdown_metric_attrs;
+		unsigned int max_level = 1;
 		char *str = NULL;
 		bool warn = false;
 
 		if (!force_metric_only)
 			stat_config.metric_only = true;
 
-		if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+		if (pmu_have_event("cpu", topdown_metric_L2_attrs[5])) {
+			metric_attrs = topdown_metric_L2_attrs;
+			max_level = 2;
+		}
+
+		if (stat_config.topdown_level > max_level) {
+			pr_err("Invalid top-down metrics level. The max level is %u.\n", max_level);
+			return -1;
+		} else if (!stat_config.topdown_level)
+			stat_config.topdown_level = max_level;
+
+		if (topdown_filter_events(metric_attrs, &str, 1) < 0) {
 			pr_err("Out of memory\n");
 			return -1;
 		}
-		if (topdown_metric_attrs[0] && str) {
+		if (metric_attrs[0] && str) {
 			if (!stat_config.interval && !stat_config.metric_only) {
 				fprintf(stat_config.output,
 					"Topdown accuracy may decrease when measuring long periods.\n"
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 12eafd1..6ccf21a 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -273,6 +273,18 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count,
 	else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
 		update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
 				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_HEAVY_OPS))
+		update_runtime_stat(st, STAT_TOPDOWN_HEAVY_OPS,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_BR_MISPREDICT))
+		update_runtime_stat(st, STAT_TOPDOWN_BR_MISPREDICT,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_LAT))
+		update_runtime_stat(st, STAT_TOPDOWN_FETCH_LAT,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_MEM_BOUND))
+		update_runtime_stat(st, STAT_TOPDOWN_MEM_BOUND,
+				    cpu, count, &rsd);
 	else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
 				    cpu, count, &rsd);
@@ -1174,6 +1186,86 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
 			color = PERF_COLOR_RED;
 		print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
 				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_HEAVY_OPS) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double retiring = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_RETIRING, st,
+						  &rsd);
+		double heavy_ops = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_HEAVY_OPS, st,
+						   &rsd);
+		double light_ops = retiring - heavy_ops;
+
+		if (retiring > 0.7 && heavy_ops > 0.1)
+			color = PERF_COLOR_GREEN;
+		print_metric(config, ctxp, color, "%8.1f%%", "heavy operations",
+				heavy_ops * 100.);
+		if (retiring > 0.7 && light_ops > 0.6)
+			color = PERF_COLOR_GREEN;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "light operations",
+				light_ops * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_BR_MISPREDICT) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double bad_spec = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_BAD_SPEC, st,
+						  &rsd);
+		double br_mis = td_metric_ratio(cpu,
+						STAT_TOPDOWN_BR_MISPREDICT, st,
+						&rsd);
+		double m_clears = bad_spec - br_mis;
+
+		if (bad_spec > 0.1 && br_mis > 0.05)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "branch mispredict",
+				br_mis * 100.);
+		if (bad_spec > 0.1 && m_clears > 0.05)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "machine clears",
+				m_clears * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_LAT) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double fe_bound = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_FE_BOUND, st,
+						  &rsd);
+		double fetch_lat = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_FETCH_LAT, st,
+						   &rsd);
+		double fetch_bw = fe_bound - fetch_lat;
+
+		if (fe_bound > 0.2 && fetch_lat > 0.15)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "fetch latency",
+				fetch_lat * 100.);
+		if (fe_bound > 0.2 && fetch_bw > 0.1)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "fetch bandwidth",
+				fetch_bw * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_MEM_BOUND) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double be_bound = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_BE_BOUND, st,
+						  &rsd);
+		double mem_bound = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_MEM_BOUND, st,
+						   &rsd);
+		double core_bound = be_bound - mem_bound;
+
+		if (be_bound > 0.2 && mem_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "memory bound",
+				mem_bound * 100.);
+		if (be_bound > 0.2 && core_bound > 0.1)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "Core bound",
+				core_bound * 100.);
 	} else if (evsel->metric_expr) {
 		generic_metric(config, evsel->metric_expr, evsel->metric_events, NULL,
 				evsel->name, evsel->metric_name, NULL, 1, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 8ce1479..82c767b 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -99,6 +99,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
 	ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
 	ID(TOPDOWN_BE_BOUND, topdown-be-bound),
+	ID(TOPDOWN_HEAVY_OPS, topdown-heavy-ops),
+	ID(TOPDOWN_BR_MISPREDICT, topdown-br-mispredict),
+	ID(TOPDOWN_FETCH_LAT, topdown-fetch-lat),
+	ID(TOPDOWN_MEM_BOUND, topdown-mem-bound),
 	ID(SMI_NUM, msr/smi/),
 	ID(APERF, msr/aperf/),
 };
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index b536973..d85c292 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -33,6 +33,10 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
 	PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
 	PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
+	PERF_STAT_EVSEL_ID__TOPDOWN_HEAVY_OPS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_BR_MISPREDICT,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_LAT,
+	PERF_STAT_EVSEL_ID__TOPDOWN_MEM_BOUND,
 	PERF_STAT_EVSEL_ID__SMI_NUM,
 	PERF_STAT_EVSEL_ID__APERF,
 	PERF_STAT_EVSEL_ID__MAX,
@@ -91,6 +95,10 @@ enum stat_type {
 	STAT_TOPDOWN_BAD_SPEC,
 	STAT_TOPDOWN_FE_BOUND,
 	STAT_TOPDOWN_BE_BOUND,
+	STAT_TOPDOWN_HEAVY_OPS,
+	STAT_TOPDOWN_BR_MISPREDICT,
+	STAT_TOPDOWN_FETCH_LAT,
+	STAT_TOPDOWN_MEM_BOUND,
 	STAT_SMI_NUM,
 	STAT_APERF,
 	STAT_MAX
@@ -148,6 +156,7 @@ struct perf_stat_config {
 	int			 ctl_fd_ack;
 	bool			 ctl_fd_close;
 	const char		*cgroup_list;
+	unsigned int		topdown_level;
 };
 
 void perf_stat__set_big_num(int set);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 12/12] perf, tools: Update topdown documentation for Sapphire Rapids
  2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
                   ` (10 preceding siblings ...)
  2021-01-19 20:38 ` [PATCH 11/12] perf stat: Support L2 Topdown events kan.liang
@ 2021-01-19 20:38 ` kan.liang
  11 siblings, 0 replies; 26+ messages in thread
From: kan.liang @ 2021-01-19 20:38 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: eranian, namhyung, jolsa, ak, yao.jin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Update Topdown extension on Sapphire Rapids and how to collect the L2
events.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/topdown.txt | 78 ++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
index 3c39bb3..10f07f9 100644
--- a/tools/perf/Documentation/topdown.txt
+++ b/tools/perf/Documentation/topdown.txt
@@ -121,7 +121,7 @@ to read slots and the topdown metrics at different points of the program:
 #define RDPMC_METRIC	(1 << 29)	/* return metric counters */
 
 #define FIXED_COUNTER_SLOTS		3
-#define METRIC_COUNTER_TOPDOWN_L1	0
+#define METRIC_COUNTER_TOPDOWN_L1_L2	0
 
 static inline uint64_t read_slots(void)
 {
@@ -130,7 +130,7 @@ static inline uint64_t read_slots(void)
 
 static inline uint64_t read_metrics(void)
 {
-	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2);
 }
 
 Then the program can be instrumented to read these metrics at different
@@ -152,11 +152,21 @@ The binary ratios in the metric value can be converted to float ratios:
 
 #define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
 
+/* L1 Topdown metric events */
 #define TOPDOWN_RETIRING(val)	((float)GET_METRIC(val, 0) / 0xff)
 #define TOPDOWN_BAD_SPEC(val)	((float)GET_METRIC(val, 1) / 0xff)
 #define TOPDOWN_FE_BOUND(val)	((float)GET_METRIC(val, 2) / 0xff)
 #define TOPDOWN_BE_BOUND(val)	((float)GET_METRIC(val, 3) / 0xff)
 
+/*
+ * L2 Topdown metric events.
+ * Available on Sapphire Rapids and later platforms.
+ */
+#define TOPDOWN_HEAVY_OPS(val)		((float)GET_METRIC(val, 4) / 0xff)
+#define TOPDOWN_BR_MISPREDICT(val)	((float)GET_METRIC(val, 5) / 0xff)
+#define TOPDOWN_FETCH_LAT(val)		((float)GET_METRIC(val, 6) / 0xff)
+#define TOPDOWN_MEM_BOUND(val)		((float)GET_METRIC(val, 7) / 0xff)
+
 and then converted to percent for printing.
 
 The ratios in the metric accumulate for the time when the counter
@@ -190,8 +200,8 @@ for that time period.
 	fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
 	be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
 
-Later the individual ratios for the measurement period can be recreated
-from these counts.
+Later the individual ratios of L1 metric events for the measurement period can
+be recreated from these counts.
 
 	slots_delta = slots_b - slots_a
 	retiring_ratio = (float)retiring_slots / slots_delta
@@ -205,6 +215,48 @@ from these counts.
 		fe_bound_ratio * 100.,
 		be_bound_ratio * 100.);
 
+The individual ratios of L2 metric events for the measurement period can be
+recreated from L1 and L2 metric counters. (Available on Sapphire Rapids and
+later platforms)
+
+	# compute scaled metrics for measurement a
+	heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a
+	br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a
+	fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a
+	mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a
+
+	# compute delta scaled metrics between b and a
+	heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a
+	br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a
+	fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a
+	mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a
+
+	slots_delta = slots_b - slots_a
+	heavy_ops_ratio = (float)heavy_ops_slots / slots_delta
+	light_ops_ratio = retiring_ratio - heavy_ops_ratio;
+
+	br_mispredict_ratio = (float)br_mispredict_slots / slots_delta
+	machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio;
+
+	fetch_lat_ratio = (float)fetch_lat_slots / slots_delta
+	fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio;
+
+	mem_bound_ratio = (float)mem_bound_slots / slota_delta
+	core_bound_ratio = be_bound_ratio - mem_bound_ratio;
+
+	printf("Heavy Operations %.2f%% Light Operations %.2f%% "
+	       "Branch Mispredict %.2f%% Machine Clears %.2f%% "
+	       "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% "
+	       "Mem Bound %.2f%% Core Bound %.2f%%\n",
+		heavy_ops_ratio * 100.,
+		light_ops_ratio * 100.,
+		br_mispredict_ratio * 100.,
+		machine_clears_ratio * 100.,
+		fetch_lat_ratio * 100.,
+		fetch_bw_ratio * 100.,
+		mem_bound_ratio * 100.,
+		core_bound_ratio * 100.);
+
 Resetting metrics counters
 ==========================
 
@@ -248,6 +300,24 @@ a sampling read group. Since the SLOTS event must be the leader of a TopDown
 group, the second event of the group is the sampling event.
 For example, perf record -e '{slots, $sampling_event, topdown-retiring}:S'
 
+Extension on Sapphire Rapids Server
+===================================
+The metrics counter is extended to support TMA method level 2 metrics.
+The lower half of the register is the TMA level 1 metrics (legacy).
+The upper half is also divided into four 8-bit fields for the new level 2
+metrics. Four more TopDown metric events are exposed for the end-users,
+topdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and
+topdown-mem-bound.
+
+Each of the new level 2 metrics in the upper half is a subset of the
+corresponding level 1 metric in the lower half. Software can deduce the
+other four level 2 metrics by subtracting corresponding metrics as below.
+
+    Light_Operations = Retiring - Heavy_Operations
+    Machine_Clears = Bad_Speculation - Branch_Mispredicts
+    Fetch_Bandwidth = Frontend_Bound - Fetch_Latency
+    Core_Bound = Backend_Bound - Memory_Bound
+
 
 [1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
 [2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT
  2021-01-19 20:38 ` [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT kan.liang
@ 2021-01-26 14:42   ` Peter Zijlstra
  2021-01-26 15:33     ` Liang, Kan
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 14:42 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:20PM -0800, kan.liang@linux.intel.com wrote:

> @@ -900,6 +901,13 @@ enum perf_event_type {
>  	 *	  char			data[size]; } && PERF_SAMPLE_AUX
>  	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>  	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
> +	 *	{ union {
> +	 *		u64		weight_ext;
> +	 *		struct {
> +	 *			u64	instr_latency:16,
> +	 *				reserved:48;
> +	 *		};
> +	 *	} && PERF_SAMPLE_WEIGHT_EXT
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,
> @@ -1248,4 +1256,12 @@ struct perf_branch_entry {
>  		reserved:40;
>  };
>  
> +union perf_weight_ext {
> +	__u64		val;
> +	struct {
> +		__u64	instr_latency:16,
> +			reserved:48;
> +	};
> +};
> +
>  #endif /* _UAPI_LINUX_PERF_EVENT_H */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 55d1879..9363d12 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
>  	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
>  		size += sizeof(data->code_page_size);
>  
> +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
> +		size += sizeof(data->weight_ext);
> +
>  	event->header_size = size;
>  }
>  
> @@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle,
>  			perf_aux_sample_output(event, handle, data);
>  	}
>  
> +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
> +		perf_output_put(handle, data->weight_ext);
> +
>  	if (!event->attr.watermark) {
>  		int wakeup_events = event->attr.wakeup_events;
>  

This patch is broken and will expose uninitialized kernel stack.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
@ 2021-01-26 14:43   ` Peter Zijlstra
  2021-01-26 15:34     ` Liang, Kan
  2021-01-26 14:44   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 14:43 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
> @@ -2319,6 +2474,17 @@ static void __icl_update_topdown_event(struct perf_event *event,
>  {
>  	u64 delta, last = 0;
>  
> +	/*
> +	 * Although the unsupported topdown events are not exposed to users,
> +	 * users may mistakenly use the unsupported events via RAW format.
> +	 * For example, using L2 topdown event, cpu/event=0x00,umask=0x84/,
> +	 * on Ice Lake. In this case, the scheduler follows the unknown
> +	 * event handling and assigns a GP counter to the event.
> +	 * Check the case, and avoid updating unsupported events.
> +	 */
> +	if (event->hw.idx < INTEL_PMC_IDX_FIXED)
> +		return;
> +
>  	delta = icl_get_topdown_value(event, slots, metrics);
>  	if (last_slots)
>  		last = icl_get_topdown_value(event, last_slots, last_metrics);

Is this a separate patch?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
  2021-01-26 14:43   ` Peter Zijlstra
@ 2021-01-26 14:44   ` Peter Zijlstra
  2021-01-26 15:44     ` Liang, Kan
  2021-01-26 14:49   ` Peter Zijlstra
  2021-01-26 15:37   ` Peter Zijlstra
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 14:44 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
> @@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
>  		}
>  	}
>  
> +	/*
> +	 * To retrieve complete Memory Info of the load latency event, an
> +	 * auxiliary event has to be enabled simultaneously. Add a check for
> +	 * the load latency event.
> +	 *
> +	 * In a group, the auxiliary event must be in front of the load latency
> +	 * event. The rule is to simplify the implementation of the check.
> +	 * That's because perf cannot have a complete group at the moment.
> +	 */
> +	if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX &&
> +	    (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) &&
> +	    is_mem_loads_event(event)) {
> +		struct perf_event *leader = event->group_leader;
> +		struct perf_event *sibling = NULL;
> +
> +		if (!is_mem_loads_aux_event(leader)) {
> +			for_each_sibling_event(sibling, leader) {
> +				if (is_mem_loads_aux_event(sibling))
> +					break;
> +			}
> +			if (list_entry_is_head(sibling, &leader->sibling_list, sibling_list))
> +				return -ENODATA;
> +		}
> +	}
> +
>  	if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
>  		return 0;
>  

I have vague memories of this getting mentioned in a call at some point.
Pretend I don't know anything and tell me more.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
  2021-01-26 14:43   ` Peter Zijlstra
  2021-01-26 14:44   ` Peter Zijlstra
@ 2021-01-26 14:49   ` Peter Zijlstra
  2021-01-26 15:37   ` Peter Zijlstra
  3 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 14:49 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:

>  Add pebs_no_block to
>   explicitly indicate the previous platforms which don't support the new
>   block fields. Accessing the new block fields are ignored on those
>   platforms.

> @@ -5475,6 +5749,7 @@ __init int intel_pmu_init(void)
>  		x86_pmu.extra_regs = intel_icl_extra_regs;
>  		x86_pmu.pebs_aliases = NULL;
>  		x86_pmu.pebs_prec_dist = true;
> +		x86_pmu.pebs_no_block = true;
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
>  

> @@ -198,6 +206,63 @@ static u64 load_latency_data(u64 status)
>  	if (dse.ld_locked)
>  		val |= P(LOCK, LOCKED);
>  
> +	/*
> +	 * Ice Lake and earlier models do not support block infos.
> +	 */
> +	if (x86_pmu.pebs_no_block) {
> +		val |= P(BLK, NA);
> +		return val;
> +	}

> @@ -2026,8 +2128,10 @@ void __init intel_ds_init(void)
>  	x86_pmu.bts  = boot_cpu_has(X86_FEATURE_BTS);
>  	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
>  	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
> -	if (x86_pmu.version <= 4)
> +	if (x86_pmu.version <= 4) {
>  		x86_pmu.pebs_no_isolation = 1;
> +		x86_pmu.pebs_no_block = 1;
> +	}
>  
>  	if (x86_pmu.pebs) {
>  		char pebs_type = x86_pmu.intel_cap.pebs_trap ?  '+' : '-';

> @@ -724,7 +729,8 @@ struct x86_pmu {
>  			pebs_broken		:1,
>  			pebs_prec_dist		:1,
>  			pebs_no_tlb		:1,
> -			pebs_no_isolation	:1;
> +			pebs_no_isolation	:1,
> +			pebs_no_block		:1;
>  	int		pebs_record_size;
>  	int		pebs_buffer_size;
>  	int		max_pebs_events;

I suppose the existing pebs_no_isolation set the bad precedent, but this
is ofcourse a bit backwards. Since we're 0 initialized, new features
should be 1, and not the other way around.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT
  2021-01-26 14:42   ` Peter Zijlstra
@ 2021-01-26 15:33     ` Liang, Kan
  2021-01-26 15:55       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2021-01-26 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin



On 1/26/2021 9:42 AM, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 12:38:20PM -0800, kan.liang@linux.intel.com wrote:
> 
>> @@ -900,6 +901,13 @@ enum perf_event_type {
>>   	 *	  char			data[size]; } && PERF_SAMPLE_AUX
>>   	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
>>   	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
>> +	 *	{ union {
>> +	 *		u64		weight_ext;
>> +	 *		struct {
>> +	 *			u64	instr_latency:16,
>> +	 *				reserved:48;
>> +	 *		};
>> +	 *	} && PERF_SAMPLE_WEIGHT_EXT
>>   	 * };
>>   	 */
>>   	PERF_RECORD_SAMPLE			= 9,
>> @@ -1248,4 +1256,12 @@ struct perf_branch_entry {
>>   		reserved:40;
>>   };
>>   
>> +union perf_weight_ext {
>> +	__u64		val;
>> +	struct {
>> +		__u64	instr_latency:16,
>> +			reserved:48;
>> +	};
>> +};
>> +
>>   #endif /* _UAPI_LINUX_PERF_EVENT_H */
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 55d1879..9363d12 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
>>   	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
>>   		size += sizeof(data->code_page_size);
>>   
>> +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
>> +		size += sizeof(data->weight_ext);
>> +
>>   	event->header_size = size;
>>   }
>>   
>> @@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle,
>>   			perf_aux_sample_output(event, handle, data);
>>   	}
>>   
>> +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
>> +		perf_output_put(handle, data->weight_ext);
>> +
>>   	if (!event->attr.watermark) {
>>   		int wakeup_events = event->attr.wakeup_events;
>>   
> 
> This patch is broken and will expose uninitialized kernel stack.
> 

Could we initialize the 'weight_ext' in perf_sample_data_init()?

I understand that we prefer not to set the field in 
perf_sample_data_init() to minimize the cachelines touched.
However, the perf_sample_data_init() should be the most proper place to 
do the initialization. Also, the 'weight' is already initialized in it. 
As an extension, I think the 'weight_ext' should be initialized in it as 
well.

In the perf_prepare_sample(), I think we can only clear the unused 
fields. The [0:15] bits may still leak the data.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-26 14:43   ` Peter Zijlstra
@ 2021-01-26 15:34     ` Liang, Kan
  0 siblings, 0 replies; 26+ messages in thread
From: Liang, Kan @ 2021-01-26 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin



On 1/26/2021 9:43 AM, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
>> @@ -2319,6 +2474,17 @@ static void __icl_update_topdown_event(struct perf_event *event,
>>   {
>>   	u64 delta, last = 0;
>>   
>> +	/*
>> +	 * Although the unsupported topdown events are not exposed to users,
>> +	 * users may mistakenly use the unsupported events via RAW format.
>> +	 * For example, using L2 topdown event, cpu/event=0x00,umask=0x84/,
>> +	 * on Ice Lake. In this case, the scheduler follows the unknown
>> +	 * event handling and assigns a GP counter to the event.
>> +	 * Check the case, and avoid updating unsupported events.
>> +	 */
>> +	if (event->hw.idx < INTEL_PMC_IDX_FIXED)
>> +		return;
>> +
>>   	delta = icl_get_topdown_value(event, slots, metrics);
>>   	if (last_slots)
>>   		last = icl_get_topdown_value(event, last_slots, last_metrics);
> 
> Is this a separate patch?
> 

I will move it to a separate patch.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
                     ` (2 preceding siblings ...)
  2021-01-26 14:49   ` Peter Zijlstra
@ 2021-01-26 15:37   ` Peter Zijlstra
  2021-01-26 16:21     ` Liang, Kan
  3 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 15:37 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
> @@ -1577,9 +1668,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
>  	}
>  
>  	if (format_size & PEBS_DATACFG_MEMINFO) {
> +		if (sample_type & PERF_SAMPLE_WEIGHT) {
> +			u64 weight = meminfo->latency;
> +
> +			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
> +				weight >>= PEBS_CACHE_LATENCY_OFFSET;
> +			data->weight = weight & PEBS_LATENCY_MASK ?:
>  				intel_get_tsx_weight(meminfo->tsx_tuning);
> +		}
> +
> +		if (sample_type & PERF_SAMPLE_WEIGHT_EXT) {
> +			data->weight_ext.val = 0;
> +			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
> +				data->weight_ext.instr_latency = meminfo->latency & PEBS_LATENCY_MASK;
> +		}
>  
>  		if (sample_type & PERF_SAMPLE_DATA_SRC)
>  			data->data_src.val = get_data_src(event, meminfo->aux);

Talk to me about that SAMPLE_WEIGHT stuff.... I'm not liking it.

Sure you want multiple dimensions, but urgh.

Also, afaict, as proposed you're wasting 80/128 bits. That is, all data
you want to export fits in a single u64 and yet you're using two, which
is mighty daft.

Sure, pebs::lat / pebs_meminfo::latency is defined as a u64, but you
can't tell me that that is ever actually more than 4G cycles. Even the
TSX block latency is u32.

So how about defining SAMPLE_WEIGHT_STRUCT which uses the exact same
data as SAMPLE_WEIGHT but unions it with a struct. I'm not sure if we
want:

union sample_weight {
	u64 weight;

	struct {
		u32	low_dword;
		u32	high_dword;
	};

	/* or */

	struct {
		u32	low_dword;
		u16	high_word;
		u16	higher_word;
	};
};

Then have the core code enforce SAMPLE_WEIGHT ^ SAMPLE_WEIGHT_STRUCT and
make the existing code never set the high dword.

Hmmm?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-26 14:44   ` Peter Zijlstra
@ 2021-01-26 15:44     ` Liang, Kan
  2021-01-27 19:16       ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Liang, Kan @ 2021-01-26 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin



On 1/26/2021 9:44 AM, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
>> @@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
>>   		}
>>   	}
>>   
>> +	/*
>> +	 * To retrieve complete Memory Info of the load latency event, an
>> +	 * auxiliary event has to be enabled simultaneously. Add a check for
>> +	 * the load latency event.
>> +	 *
>> +	 * In a group, the auxiliary event must be in front of the load latency
>> +	 * event. The rule is to simplify the implementation of the check.
>> +	 * That's because perf cannot have a complete group at the moment.
>> +	 */
>> +	if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX &&
>> +	    (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) &&
>> +	    is_mem_loads_event(event)) {
>> +		struct perf_event *leader = event->group_leader;
>> +		struct perf_event *sibling = NULL;
>> +
>> +		if (!is_mem_loads_aux_event(leader)) {
>> +			for_each_sibling_event(sibling, leader) {
>> +				if (is_mem_loads_aux_event(sibling))
>> +					break;
>> +			}
>> +			if (list_entry_is_head(sibling, &leader->sibling_list, sibling_list))
>> +				return -ENODATA;
>> +		}
>> +	}
>> +
>>   	if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
>>   		return 0;
>>   
> 
> I have vague memories of this getting mentioned in a call at some point.
> Pretend I don't know anything and tell me more.
> 

Adding the auxiliary event is for the new data source fields, data block 
& address block. If perf only samples the load latency event, the value 
of the data block & address block fields in a sample is not correct. To 
get the correct value, we have to sample both the auxiliary event and 
the load latency together on SPR. So I add the check in the kernel. I 
also modify the perf mem in the perf tool accordingly.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters
  2021-01-19 20:38 ` [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters kan.liang
@ 2021-01-26 15:44   ` Peter Zijlstra
  2021-01-26 15:53   ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 15:44 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:23PM -0800, kan.liang@linux.intel.com wrote:
> @@ -5228,7 +5231,7 @@ __init int intel_pmu_init(void)
>  	 * Check whether the Architectural PerfMon supports
>  	 * Branch Misses Retired hw_event or not.
>  	 */
> -	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
> +	cpuid(10, &eax.full, &ebx.full, &fixed_mask, &edx.full);
>  	if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
>  		return -ENODEV;
>  
> @@ -5255,8 +5258,16 @@ __init int intel_pmu_init(void)
>  	if (version > 1) {
>  		int assume = 3 * !boot_cpu_has(X86_FEATURE_HYPERVISOR);
>  
> -		x86_pmu.num_counters_fixed =
> -			max((int)edx.split.num_counters_fixed, assume);
> +		if (!fixed_mask) {
> +			x86_pmu.num_counters_fixed =
> +				max((int)edx.split.num_counters_fixed, assume);
> +		} else {
> +			/*
> +			 * The fixed-purpose counters are enumerated in the ECX
> +			 * since V5 perfmon.
> +			 */

But that's not what the code implements.

> +			x86_pmu.num_counters_fixed = fls(fixed_mask);
> +		}
>  	}

What you were looking for is something like this:

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fe940082d49a..9ad42cb59606 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4817,6 +4817,13 @@ __init int intel_pmu_init(void)
 
 		x86_pmu.num_counters_fixed =
 			max((int)edx.split.num_counters_fixed, assume);
+
+		if (version >= 5) {
+			/*
+			 * V5 and later provide a fixed counter mask.
+			 */
+			x86_pmu.num_counters_fixed = fls(fixed_mask);
+		}
 	}
 
 	if (boot_cpu_has(X86_FEATURE_PDCM)) {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters
  2021-01-19 20:38 ` [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters kan.liang
  2021-01-26 15:44   ` Peter Zijlstra
@ 2021-01-26 15:53   ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 15:53 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 19, 2021 at 12:38:23PM -0800, kan.liang@linux.intel.com wrote:
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a54d4a9..21267dc 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -5206,7 +5209,7 @@ __init int intel_pmu_init(void)
>  	union cpuid10_eax eax;
>  	union cpuid10_ebx ebx;
>  	struct event_constraint *c;
> -	unsigned int unused;
> +	unsigned int fixed_mask;
>  	struct extra_reg *er;
>  	bool pmem = false;
>  	int version, i;
> @@ -5228,7 +5231,7 @@ __init int intel_pmu_init(void)
>  	 * Check whether the Architectural PerfMon supports
>  	 * Branch Misses Retired hw_event or not.
>  	 */
> -	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
> +	cpuid(10, &eax.full, &ebx.full, &fixed_mask, &edx.full);
>  	if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
>  		return -ENODEV;
>  
> @@ -5255,8 +5258,16 @@ __init int intel_pmu_init(void)
>  	if (version > 1) {
>  		int assume = 3 * !boot_cpu_has(X86_FEATURE_HYPERVISOR);
>  
> -		x86_pmu.num_counters_fixed =
> -			max((int)edx.split.num_counters_fixed, assume);
> +		if (!fixed_mask) {
> +			x86_pmu.num_counters_fixed =
> +				max((int)edx.split.num_counters_fixed, assume);
> +		} else {
> +			/*
> +			 * The fixed-purpose counters are enumerated in the ECX
> +			 * since V5 perfmon.
> +			 */
> +			x86_pmu.num_counters_fixed = fls(fixed_mask);
> +		}
>  	}
>  
>  	if (version >= 4)
> @@ -5847,8 +5858,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.num_counters_fixed = INTEL_PMC_MAX_FIXED;
>  	}
>  
> -	x86_pmu.intel_ctrl |=
> -		((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
> +	if (!fixed_mask) {
> +		x86_pmu.intel_ctrl |=
> +			((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
> +	} else
> +		x86_pmu.intel_ctrl |= (u64)fixed_mask << INTEL_PMC_IDX_FIXED;
>  
>  	/* AnyThread may be deprecated on arch perfmon v5 or later */
>  	if (x86_pmu.intel_cap.anythread_deprecated)

Maybe like so.

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fe940082d49a..274d75d33c14 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4766,7 +4766,7 @@ __init int intel_pmu_init(void)
 	union cpuid10_eax eax;
 	union cpuid10_ebx ebx;
 	struct event_constraint *c;
-	unsigned int unused;
+	unsigned int fixed_mask;
 	struct extra_reg *er;
 	bool pmem = false;
 	int version, i;
@@ -4788,7 +4788,7 @@ __init int intel_pmu_init(void)
 	 * Check whether the Architectural PerfMon supports
 	 * Branch Misses Retired hw_event or not.
 	 */
-	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
+	cpuid(10, &eax.full, &ebx.full, &fixed_mask, &edx.full);
 	if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
 		return -ENODEV;
 
@@ -4812,11 +4812,18 @@ __init int intel_pmu_init(void)
 	 * Quirk: v2 perfmon does not report fixed-purpose events, so
 	 * assume at least 3 events, when not running in a hypervisor:
 	 */
-	if (version > 1) {
+	if (version > 1 && version < 5) {
 		int assume = 3 * !boot_cpu_has(X86_FEATURE_HYPERVISOR);
 
 		x86_pmu.num_counters_fixed =
 			max((int)edx.split.num_counters_fixed, assume);
+
+		fixed_mask = (1L << x86_pmu.num_counters_fixed) - 1;
+
+	} else if (version >= 5 ) {
+
+		x86_pmu.num_counters_fixed = fls(fixed_mask);
+
 	}
 
 	if (boot_cpu_has(X86_FEATURE_PDCM)) {
@@ -5366,8 +5373,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.num_counters_fixed = INTEL_PMC_MAX_FIXED;
 	}
 
-	x86_pmu.intel_ctrl |=
-		((1LL << x86_pmu.num_counters_fixed)-1) << INTEL_PMC_IDX_FIXED;
+	x86_pmu.intel_ctrl |= (u64)fixed_mask << INTEL_PMC_IDX_FIXED;
 
 	/* AnyThread may be deprecated on arch perfmon v5 or later */
 	if (x86_pmu.intel_cap.anythread_deprecated)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT
  2021-01-26 15:33     ` Liang, Kan
@ 2021-01-26 15:55       ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-26 15:55 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 26, 2021 at 10:33:18AM -0500, Liang, Kan wrote:
> 
> 
> On 1/26/2021 9:42 AM, Peter Zijlstra wrote:
> > On Tue, Jan 19, 2021 at 12:38:20PM -0800, kan.liang@linux.intel.com wrote:
> > 
> > > @@ -900,6 +901,13 @@ enum perf_event_type {
> > >   	 *	  char			data[size]; } && PERF_SAMPLE_AUX
> > >   	 *	{ u64			data_page_size;} && PERF_SAMPLE_DATA_PAGE_SIZE
> > >   	 *	{ u64			code_page_size;} && PERF_SAMPLE_CODE_PAGE_SIZE
> > > +	 *	{ union {
> > > +	 *		u64		weight_ext;
> > > +	 *		struct {
> > > +	 *			u64	instr_latency:16,
> > > +	 *				reserved:48;
> > > +	 *		};
> > > +	 *	} && PERF_SAMPLE_WEIGHT_EXT
> > >   	 * };
> > >   	 */
> > >   	PERF_RECORD_SAMPLE			= 9,
> > > @@ -1248,4 +1256,12 @@ struct perf_branch_entry {
> > >   		reserved:40;
> > >   };
> > > +union perf_weight_ext {
> > > +	__u64		val;
> > > +	struct {
> > > +		__u64	instr_latency:16,
> > > +			reserved:48;
> > > +	};
> > > +};
> > > +
> > >   #endif /* _UAPI_LINUX_PERF_EVENT_H */
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 55d1879..9363d12 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -1903,6 +1903,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type)
> > >   	if (sample_type & PERF_SAMPLE_CODE_PAGE_SIZE)
> > >   		size += sizeof(data->code_page_size);
> > > +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
> > > +		size += sizeof(data->weight_ext);
> > > +
> > >   	event->header_size = size;
> > >   }
> > > @@ -6952,6 +6955,9 @@ void perf_output_sample(struct perf_output_handle *handle,
> > >   			perf_aux_sample_output(event, handle, data);
> > >   	}
> > > +	if (sample_type & PERF_SAMPLE_WEIGHT_EXT)
> > > +		perf_output_put(handle, data->weight_ext);
> > > +
> > >   	if (!event->attr.watermark) {
> > >   		int wakeup_events = event->attr.wakeup_events;
> > 
> > This patch is broken and will expose uninitialized kernel stack.
> > 
> 
> Could we initialize the 'weight_ext' in perf_sample_data_init()?

No. Also see my other mail for why I hate this thing.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-26 15:37   ` Peter Zijlstra
@ 2021-01-26 16:21     ` Liang, Kan
  0 siblings, 0 replies; 26+ messages in thread
From: Liang, Kan @ 2021-01-26 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin



On 1/26/2021 10:37 AM, Peter Zijlstra wrote:
> On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
>> @@ -1577,9 +1668,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
>>   	}
>>   
>>   	if (format_size & PEBS_DATACFG_MEMINFO) {
>> +		if (sample_type & PERF_SAMPLE_WEIGHT) {
>> +			u64 weight = meminfo->latency;
>> +
>> +			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
>> +				weight >>= PEBS_CACHE_LATENCY_OFFSET;
>> +			data->weight = weight & PEBS_LATENCY_MASK ?:
>>   				intel_get_tsx_weight(meminfo->tsx_tuning);
>> +		}
>> +
>> +		if (sample_type & PERF_SAMPLE_WEIGHT_EXT) {
>> +			data->weight_ext.val = 0;
>> +			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
>> +				data->weight_ext.instr_latency = meminfo->latency & PEBS_LATENCY_MASK;
>> +		}
>>   
>>   		if (sample_type & PERF_SAMPLE_DATA_SRC)
>>   			data->data_src.val = get_data_src(event, meminfo->aux);
> 
> Talk to me about that SAMPLE_WEIGHT stuff.... I'm not liking it.
> 
> Sure you want multiple dimensions, but urgh.
> 
> Also, afaict, as proposed you're wasting 80/128 bits. That is, all data
> you want to export fits in a single u64 and yet you're using two, which
> is mighty daft.
> 
> Sure, pebs::lat / pebs_meminfo::latency is defined as a u64, but you
> can't tell me that that is ever actually more than 4G cycles. Even the
> TSX block latency is u32.
> 
> So how about defining SAMPLE_WEIGHT_STRUCT which uses the exact same
> data as SAMPLE_WEIGHT but unions it with a struct. I'm not sure if we
> want:
> 
> union sample_weight {
> 	u64 weight;
> 
> 	struct {
> 		u32	low_dword;
> 		u32	high_dword;
> 	};
> 
> 	/* or */
> 
> 	struct {
> 		u32	low_dword;
> 		u16	high_word;
> 		u16	higher_word;
> 	};
> };
> 
> Then have the core code enforce SAMPLE_WEIGHT ^ SAMPLE_WEIGHT_STRUCT and
> make the existing code never set the high dword.

So the kernel will only accept either SAMPLE_WEIGHT type or 
SAMPLE_WEIGHT_STRUCT type. It should error out if both types are set, right?

I will check if u32 is enough for meminfo::latency on the previous 
platforms.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids
  2021-01-26 15:44     ` Liang, Kan
@ 2021-01-27 19:16       ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2021-01-27 19:16 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, eranian, namhyung, jolsa, ak, yao.jin

On Tue, Jan 26, 2021 at 10:44:17AM -0500, Liang, Kan wrote:
> 
> 
> On 1/26/2021 9:44 AM, Peter Zijlstra wrote:
> > On Tue, Jan 19, 2021 at 12:38:22PM -0800, kan.liang@linux.intel.com wrote:
> > > @@ -3671,6 +3853,31 @@ static int intel_pmu_hw_config(struct perf_event *event)
> > >   		}
> > >   	}
> > > +	/*
> > > +	 * To retrieve complete Memory Info of the load latency event, an
> > > +	 * auxiliary event has to be enabled simultaneously. Add a check for
> > > +	 * the load latency event.
> > > +	 *
> > > +	 * In a group, the auxiliary event must be in front of the load latency
> > > +	 * event. The rule is to simplify the implementation of the check.
> > > +	 * That's because perf cannot have a complete group at the moment.
> > > +	 */
> > > +	if (x86_pmu.flags & PMU_FL_MEM_LOADS_AUX &&
> > > +	    (event->attr.sample_type & PERF_SAMPLE_DATA_SRC) &&
> > > +	    is_mem_loads_event(event)) {
> > > +		struct perf_event *leader = event->group_leader;
> > > +		struct perf_event *sibling = NULL;
> > > +
> > > +		if (!is_mem_loads_aux_event(leader)) {
> > > +			for_each_sibling_event(sibling, leader) {
> > > +				if (is_mem_loads_aux_event(sibling))
> > > +					break;
> > > +			}
> > > +			if (list_entry_is_head(sibling, &leader->sibling_list, sibling_list))
> > > +				return -ENODATA;
> > > +		}
> > > +	}
> > > +
> > >   	if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
> > >   		return 0;
> > 
> > I have vague memories of this getting mentioned in a call at some point.
> > Pretend I don't know anything and tell me more.
> > 
> 
> Adding the auxiliary event is for the new data source fields, data block &
> address block. If perf only samples the load latency event, the value of the
> data block & address block fields in a sample is not correct. To get the
> correct value, we have to sample both the auxiliary event and the load
> latency together on SPR. So I add the check in the kernel. I also modify the
> perf mem in the perf tool accordingly.

This is an active work around for a chip defect right? Something we're
normally have an errata for. Can we call it that?

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-27 19:17 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 20:38 [PATCH 00/12] perf core PMU support for Sapphire Rapids kan.liang
2021-01-19 20:38 ` [PATCH 01/12] perf/core: Add PERF_SAMPLE_WEIGHT_EXT kan.liang
2021-01-26 14:42   ` Peter Zijlstra
2021-01-26 15:33     ` Liang, Kan
2021-01-26 15:55       ` Peter Zijlstra
2021-01-19 20:38 ` [PATCH 02/12] perf/x86/intel: Factor out intel_update_topdown_event() kan.liang
2021-01-19 20:38 ` [PATCH 03/12] perf/x86/intel: Add perf core PMU support for Sapphire Rapids kan.liang
2021-01-26 14:43   ` Peter Zijlstra
2021-01-26 15:34     ` Liang, Kan
2021-01-26 14:44   ` Peter Zijlstra
2021-01-26 15:44     ` Liang, Kan
2021-01-27 19:16       ` Peter Zijlstra
2021-01-26 14:49   ` Peter Zijlstra
2021-01-26 15:37   ` Peter Zijlstra
2021-01-26 16:21     ` Liang, Kan
2021-01-19 20:38 ` [PATCH 04/12] perf/x86/intel: Support CPUID 10.ECX to disable fixed counters kan.liang
2021-01-26 15:44   ` Peter Zijlstra
2021-01-26 15:53   ` Peter Zijlstra
2021-01-19 20:38 ` [PATCH 05/12] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
2021-01-19 20:38 ` [PATCH 06/12] perf tools: Support data block and addr block kan.liang
2021-01-19 20:38 ` [PATCH 07/12] perf c2c: " kan.liang
2021-01-19 20:38 ` [PATCH 08/12] perf tools: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
2021-01-19 20:38 ` [PATCH 09/12] perf report: Support instruction latency kan.liang
2021-01-19 20:38 ` [PATCH 10/12] perf test: Support PERF_SAMPLE_WEIGHT_EXT kan.liang
2021-01-19 20:38 ` [PATCH 11/12] perf stat: Support L2 Topdown events kan.liang
2021-01-19 20:38 ` [PATCH 12/12] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).