linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools)
@ 2021-02-02 20:09 kan.liang
  2021-02-02 20:09 ` [PATCH 1/9] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Intel Sapphire Rapids server is the successor of the Intel Ice Lake
server. There are several user-visible features introduced.
- Two new data source fields, data block & address block, are added in
  the PEBS Memory Info Record for the load latency event.
  Improve the c2c and mem tools to display the reasons of a memory load
  block.
- Support the new instruction latency in perf mem.
  Support the new sample type, PERF_SAMPLE_WEIGHT_STRUCT.
- Support TMA method level 2 metrics with the perf stat --topdown option.
  Update the topdown documentation accordingly.

The full description for the SPR features can be found at Intel
Architecture Instruction Set Extensions and Future Features Programming
Reference, 319433-041 (and later).

The kernel patches has been merged into the tip:perf/core branch.

Kan Liang (9):
  tools headers uapi: Update tools's copy of linux/perf_event.h
  perf tools: Support the auxiliary event
  perf tools: Support data block and addr block
  perf c2c: Support data block and addr block
  perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
  perf report: Support instruction latency
  perf test: Support PERF_SAMPLE_WEIGHT_STRUCT
  perf stat: Support L2 Topdown events
  perf, tools: Update topdown documentation for Sapphire Rapids

 tools/include/uapi/linux/perf_event.h     | 54 ++++++++++++++++--
 tools/perf/Documentation/perf-report.txt  |  9 ++-
 tools/perf/Documentation/perf-stat.txt    | 14 ++++-
 tools/perf/Documentation/topdown.txt      | 78 ++++++++++++++++++++++++--
 tools/perf/arch/x86/util/Build            |  2 +
 tools/perf/arch/x86/util/evsel.c          |  8 +++
 tools/perf/arch/x86/util/mem-events.c     | 44 +++++++++++++++
 tools/perf/builtin-c2c.c                  |  3 +
 tools/perf/builtin-mem.c                  |  2 +-
 tools/perf/builtin-stat.c                 | 34 +++++++++++-
 tools/perf/tests/sample-parsing.c         | 14 ++++-
 tools/perf/util/event.h                   |  1 +
 tools/perf/util/evsel.c                   | 33 +++++++++--
 tools/perf/util/evsel.h                   |  3 +
 tools/perf/util/hist.c                    | 13 ++++-
 tools/perf/util/hist.h                    |  3 +
 tools/perf/util/intel-pt.c                | 23 +++++++-
 tools/perf/util/mem-events.c              | 36 ++++++++++++
 tools/perf/util/mem-events.h              |  5 ++
 tools/perf/util/parse-events.l            |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  2 +-
 tools/perf/util/record.c                  |  5 +-
 tools/perf/util/session.c                 |  8 ++-
 tools/perf/util/sort.c                    | 83 +++++++++++++++++++++++++++-
 tools/perf/util/sort.h                    |  4 ++
 tools/perf/util/stat-shadow.c             | 92 +++++++++++++++++++++++++++++++
 tools/perf/util/stat.c                    |  4 ++
 tools/perf/util/stat.h                    |  9 +++
 tools/perf/util/synthetic-events.c        |  8 ++-
 29 files changed, 560 insertions(+), 35 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/evsel.c
 create mode 100644 tools/perf/arch/x86/util/mem-events.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/9] tools headers uapi: Update tools's copy of linux/perf_event.h
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-02 20:09 ` [PATCH 2/9] perf tools: Support the auxiliary event kan.liang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To get the changes in:

    commit 2a6c6b7d7ad3 ("perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT")

This cures the following warning during perf's build:

        Warning: Kernel ABI header at
'tools/include/uapi/linux/perf_event.h' differs from latest version at
'include/uapi/linux/perf_event.h'
        diff -u tools/include/uapi/linux/perf_event.h
include/uapi/linux/perf_event.h

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/include/uapi/linux/perf_event.h | 54 ++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index b15e344..7d292de5 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -145,12 +145,14 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_CGROUP			= 1U << 21,
 	PERF_SAMPLE_DATA_PAGE_SIZE		= 1U << 22,
 	PERF_SAMPLE_CODE_PAGE_SIZE		= 1U << 23,
+	PERF_SAMPLE_WEIGHT_STRUCT		= 1U << 24,
 
-	PERF_SAMPLE_MAX = 1U << 24,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 25,		/* non-ABI */
 
 	__PERF_SAMPLE_CALLCHAIN_EARLY		= 1ULL << 63, /* non-ABI; internal use */
 };
 
+#define PERF_SAMPLE_WEIGHT_TYPE	(PERF_SAMPLE_WEIGHT | PERF_SAMPLE_WEIGHT_STRUCT)
 /*
  * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
  *
@@ -890,7 +892,24 @@ enum perf_event_type {
 	 * 	  char			data[size];
 	 * 	  u64			dyn_size; } && PERF_SAMPLE_STACK_USER
 	 *
-	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
+	 *	{ union perf_sample_weight
+	 *	 {
+	 *		u64		full; && PERF_SAMPLE_WEIGHT
+	 *	#if defined(__LITTLE_ENDIAN_BITFIELD)
+	 *		struct {
+	 *			u32	var1_dw;
+	 *			u16	var2_w;
+	 *			u16	var3_w;
+	 *		} && PERF_SAMPLE_WEIGHT_STRUCT
+	 *	#elif defined(__BIG_ENDIAN_BITFIELD)
+	 *		struct {
+	 *			u16	var3_w;
+	 *			u16	var2_w;
+	 *			u32	var1_dw;
+	 *		} && PERF_SAMPLE_WEIGHT_STRUCT
+	 *	#endif
+	 *	 }
+	 *	}
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
 	 *	{ u64			abi; # enum perf_sample_regs_abi
@@ -1127,14 +1146,16 @@ union perf_mem_data_src {
 			mem_lvl_num:4,	/* memory hierarchy level number */
 			mem_remote:1,   /* remote */
 			mem_snoopx:2,	/* snoop mode, ext */
-			mem_rsvd:24;
+			mem_blk:3,	/* access blocked */
+			mem_rsvd:21;
 	};
 };
 #elif defined(__BIG_ENDIAN_BITFIELD)
 union perf_mem_data_src {
 	__u64 val;
 	struct {
-		__u64	mem_rsvd:24,
+		__u64	mem_rsvd:21,
+			mem_blk:3,	/* access blocked */
 			mem_snoopx:2,	/* snoop mode, ext */
 			mem_remote:1,   /* remote */
 			mem_lvl_num:4,	/* memory hierarchy level number */
@@ -1217,6 +1238,12 @@ union perf_mem_data_src {
 #define PERF_MEM_TLB_OS		0x40 /* OS fault handler */
 #define PERF_MEM_TLB_SHIFT	26
 
+/* Access blocked */
+#define PERF_MEM_BLK_NA		0x01 /* not available */
+#define PERF_MEM_BLK_DATA	0x02 /* data could not be forwarded */
+#define PERF_MEM_BLK_ADDR	0x04 /* address conflict */
+#define PERF_MEM_BLK_SHIFT	40
+
 #define PERF_MEM_S(a, s) \
 	(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
 
@@ -1248,4 +1275,23 @@ struct perf_branch_entry {
 		reserved:40;
 };
 
+union perf_sample_weight {
+	__u64		full;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	struct {
+		__u32	var1_dw;
+		__u16	var2_w;
+		__u16	var3_w;
+	};
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	struct {
+		__u16	var3_w;
+		__u16	var2_w;
+		__u32	var1_dw;
+	};
+#else
+#error "Unknown endianness"
+#endif
+};
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
  2021-02-02 20:09 ` [PATCH 1/9] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-03 20:02   ` Arnaldo Carvalho de Melo
  2021-02-05 10:52   ` Namhyung Kim
  2021-02-02 20:09 ` [PATCH 3/9] perf tools: Support data block and addr block kan.liang
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

On the Intel Sapphire Rapids server, an auxiliary event has to be
enabled simultaneously with the load latency event to retrieve complete
Memory Info.

Add X86 specific perf_mem_events__name() to handle the auxiliary event.
- Users are only interested in the samples of the mem-loads event.
  Sample read the auxiliary event.
- The auxiliary event must be in front of the load latency event in a
  group. Assume the second event to sample if the auxiliary event is the
  leader.
- Add a weak is_mem_loads_aux_event() to check the auxiliary event for
  X86. For other ARCHs, it always return false.

Parse the unique event name, mem-loads-aux, for the auxiliary event.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/arch/x86/util/Build        |  1 +
 tools/perf/arch/x86/util/mem-events.c | 44 +++++++++++++++++++++++++++++++++++
 tools/perf/util/evsel.c               |  3 +++
 tools/perf/util/mem-events.c          |  5 ++++
 tools/perf/util/mem-events.h          |  2 ++
 tools/perf/util/parse-events.l        |  1 +
 tools/perf/util/record.c              |  5 +++-
 7 files changed, 60 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/arch/x86/util/mem-events.c

diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index 347c39b..d73f548 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -6,6 +6,7 @@ perf-y += perf_regs.o
 perf-y += topdown.o
 perf-y += machine.o
 perf-y += event.o
+perf-y += mem-events.o
 
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
new file mode 100644
index 0000000..11b8469
--- /dev/null
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "util/pmu.h"
+#include "map_symbol.h"
+#include "mem-events.h"
+
+static char mem_loads_name[100];
+static bool mem_loads_name__init;
+
+#define MEM_LOADS_AUX		0x8203
+#define MEM_LOADS_AUX_NAME	"{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
+
+bool is_mem_loads_aux_event(struct evsel *leader)
+{
+	if (!pmu_have_event("cpu", "mem-loads-aux"))
+		return false;
+
+	return leader->core.attr.config == MEM_LOADS_AUX;
+}
+
+char *perf_mem_events__name(int i)
+{
+	struct perf_mem_event *e = perf_mem_events__ptr(i);
+
+	if (!e)
+		return NULL;
+
+	if (i == PERF_MEM_EVENTS__LOAD) {
+		if (mem_loads_name__init)
+			return mem_loads_name;
+
+		mem_loads_name__init = true;
+
+		if (pmu_have_event("cpu", "mem-loads-aux")) {
+			scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+				  MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
+		} else {
+			scnprintf(mem_loads_name, sizeof(mem_loads_name),
+				  e->name, perf_mem_events__loads_ldlat);
+		}
+		return mem_loads_name;
+	}
+
+	return (char *)e->name;
+}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c26ea822..c48f6de 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
 		if (perf_missing_features.aux_output)
 			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
 		break;
+	case ENODATA:
+		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
+				 "Please add an auxiliary event in front of the load latency event.");
 	default:
 		break;
 	}
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 19007e4..3edfb88 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -56,6 +56,11 @@ char * __weak perf_mem_events__name(int i)
 	return (char *)e->name;
 }
 
+__weak bool is_mem_loads_aux_event(struct evsel *leader __maybe_unused)
+{
+	return false;
+}
+
 int perf_mem_events__parse(const char *str)
 {
 	char *tok, *saveptr = NULL;
diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
index 5ef1782..045a507 100644
--- a/tools/perf/util/mem-events.h
+++ b/tools/perf/util/mem-events.h
@@ -9,6 +9,7 @@
 #include <linux/refcount.h>
 #include <linux/perf_event.h>
 #include "stat.h"
+#include "evsel.h"
 
 struct perf_mem_event {
 	bool		record;
@@ -39,6 +40,7 @@ int perf_mem_events__init(void);
 
 char *perf_mem_events__name(int i);
 struct perf_mem_event *perf_mem_events__ptr(int i);
+bool is_mem_loads_aux_event(struct evsel *leader);
 
 void perf_mem_events__list(void);
 
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 9db5097..0b36285 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -356,6 +356,7 @@ bpf-output					{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUT
 cycles-ct				|
 cycles-t				|
 mem-loads				|
+mem-loads-aux				|
 mem-stores				|
 topdown-[a-z-]+				|
 tx-capacity-[a-z-]+			|
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index e70c9dd..d0735fb 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -15,6 +15,8 @@
 #include "record.h"
 #include "../perf-sys.h"
 #include "topdown.h"
+#include "map_symbol.h"
+#include "mem-events.h"
 
 /*
  * evsel__config_leader_sampling() uses special rules for leader sampling.
@@ -25,7 +27,8 @@ static struct evsel *evsel__read_sampler(struct evsel *evsel, struct evlist *evl
 {
 	struct evsel *leader = evsel->leader;
 
-	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader)) {
+	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader) ||
+	    is_mem_loads_aux_event(leader)) {
 		evlist__for_each_entry(evlist, evsel) {
 			if (evsel->leader == leader && evsel != evsel->leader)
 				return evsel;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/9] perf tools: Support data block and addr block
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
  2021-02-02 20:09 ` [PATCH 1/9] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
  2021-02-02 20:09 ` [PATCH 2/9] perf tools: Support the auxiliary event kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-05 11:02   ` Namhyung Kim
  2021-02-02 20:09 ` [PATCH 4/9] perf c2c: " kan.liang
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Two new data source fields, to indicate the block reasons of a load
instruction, are introduced on the Intel Sapphire Rapids server. The
fields can be used by the memory profiling.

Add a new sort function, SORT_MEM_BLOCKED, for the two fields.

For the previous platforms or the block reason is unknown, print "N/A"
for the block reason.

Add blocked as a default mem sort key for perf report and
perf mem report.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt |  5 +++--
 tools/perf/builtin-mem.c                 |  2 +-
 tools/perf/util/hist.c                   |  1 +
 tools/perf/util/hist.h                   |  1 +
 tools/perf/util/mem-events.c             | 25 +++++++++++++++++++++
 tools/perf/util/mem-events.h             |  1 +
 tools/perf/util/sort.c                   | 38 +++++++++++++++++++++++++++++++-
 tools/perf/util/sort.h                   |  1 +
 8 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 8f7f4e9..826b5a9 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -139,7 +139,7 @@ OPTIONS
 
 	If the --mem-mode option is used, the following sort keys are also available
 	(incompatible with --branch-stack):
-	symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline.
+	symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline, blocked.
 
 	- symbol_daddr: name of data symbol being executed on at the time of sample
 	- dso_daddr: name of library or module containing the data being executed
@@ -151,9 +151,10 @@ OPTIONS
 	- dcacheline: the cacheline the data address is on at the time of the sample
 	- phys_daddr: physical address of data being executed on at the time of sample
 	- data_page_size: the data page size of data being executed on at the time of sample
+	- blocked: reason of blocked load access for the data at the time of the sample
 
 	And the default sort keys are changed to local_weight, mem, sym, dso,
-	symbol_daddr, dso_daddr, snoop, tlb, locked, see '--mem-mode'.
+	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
 
 	If the data file has tracepoint event(s), following (dynamic) sort keys
 	are also available:
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index 8237420..e5778aa 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -312,7 +312,7 @@ static char *get_sort_order(struct perf_mem *mem)
 			     "dso_daddr,tlb,locked");
 	} else if (has_extra_options) {
 		strcpy(sort, "--sort=local_weight,mem,sym,dso,symbol_daddr,"
-			     "dso_daddr,snoop,tlb,locked");
+			     "dso_daddr,snoop,tlb,locked,blocked");
 	} else
 		return NULL;
 
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index a08fb9e..6866ab0 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -208,6 +208,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	hists__new_col_len(hists, HISTC_MEM_LVL, 21 + 3);
 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
+	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
 	if (symbol_conf.nanosecs)
 		hists__new_col_len(hists, HISTC_TIME, 16);
 	else
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 14f6633..522486b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -71,6 +71,7 @@ enum hist_column {
 	HISTC_SYM_SIZE,
 	HISTC_DSO_SIZE,
 	HISTC_SYMBOL_IPC,
+	HISTC_MEM_BLOCKED,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 3edfb88..890f638 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -337,6 +337,29 @@ int perf_mem__lck_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 	return l;
 }
 
+int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
+{
+	size_t l = 0;
+	u64 mask = PERF_MEM_BLK_NA;
+
+	sz -= 1; /* -1 for null termination */
+	out[0] = '\0';
+
+	if (mem_info)
+		mask = mem_info->data_src.mem_blk;
+
+	if (!mask || (mask & PERF_MEM_BLK_NA)) {
+		l += scnprintf(out + l, sz - l, " N/A");
+		return l;
+	}
+	if (mask & PERF_MEM_BLK_DATA)
+		l += scnprintf(out + l, sz - l, " Data");
+	if (mask & PERF_MEM_BLK_ADDR)
+		l += scnprintf(out + l, sz - l, " Addr");
+
+	return l;
+}
+
 int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 {
 	int i = 0;
@@ -348,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_in
 	i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info);
 	i += scnprintf(out + i, sz - i, "|LCK ");
 	i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info);
+	i += scnprintf(out + i, sz - i, "|BLK ");
+	i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info);
 
 	return i;
 }
diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
index 045a507..5ddf447 100644
--- a/tools/perf/util/mem-events.h
+++ b/tools/perf/util/mem-events.h
@@ -49,6 +49,7 @@ int perf_mem__tlb_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__lvl_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 int perf_mem__lck_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
+int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info);
 
 int perf_script__meminfo_scnprintf(char *bf, size_t size, struct mem_info *mem_info);
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 80907bc..249a03c 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
 const char	*parent_pattern = default_parent_pattern;
 const char	*default_sort_order = "comm,dso,symbol";
 const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
 const char	default_top_sort_order[] = "dso,symbol";
 const char	default_diff_sort_order[] = "dso,symbol";
 const char	default_tracepoint_sort_order[] = "trace";
@@ -1422,6 +1422,41 @@ struct sort_entry sort_mem_dcacheline = {
 };
 
 static int64_t
+sort__blocked_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_data_src data_src_l;
+	union perf_mem_data_src data_src_r;
+
+	if (left->mem_info)
+		data_src_l = left->mem_info->data_src;
+	else
+		data_src_l.mem_blk = PERF_MEM_BLK_NA;
+
+	if (right->mem_info)
+		data_src_r = right->mem_info->data_src;
+	else
+		data_src_r.mem_blk = PERF_MEM_BLK_NA;
+
+	return (int64_t)(data_src_r.mem_blk - data_src_l.mem_blk);
+}
+
+static int hist_entry__blocked_snprintf(struct hist_entry *he, char *bf,
+					size_t size, unsigned int width)
+{
+	char out[16];
+
+	perf_mem__blk_scnprintf(out, sizeof(out), he->mem_info);
+	return repsep_snprintf(bf, size, "%.*s", width, out);
+}
+
+struct sort_entry sort_mem_blocked = {
+	.se_header	= "Blocked",
+	.se_cmp		= sort__blocked_cmp,
+	.se_snprintf	= hist_entry__blocked_snprintf,
+	.se_width_idx	= HISTC_MEM_BLOCKED,
+};
+
+static int64_t
 sort__phys_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
 {
 	uint64_t l = 0, r = 0;
@@ -1770,6 +1805,7 @@ static struct sort_dimension memory_sort_dimensions[] = {
 	DIM(SORT_MEM_DCACHELINE, "dcacheline", sort_mem_dcacheline),
 	DIM(SORT_MEM_PHYS_DADDR, "phys_daddr", sort_mem_phys_daddr),
 	DIM(SORT_MEM_DATA_PAGE_SIZE, "data_page_size", sort_mem_data_page_size),
+	DIM(SORT_MEM_BLOCKED, "blocked", sort_mem_blocked),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index e50f2b6..2b2645b 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -256,6 +256,7 @@ enum sort_type {
 	SORT_MEM_IADDR_SYMBOL,
 	SORT_MEM_PHYS_DADDR,
 	SORT_MEM_DATA_PAGE_SIZE,
+	SORT_MEM_BLOCKED,
 };
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/9] perf c2c: Support data block and addr block
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (2 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 3/9] perf tools: Support data block and addr block kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-03 20:21   ` Arnaldo Carvalho de Melo
  2021-02-02 20:09 ` [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

perf c2c is also a memory profiling tool. Apply the two new data source
fields to perf c2c as well.

Extend perf c2c to display the number of loads which blocked by data or
address conflict.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/builtin-c2c.c     | 3 +++
 tools/perf/util/mem-events.c | 6 ++++++
 tools/perf/util/mem-events.h | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c5babea..4de49ae 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -2111,6 +2111,8 @@ static void print_c2c__display_stats(FILE *out)
 	fprintf(out, "  Load MESI State Exclusive         : %10d\n", stats->ld_excl);
 	fprintf(out, "  Load MESI State Shared            : %10d\n", stats->ld_shared);
 	fprintf(out, "  Load LLC Misses                   : %10d\n", llc_misses);
+	fprintf(out, "  Load access blocked by data       : %10d\n", stats->blk_data);
+	fprintf(out, "  Load access blocked by address    : %10d\n", stats->blk_addr);
 	fprintf(out, "  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->lcl_dram/(double)llc_misses) * 100.);
 	fprintf(out, "  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->rmt_dram/(double)llc_misses) * 100.);
 	fprintf(out, "  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->rmt_hit /(double)llc_misses) * 100.);
@@ -2139,6 +2141,7 @@ static void print_shared_cacheline_info(FILE *out)
 	fprintf(out, "  L2D hits on shared lines          : %10d\n", stats->ld_l2hit);
 	fprintf(out, "  LLC hits on shared lines          : %10d\n", stats->ld_llchit + stats->lcl_hitm);
 	fprintf(out, "  Locked Access on shared lines     : %10d\n", stats->locks);
+	fprintf(out, "  Blocked Access on shared lines    : %10d\n", stats->blk_data + stats->blk_addr);
 	fprintf(out, "  Store HITs on shared lines        : %10d\n", stats->store);
 	fprintf(out, "  Store L1D hits on shared lines    : %10d\n", stats->st_l1hit);
 	fprintf(out, "  Total Merged records              : %10d\n", hitm_cnt + stats->store);
diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index 890f638..f93a852 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -385,6 +385,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
 	u64 lvl    = data_src->mem_lvl;
 	u64 snoop  = data_src->mem_snoop;
 	u64 lock   = data_src->mem_lock;
+	u64 blk    = data_src->mem_blk;
 	/*
 	 * Skylake might report unknown remote level via this
 	 * bit, consider it when evaluating remote HITMs.
@@ -404,6 +405,9 @@ do {				\
 
 	if (lock & P(LOCK, LOCKED)) stats->locks++;
 
+	if (blk & P(BLK, DATA)) stats->blk_data++;
+	if (blk & P(BLK, ADDR)) stats->blk_addr++;
+
 	if (op & P(OP, LOAD)) {
 		/* load */
 		stats->load++;
@@ -515,6 +519,8 @@ void c2c_add_stats(struct c2c_stats *stats, struct c2c_stats *add)
 	stats->rmt_hit		+= add->rmt_hit;
 	stats->lcl_dram		+= add->lcl_dram;
 	stats->rmt_dram		+= add->rmt_dram;
+	stats->blk_data		+= add->blk_data;
+	stats->blk_addr		+= add->blk_addr;
 	stats->nomap		+= add->nomap;
 	stats->noparse		+= add->noparse;
 }
diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
index 5ddf447..755cef7 100644
--- a/tools/perf/util/mem-events.h
+++ b/tools/perf/util/mem-events.h
@@ -79,6 +79,8 @@ struct c2c_stats {
 	u32	rmt_hit;             /* count of loads with remote hit clean; */
 	u32	lcl_dram;            /* count of loads miss to local DRAM */
 	u32	rmt_dram;            /* count of loads miss to remote DRAM */
+	u32	blk_data;            /* count of loads blocked by data */
+	u32	blk_addr;            /* count of loads blocked by address conflict */
 	u32	nomap;               /* count of load/stores with no phys adrs */
 	u32	noparse;             /* count of unparsable data sources */
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (3 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 4/9] perf c2c: " kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-03 20:31   ` Arnaldo Carvalho de Melo
  2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The new sample type, PERF_SAMPLE_WEIGHT_STRUCT, is an alternative of the
PERF_SAMPLE_WEIGHT sample type. Users can apply either the
PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample
type to retrieve the sample weight, but they cannot apply both sample
types simultaneously. The new sample type shares the same space as the
PERF_SAMPLE_WEIGHT sample type. The lower 32 bits are exactly the same
for both sample type. The higher 32 bits may be different for different
architecture.

Add arch specific arch_evsel__set_sample_weight() to set the new sample
type for X86. Only store the lower 32 bits for the sample->weight if the
new sample type is applied. In practice, no memory access could last
than 4G cycles. No data will be lost.

If the kernel doesn't support the new sample type. Fall back to the
PERF_SAMPLE_WEIGHT sample type.

There is no impact for other architectures.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/arch/x86/util/Build            |  1 +
 tools/perf/arch/x86/util/evsel.c          |  8 ++++++++
 tools/perf/util/evsel.c                   | 28 ++++++++++++++++++++++++----
 tools/perf/util/evsel.h                   |  3 +++
 tools/perf/util/intel-pt.c                | 22 +++++++++++++++++++---
 tools/perf/util/perf_event_attr_fprintf.c |  2 +-
 tools/perf/util/session.c                 |  2 +-
 tools/perf/util/synthetic-events.c        |  6 ++++--
 8 files changed, 61 insertions(+), 11 deletions(-)
 create mode 100644 tools/perf/arch/x86/util/evsel.c

diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
index d73f548..18848b3 100644
--- a/tools/perf/arch/x86/util/Build
+++ b/tools/perf/arch/x86/util/Build
@@ -7,6 +7,7 @@ perf-y += topdown.o
 perf-y += machine.o
 perf-y += event.o
 perf-y += mem-events.o
+perf-y += evsel.o
 
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
new file mode 100644
index 0000000..2f733cd
--- /dev/null
+++ b/tools/perf/arch/x86/util/evsel.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include "util/evsel.h"
+
+void arch_evsel__set_sample_weight(struct evsel *evsel)
+{
+	evsel__set_sample_bit(evsel, WEIGHT_STRUCT);
+}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index c48f6de..0a2a307 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1012,6 +1012,11 @@ struct evsel_config_term *__evsel__get_config_term(struct evsel *evsel, enum evs
 	return found_term;
 }
 
+void __weak arch_evsel__set_sample_weight(struct evsel *evsel)
+{
+	evsel__set_sample_bit(evsel, WEIGHT);
+}
+
 /*
  * The enable_on_exec/disabled value strategy:
  *
@@ -1166,7 +1171,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	}
 
 	if (opts->sample_weight)
-		evsel__set_sample_bit(evsel, WEIGHT);
+		arch_evsel__set_sample_weight(evsel);
 
 	attr->task  = track;
 	attr->mmap  = track;
@@ -1735,6 +1740,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 	}
 
 fallback_missing_features:
+	if (perf_missing_features.weight_struct) {
+		evsel__set_sample_bit(evsel, WEIGHT);
+		evsel__reset_sample_bit(evsel, WEIGHT_STRUCT);
+	}
 	if (perf_missing_features.clockid_wrong)
 		evsel->core.attr.clockid = CLOCK_MONOTONIC; /* should always work */
 	if (perf_missing_features.clockid) {
@@ -1873,7 +1882,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-        if (!perf_missing_features.data_page_size &&
+	if (!perf_missing_features.weight_struct &&
+	    (evsel->core.attr.sample_type & PERF_SAMPLE_WEIGHT_STRUCT)) {
+		perf_missing_features.weight_struct = true;
+		pr_debug2("switching off weight struct support\n");
+		goto fallback_missing_features;
+	} else if (!perf_missing_features.data_page_size &&
 	    (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) {
 		perf_missing_features.data_page_size = true;
 		pr_debug2_peo("Kernel has no PERF_SAMPLE_DATA_PAGE_SIZE support, bailing out\n");
@@ -2316,9 +2330,15 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		}
 	}
 
-	if (type & PERF_SAMPLE_WEIGHT) {
+	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
+		union perf_sample_weight weight;
+
 		OVERFLOW_CHECK_u64(array);
-		data->weight = *array;
+		weight.full = *array;
+		if (type & PERF_SAMPLE_WEIGHT)
+			data->weight = weight.full;
+		else
+			data->weight = weight.var1_dw;
 		array++;
 	}
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index cd1d8dd..5c161a2 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -145,6 +145,7 @@ struct perf_missing_features {
 	bool branch_hw_idx;
 	bool cgroup;
 	bool data_page_size;
+	bool weight_struct;
 };
 
 extern struct perf_missing_features perf_missing_features;
@@ -239,6 +240,8 @@ void __evsel__reset_sample_bit(struct evsel *evsel, enum perf_event_sample_forma
 
 void evsel__set_sample_id(struct evsel *evsel, bool use_sample_identifier);
 
+void arch_evsel__set_sample_weight(struct evsel *evsel);
+
 int evsel__set_filter(struct evsel *evsel, const char *filter);
 int evsel__append_tp_filter(struct evsel *evsel, const char *filter);
 int evsel__append_addr_filter(struct evsel *evsel, const char *filter);
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 60214de..a929f6d 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -1853,13 +1853,29 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
 	if (sample_type & PERF_SAMPLE_ADDR && items->has_mem_access_address)
 		sample.addr = items->mem_access_address;
 
-	if (sample_type & PERF_SAMPLE_WEIGHT) {
+	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
 		/*
 		 * Refer kernel's setup_pebs_adaptive_sample_data() and
 		 * intel_hsw_weight().
 		 */
-		if (items->has_mem_access_latency)
-			sample.weight = items->mem_access_latency;
+		if (items->has_mem_access_latency) {
+			u64 weight = items->mem_access_latency >> 32;
+
+			/*
+			 * Starts from SPR, the mem access latency field
+			 * contains both cache latency [47:32] and instruction
+			 * latency [15:0]. The cache latency is the same as the
+			 * mem access latency on previous platforms.
+			 *
+			 * In practice, no memory access could last than 4G
+			 * cycles. Use latency >> 32 to distinguish the
+			 * different format of the mem access latency field.
+			 */
+			if (weight > 0)
+				sample.weight = weight & 0xffff;
+			else
+				sample.weight = items->mem_access_latency;
+		}
 		if (!sample.weight && items->has_tsx_aux_info) {
 			/* Cycles last block */
 			sample.weight = (u32)items->tsx_aux_info;
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index fb0bb66..e972e63 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
 		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
 		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
 		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
-		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
+		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_STRUCT),
 		{ .name = NULL, }
 	};
 #undef bit_name
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 25adbcc..a35d8c2 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1297,7 +1297,7 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 	if (sample_type & PERF_SAMPLE_STACK_USER)
 		stack_user__printf(&sample->user_stack);
 
-	if (sample_type & PERF_SAMPLE_WEIGHT)
+	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
 		printf("... weight: %" PRIu64 "\n", sample->weight);
 
 	if (sample_type & PERF_SAMPLE_DATA_SRC)
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 2947e3f..bc16268 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1384,7 +1384,7 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
 		}
 	}
 
-	if (type & PERF_SAMPLE_WEIGHT)
+	if (type & PERF_SAMPLE_WEIGHT_TYPE)
 		result += sizeof(u64);
 
 	if (type & PERF_SAMPLE_DATA_SRC)
@@ -1555,8 +1555,10 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 		}
 	}
 
-	if (type & PERF_SAMPLE_WEIGHT) {
+	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
 		*array = sample->weight;
+		if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+			*array &= 0xffffffff;
 		array++;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/9] perf report: Support instruction latency
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (4 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-03 20:43   ` Arnaldo Carvalho de Melo
                     ` (2 more replies)
  2021-02-02 20:09 ` [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The instruction latency information can be recorded on some platforms,
e.g., the Intel Sapphire Rapids server. With both memory latency
(weight) and the new instruction latency information, users can easily
locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications
in different pipeline stages.

The 'weight' field is shared among different architectures. Reusing the
'weight' field may impacts other architectures. Add a new field to store
the instruction latency.

Like the 'weight' support, introduce a 'ins_lat' for the global
instruction latency, and a 'local_ins_lat' for the local instruction
latency version.

Add new sort functions, INSTR Latency and Local INSTR Latency,
accordingly.

Add local_ins_lat to the default_mem_sort_order[].

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-report.txt |  6 +++-
 tools/perf/util/event.h                  |  1 +
 tools/perf/util/evsel.c                  |  4 ++-
 tools/perf/util/hist.c                   | 12 ++++++--
 tools/perf/util/hist.h                   |  2 ++
 tools/perf/util/intel-pt.c               |  5 ++--
 tools/perf/util/session.c                |  8 ++++--
 tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
 tools/perf/util/sort.h                   |  3 ++
 tools/perf/util/synthetic-events.c       |  4 ++-
 10 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 826b5a9..0565b7c 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -108,6 +108,9 @@ OPTIONS
 	- period: Raw number of event count of sample
 	- time: Separate the samples by time stamp with the resolution specified by
 	--time-quantum (default 100ms). Specify with overhead and before it.
+	- ins_lat: Instruction latency in core cycles. This is the global
+	instruction latency
+	- local_ins_lat: Local instruction latency version
 
 	By default, comm, dso and symbol keys are used.
 	(i.e. --sort comm,dso,symbol)
@@ -154,7 +157,8 @@ OPTIONS
 	- blocked: reason of blocked load access for the data at the time of the sample
 
 	And the default sort keys are changed to local_weight, mem, sym, dso,
-	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
+	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
+	see '--mem-mode'.
 
 	If the data file has tracepoint event(s), following (dynamic) sort keys
 	are also available:
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index ff403ea..5d50a49 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -141,6 +141,7 @@ struct perf_sample {
 	u16 insn_len;
 	u8  cpumode;
 	u16 misc;
+	u16 ins_lat;
 	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
 	char insn[MAX_INSN];
 	void *raw_data;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 0a2a307..24c0b59 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		weight.full = *array;
 		if (type & PERF_SAMPLE_WEIGHT)
 			data->weight = weight.full;
-		else
+		else {
 			data->weight = weight.var1_dw;
+			data->ins_lat = weight.var2_w;
+		}
 		array++;
 	}
 
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6866ab0..8a432f8 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
+	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
+	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
 	if (symbol_conf.nanosecs)
 		hists__new_col_len(hists, HISTC_TIME, 16);
 	else
@@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
 }
 
 static void he_stat__add_period(struct he_stat *he_stat, u64 period,
-				u64 weight)
+				u64 weight, u64 ins_lat)
 {
 
 	he_stat->period		+= period;
 	he_stat->weight		+= weight;
 	he_stat->nr_events	+= 1;
+	he_stat->ins_lat	+= ins_lat;
 }
 
 static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
@@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->period_guest_us	+= src->period_guest_us;
 	dest->nr_events		+= src->nr_events;
 	dest->weight		+= src->weight;
+	dest->ins_lat		+= src->ins_lat;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	int64_t cmp;
 	u64 period = entry->stat.period;
 	u64 weight = entry->stat.weight;
+	u64 ins_lat = entry->stat.ins_lat;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 
 		if (!cmp) {
 			if (sample_self) {
-				he_stat__add_period(&he->stat, period, weight);
+				he_stat__add_period(&he->stat, period, weight, ins_lat);
 				hist_entry__add_callchain_period(he, period);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period, weight);
+				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
 
 			/*
 			 * This mem info was allocated from sample__resolve_mem
@@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
 			.nr_events = 1,
 			.period	= sample->period,
 			.weight = sample->weight,
+			.ins_lat = sample->ins_lat,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 522486b..36bca33 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -72,6 +72,8 @@ enum hist_column {
 	HISTC_DSO_SIZE,
 	HISTC_SYMBOL_IPC,
 	HISTC_MEM_BLOCKED,
+	HISTC_LOCAL_INS_LAT,
+	HISTC_GLOBAL_INS_LAT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index a929f6d..c9477d0 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -1871,9 +1871,10 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
 			 * cycles. Use latency >> 32 to distinguish the
 			 * different format of the mem access latency field.
 			 */
-			if (weight > 0)
+			if (weight > 0) {
 				sample.weight = weight & 0xffff;
-			else
+				sample.ins_lat = items->mem_access_latency & 0xffff;
+			} else
 				sample.weight = items->mem_access_latency;
 		}
 		if (!sample.weight && items->has_tsx_aux_info) {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a35d8c2..330e591 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1297,8 +1297,12 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 	if (sample_type & PERF_SAMPLE_STACK_USER)
 		stack_user__printf(&sample->user_stack);
 
-	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
-		printf("... weight: %" PRIu64 "\n", sample->weight);
+	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
+		printf("... weight: %" PRIu64 "", sample->weight);
+			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+				printf(",0x%"PRIx16"", sample->ins_lat);
+		printf("\n");
+	}
 
 	if (sample_type & PERF_SAMPLE_DATA_SRC)
 		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 249a03c..e0529f2 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
 const char	*parent_pattern = default_parent_pattern;
 const char	*default_sort_order = "comm,dso,symbol";
 const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
+const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
 const char	default_top_sort_order[] = "dso,symbol";
 const char	default_diff_sort_order[] = "dso,symbol";
 const char	default_tracepoint_sort_order[] = "trace";
@@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
 	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
 };
 
+static u64 he_ins_lat(struct hist_entry *he)
+{
+		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
+}
+
+static int64_t
+sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+		return he_ins_lat(left) - he_ins_lat(right);
+}
+
+static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
+					      size_t size, unsigned int width)
+{
+		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
+}
+
+struct sort_entry sort_local_ins_lat = {
+	.se_header	= "Local INSTR Latency",
+	.se_cmp		= sort__local_ins_lat_cmp,
+	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
+	.se_width_idx	= HISTC_LOCAL_INS_LAT,
+};
+
+static int64_t
+sort__global_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+		return left->stat.ins_lat - right->stat.ins_lat;
+}
+
+static int hist_entry__global_ins_lat_snprintf(struct hist_entry *he, char *bf,
+					       size_t size, unsigned int width)
+{
+		return repsep_snprintf(bf, size, "%-*u", width, he->stat.ins_lat);
+}
+
+struct sort_entry sort_global_ins_lat = {
+	.se_header	= "INSTR Latency",
+	.se_cmp		= sort__global_ins_lat_cmp,
+	.se_snprintf	= hist_entry__global_ins_lat_snprintf,
+	.se_width_idx	= HISTC_GLOBAL_INS_LAT,
+};
+
 struct sort_entry sort_mem_daddr_sym = {
 	.se_header	= "Data Symbol",
 	.se_cmp		= sort__daddr_cmp,
@@ -1770,6 +1813,8 @@ static struct sort_dimension common_sort_dimensions[] = {
 	DIM(SORT_CGROUP_ID, "cgroup_id", sort_cgroup_id),
 	DIM(SORT_SYM_IPC_NULL, "ipc_null", sort_sym_ipc_null),
 	DIM(SORT_TIME, "time", sort_time),
+	DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
+	DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 2b2645b..c92ca15 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -50,6 +50,7 @@ struct he_stat {
 	u64			period_guest_sys;
 	u64			period_guest_us;
 	u64			weight;
+	u64			ins_lat;
 	u32			nr_events;
 };
 
@@ -229,6 +230,8 @@ enum sort_type {
 	SORT_CGROUP_ID,
 	SORT_SYM_IPC_NULL,
 	SORT_TIME,
+	SORT_LOCAL_INS_LAT,
+	SORT_GLOBAL_INS_LAT,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index bc16268..95401c9 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1557,8 +1557,10 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 
 	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
 		*array = sample->weight;
-		if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+		if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
 			*array &= 0xffffffff;
+			*array |= ((u64)sample->ins_lat << 32);
+		}
 		array++;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (5 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-03 20:44   ` Arnaldo Carvalho de Melo
  2021-02-02 20:09 ` [PATCH 8/9] perf stat: Support L2 Topdown events kan.liang
  2021-02-02 20:09 ` [PATCH 9/9] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang
  8 siblings, 1 reply; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Support the new sample type for sample-parsing test case.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/tests/sample-parsing.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tools/perf/tests/sample-parsing.c b/tools/perf/tests/sample-parsing.c
index 2393916..c5739cc 100644
--- a/tools/perf/tests/sample-parsing.c
+++ b/tools/perf/tests/sample-parsing.c
@@ -129,6 +129,9 @@ static bool samples_same(const struct perf_sample *s1,
 	if (type & PERF_SAMPLE_WEIGHT)
 		COMP(weight);
 
+	if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+		COMP(ins_lat);
+
 	if (type & PERF_SAMPLE_DATA_SRC)
 		COMP(data_src);
 
@@ -238,6 +241,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 read_format)
 		.phys_addr	= 113,
 		.cgroup		= 114,
 		.data_page_size = 115,
+		.ins_lat	= 116,
 		.aux_sample	= {
 			.size	= sizeof(aux_data),
 			.data	= (void *)aux_data,
@@ -344,7 +348,7 @@ int test__sample_parsing(struct test *test __maybe_unused, int subtest __maybe_u
 	 * were added.  Please actually update the test rather than just change
 	 * the condition below.
 	 */
-	if (PERF_SAMPLE_MAX > PERF_SAMPLE_CODE_PAGE_SIZE << 1) {
+	if (PERF_SAMPLE_MAX > PERF_SAMPLE_WEIGHT_STRUCT << 1) {
 		pr_debug("sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating\n");
 		return -1;
 	}
@@ -374,8 +378,12 @@ int test__sample_parsing(struct test *test __maybe_unused, int subtest __maybe_u
 			return err;
 	}
 
-	/* Test all sample format bits together */
-	sample_type = PERF_SAMPLE_MAX - 1;
+	/*
+	 * Test all sample format bits together
+	 * Note: PERF_SAMPLE_WEIGHT and PERF_SAMPLE_WEIGHT_STRUCT cannot
+	 *       be set simultaneously.
+	 */
+	sample_type = (PERF_SAMPLE_MAX - 1) & ~PERF_SAMPLE_WEIGHT;
 	sample_regs = 0x3fff; /* shared yb intr and user regs */
 	for (i = 0; i < ARRAY_SIZE(rf); i++) {
 		err = do_test(sample_type, sample_regs, rf[i]);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/9] perf stat: Support L2 Topdown events
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (6 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
@ 2021-02-02 20:09 ` kan.liang
  2021-02-02 20:09 ` [PATCH 9/9] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang
  8 siblings, 0 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The TMA method level 2 metrics is supported from the Intel Sapphire
Rapids server, which expose four L2 Topdown metrics events to user
space. There are eight L2 events in total. The other four L2 Topdown
metrics events are calculated from the corresponding L1 and the exposed
L2 events.

Now, the --topdown prints the complete top-down metrics that supported
by the CPU. For the Intel Sapphire Rapids server, there are 4 L1 events
and 8 L2 events displyed in one line.

Add a new option, --td-level, to display the top-down statistics that
equal to or lower than the input level.

The L2 event is marked only when both its L1 parent event and itself
crosse the threshold.

Here is an example:

 $perf stat --topdown --td-level=2 --no-metric-only sleep 1
 Topdown accuracy may decrease when measuring long periods.
 Please print the result regularly, e.g. -I1000

 Performance counter stats for 'sleep 1':

        16,734,390      slots
         2,100,001      topdown-retiring          #     12.6% retiring
         2,034,376      topdown-bad-spec          #     12.3% bad
speculation
         4,003,128      topdown-fe-bound          #     24.1% frontend
bound
           328,125      topdown-heavy-ops         #      2.0% heavy
operations         #      10.6% light operations
         1,968,751      topdown-br-mispredict     #     11.9% branch
mispredict         #      0.4% machine clears
         2,953,127      topdown-fetch-lat         #     17.8% fetch
latency         #      6.3% fetch bandwidth
         5,906,255      topdown-mem-bound         #     35.6% memory
bound          #      15.4% core bound

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt | 14 +++++-
 tools/perf/builtin-stat.c              | 34 +++++++++++--
 tools/perf/util/stat-shadow.c          | 92 ++++++++++++++++++++++++++++++++++
 tools/perf/util/stat.c                 |  4 ++
 tools/perf/util/stat.h                 |  9 ++++
 5 files changed, 149 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 5d4a673d..796772c 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -358,7 +358,7 @@ See perf list output for the possble metrics and metricgroups.
 Do not aggregate counts across all monitored CPUs.
 
 --topdown::
-Print top down level 1 metrics if supported by the CPU. This allows to
+Print complete top-down metrics supported by the CPU. This allows to
 determine bottle necks in the CPU pipeline for CPU bound workloads,
 by breaking the cycles consumed down into frontend bound, backend bound,
 bad speculation and retiring.
@@ -393,6 +393,18 @@ To interpret the results it is usually needed to know on which
 CPUs the workload runs on. If needed the CPUs can be forced using
 taskset.
 
+--td-level::
+Print the top-down statistics that equal to or lower than the input level.
+It allows users to print the interested top-down metrics level instead of
+the complete top-down metrics.
+
+The availability of the top-down metrics level depends on the hardware. For
+example, Ice Lake only supports L1 top-down metrics. The Sapphire Rapids
+supports both L1 and L2 top-down metrics.
+
+Default: 0 means the max level that the current hardware support.
+Error out if the input is higher than the supported max level.
+
 --no-merge::
 Do not merge results from same PMUs.
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 8cc2496..bc84b31 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -137,6 +137,19 @@ static const char *topdown_metric_attrs[] = {
 	NULL,
 };
 
+static const char *topdown_metric_L2_attrs[] = {
+	"slots",
+	"topdown-retiring",
+	"topdown-bad-spec",
+	"topdown-fe-bound",
+	"topdown-be-bound",
+	"topdown-heavy-ops",
+	"topdown-br-mispredict",
+	"topdown-fetch-lat",
+	"topdown-mem-bound",
+	NULL,
+};
+
 static const char *smi_cost_attrs = {
 	"{"
 	"msr/aperf/,"
@@ -1153,7 +1166,9 @@ static struct option stat_options[] = {
 	OPT_BOOLEAN(0, "metric-no-merge", &stat_config.metric_no_merge,
 		       "don't try to share events between metrics in a group"),
 	OPT_BOOLEAN(0, "topdown", &topdown_run,
-			"measure topdown level 1 statistics"),
+			"measure top-down statistics"),
+	OPT_UINTEGER(0, "td-level", &stat_config.topdown_level,
+			"Set the metrics level for the top-down statistics (0: max level)"),
 	OPT_BOOLEAN(0, "smi-cost", &smi_cost,
 			"measure SMI cost"),
 	OPT_CALLBACK('M', "metrics", &evsel_list, "metric/metric group list",
@@ -1706,17 +1721,30 @@ static int add_default_attributes(void)
 	}
 
 	if (topdown_run) {
+		const char **metric_attrs = topdown_metric_attrs;
+		unsigned int max_level = 1;
 		char *str = NULL;
 		bool warn = false;
 
 		if (!force_metric_only)
 			stat_config.metric_only = true;
 
-		if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+		if (pmu_have_event("cpu", topdown_metric_L2_attrs[5])) {
+			metric_attrs = topdown_metric_L2_attrs;
+			max_level = 2;
+		}
+
+		if (stat_config.topdown_level > max_level) {
+			pr_err("Invalid top-down metrics level. The max level is %u.\n", max_level);
+			return -1;
+		} else if (!stat_config.topdown_level)
+			stat_config.topdown_level = max_level;
+
+		if (topdown_filter_events(metric_attrs, &str, 1) < 0) {
 			pr_err("Out of memory\n");
 			return -1;
 		}
-		if (topdown_metric_attrs[0] && str) {
+		if (metric_attrs[0] && str) {
 			if (!stat_config.interval && !stat_config.metric_only) {
 				fprintf(stat_config.output,
 					"Topdown accuracy may decrease when measuring long periods.\n"
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 12eafd1..6ccf21a 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -273,6 +273,18 @@ void perf_stat__update_shadow_stats(struct evsel *counter, u64 count,
 	else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
 		update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
 				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_HEAVY_OPS))
+		update_runtime_stat(st, STAT_TOPDOWN_HEAVY_OPS,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_BR_MISPREDICT))
+		update_runtime_stat(st, STAT_TOPDOWN_BR_MISPREDICT,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FETCH_LAT))
+		update_runtime_stat(st, STAT_TOPDOWN_FETCH_LAT,
+				    cpu, count, &rsd);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_MEM_BOUND))
+		update_runtime_stat(st, STAT_TOPDOWN_MEM_BOUND,
+				    cpu, count, &rsd);
 	else if (evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
 				    cpu, count, &rsd);
@@ -1174,6 +1186,86 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
 			color = PERF_COLOR_RED;
 		print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
 				bad_spec * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_HEAVY_OPS) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double retiring = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_RETIRING, st,
+						  &rsd);
+		double heavy_ops = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_HEAVY_OPS, st,
+						   &rsd);
+		double light_ops = retiring - heavy_ops;
+
+		if (retiring > 0.7 && heavy_ops > 0.1)
+			color = PERF_COLOR_GREEN;
+		print_metric(config, ctxp, color, "%8.1f%%", "heavy operations",
+				heavy_ops * 100.);
+		if (retiring > 0.7 && light_ops > 0.6)
+			color = PERF_COLOR_GREEN;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "light operations",
+				light_ops * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_BR_MISPREDICT) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double bad_spec = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_BAD_SPEC, st,
+						  &rsd);
+		double br_mis = td_metric_ratio(cpu,
+						STAT_TOPDOWN_BR_MISPREDICT, st,
+						&rsd);
+		double m_clears = bad_spec - br_mis;
+
+		if (bad_spec > 0.1 && br_mis > 0.05)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "branch mispredict",
+				br_mis * 100.);
+		if (bad_spec > 0.1 && m_clears > 0.05)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "machine clears",
+				m_clears * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FETCH_LAT) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double fe_bound = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_FE_BOUND, st,
+						  &rsd);
+		double fetch_lat = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_FETCH_LAT, st,
+						   &rsd);
+		double fetch_bw = fe_bound - fetch_lat;
+
+		if (fe_bound > 0.2 && fetch_lat > 0.15)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "fetch latency",
+				fetch_lat * 100.);
+		if (fe_bound > 0.2 && fetch_bw > 0.1)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "fetch bandwidth",
+				fetch_bw * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_MEM_BOUND) &&
+			full_td(cpu, st, &rsd) && (config->topdown_level > 1)) {
+		double be_bound = td_metric_ratio(cpu,
+						  STAT_TOPDOWN_BE_BOUND, st,
+						  &rsd);
+		double mem_bound = td_metric_ratio(cpu,
+						   STAT_TOPDOWN_MEM_BOUND, st,
+						   &rsd);
+		double core_bound = be_bound - mem_bound;
+
+		if (be_bound > 0.2 && mem_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "memory bound",
+				mem_bound * 100.);
+		if (be_bound > 0.2 && core_bound > 0.1)
+			color = PERF_COLOR_RED;
+		else
+			color = NULL;
+		print_metric(config, ctxp, color, "%8.1f%%", "Core bound",
+				core_bound * 100.);
 	} else if (evsel->metric_expr) {
 		generic_metric(config, evsel->metric_expr, evsel->metric_events, NULL,
 				evsel->name, evsel->metric_name, NULL, 1, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 8ce1479..82c767b 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -99,6 +99,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
 	ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
 	ID(TOPDOWN_BE_BOUND, topdown-be-bound),
+	ID(TOPDOWN_HEAVY_OPS, topdown-heavy-ops),
+	ID(TOPDOWN_BR_MISPREDICT, topdown-br-mispredict),
+	ID(TOPDOWN_FETCH_LAT, topdown-fetch-lat),
+	ID(TOPDOWN_MEM_BOUND, topdown-mem-bound),
 	ID(SMI_NUM, msr/smi/),
 	ID(APERF, msr/aperf/),
 };
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index b536973..d85c292 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -33,6 +33,10 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
 	PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
 	PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
+	PERF_STAT_EVSEL_ID__TOPDOWN_HEAVY_OPS,
+	PERF_STAT_EVSEL_ID__TOPDOWN_BR_MISPREDICT,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_LAT,
+	PERF_STAT_EVSEL_ID__TOPDOWN_MEM_BOUND,
 	PERF_STAT_EVSEL_ID__SMI_NUM,
 	PERF_STAT_EVSEL_ID__APERF,
 	PERF_STAT_EVSEL_ID__MAX,
@@ -91,6 +95,10 @@ enum stat_type {
 	STAT_TOPDOWN_BAD_SPEC,
 	STAT_TOPDOWN_FE_BOUND,
 	STAT_TOPDOWN_BE_BOUND,
+	STAT_TOPDOWN_HEAVY_OPS,
+	STAT_TOPDOWN_BR_MISPREDICT,
+	STAT_TOPDOWN_FETCH_LAT,
+	STAT_TOPDOWN_MEM_BOUND,
 	STAT_SMI_NUM,
 	STAT_APERF,
 	STAT_MAX
@@ -148,6 +156,7 @@ struct perf_stat_config {
 	int			 ctl_fd_ack;
 	bool			 ctl_fd_close;
 	const char		*cgroup_list;
+	unsigned int		topdown_level;
 };
 
 void perf_stat__set_big_num(int set);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 9/9] perf, tools: Update topdown documentation for Sapphire Rapids
  2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
                   ` (7 preceding siblings ...)
  2021-02-02 20:09 ` [PATCH 8/9] perf stat: Support L2 Topdown events kan.liang
@ 2021-02-02 20:09 ` kan.liang
  8 siblings, 0 replies; 34+ messages in thread
From: kan.liang @ 2021-02-02 20:09 UTC (permalink / raw)
  To: acme, mingo, linux-kernel
  Cc: peterz, eranian, namhyung, jolsa, ak, yao.jin, maddy, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Update Topdown extension on Sapphire Rapids and how to collect the L2
events.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/topdown.txt | 78 ++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
index 3c39bb3..10f07f9 100644
--- a/tools/perf/Documentation/topdown.txt
+++ b/tools/perf/Documentation/topdown.txt
@@ -121,7 +121,7 @@ to read slots and the topdown metrics at different points of the program:
 #define RDPMC_METRIC	(1 << 29)	/* return metric counters */
 
 #define FIXED_COUNTER_SLOTS		3
-#define METRIC_COUNTER_TOPDOWN_L1	0
+#define METRIC_COUNTER_TOPDOWN_L1_L2	0
 
 static inline uint64_t read_slots(void)
 {
@@ -130,7 +130,7 @@ static inline uint64_t read_slots(void)
 
 static inline uint64_t read_metrics(void)
 {
-	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1_L2);
 }
 
 Then the program can be instrumented to read these metrics at different
@@ -152,11 +152,21 @@ The binary ratios in the metric value can be converted to float ratios:
 
 #define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
 
+/* L1 Topdown metric events */
 #define TOPDOWN_RETIRING(val)	((float)GET_METRIC(val, 0) / 0xff)
 #define TOPDOWN_BAD_SPEC(val)	((float)GET_METRIC(val, 1) / 0xff)
 #define TOPDOWN_FE_BOUND(val)	((float)GET_METRIC(val, 2) / 0xff)
 #define TOPDOWN_BE_BOUND(val)	((float)GET_METRIC(val, 3) / 0xff)
 
+/*
+ * L2 Topdown metric events.
+ * Available on Sapphire Rapids and later platforms.
+ */
+#define TOPDOWN_HEAVY_OPS(val)		((float)GET_METRIC(val, 4) / 0xff)
+#define TOPDOWN_BR_MISPREDICT(val)	((float)GET_METRIC(val, 5) / 0xff)
+#define TOPDOWN_FETCH_LAT(val)		((float)GET_METRIC(val, 6) / 0xff)
+#define TOPDOWN_MEM_BOUND(val)		((float)GET_METRIC(val, 7) / 0xff)
+
 and then converted to percent for printing.
 
 The ratios in the metric accumulate for the time when the counter
@@ -190,8 +200,8 @@ for that time period.
 	fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
 	be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
 
-Later the individual ratios for the measurement period can be recreated
-from these counts.
+Later the individual ratios of L1 metric events for the measurement period can
+be recreated from these counts.
 
 	slots_delta = slots_b - slots_a
 	retiring_ratio = (float)retiring_slots / slots_delta
@@ -205,6 +215,48 @@ from these counts.
 		fe_bound_ratio * 100.,
 		be_bound_ratio * 100.);
 
+The individual ratios of L2 metric events for the measurement period can be
+recreated from L1 and L2 metric counters. (Available on Sapphire Rapids and
+later platforms)
+
+	# compute scaled metrics for measurement a
+	heavy_ops_slots_a = GET_METRIC(metric_a, 4) * slots_a
+	br_mispredict_slots_a = GET_METRIC(metric_a, 5) * slots_a
+	fetch_lat_slots_a = GET_METRIC(metric_a, 6) * slots_a
+	mem_bound_slots_a = GET_METRIC(metric_a, 7) * slots_a
+
+	# compute delta scaled metrics between b and a
+	heavy_ops_slots = GET_METRIC(metric_b, 4) * slots_b - heavy_ops_slots_a
+	br_mispredict_slots = GET_METRIC(metric_b, 5) * slots_b - br_mispredict_slots_a
+	fetch_lat_slots = GET_METRIC(metric_b, 6) * slots_b - fetch_lat_slots_a
+	mem_bound_slots = GET_METRIC(metric_b, 7) * slots_b - mem_bound_slots_a
+
+	slots_delta = slots_b - slots_a
+	heavy_ops_ratio = (float)heavy_ops_slots / slots_delta
+	light_ops_ratio = retiring_ratio - heavy_ops_ratio;
+
+	br_mispredict_ratio = (float)br_mispredict_slots / slots_delta
+	machine_clears_ratio = bad_spec_ratio - br_mispredict_ratio;
+
+	fetch_lat_ratio = (float)fetch_lat_slots / slots_delta
+	fetch_bw_ratio = fe_bound_ratio - fetch_lat_ratio;
+
+	mem_bound_ratio = (float)mem_bound_slots / slota_delta
+	core_bound_ratio = be_bound_ratio - mem_bound_ratio;
+
+	printf("Heavy Operations %.2f%% Light Operations %.2f%% "
+	       "Branch Mispredict %.2f%% Machine Clears %.2f%% "
+	       "Fetch Latency %.2f%% Fetch Bandwidth %.2f%% "
+	       "Mem Bound %.2f%% Core Bound %.2f%%\n",
+		heavy_ops_ratio * 100.,
+		light_ops_ratio * 100.,
+		br_mispredict_ratio * 100.,
+		machine_clears_ratio * 100.,
+		fetch_lat_ratio * 100.,
+		fetch_bw_ratio * 100.,
+		mem_bound_ratio * 100.,
+		core_bound_ratio * 100.);
+
 Resetting metrics counters
 ==========================
 
@@ -248,6 +300,24 @@ a sampling read group. Since the SLOTS event must be the leader of a TopDown
 group, the second event of the group is the sampling event.
 For example, perf record -e '{slots, $sampling_event, topdown-retiring}:S'
 
+Extension on Sapphire Rapids Server
+===================================
+The metrics counter is extended to support TMA method level 2 metrics.
+The lower half of the register is the TMA level 1 metrics (legacy).
+The upper half is also divided into four 8-bit fields for the new level 2
+metrics. Four more TopDown metric events are exposed for the end-users,
+topdown-heavy-ops, topdown-br-mispredict, topdown-fetch-lat and
+topdown-mem-bound.
+
+Each of the new level 2 metrics in the upper half is a subset of the
+corresponding level 1 metric in the lower half. Software can deduce the
+other four level 2 metrics by subtracting corresponding metrics as below.
+
+    Light_Operations = Retiring - Heavy_Operations
+    Machine_Clears = Bad_Speculation - Branch_Mispredicts
+    Fetch_Bandwidth = Frontend_Bound - Fetch_Latency
+    Core_Bound = Backend_Bound - Memory_Bound
+
 
 [1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
 [2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-02 20:09 ` [PATCH 2/9] perf tools: Support the auxiliary event kan.liang
@ 2021-02-03 20:02   ` Arnaldo Carvalho de Melo
  2021-02-03 21:20     ` Liang, Kan
  2021-02-05 10:52   ` Namhyung Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 20:02 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Tue, Feb 02, 2021 at 12:09:06PM -0800, kan.liang@linux.intel.com escreveu:
> From: Kan Liang <kan.liang@linux.intel.com>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index c26ea822..c48f6de 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>  		if (perf_missing_features.aux_output)
>  			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
>  		break;
> +	case ENODATA:
> +		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> +				 "Please add an auxiliary event in front of the load latency event.");

Are you sure this is the only case where ENODATA comes out from
perf_event_open()? Well, according to your comment in:

  61b985e3e775a3a7 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")

It should be at that point in time, so its safe to merge as-is, but then
I think this is fragile, what if someone else, in the future, not
knowing that ENODATA is supposed to be used only with that ancient CPU,
Sapphire Rapids, uses it? :-)

Please consider adding a check before assuming ENODATA is for this
specific case.

Back to processing the other patches.

- Arnaldo

>  	default:
>  		break;
>  	}
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index 19007e4..3edfb88 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -56,6 +56,11 @@ char * __weak perf_mem_events__name(int i)
>  	return (char *)e->name;
>  }
>  
> +__weak bool is_mem_loads_aux_event(struct evsel *leader __maybe_unused)
> +{
> +	return false;
> +}
> +
>  int perf_mem_events__parse(const char *str)
>  {
>  	char *tok, *saveptr = NULL;
> diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
> index 5ef1782..045a507 100644
> --- a/tools/perf/util/mem-events.h
> +++ b/tools/perf/util/mem-events.h
> @@ -9,6 +9,7 @@
>  #include <linux/refcount.h>
>  #include <linux/perf_event.h>
>  #include "stat.h"
> +#include "evsel.h"
>  
>  struct perf_mem_event {
>  	bool		record;
> @@ -39,6 +40,7 @@ int perf_mem_events__init(void);
>  
>  char *perf_mem_events__name(int i);
>  struct perf_mem_event *perf_mem_events__ptr(int i);
> +bool is_mem_loads_aux_event(struct evsel *leader);
>  
>  void perf_mem_events__list(void);
>  
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 9db5097..0b36285 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -356,6 +356,7 @@ bpf-output					{ return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUT
>  cycles-ct				|
>  cycles-t				|
>  mem-loads				|
> +mem-loads-aux				|
>  mem-stores				|
>  topdown-[a-z-]+				|
>  tx-capacity-[a-z-]+			|
> diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
> index e70c9dd..d0735fb 100644
> --- a/tools/perf/util/record.c
> +++ b/tools/perf/util/record.c
> @@ -15,6 +15,8 @@
>  #include "record.h"
>  #include "../perf-sys.h"
>  #include "topdown.h"
> +#include "map_symbol.h"
> +#include "mem-events.h"
>  
>  /*
>   * evsel__config_leader_sampling() uses special rules for leader sampling.
> @@ -25,7 +27,8 @@ static struct evsel *evsel__read_sampler(struct evsel *evsel, struct evlist *evl
>  {
>  	struct evsel *leader = evsel->leader;
>  
> -	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader)) {
> +	if (evsel__is_aux_event(leader) || arch_topdown_sample_read(leader) ||
> +	    is_mem_loads_aux_event(leader)) {
>  		evlist__for_each_entry(evlist, evsel) {
>  			if (evsel->leader == leader && evsel != evsel->leader)
>  				return evsel;
> -- 
> 2.7.4
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] perf c2c: Support data block and addr block
  2021-02-02 20:09 ` [PATCH 4/9] perf c2c: " kan.liang
@ 2021-02-03 20:21   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 20:21 UTC (permalink / raw)
  To: Joe Mario
  Cc: Don Zickus, kan.liang, mingo, linux-kernel, peterz, eranian,
	namhyung, jolsa, ak, yao.jin, maddy

Em Tue, Feb 02, 2021 at 12:09:08PM -0800, kan.liang@linux.intel.com escreveu:
> From: Kan Liang <kan.liang@linux.intel.com>

Hey Joe, heads up on these new fields.

The whole thread is at:

  https://lore.kernel.org/lkml/1612296553-21962-1-git-send-email-kan.liang@linux.intel.com/

The kernel cset introducing the support:

  "perf/x86/intel: Add perf core PMU support for Sapphire Rapids"
  https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=perf/core&id=61b985e3e775a3a75fda04ce7ef1b1aefc4758bc

Cheers,

- Arnaldo
 
> perf c2c is also a memory profiling tool. Apply the two new data source
> fields to perf c2c as well.
> 
> Extend perf c2c to display the number of loads which blocked by data or
> address conflict.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/builtin-c2c.c     | 3 +++
>  tools/perf/util/mem-events.c | 6 ++++++
>  tools/perf/util/mem-events.h | 2 ++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index c5babea..4de49ae 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -2111,6 +2111,8 @@ static void print_c2c__display_stats(FILE *out)
>  	fprintf(out, "  Load MESI State Exclusive         : %10d\n", stats->ld_excl);
>  	fprintf(out, "  Load MESI State Shared            : %10d\n", stats->ld_shared);
>  	fprintf(out, "  Load LLC Misses                   : %10d\n", llc_misses);
> +	fprintf(out, "  Load access blocked by data       : %10d\n", stats->blk_data);
> +	fprintf(out, "  Load access blocked by address    : %10d\n", stats->blk_addr);
>  	fprintf(out, "  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->lcl_dram/(double)llc_misses) * 100.);
>  	fprintf(out, "  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->rmt_dram/(double)llc_misses) * 100.);
>  	fprintf(out, "  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->rmt_hit /(double)llc_misses) * 100.);
> @@ -2139,6 +2141,7 @@ static void print_shared_cacheline_info(FILE *out)
>  	fprintf(out, "  L2D hits on shared lines          : %10d\n", stats->ld_l2hit);
>  	fprintf(out, "  LLC hits on shared lines          : %10d\n", stats->ld_llchit + stats->lcl_hitm);
>  	fprintf(out, "  Locked Access on shared lines     : %10d\n", stats->locks);
> +	fprintf(out, "  Blocked Access on shared lines    : %10d\n", stats->blk_data + stats->blk_addr);
>  	fprintf(out, "  Store HITs on shared lines        : %10d\n", stats->store);
>  	fprintf(out, "  Store L1D hits on shared lines    : %10d\n", stats->st_l1hit);
>  	fprintf(out, "  Total Merged records              : %10d\n", hitm_cnt + stats->store);
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index 890f638..f93a852 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -385,6 +385,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
>  	u64 lvl    = data_src->mem_lvl;
>  	u64 snoop  = data_src->mem_snoop;
>  	u64 lock   = data_src->mem_lock;
> +	u64 blk    = data_src->mem_blk;
>  	/*
>  	 * Skylake might report unknown remote level via this
>  	 * bit, consider it when evaluating remote HITMs.
> @@ -404,6 +405,9 @@ do {				\
>  
>  	if (lock & P(LOCK, LOCKED)) stats->locks++;
>  
> +	if (blk & P(BLK, DATA)) stats->blk_data++;
> +	if (blk & P(BLK, ADDR)) stats->blk_addr++;
> +
>  	if (op & P(OP, LOAD)) {
>  		/* load */
>  		stats->load++;
> @@ -515,6 +519,8 @@ void c2c_add_stats(struct c2c_stats *stats, struct c2c_stats *add)
>  	stats->rmt_hit		+= add->rmt_hit;
>  	stats->lcl_dram		+= add->lcl_dram;
>  	stats->rmt_dram		+= add->rmt_dram;
> +	stats->blk_data		+= add->blk_data;
> +	stats->blk_addr		+= add->blk_addr;
>  	stats->nomap		+= add->nomap;
>  	stats->noparse		+= add->noparse;
>  }
> diff --git a/tools/perf/util/mem-events.h b/tools/perf/util/mem-events.h
> index 5ddf447..755cef7 100644
> --- a/tools/perf/util/mem-events.h
> +++ b/tools/perf/util/mem-events.h
> @@ -79,6 +79,8 @@ struct c2c_stats {
>  	u32	rmt_hit;             /* count of loads with remote hit clean; */
>  	u32	lcl_dram;            /* count of loads miss to local DRAM */
>  	u32	rmt_dram;            /* count of loads miss to remote DRAM */
> +	u32	blk_data;            /* count of loads blocked by data */
> +	u32	blk_addr;            /* count of loads blocked by address conflict */
>  	u32	nomap;               /* count of load/stores with no phys adrs */
>  	u32	noparse;             /* count of unparsable data sources */
>  };
> -- 
> 2.7.4
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-02 20:09 ` [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
@ 2021-02-03 20:31   ` Arnaldo Carvalho de Melo
  2021-02-03 21:19     ` Liang, Kan
  0 siblings, 1 reply; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 20:31 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Tue, Feb 02, 2021 at 12:09:09PM -0800, kan.liang@linux.intel.com escreveu:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The new sample type, PERF_SAMPLE_WEIGHT_STRUCT, is an alternative of the
> PERF_SAMPLE_WEIGHT sample type. Users can apply either the
> PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample
> type to retrieve the sample weight, but they cannot apply both sample
> types simultaneously. The new sample type shares the same space as the
> PERF_SAMPLE_WEIGHT sample type. The lower 32 bits are exactly the same
> for both sample type. The higher 32 bits may be different for different
> architecture.
> 
> Add arch specific arch_evsel__set_sample_weight() to set the new sample
> type for X86. Only store the lower 32 bits for the sample->weight if the
> new sample type is applied. In practice, no memory access could last
> than 4G cycles. No data will be lost.
> 
> If the kernel doesn't support the new sample type. Fall back to the
> PERF_SAMPLE_WEIGHT sample type.
> 
> There is no impact for other architectures.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/Build            |  1 +
>  tools/perf/arch/x86/util/evsel.c          |  8 ++++++++
>  tools/perf/util/evsel.c                   | 28 ++++++++++++++++++++++++----
>  tools/perf/util/evsel.h                   |  3 +++
>  tools/perf/util/intel-pt.c                | 22 +++++++++++++++++++---
>  tools/perf/util/perf_event_attr_fprintf.c |  2 +-
>  tools/perf/util/session.c                 |  2 +-
>  tools/perf/util/synthetic-events.c        |  6 ++++--
>  8 files changed, 61 insertions(+), 11 deletions(-)
>  create mode 100644 tools/perf/arch/x86/util/evsel.c
> 
> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
> index d73f548..18848b3 100644
> --- a/tools/perf/arch/x86/util/Build
> +++ b/tools/perf/arch/x86/util/Build
> @@ -7,6 +7,7 @@ perf-y += topdown.o
>  perf-y += machine.o
>  perf-y += event.o
>  perf-y += mem-events.o
> +perf-y += evsel.o
>  
>  perf-$(CONFIG_DWARF) += dwarf-regs.o
>  perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
> diff --git a/tools/perf/arch/x86/util/evsel.c b/tools/perf/arch/x86/util/evsel.c
> new file mode 100644
> index 0000000..2f733cd
> --- /dev/null
> +++ b/tools/perf/arch/x86/util/evsel.c
> @@ -0,0 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include "util/evsel.h"
> +
> +void arch_evsel__set_sample_weight(struct evsel *evsel)
> +{
> +	evsel__set_sample_bit(evsel, WEIGHT_STRUCT);
> +}
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index c48f6de..0a2a307 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1012,6 +1012,11 @@ struct evsel_config_term *__evsel__get_config_term(struct evsel *evsel, enum evs
>  	return found_term;
>  }
>  
> +void __weak arch_evsel__set_sample_weight(struct evsel *evsel)
> +{
> +	evsel__set_sample_bit(evsel, WEIGHT);
> +}
> +
>  /*
>   * The enable_on_exec/disabled value strategy:
>   *
> @@ -1166,7 +1171,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>  	}
>  
>  	if (opts->sample_weight)
> -		evsel__set_sample_bit(evsel, WEIGHT);
> +		arch_evsel__set_sample_weight(evsel);
>  
>  	attr->task  = track;
>  	attr->mmap  = track;
> @@ -1735,6 +1740,10 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>  	}
>  
>  fallback_missing_features:
> +	if (perf_missing_features.weight_struct) {
> +		evsel__set_sample_bit(evsel, WEIGHT);
> +		evsel__reset_sample_bit(evsel, WEIGHT_STRUCT);
> +	}
>  	if (perf_missing_features.clockid_wrong)
>  		evsel->core.attr.clockid = CLOCK_MONOTONIC; /* should always work */
>  	if (perf_missing_features.clockid) {
> @@ -1873,7 +1882,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>  	 * Must probe features in the order they were added to the
>  	 * perf_event_attr interface.
>  	 */
> -        if (!perf_missing_features.data_page_size &&
> +	if (!perf_missing_features.weight_struct &&
> +	    (evsel->core.attr.sample_type & PERF_SAMPLE_WEIGHT_STRUCT)) {
> +		perf_missing_features.weight_struct = true;
> +		pr_debug2("switching off weight struct support\n");
> +		goto fallback_missing_features;
> +	} else if (!perf_missing_features.data_page_size &&
>  	    (evsel->core.attr.sample_type & PERF_SAMPLE_DATA_PAGE_SIZE)) {
>  		perf_missing_features.data_page_size = true;
>  		pr_debug2_peo("Kernel has no PERF_SAMPLE_DATA_PAGE_SIZE support, bailing out\n");
> @@ -2316,9 +2330,15 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  		}
>  	}
>  
> -	if (type & PERF_SAMPLE_WEIGHT) {
> +	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
> +		union perf_sample_weight weight;
> +
>  		OVERFLOW_CHECK_u64(array);
> -		data->weight = *array;
> +		weight.full = *array;
> +		if (type & PERF_SAMPLE_WEIGHT)
> +			data->weight = weight.full;
> +		else
> +			data->weight = weight.var1_dw;
>  		array++;
>  	}
>  
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index cd1d8dd..5c161a2 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -145,6 +145,7 @@ struct perf_missing_features {
>  	bool branch_hw_idx;
>  	bool cgroup;
>  	bool data_page_size;
> +	bool weight_struct;
>  };
>  
>  extern struct perf_missing_features perf_missing_features;
> @@ -239,6 +240,8 @@ void __evsel__reset_sample_bit(struct evsel *evsel, enum perf_event_sample_forma
>  
>  void evsel__set_sample_id(struct evsel *evsel, bool use_sample_identifier);
>  
> +void arch_evsel__set_sample_weight(struct evsel *evsel);
> +
>  int evsel__set_filter(struct evsel *evsel, const char *filter);
>  int evsel__append_tp_filter(struct evsel *evsel, const char *filter);
>  int evsel__append_addr_filter(struct evsel *evsel, const char *filter);
> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
> index 60214de..a929f6d 100644
> --- a/tools/perf/util/intel-pt.c
> +++ b/tools/perf/util/intel-pt.c
> @@ -1853,13 +1853,29 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
>  	if (sample_type & PERF_SAMPLE_ADDR && items->has_mem_access_address)
>  		sample.addr = items->mem_access_address;
>  
> -	if (sample_type & PERF_SAMPLE_WEIGHT) {
> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
>  		/*
>  		 * Refer kernel's setup_pebs_adaptive_sample_data() and
>  		 * intel_hsw_weight().
>  		 */
> -		if (items->has_mem_access_latency)
> -			sample.weight = items->mem_access_latency;
> +		if (items->has_mem_access_latency) {
> +			u64 weight = items->mem_access_latency >> 32;
> +
> +			/*
> +			 * Starts from SPR, the mem access latency field
> +			 * contains both cache latency [47:32] and instruction
> +			 * latency [15:0]. The cache latency is the same as the
> +			 * mem access latency on previous platforms.
> +			 *
> +			 * In practice, no memory access could last than 4G
> +			 * cycles. Use latency >> 32 to distinguish the
> +			 * different format of the mem access latency field.
> +			 */
> +			if (weight > 0)
> +				sample.weight = weight & 0xffff;
> +			else
> +				sample.weight = items->mem_access_latency;
> +		}
>  		if (!sample.weight && items->has_tsx_aux_info) {
>  			/* Cycles last block */
>  			sample.weight = (u32)items->tsx_aux_info;
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index fb0bb66..e972e63 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
>  		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
>  		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
>  		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
> -		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
> +		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_STRUCT),

I have CODE_PAGE_SIZE in my perf/core branch, was this somehow removed?

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core&id=c1de7f3d84ca324c7cda85c3ce27b11741af2124

I see, you did this patchkit on top of upstream, that has just
DATA_PAGE_SIZE, while my perf/core branch has CODE_PAGE_SIZE. I'm
adjusting it, please next time do tooling development on acme/perf/core.

- Arnaldo

>  		{ .name = NULL, }
>  	};
>  #undef bit_name
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 25adbcc..a35d8c2 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1297,7 +1297,7 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
>  	if (sample_type & PERF_SAMPLE_STACK_USER)
>  		stack_user__printf(&sample->user_stack);
>  
> -	if (sample_type & PERF_SAMPLE_WEIGHT)
> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
>  		printf("... weight: %" PRIu64 "\n", sample->weight);
>  
>  	if (sample_type & PERF_SAMPLE_DATA_SRC)
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index 2947e3f..bc16268 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -1384,7 +1384,7 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
>  		}
>  	}
>  
> -	if (type & PERF_SAMPLE_WEIGHT)
> +	if (type & PERF_SAMPLE_WEIGHT_TYPE)
>  		result += sizeof(u64);
>  
>  	if (type & PERF_SAMPLE_DATA_SRC)
> @@ -1555,8 +1555,10 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
>  		}
>  	}
>  
> -	if (type & PERF_SAMPLE_WEIGHT) {
> +	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
>  		*array = sample->weight;
> +		if (type & PERF_SAMPLE_WEIGHT_STRUCT)
> +			*array &= 0xffffffff;
>  		array++;
>  	}
>  
> -- 
> 2.7.4
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
@ 2021-02-03 20:43   ` Arnaldo Carvalho de Melo
  2021-02-04 13:11   ` Athira Rajeev
  2021-02-05 11:08   ` Namhyung Kim
  2 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 20:43 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Tue, Feb 02, 2021 at 12:09:10PM -0800, kan.liang@linux.intel.com escreveu:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The instruction latency information can be recorded on some platforms,
> e.g., the Intel Sapphire Rapids server. With both memory latency
> (weight) and the new instruction latency information, users can easily
> locate the expensive load instructions, and also understand the time
> spent in different stages. The users can optimize their applications
> in different pipeline stages.
> 
> The 'weight' field is shared among different architectures. Reusing the
> 'weight' field may impacts other architectures. Add a new field to store
> the instruction latency.
> 
> Like the 'weight' support, introduce a 'ins_lat' for the global
> instruction latency, and a 'local_ins_lat' for the local instruction
> latency version.

This one also had to be fixed up wrt code page size.
 
> Add new sort functions, INSTR Latency and Local INSTR Latency,
> accordingly.
> 
> Add local_ins_lat to the default_mem_sort_order[].
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/Documentation/perf-report.txt |  6 +++-
>  tools/perf/util/event.h                  |  1 +
>  tools/perf/util/evsel.c                  |  4 ++-
>  tools/perf/util/hist.c                   | 12 ++++++--
>  tools/perf/util/hist.h                   |  2 ++
>  tools/perf/util/intel-pt.c               |  5 ++--
>  tools/perf/util/session.c                |  8 ++++--
>  tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
>  tools/perf/util/sort.h                   |  3 ++
>  tools/perf/util/synthetic-events.c       |  4 ++-
>  10 files changed, 81 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 826b5a9..0565b7c 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -108,6 +108,9 @@ OPTIONS
>  	- period: Raw number of event count of sample
>  	- time: Separate the samples by time stamp with the resolution specified by
>  	--time-quantum (default 100ms). Specify with overhead and before it.
> +	- ins_lat: Instruction latency in core cycles. This is the global
> +	instruction latency
> +	- local_ins_lat: Local instruction latency version
>  
>  	By default, comm, dso and symbol keys are used.
>  	(i.e. --sort comm,dso,symbol)
> @@ -154,7 +157,8 @@ OPTIONS
>  	- blocked: reason of blocked load access for the data at the time of the sample
>  
>  	And the default sort keys are changed to local_weight, mem, sym, dso,
> -	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
> +	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
> +	see '--mem-mode'.
>  
>  	If the data file has tracepoint event(s), following (dynamic) sort keys
>  	are also available:
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index ff403ea..5d50a49 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -141,6 +141,7 @@ struct perf_sample {
>  	u16 insn_len;
>  	u8  cpumode;
>  	u16 misc;
> +	u16 ins_lat;
>  	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
>  	char insn[MAX_INSN];
>  	void *raw_data;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 0a2a307..24c0b59 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>  		weight.full = *array;
>  		if (type & PERF_SAMPLE_WEIGHT)
>  			data->weight = weight.full;
> -		else
> +		else {
>  			data->weight = weight.var1_dw;
> +			data->ins_lat = weight.var2_w;
> +		}
>  		array++;
>  	}
>  
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index 6866ab0..8a432f8 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>  	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
>  	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
>  	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
> +	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
> +	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
>  	if (symbol_conf.nanosecs)
>  		hists__new_col_len(hists, HISTC_TIME, 16);
>  	else
> @@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
>  }
>  
>  static void he_stat__add_period(struct he_stat *he_stat, u64 period,
> -				u64 weight)
> +				u64 weight, u64 ins_lat)
>  {
>  
>  	he_stat->period		+= period;
>  	he_stat->weight		+= weight;
>  	he_stat->nr_events	+= 1;
> +	he_stat->ins_lat	+= ins_lat;
>  }
>  
>  static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
> @@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
>  	dest->period_guest_us	+= src->period_guest_us;
>  	dest->nr_events		+= src->nr_events;
>  	dest->weight		+= src->weight;
> +	dest->ins_lat		+= src->ins_lat;
>  }
>  
>  static void he_stat__decay(struct he_stat *he_stat)
> @@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>  	int64_t cmp;
>  	u64 period = entry->stat.period;
>  	u64 weight = entry->stat.weight;
> +	u64 ins_lat = entry->stat.ins_lat;
>  	bool leftmost = true;
>  
>  	p = &hists->entries_in->rb_root.rb_node;
> @@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>  
>  		if (!cmp) {
>  			if (sample_self) {
> -				he_stat__add_period(&he->stat, period, weight);
> +				he_stat__add_period(&he->stat, period, weight, ins_lat);
>  				hist_entry__add_callchain_period(he, period);
>  			}
>  			if (symbol_conf.cumulate_callchain)
> -				he_stat__add_period(he->stat_acc, period, weight);
> +				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
>  
>  			/*
>  			 * This mem info was allocated from sample__resolve_mem
> @@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
>  			.nr_events = 1,
>  			.period	= sample->period,
>  			.weight = sample->weight,
> +			.ins_lat = sample->ins_lat,
>  		},
>  		.parent = sym_parent,
>  		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 522486b..36bca33 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -72,6 +72,8 @@ enum hist_column {
>  	HISTC_DSO_SIZE,
>  	HISTC_SYMBOL_IPC,
>  	HISTC_MEM_BLOCKED,
> +	HISTC_LOCAL_INS_LAT,
> +	HISTC_GLOBAL_INS_LAT,
>  	HISTC_NR_COLS, /* Last entry */
>  };
>  
> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
> index a929f6d..c9477d0 100644
> --- a/tools/perf/util/intel-pt.c
> +++ b/tools/perf/util/intel-pt.c
> @@ -1871,9 +1871,10 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
>  			 * cycles. Use latency >> 32 to distinguish the
>  			 * different format of the mem access latency field.
>  			 */
> -			if (weight > 0)
> +			if (weight > 0) {
>  				sample.weight = weight & 0xffff;
> -			else
> +				sample.ins_lat = items->mem_access_latency & 0xffff;
> +			} else
>  				sample.weight = items->mem_access_latency;
>  		}
>  		if (!sample.weight && items->has_tsx_aux_info) {
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index a35d8c2..330e591 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1297,8 +1297,12 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
>  	if (sample_type & PERF_SAMPLE_STACK_USER)
>  		stack_user__printf(&sample->user_stack);
>  
> -	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
> -		printf("... weight: %" PRIu64 "\n", sample->weight);
> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
> +		printf("... weight: %" PRIu64 "", sample->weight);
> +			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
> +				printf(",0x%"PRIx16"", sample->ins_lat);
> +		printf("\n");
> +	}
>  
>  	if (sample_type & PERF_SAMPLE_DATA_SRC)
>  		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 249a03c..e0529f2 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
>  const char	*parent_pattern = default_parent_pattern;
>  const char	*default_sort_order = "comm,dso,symbol";
>  const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
> -const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
> +const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
>  const char	default_top_sort_order[] = "dso,symbol";
>  const char	default_diff_sort_order[] = "dso,symbol";
>  const char	default_tracepoint_sort_order[] = "trace";
> @@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
>  	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
>  };
>  
> +static u64 he_ins_lat(struct hist_entry *he)
> +{
> +		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
> +}
> +
> +static int64_t
> +sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +		return he_ins_lat(left) - he_ins_lat(right);
> +}
> +
> +static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
> +					      size_t size, unsigned int width)
> +{
> +		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
> +}
> +
> +struct sort_entry sort_local_ins_lat = {
> +	.se_header	= "Local INSTR Latency",
> +	.se_cmp		= sort__local_ins_lat_cmp,
> +	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
> +	.se_width_idx	= HISTC_LOCAL_INS_LAT,
> +};
> +
> +static int64_t
> +sort__global_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +		return left->stat.ins_lat - right->stat.ins_lat;
> +}
> +
> +static int hist_entry__global_ins_lat_snprintf(struct hist_entry *he, char *bf,
> +					       size_t size, unsigned int width)
> +{
> +		return repsep_snprintf(bf, size, "%-*u", width, he->stat.ins_lat);
> +}
> +
> +struct sort_entry sort_global_ins_lat = {
> +	.se_header	= "INSTR Latency",
> +	.se_cmp		= sort__global_ins_lat_cmp,
> +	.se_snprintf	= hist_entry__global_ins_lat_snprintf,
> +	.se_width_idx	= HISTC_GLOBAL_INS_LAT,
> +};
> +
>  struct sort_entry sort_mem_daddr_sym = {
>  	.se_header	= "Data Symbol",
>  	.se_cmp		= sort__daddr_cmp,
> @@ -1770,6 +1813,8 @@ static struct sort_dimension common_sort_dimensions[] = {
>  	DIM(SORT_CGROUP_ID, "cgroup_id", sort_cgroup_id),
>  	DIM(SORT_SYM_IPC_NULL, "ipc_null", sort_sym_ipc_null),
>  	DIM(SORT_TIME, "time", sort_time),
> +	DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
> +	DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
>  };
>  
>  #undef DIM
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 2b2645b..c92ca15 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -50,6 +50,7 @@ struct he_stat {
>  	u64			period_guest_sys;
>  	u64			period_guest_us;
>  	u64			weight;
> +	u64			ins_lat;
>  	u32			nr_events;
>  };
>  
> @@ -229,6 +230,8 @@ enum sort_type {
>  	SORT_CGROUP_ID,
>  	SORT_SYM_IPC_NULL,
>  	SORT_TIME,
> +	SORT_LOCAL_INS_LAT,
> +	SORT_GLOBAL_INS_LAT,
>  
>  	/* branch stack specific sort keys */
>  	__SORT_BRANCH_STACK,
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index bc16268..95401c9 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -1557,8 +1557,10 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
>  
>  	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
>  		*array = sample->weight;
> -		if (type & PERF_SAMPLE_WEIGHT_STRUCT)
> +		if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
>  			*array &= 0xffffffff;
> +			*array |= ((u64)sample->ins_lat << 32);
> +		}
>  		array++;
>  	}
>  
> -- 
> 2.7.4
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-02 20:09 ` [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
@ 2021-02-03 20:44   ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 20:44 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Tue, Feb 02, 2021 at 12:09:11PM -0800, kan.liang@linux.intel.com escreveu:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Support the new sample type for sample-parsing test case.

ditto wrt code page size
 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/tests/sample-parsing.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/tests/sample-parsing.c b/tools/perf/tests/sample-parsing.c
> index 2393916..c5739cc 100644
> --- a/tools/perf/tests/sample-parsing.c
> +++ b/tools/perf/tests/sample-parsing.c
> @@ -129,6 +129,9 @@ static bool samples_same(const struct perf_sample *s1,
>  	if (type & PERF_SAMPLE_WEIGHT)
>  		COMP(weight);
>  
> +	if (type & PERF_SAMPLE_WEIGHT_STRUCT)
> +		COMP(ins_lat);
> +
>  	if (type & PERF_SAMPLE_DATA_SRC)
>  		COMP(data_src);
>  
> @@ -238,6 +241,7 @@ static int do_test(u64 sample_type, u64 sample_regs, u64 read_format)
>  		.phys_addr	= 113,
>  		.cgroup		= 114,
>  		.data_page_size = 115,
> +		.ins_lat	= 116,
>  		.aux_sample	= {
>  			.size	= sizeof(aux_data),
>  			.data	= (void *)aux_data,
> @@ -344,7 +348,7 @@ int test__sample_parsing(struct test *test __maybe_unused, int subtest __maybe_u
>  	 * were added.  Please actually update the test rather than just change
>  	 * the condition below.
>  	 */
> -	if (PERF_SAMPLE_MAX > PERF_SAMPLE_CODE_PAGE_SIZE << 1) {
> +	if (PERF_SAMPLE_MAX > PERF_SAMPLE_WEIGHT_STRUCT << 1) {
>  		pr_debug("sample format has changed, some new PERF_SAMPLE_ bit was introduced - test needs updating\n");
>  		return -1;
>  	}
> @@ -374,8 +378,12 @@ int test__sample_parsing(struct test *test __maybe_unused, int subtest __maybe_u
>  			return err;
>  	}
>  
> -	/* Test all sample format bits together */
> -	sample_type = PERF_SAMPLE_MAX - 1;
> +	/*
> +	 * Test all sample format bits together
> +	 * Note: PERF_SAMPLE_WEIGHT and PERF_SAMPLE_WEIGHT_STRUCT cannot
> +	 *       be set simultaneously.
> +	 */
> +	sample_type = (PERF_SAMPLE_MAX - 1) & ~PERF_SAMPLE_WEIGHT;
>  	sample_regs = 0x3fff; /* shared yb intr and user regs */
>  	for (i = 0; i < ARRAY_SIZE(rf); i++) {
>  		err = do_test(sample_type, sample_regs, rf[i]);
> -- 
> 2.7.4
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-03 20:31   ` Arnaldo Carvalho de Melo
@ 2021-02-03 21:19     ` Liang, Kan
  2021-02-03 21:29       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-03 21:19 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy



On 2/3/2021 3:31 PM, Arnaldo Carvalho de Melo wrote:
>> --- a/tools/perf/util/perf_event_attr_fprintf.c
>> +++ b/tools/perf/util/perf_event_attr_fprintf.c
>> @@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
>>   		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
>>   		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
>>   		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
>> -		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
>> +		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_STRUCT),
> I have CODE_PAGE_SIZE in my perf/core branch, was this somehow removed?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core&id=c1de7f3d84ca324c7cda85c3ce27b11741af2124
> 
> I see, you did this patchkit on top of upstream, that has just
> DATA_PAGE_SIZE, while my perf/core branch has CODE_PAGE_SIZE. I'm
> adjusting it, please next time do tooling development on acme/perf/core.

Sorry, I will rebase the patchset on acme/perf/core.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-03 20:02   ` Arnaldo Carvalho de Melo
@ 2021-02-03 21:20     ` Liang, Kan
  2021-02-03 21:30       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-03 21:20 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy



On 2/3/2021 3:02 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 02, 2021 at 12:09:06PM -0800,kan.liang@linux.intel.com  escreveu:
>> From: Kan Liang<kan.liang@linux.intel.com>
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index c26ea822..c48f6de 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
>>   		if (perf_missing_features.aux_output)
>>   			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
>>   		break;
>> +	case ENODATA:
>> +		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
>> +				 "Please add an auxiliary event in front of the load latency event.");
> Are you sure this is the only case where ENODATA comes out from
> perf_event_open()? Well, according to your comment in:
> 
>    61b985e3e775a3a7 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
> 
> It should be at that point in time, so its safe to merge as-is, but then
> I think this is fragile, what if someone else, in the future, not
> knowing that ENODATA is supposed to be used only with that ancient CPU,
> Sapphire Rapids, uses it?:-)
> 
> Please consider adding a check before assuming ENODATA is for this
> specific case.

Sure, I will add a check in V2.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT
  2021-02-03 21:19     ` Liang, Kan
@ 2021-02-03 21:29       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 21:29 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Wed, Feb 03, 2021 at 04:19:34PM -0500, Liang, Kan escreveu:
> 
> 
> On 2/3/2021 3:31 PM, Arnaldo Carvalho de Melo wrote:
> > > --- a/tools/perf/util/perf_event_attr_fprintf.c
> > > +++ b/tools/perf/util/perf_event_attr_fprintf.c
> > > @@ -35,7 +35,7 @@ static void __p_sample_type(char *buf, size_t size, u64 value)
> > >   		bit_name(BRANCH_STACK), bit_name(REGS_USER), bit_name(STACK_USER),
> > >   		bit_name(IDENTIFIER), bit_name(REGS_INTR), bit_name(DATA_SRC),
> > >   		bit_name(WEIGHT), bit_name(PHYS_ADDR), bit_name(AUX),
> > > -		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE),
> > > +		bit_name(CGROUP), bit_name(DATA_PAGE_SIZE), bit_name(WEIGHT_STRUCT),
> > I have CODE_PAGE_SIZE in my perf/core branch, was this somehow removed?
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=perf/core&id=c1de7f3d84ca324c7cda85c3ce27b11741af2124
> > 
> > I see, you did this patchkit on top of upstream, that has just
> > DATA_PAGE_SIZE, while my perf/core branch has CODE_PAGE_SIZE. I'm
> > adjusting it, please next time do tooling development on acme/perf/core.
> 
> Sorry, I will rebase the patchset on acme/perf/core.

No need. Look at tmp.perf/core, please test it, its all already adjusted
there.

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-03 21:20     ` Liang, Kan
@ 2021-02-03 21:30       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-03 21:30 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, linux-kernel, peterz, eranian, namhyung, jolsa, ak,
	yao.jin, maddy

Em Wed, Feb 03, 2021 at 04:20:38PM -0500, Liang, Kan escreveu:
> 
> 
> On 2/3/2021 3:02 PM, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Feb 02, 2021 at 12:09:06PM -0800,kan.liang@linux.intel.com  escreveu:
> > > From: Kan Liang<kan.liang@linux.intel.com>
> > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > > index c26ea822..c48f6de 100644
> > > --- a/tools/perf/util/evsel.c
> > > +++ b/tools/perf/util/evsel.c
> > > @@ -2689,6 +2689,9 @@ int evsel__open_strerror(struct evsel *evsel, struct target *target,
> > >   		if (perf_missing_features.aux_output)
> > >   			return scnprintf(msg, size, "The 'aux_output' feature is not supported, update the kernel.");
> > >   		break;
> > > +	case ENODATA:
> > > +		return scnprintf(msg, size, "Cannot collect data source with the load latency event alone. "
> > > +				 "Please add an auxiliary event in front of the load latency event.");
> > Are you sure this is the only case where ENODATA comes out from
> > perf_event_open()? Well, according to your comment in:
> > 
> >    61b985e3e775a3a7 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
> > 
> > It should be at that point in time, so its safe to merge as-is, but then
> > I think this is fragile, what if someone else, in the future, not
> > knowing that ENODATA is supposed to be used only with that ancient CPU,
> > Sapphire Rapids, uses it?:-)
> > 
> > Please consider adding a check before assuming ENODATA is for this
> > specific case.
> 
> Sure, I will add a check in V2.

Do it as a separate patch, on top of what is now in tmp.perf/core.

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
  2021-02-03 20:43   ` Arnaldo Carvalho de Melo
@ 2021-02-04 13:11   ` Athira Rajeev
  2021-02-04 15:19     ` Liang, Kan
  2021-02-05 11:08   ` Namhyung Kim
  2 siblings, 1 reply; 34+ messages in thread
From: Athira Rajeev @ 2021-02-04 13:11 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, mingo, linux-kernel, Peter Zijlstra,
	eranian, namhyung, jolsa, ak, yao.jin, maddy



> On 03-Feb-2021, at 1:39 AM, kan.liang@linux.intel.com wrote:
> 
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The instruction latency information can be recorded on some platforms,
> e.g., the Intel Sapphire Rapids server. With both memory latency
> (weight) and the new instruction latency information, users can easily
> locate the expensive load instructions, and also understand the time
> spent in different stages. The users can optimize their applications
> in different pipeline stages.
> 
> The 'weight' field is shared among different architectures. Reusing the
> 'weight' field may impacts other architectures. Add a new field to store
> the instruction latency.
> 
> Like the 'weight' support, introduce a 'ins_lat' for the global
> instruction latency, and a 'local_ins_lat' for the local instruction
> latency version.
> 
> Add new sort functions, INSTR Latency and Local INSTR Latency,
> accordingly.
> 
> Add local_ins_lat to the default_mem_sort_order[].
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
> tools/perf/Documentation/perf-report.txt |  6 +++-
> tools/perf/util/event.h                  |  1 +
> tools/perf/util/evsel.c                  |  4 ++-
> tools/perf/util/hist.c                   | 12 ++++++--
> tools/perf/util/hist.h                   |  2 ++
> tools/perf/util/intel-pt.c               |  5 ++--
> tools/perf/util/session.c                |  8 ++++--
> tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
> tools/perf/util/sort.h                   |  3 ++
> tools/perf/util/synthetic-events.c       |  4 ++-
> 10 files changed, 81 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
> index 826b5a9..0565b7c 100644
> --- a/tools/perf/Documentation/perf-report.txt
> +++ b/tools/perf/Documentation/perf-report.txt
> @@ -108,6 +108,9 @@ OPTIONS
> 	- period: Raw number of event count of sample
> 	- time: Separate the samples by time stamp with the resolution specified by
> 	--time-quantum (default 100ms). Specify with overhead and before it.
> +	- ins_lat: Instruction latency in core cycles. This is the global
> +	instruction latency
> +	- local_ins_lat: Local instruction latency version
> 
> 	By default, comm, dso and symbol keys are used.
> 	(i.e. --sort comm,dso,symbol)
> @@ -154,7 +157,8 @@ OPTIONS
> 	- blocked: reason of blocked load access for the data at the time of the sample
> 
> 	And the default sort keys are changed to local_weight, mem, sym, dso,
> -	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
> +	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
> +	see '--mem-mode'.
> 
> 	If the data file has tracepoint event(s), following (dynamic) sort keys
> 	are also available:
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index ff403ea..5d50a49 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -141,6 +141,7 @@ struct perf_sample {
> 	u16 insn_len;
> 	u8  cpumode;
> 	u16 misc;
> +	u16 ins_lat;
> 	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
> 	char insn[MAX_INSN];
> 	void *raw_data;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 0a2a307..24c0b59 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> 		weight.full = *array;
> 		if (type & PERF_SAMPLE_WEIGHT)
> 			data->weight = weight.full;
> -		else
> +		else {
> 			data->weight = weight.var1_dw;
> +			data->ins_lat = weight.var2_w;
> +		}
> 		array++;
> 	}
> 
> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
> index 6866ab0..8a432f8 100644
> --- a/tools/perf/util/hist.c
> +++ b/tools/perf/util/hist.c
> @@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
> 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
> 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
> 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
> +	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
> +	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
> 	if (symbol_conf.nanosecs)
> 		hists__new_col_len(hists, HISTC_TIME, 16);
> 	else
> @@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
> }
> 
> static void he_stat__add_period(struct he_stat *he_stat, u64 period,
> -				u64 weight)
> +				u64 weight, u64 ins_lat)
> {
> 
> 	he_stat->period		+= period;
> 	he_stat->weight		+= weight;
> 	he_stat->nr_events	+= 1;
> +	he_stat->ins_lat	+= ins_lat;
> }
> 
> static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
> @@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
> 	dest->period_guest_us	+= src->period_guest_us;
> 	dest->nr_events		+= src->nr_events;
> 	dest->weight		+= src->weight;
> +	dest->ins_lat		+= src->ins_lat;
> }
> 
> static void he_stat__decay(struct he_stat *he_stat)
> @@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
> 	int64_t cmp;
> 	u64 period = entry->stat.period;
> 	u64 weight = entry->stat.weight;
> +	u64 ins_lat = entry->stat.ins_lat;
> 	bool leftmost = true;
> 
> 	p = &hists->entries_in->rb_root.rb_node;
> @@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
> 
> 		if (!cmp) {
> 			if (sample_self) {
> -				he_stat__add_period(&he->stat, period, weight);
> +				he_stat__add_period(&he->stat, period, weight, ins_lat);
> 				hist_entry__add_callchain_period(he, period);
> 			}
> 			if (symbol_conf.cumulate_callchain)
> -				he_stat__add_period(he->stat_acc, period, weight);
> +				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
> 
> 			/*
> 			 * This mem info was allocated from sample__resolve_mem
> @@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
> 			.nr_events = 1,
> 			.period	= sample->period,
> 			.weight = sample->weight,
> +			.ins_lat = sample->ins_lat,
> 		},
> 		.parent = sym_parent,
> 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
> index 522486b..36bca33 100644
> --- a/tools/perf/util/hist.h
> +++ b/tools/perf/util/hist.h
> @@ -72,6 +72,8 @@ enum hist_column {
> 	HISTC_DSO_SIZE,
> 	HISTC_SYMBOL_IPC,
> 	HISTC_MEM_BLOCKED,
> +	HISTC_LOCAL_INS_LAT,
> +	HISTC_GLOBAL_INS_LAT,
> 	HISTC_NR_COLS, /* Last entry */
> };
> 
> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
> index a929f6d..c9477d0 100644
> --- a/tools/perf/util/intel-pt.c
> +++ b/tools/perf/util/intel-pt.c
> @@ -1871,9 +1871,10 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
> 			 * cycles. Use latency >> 32 to distinguish the
> 			 * different format of the mem access latency field.
> 			 */
> -			if (weight > 0)
> +			if (weight > 0) {
> 				sample.weight = weight & 0xffff;
> -			else
> +				sample.ins_lat = items->mem_access_latency & 0xffff;
> +			} else
> 				sample.weight = items->mem_access_latency;
> 		}
> 		if (!sample.weight && items->has_tsx_aux_info) {
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index a35d8c2..330e591 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -1297,8 +1297,12 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
> 	if (sample_type & PERF_SAMPLE_STACK_USER)
> 		stack_user__printf(&sample->user_stack);
> 
> -	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
> -		printf("... weight: %" PRIu64 "\n", sample->weight);
> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
> +		printf("... weight: %" PRIu64 "", sample->weight);
> +			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
> +				printf(",0x%"PRIx16"", sample->ins_lat);
> +		printf("\n");
> +	}
> 
> 	if (sample_type & PERF_SAMPLE_DATA_SRC)
> 		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
> index 249a03c..e0529f2 100644
> --- a/tools/perf/util/sort.c
> +++ b/tools/perf/util/sort.c
> @@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
> const char	*parent_pattern = default_parent_pattern;
> const char	*default_sort_order = "comm,dso,symbol";
> const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
> -const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
> +const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
> const char	default_top_sort_order[] = "dso,symbol";
> const char	default_diff_sort_order[] = "dso,symbol";
> const char	default_tracepoint_sort_order[] = "trace";
> @@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
> 	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
> };
> 
> +static u64 he_ins_lat(struct hist_entry *he)
> +{
> +		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
> +}
> +
> +static int64_t
> +sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +		return he_ins_lat(left) - he_ins_lat(right);
> +}
> +
> +static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
> +					      size_t size, unsigned int width)
> +{
> +		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
> +}
> +
> +struct sort_entry sort_local_ins_lat = {
> +	.se_header	= "Local INSTR Latency",
> +	.se_cmp		= sort__local_ins_lat_cmp,
> +	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
> +	.se_width_idx	= HISTC_LOCAL_INS_LAT,
> +};

Hi Kan Liang,

Currently with this changes, perf will display "Local INSTR Latency”  for the new column in ‘perf mem report’

Can we make this header string to be Architecture Specific ?
Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency.
Can we have some weak function to populate the header string ?


Thanks
Athira Rajeev
> +
> +static int64_t
> +sort__global_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +		return left->stat.ins_lat - right->stat.ins_lat;
> +}
> +
> +static int hist_entry__global_ins_lat_snprintf(struct hist_entry *he, char *bf,
> +					       size_t size, unsigned int width)
> +{
> +		return repsep_snprintf(bf, size, "%-*u", width, he->stat.ins_lat);
> +}
> +
> +struct sort_entry sort_global_ins_lat = {
> +	.se_header	= "INSTR Latency",
> +	.se_cmp		= sort__global_ins_lat_cmp,
> +	.se_snprintf	= hist_entry__global_ins_lat_snprintf,
> +	.se_width_idx	= HISTC_GLOBAL_INS_LAT,

> +};
> +
> struct sort_entry sort_mem_daddr_sym = {
> 	.se_header	= "Data Symbol",
> 	.se_cmp		= sort__daddr_cmp,
> @@ -1770,6 +1813,8 @@ static struct sort_dimension common_sort_dimensions[] = {
> 	DIM(SORT_CGROUP_ID, "cgroup_id", sort_cgroup_id),
> 	DIM(SORT_SYM_IPC_NULL, "ipc_null", sort_sym_ipc_null),
> 	DIM(SORT_TIME, "time", sort_time),
> +	DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
> +	DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
> };
> 
> #undef DIM
> diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
> index 2b2645b..c92ca15 100644
> --- a/tools/perf/util/sort.h
> +++ b/tools/perf/util/sort.h
> @@ -50,6 +50,7 @@ struct he_stat {
> 	u64			period_guest_sys;
> 	u64			period_guest_us;
> 	u64			weight;
> +	u64			ins_lat;
> 	u32			nr_events;
> };
> 
> @@ -229,6 +230,8 @@ enum sort_type {
> 	SORT_CGROUP_ID,
> 	SORT_SYM_IPC_NULL,
> 	SORT_TIME,
> +	SORT_LOCAL_INS_LAT,
> +	SORT_GLOBAL_INS_LAT,
> 
> 	/* branch stack specific sort keys */
> 	__SORT_BRANCH_STACK,
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index bc16268..95401c9 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -1557,8 +1557,10 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
> 
> 	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
> 		*array = sample->weight;
> -		if (type & PERF_SAMPLE_WEIGHT_STRUCT)
> +		if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
> 			*array &= 0xffffffff;
> +			*array |= ((u64)sample->ins_lat << 32);
> +		}
> 		array++;
> 	}
> 
> -- 
> 2.7.4
> 
> 
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-04 13:11   ` Athira Rajeev
@ 2021-02-04 15:19     ` Liang, Kan
  2021-02-05 12:55       ` Athira Rajeev
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-04 15:19 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: Arnaldo Carvalho de Melo, mingo, linux-kernel, Peter Zijlstra,
	eranian, namhyung, jolsa, ak, yao.jin, maddy



On 2/4/2021 8:11 AM, Athira Rajeev wrote:
> 
> 
>> On 03-Feb-2021, at 1:39 AM, kan.liang@linux.intel.com wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The instruction latency information can be recorded on some platforms,
>> e.g., the Intel Sapphire Rapids server. With both memory latency
>> (weight) and the new instruction latency information, users can easily
>> locate the expensive load instructions, and also understand the time
>> spent in different stages. The users can optimize their applications
>> in different pipeline stages.
>>
>> The 'weight' field is shared among different architectures. Reusing the
>> 'weight' field may impacts other architectures. Add a new field to store
>> the instruction latency.
>>
>> Like the 'weight' support, introduce a 'ins_lat' for the global
>> instruction latency, and a 'local_ins_lat' for the local instruction
>> latency version.
>>
>> Add new sort functions, INSTR Latency and Local INSTR Latency,
>> accordingly.
>>
>> Add local_ins_lat to the default_mem_sort_order[].
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>> tools/perf/Documentation/perf-report.txt |  6 +++-
>> tools/perf/util/event.h                  |  1 +
>> tools/perf/util/evsel.c                  |  4 ++-
>> tools/perf/util/hist.c                   | 12 ++++++--
>> tools/perf/util/hist.h                   |  2 ++
>> tools/perf/util/intel-pt.c               |  5 ++--
>> tools/perf/util/session.c                |  8 ++++--
>> tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
>> tools/perf/util/sort.h                   |  3 ++
>> tools/perf/util/synthetic-events.c       |  4 ++-
>> 10 files changed, 81 insertions(+), 11 deletions(-)
>>
>> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
>> index 826b5a9..0565b7c 100644
>> --- a/tools/perf/Documentation/perf-report.txt
>> +++ b/tools/perf/Documentation/perf-report.txt
>> @@ -108,6 +108,9 @@ OPTIONS
>> 	- period: Raw number of event count of sample
>> 	- time: Separate the samples by time stamp with the resolution specified by
>> 	--time-quantum (default 100ms). Specify with overhead and before it.
>> +	- ins_lat: Instruction latency in core cycles. This is the global
>> +	instruction latency
>> +	- local_ins_lat: Local instruction latency version
>>
>> 	By default, comm, dso and symbol keys are used.
>> 	(i.e. --sort comm,dso,symbol)
>> @@ -154,7 +157,8 @@ OPTIONS
>> 	- blocked: reason of blocked load access for the data at the time of the sample
>>
>> 	And the default sort keys are changed to local_weight, mem, sym, dso,
>> -	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
>> +	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
>> +	see '--mem-mode'.
>>
>> 	If the data file has tracepoint event(s), following (dynamic) sort keys
>> 	are also available:
>> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
>> index ff403ea..5d50a49 100644
>> --- a/tools/perf/util/event.h
>> +++ b/tools/perf/util/event.h
>> @@ -141,6 +141,7 @@ struct perf_sample {
>> 	u16 insn_len;
>> 	u8  cpumode;
>> 	u16 misc;
>> +	u16 ins_lat;
>> 	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
>> 	char insn[MAX_INSN];
>> 	void *raw_data;
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 0a2a307..24c0b59 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>> 		weight.full = *array;
>> 		if (type & PERF_SAMPLE_WEIGHT)
>> 			data->weight = weight.full;
>> -		else
>> +		else {
>> 			data->weight = weight.var1_dw;
>> +			data->ins_lat = weight.var2_w;
>> +		}
>> 		array++;
>> 	}
>>
>> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
>> index 6866ab0..8a432f8 100644
>> --- a/tools/perf/util/hist.c
>> +++ b/tools/perf/util/hist.c
>> @@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>> 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
>> 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
>> 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
>> +	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
>> +	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
>> 	if (symbol_conf.nanosecs)
>> 		hists__new_col_len(hists, HISTC_TIME, 16);
>> 	else
>> @@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
>> }
>>
>> static void he_stat__add_period(struct he_stat *he_stat, u64 period,
>> -				u64 weight)
>> +				u64 weight, u64 ins_lat)
>> {
>>
>> 	he_stat->period		+= period;
>> 	he_stat->weight		+= weight;
>> 	he_stat->nr_events	+= 1;
>> +	he_stat->ins_lat	+= ins_lat;
>> }
>>
>> static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
>> @@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
>> 	dest->period_guest_us	+= src->period_guest_us;
>> 	dest->nr_events		+= src->nr_events;
>> 	dest->weight		+= src->weight;
>> +	dest->ins_lat		+= src->ins_lat;
>> }
>>
>> static void he_stat__decay(struct he_stat *he_stat)
>> @@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>> 	int64_t cmp;
>> 	u64 period = entry->stat.period;
>> 	u64 weight = entry->stat.weight;
>> +	u64 ins_lat = entry->stat.ins_lat;
>> 	bool leftmost = true;
>>
>> 	p = &hists->entries_in->rb_root.rb_node;
>> @@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>>
>> 		if (!cmp) {
>> 			if (sample_self) {
>> -				he_stat__add_period(&he->stat, period, weight);
>> +				he_stat__add_period(&he->stat, period, weight, ins_lat);
>> 				hist_entry__add_callchain_period(he, period);
>> 			}
>> 			if (symbol_conf.cumulate_callchain)
>> -				he_stat__add_period(he->stat_acc, period, weight);
>> +				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
>>
>> 			/*
>> 			 * This mem info was allocated from sample__resolve_mem
>> @@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
>> 			.nr_events = 1,
>> 			.period	= sample->period,
>> 			.weight = sample->weight,
>> +			.ins_lat = sample->ins_lat,
>> 		},
>> 		.parent = sym_parent,
>> 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
>> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
>> index 522486b..36bca33 100644
>> --- a/tools/perf/util/hist.h
>> +++ b/tools/perf/util/hist.h
>> @@ -72,6 +72,8 @@ enum hist_column {
>> 	HISTC_DSO_SIZE,
>> 	HISTC_SYMBOL_IPC,
>> 	HISTC_MEM_BLOCKED,
>> +	HISTC_LOCAL_INS_LAT,
>> +	HISTC_GLOBAL_INS_LAT,
>> 	HISTC_NR_COLS, /* Last entry */
>> };
>>
>> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
>> index a929f6d..c9477d0 100644
>> --- a/tools/perf/util/intel-pt.c
>> +++ b/tools/perf/util/intel-pt.c
>> @@ -1871,9 +1871,10 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
>> 			 * cycles. Use latency >> 32 to distinguish the
>> 			 * different format of the mem access latency field.
>> 			 */
>> -			if (weight > 0)
>> +			if (weight > 0) {
>> 				sample.weight = weight & 0xffff;
>> -			else
>> +				sample.ins_lat = items->mem_access_latency & 0xffff;
>> +			} else
>> 				sample.weight = items->mem_access_latency;
>> 		}
>> 		if (!sample.weight && items->has_tsx_aux_info) {
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index a35d8c2..330e591 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -1297,8 +1297,12 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
>> 	if (sample_type & PERF_SAMPLE_STACK_USER)
>> 		stack_user__printf(&sample->user_stack);
>>
>> -	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
>> -		printf("... weight: %" PRIu64 "\n", sample->weight);
>> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
>> +		printf("... weight: %" PRIu64 "", sample->weight);
>> +			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
>> +				printf(",0x%"PRIx16"", sample->ins_lat);
>> +		printf("\n");
>> +	}
>>
>> 	if (sample_type & PERF_SAMPLE_DATA_SRC)
>> 		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
>> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
>> index 249a03c..e0529f2 100644
>> --- a/tools/perf/util/sort.c
>> +++ b/tools/perf/util/sort.c
>> @@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
>> const char	*parent_pattern = default_parent_pattern;
>> const char	*default_sort_order = "comm,dso,symbol";
>> const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
>> -const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
>> +const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
>> const char	default_top_sort_order[] = "dso,symbol";
>> const char	default_diff_sort_order[] = "dso,symbol";
>> const char	default_tracepoint_sort_order[] = "trace";
>> @@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
>> 	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
>> };
>>
>> +static u64 he_ins_lat(struct hist_entry *he)
>> +{
>> +		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
>> +}
>> +
>> +static int64_t
>> +sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
>> +{
>> +		return he_ins_lat(left) - he_ins_lat(right);
>> +}
>> +
>> +static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
>> +					      size_t size, unsigned int width)
>> +{
>> +		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
>> +}
>> +
>> +struct sort_entry sort_local_ins_lat = {
>> +	.se_header	= "Local INSTR Latency",
>> +	.se_cmp		= sort__local_ins_lat_cmp,
>> +	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
>> +	.se_width_idx	= HISTC_LOCAL_INS_LAT,
>> +};
> 
> Hi Kan Liang,
> 
> Currently with this changes, perf will display "Local INSTR Latency”  for the new column in ‘perf mem report’
> 
> Can we make this header string to be Architecture Specific ?

I don't think current perf supports an arch specific header string. It 
sounds like a new feature. I think it's a good feature.

Besides the instruction latency, we also introduce a 'block' field for 
perf mem. I'm not sure if the field applies for PowerPC. You may want to 
check it as well.

Could you please propose a patch to implement the feature? I think we 
can have a further discussion based on it.

> Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency.
> Can we have some weak function to populate the header string ?
> 

I agree that the var2_w has different meanings among architectures. We 
should not force it to data->ins_lat.

The patch as below should fix it. Does it work for you?

diff --git a/tools/perf/arch/x86/util/event.c 
b/tools/perf/arch/x86/util/event.c
index 047dc00..9b31734 100644
--- a/tools/perf/arch/x86/util/event.c
+++ b/tools/perf/arch/x86/util/event.c
@@ -75,3 +75,28 @@ int perf_event__synthesize_extra_kmaps(struct 
perf_tool *tool,
  }

  #endif
+
+void arch_perf_parse_sample_weight(struct perf_sample *data,
+				   const __u64 *array, u64 type)
+{
+	union perf_sample_weight weight;
+
+	weight.full = *array;
+	if (type & PERF_SAMPLE_WEIGHT)
+		data->weight = weight.full;
+	else {
+		data->weight = weight.var1_dw;
+		data->ins_lat = weight.var2_w;
+	}
+}
+
+void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
+					__u64 *array, u64 type)
+{
+	*array = data->weight;
+
+	if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
+		*array &= 0xffffffff;
+		*array |= ((u64)data->ins_lat << 32);
+	}
+}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 60752e4..9f123aa 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -414,4 +414,7 @@ extern unsigned int proc_map_timeout;
  #define PAGE_SIZE_NAME_LEN	32
  char *get_page_size_name(u64 size, char *str);

+void arch_perf_parse_sample_weight(struct perf_sample *data, const 
__u64 *array, u64 type);
+void arch_perf_synthesize_sample_weight(const struct perf_sample *data, 
__u64 *array, u64 type);
+
  #endif /* __PERF_RECORD_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 30b5452..1da8903 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2106,6 +2106,13 @@ perf_event__check_size(union perf_event *event, 
unsigned int sample_size)
  	return 0;
  }

+void __weak arch_perf_parse_sample_weight(struct perf_sample *data,
+					  const __u64 *array,
+					  u64 type __maybe_unused)
+{
+	data->weight = *array;
+}
+
  int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
  			struct perf_sample *data)
  {
@@ -2347,16 +2354,8 @@ int evsel__parse_sample(struct evsel *evsel, 
union perf_event *event,
  	}

  	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
-		union perf_sample_weight weight;
-
  		OVERFLOW_CHECK_u64(array);
-		weight.full = *array;
-		if (type & PERF_SAMPLE_WEIGHT)
-			data->weight = weight.full;
-		else {
-			data->weight = weight.var1_dw;
-			data->ins_lat = weight.var2_w;
-		}
+		arch_perf_parse_sample_weight(data, array, type);
  		array++;
  	}

diff --git a/tools/perf/util/synthetic-events.c 
b/tools/perf/util/synthetic-events.c
index c6f9db3..412f4c3 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1507,6 +1507,12 @@ size_t perf_event__sample_event_size(const struct 
perf_sample *sample, u64 type,
  	return result;
  }

+void __weak arch_perf_synthesize_sample_weight(const struct perf_sample 
*data,
+					       __u64 *array, u64 type __maybe_unused)
+{
+	*array = data->weight;
+}
+
  int perf_event__synthesize_sample(union perf_event *event, u64 type, 
u64 read_format,
  				  const struct perf_sample *sample)
  {
@@ -1643,11 +1649,7 @@ int perf_event__synthesize_sample(union 
perf_event *event, u64 type, u64 read_fo
  	}

  	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
-		*array = sample->weight;
-		if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
-			*array &= 0xffffffff;
-			*array |= ((u64)sample->ins_lat << 32);
-		}
+		arch_perf_synthesize_sample_weight(sample, array, type);
  		array++;
  	}

Thanks,
Kan

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-02 20:09 ` [PATCH 2/9] perf tools: Support the auxiliary event kan.liang
  2021-02-03 20:02   ` Arnaldo Carvalho de Melo
@ 2021-02-05 10:52   ` Namhyung Kim
  2021-02-05 14:13     ` Liang, Kan
  1 sibling, 1 reply; 34+ messages in thread
From: Namhyung Kim @ 2021-02-05 10:52 UTC (permalink / raw)
  To: Kan Liang
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy

On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> On the Intel Sapphire Rapids server, an auxiliary event has to be
> enabled simultaneously with the load latency event to retrieve complete
> Memory Info.
>
> Add X86 specific perf_mem_events__name() to handle the auxiliary event.
> - Users are only interested in the samples of the mem-loads event.
>   Sample read the auxiliary event.
> - The auxiliary event must be in front of the load latency event in a
>   group. Assume the second event to sample if the auxiliary event is the
>   leader.
> - Add a weak is_mem_loads_aux_event() to check the auxiliary event for
>   X86. For other ARCHs, it always return false.
>
> Parse the unique event name, mem-loads-aux, for the auxiliary event.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/arch/x86/util/Build        |  1 +
>  tools/perf/arch/x86/util/mem-events.c | 44 +++++++++++++++++++++++++++++++++++
>  tools/perf/util/evsel.c               |  3 +++
>  tools/perf/util/mem-events.c          |  5 ++++
>  tools/perf/util/mem-events.h          |  2 ++
>  tools/perf/util/parse-events.l        |  1 +
>  tools/perf/util/record.c              |  5 +++-
>  7 files changed, 60 insertions(+), 1 deletion(-)
>  create mode 100644 tools/perf/arch/x86/util/mem-events.c
>
> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
> index 347c39b..d73f548 100644
> --- a/tools/perf/arch/x86/util/Build
> +++ b/tools/perf/arch/x86/util/Build
> @@ -6,6 +6,7 @@ perf-y += perf_regs.o
>  perf-y += topdown.o
>  perf-y += machine.o
>  perf-y += event.o
> +perf-y += mem-events.o
>
>  perf-$(CONFIG_DWARF) += dwarf-regs.o
>  perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
> diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
> new file mode 100644
> index 0000000..11b8469
> --- /dev/null
> +++ b/tools/perf/arch/x86/util/mem-events.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "util/pmu.h"
> +#include "map_symbol.h"
> +#include "mem-events.h"
> +
> +static char mem_loads_name[100];
> +static bool mem_loads_name__init;
> +
> +#define MEM_LOADS_AUX          0x8203
> +#define MEM_LOADS_AUX_NAME     "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
> +
> +bool is_mem_loads_aux_event(struct evsel *leader)
> +{
> +       if (!pmu_have_event("cpu", "mem-loads-aux"))
> +               return false;
> +
> +       return leader->core.attr.config == MEM_LOADS_AUX;
> +}
> +
> +char *perf_mem_events__name(int i)
> +{
> +       struct perf_mem_event *e = perf_mem_events__ptr(i);
> +
> +       if (!e)
> +               return NULL;
> +
> +       if (i == PERF_MEM_EVENTS__LOAD) {
> +               if (mem_loads_name__init)
> +                       return mem_loads_name;
> +
> +               mem_loads_name__init = true;
> +
> +               if (pmu_have_event("cpu", "mem-loads-aux")) {
> +                       scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
> +                                 MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);

It changes "%u" to an actual latency value, right?
What if the value takes 3 or more digits?
I'm not sure scnprintf() will handle it properly.

Thanks,
Namhyung


> +               } else {
> +                       scnprintf(mem_loads_name, sizeof(mem_loads_name),
> +                                 e->name, perf_mem_events__loads_ldlat);
> +               }
> +               return mem_loads_name;
> +       }
> +
> +       return (char *)e->name;
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/9] perf tools: Support data block and addr block
  2021-02-02 20:09 ` [PATCH 3/9] perf tools: Support data block and addr block kan.liang
@ 2021-02-05 11:02   ` Namhyung Kim
  2021-02-05 14:17     ` Liang, Kan
  0 siblings, 1 reply; 34+ messages in thread
From: Namhyung Kim @ 2021-02-05 11:02 UTC (permalink / raw)
  To: Kan Liang
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy

On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Two new data source fields, to indicate the block reasons of a load
> instruction, are introduced on the Intel Sapphire Rapids server. The
> fields can be used by the memory profiling.
>
> Add a new sort function, SORT_MEM_BLOCKED, for the two fields.
>
> For the previous platforms or the block reason is unknown, print "N/A"
> for the block reason.
>
> Add blocked as a default mem sort key for perf report and
> perf mem report.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
[SNIP]
> +int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
> +{
> +       size_t l = 0;
> +       u64 mask = PERF_MEM_BLK_NA;
> +
> +       sz -= 1; /* -1 for null termination */
> +       out[0] = '\0';
> +
> +       if (mem_info)
> +               mask = mem_info->data_src.mem_blk;
> +
> +       if (!mask || (mask & PERF_MEM_BLK_NA)) {
> +               l += scnprintf(out + l, sz - l, " N/A");
> +               return l;
> +       }
> +       if (mask & PERF_MEM_BLK_DATA)
> +               l += scnprintf(out + l, sz - l, " Data");
> +       if (mask & PERF_MEM_BLK_ADDR)
> +               l += scnprintf(out + l, sz - l, " Addr");

So this means it's possible to have BLK_DATA and BLK_ADDR
together and in that case it'll print "DataAddr", right?

Thanks,
Namhyung


> +
> +       return l;
> +}
> +
>  int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
>  {
>         int i = 0;
> @@ -348,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_in
>         i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info);
>         i += scnprintf(out + i, sz - i, "|LCK ");
>         i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info);
> +       i += scnprintf(out + i, sz - i, "|BLK ");
> +       i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info);
>
>         return i;
>  }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
  2021-02-03 20:43   ` Arnaldo Carvalho de Melo
  2021-02-04 13:11   ` Athira Rajeev
@ 2021-02-05 11:08   ` Namhyung Kim
  2021-02-05 14:38     ` Liang, Kan
  2 siblings, 1 reply; 34+ messages in thread
From: Namhyung Kim @ 2021-02-05 11:08 UTC (permalink / raw)
  To: Kan Liang
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy

On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The instruction latency information can be recorded on some platforms,
> e.g., the Intel Sapphire Rapids server. With both memory latency
> (weight) and the new instruction latency information, users can easily
> locate the expensive load instructions, and also understand the time
> spent in different stages. The users can optimize their applications
> in different pipeline stages.
>
> The 'weight' field is shared among different architectures. Reusing the
> 'weight' field may impacts other architectures. Add a new field to store
> the instruction latency.
>
> Like the 'weight' support, introduce a 'ins_lat' for the global
> instruction latency, and a 'local_ins_lat' for the local instruction
> latency version.

Could you please clarify the difference between the global latency
and the local latency?

Thanks,
Namhyung


>
> Add new sort functions, INSTR Latency and Local INSTR Latency,
> accordingly.
>
> Add local_ins_lat to the default_mem_sort_order[].
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-04 15:19     ` Liang, Kan
@ 2021-02-05 12:55       ` Athira Rajeev
  2021-02-05 14:51         ` Liang, Kan
  0 siblings, 1 reply; 34+ messages in thread
From: Athira Rajeev @ 2021-02-05 12:55 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, mingo, linux-kernel, Peter Zijlstra,
	eranian, namhyung, jolsa, ak, yao.jin, maddy



> On 04-Feb-2021, at 8:49 PM, Liang, Kan <kan.liang@linux.intel.com> wrote:
> 
> 
> 
> On 2/4/2021 8:11 AM, Athira Rajeev wrote:
>>> On 03-Feb-2021, at 1:39 AM, kan.liang@linux.intel.com wrote:
>>> 
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>> 
>>> The instruction latency information can be recorded on some platforms,
>>> e.g., the Intel Sapphire Rapids server. With both memory latency
>>> (weight) and the new instruction latency information, users can easily
>>> locate the expensive load instructions, and also understand the time
>>> spent in different stages. The users can optimize their applications
>>> in different pipeline stages.
>>> 
>>> The 'weight' field is shared among different architectures. Reusing the
>>> 'weight' field may impacts other architectures. Add a new field to store
>>> the instruction latency.
>>> 
>>> Like the 'weight' support, introduce a 'ins_lat' for the global
>>> instruction latency, and a 'local_ins_lat' for the local instruction
>>> latency version.
>>> 
>>> Add new sort functions, INSTR Latency and Local INSTR Latency,
>>> accordingly.
>>> 
>>> Add local_ins_lat to the default_mem_sort_order[].
>>> 
>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>> ---
>>> tools/perf/Documentation/perf-report.txt |  6 +++-
>>> tools/perf/util/event.h                  |  1 +
>>> tools/perf/util/evsel.c                  |  4 ++-
>>> tools/perf/util/hist.c                   | 12 ++++++--
>>> tools/perf/util/hist.h                   |  2 ++
>>> tools/perf/util/intel-pt.c               |  5 ++--
>>> tools/perf/util/session.c                |  8 ++++--
>>> tools/perf/util/sort.c                   | 47 +++++++++++++++++++++++++++++++-
>>> tools/perf/util/sort.h                   |  3 ++
>>> tools/perf/util/synthetic-events.c       |  4 ++-
>>> 10 files changed, 81 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
>>> index 826b5a9..0565b7c 100644
>>> --- a/tools/perf/Documentation/perf-report.txt
>>> +++ b/tools/perf/Documentation/perf-report.txt
>>> @@ -108,6 +108,9 @@ OPTIONS
>>> 	- period: Raw number of event count of sample
>>> 	- time: Separate the samples by time stamp with the resolution specified by
>>> 	--time-quantum (default 100ms). Specify with overhead and before it.
>>> +	- ins_lat: Instruction latency in core cycles. This is the global
>>> +	instruction latency
>>> +	- local_ins_lat: Local instruction latency version
>>> 
>>> 	By default, comm, dso and symbol keys are used.
>>> 	(i.e. --sort comm,dso,symbol)
>>> @@ -154,7 +157,8 @@ OPTIONS
>>> 	- blocked: reason of blocked load access for the data at the time of the sample
>>> 
>>> 	And the default sort keys are changed to local_weight, mem, sym, dso,
>>> -	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, see '--mem-mode'.
>>> +	symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat,
>>> +	see '--mem-mode'.
>>> 
>>> 	If the data file has tracepoint event(s), following (dynamic) sort keys
>>> 	are also available:
>>> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
>>> index ff403ea..5d50a49 100644
>>> --- a/tools/perf/util/event.h
>>> +++ b/tools/perf/util/event.h
>>> @@ -141,6 +141,7 @@ struct perf_sample {
>>> 	u16 insn_len;
>>> 	u8  cpumode;
>>> 	u16 misc;
>>> +	u16 ins_lat;
>>> 	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
>>> 	char insn[MAX_INSN];
>>> 	void *raw_data;
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index 0a2a307..24c0b59 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -2337,8 +2337,10 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
>>> 		weight.full = *array;
>>> 		if (type & PERF_SAMPLE_WEIGHT)
>>> 			data->weight = weight.full;
>>> -		else
>>> +		else {
>>> 			data->weight = weight.var1_dw;
>>> +			data->ins_lat = weight.var2_w;
>>> +		}
>>> 		array++;
>>> 	}
>>> 
>>> diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
>>> index 6866ab0..8a432f8 100644
>>> --- a/tools/perf/util/hist.c
>>> +++ b/tools/perf/util/hist.c
>>> @@ -209,6 +209,8 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
>>> 	hists__new_col_len(hists, HISTC_LOCAL_WEIGHT, 12);
>>> 	hists__new_col_len(hists, HISTC_GLOBAL_WEIGHT, 12);
>>> 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
>>> +	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
>>> +	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
>>> 	if (symbol_conf.nanosecs)
>>> 		hists__new_col_len(hists, HISTC_TIME, 16);
>>> 	else
>>> @@ -286,12 +288,13 @@ static long hist_time(unsigned long htime)
>>> }
>>> 
>>> static void he_stat__add_period(struct he_stat *he_stat, u64 period,
>>> -				u64 weight)
>>> +				u64 weight, u64 ins_lat)
>>> {
>>> 
>>> 	he_stat->period		+= period;
>>> 	he_stat->weight		+= weight;
>>> 	he_stat->nr_events	+= 1;
>>> +	he_stat->ins_lat	+= ins_lat;
>>> }
>>> 
>>> static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
>>> @@ -303,6 +306,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
>>> 	dest->period_guest_us	+= src->period_guest_us;
>>> 	dest->nr_events		+= src->nr_events;
>>> 	dest->weight		+= src->weight;
>>> +	dest->ins_lat		+= src->ins_lat;
>>> }
>>> 
>>> static void he_stat__decay(struct he_stat *he_stat)
>>> @@ -591,6 +595,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>>> 	int64_t cmp;
>>> 	u64 period = entry->stat.period;
>>> 	u64 weight = entry->stat.weight;
>>> +	u64 ins_lat = entry->stat.ins_lat;
>>> 	bool leftmost = true;
>>> 
>>> 	p = &hists->entries_in->rb_root.rb_node;
>>> @@ -609,11 +614,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
>>> 
>>> 		if (!cmp) {
>>> 			if (sample_self) {
>>> -				he_stat__add_period(&he->stat, period, weight);
>>> +				he_stat__add_period(&he->stat, period, weight, ins_lat);
>>> 				hist_entry__add_callchain_period(he, period);
>>> 			}
>>> 			if (symbol_conf.cumulate_callchain)
>>> -				he_stat__add_period(he->stat_acc, period, weight);
>>> +				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
>>> 
>>> 			/*
>>> 			 * This mem info was allocated from sample__resolve_mem
>>> @@ -723,6 +728,7 @@ __hists__add_entry(struct hists *hists,
>>> 			.nr_events = 1,
>>> 			.period	= sample->period,
>>> 			.weight = sample->weight,
>>> +			.ins_lat = sample->ins_lat,
>>> 		},
>>> 		.parent = sym_parent,
>>> 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
>>> diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
>>> index 522486b..36bca33 100644
>>> --- a/tools/perf/util/hist.h
>>> +++ b/tools/perf/util/hist.h
>>> @@ -72,6 +72,8 @@ enum hist_column {
>>> 	HISTC_DSO_SIZE,
>>> 	HISTC_SYMBOL_IPC,
>>> 	HISTC_MEM_BLOCKED,
>>> +	HISTC_LOCAL_INS_LAT,
>>> +	HISTC_GLOBAL_INS_LAT,
>>> 	HISTC_NR_COLS, /* Last entry */
>>> };
>>> 
>>> diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
>>> index a929f6d..c9477d0 100644
>>> --- a/tools/perf/util/intel-pt.c
>>> +++ b/tools/perf/util/intel-pt.c
>>> @@ -1871,9 +1871,10 @@ static int intel_pt_synth_pebs_sample(struct intel_pt_queue *ptq)
>>> 			 * cycles. Use latency >> 32 to distinguish the
>>> 			 * different format of the mem access latency field.
>>> 			 */
>>> -			if (weight > 0)
>>> +			if (weight > 0) {
>>> 				sample.weight = weight & 0xffff;
>>> -			else
>>> +				sample.ins_lat = items->mem_access_latency & 0xffff;
>>> +			} else
>>> 				sample.weight = items->mem_access_latency;
>>> 		}
>>> 		if (!sample.weight && items->has_tsx_aux_info) {
>>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>>> index a35d8c2..330e591 100644
>>> --- a/tools/perf/util/session.c
>>> +++ b/tools/perf/util/session.c
>>> @@ -1297,8 +1297,12 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
>>> 	if (sample_type & PERF_SAMPLE_STACK_USER)
>>> 		stack_user__printf(&sample->user_stack);
>>> 
>>> -	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE)
>>> -		printf("... weight: %" PRIu64 "\n", sample->weight);
>>> +	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
>>> +		printf("... weight: %" PRIu64 "", sample->weight);
>>> +			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
>>> +				printf(",0x%"PRIx16"", sample->ins_lat);
>>> +		printf("\n");
>>> +	}
>>> 
>>> 	if (sample_type & PERF_SAMPLE_DATA_SRC)
>>> 		printf(" . data_src: 0x%"PRIx64"\n", sample->data_src);
>>> diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
>>> index 249a03c..e0529f2 100644
>>> --- a/tools/perf/util/sort.c
>>> +++ b/tools/perf/util/sort.c
>>> @@ -36,7 +36,7 @@ const char	default_parent_pattern[] = "^sys_|^do_page_fault";
>>> const char	*parent_pattern = default_parent_pattern;
>>> const char	*default_sort_order = "comm,dso,symbol";
>>> const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
>>> -const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked";
>>> +const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
>>> const char	default_top_sort_order[] = "dso,symbol";
>>> const char	default_diff_sort_order[] = "dso,symbol";
>>> const char	default_tracepoint_sort_order[] = "trace";
>>> @@ -1365,6 +1365,49 @@ struct sort_entry sort_global_weight = {
>>> 	.se_width_idx	= HISTC_GLOBAL_WEIGHT,
>>> };
>>> 
>>> +static u64 he_ins_lat(struct hist_entry *he)
>>> +{
>>> +		return he->stat.nr_events ? he->stat.ins_lat / he->stat.nr_events : 0;
>>> +}
>>> +
>>> +static int64_t
>>> +sort__local_ins_lat_cmp(struct hist_entry *left, struct hist_entry *right)
>>> +{
>>> +		return he_ins_lat(left) - he_ins_lat(right);
>>> +}
>>> +
>>> +static int hist_entry__local_ins_lat_snprintf(struct hist_entry *he, char *bf,
>>> +					      size_t size, unsigned int width)
>>> +{
>>> +		return repsep_snprintf(bf, size, "%-*u", width, he_ins_lat(he));
>>> +}
>>> +
>>> +struct sort_entry sort_local_ins_lat = {
>>> +	.se_header	= "Local INSTR Latency",
>>> +	.se_cmp		= sort__local_ins_lat_cmp,
>>> +	.se_snprintf	= hist_entry__local_ins_lat_snprintf,
>>> +	.se_width_idx	= HISTC_LOCAL_INS_LAT,
>>> +};
>> Hi Kan Liang,
>> Currently with this changes, perf will display "Local INSTR Latency”  for the new column in ‘perf mem report’
>> Can we make this header string to be Architecture Specific ?
> 
> I don't think current perf supports an arch specific header string. It sounds like a new feature. I think it's a good feature.
> 
> Besides the instruction latency, we also introduce a 'block' field for perf mem. I'm not sure if the field applies for PowerPC. You may want to check it as well.

Hi Kan Liang,

Sure, we will check this.
> 
> Could you please propose a patch to implement the feature? I think we can have a further discussion based on it.


Ok, We will look into this and come back with an RFC patch.
> 
>> Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency.
>> Can we have some weak function to populate the header string ?
> 
> I agree that the var2_w has different meanings among architectures. We should not force it to data->ins_lat.
> 
> The patch as below should fix it. Does it work for you?

My point about weak function was actually for the arch specific header string. But I guess we should not force it to data->ins_lat
as you mentioned. I checked the below patch defining an ‘arch_perf_parse_sample_weight' for powerpc and it works.

But one observation is that, for cases with kernel having support for PERF_SAMPLE_WEIGHT_STRUCT but missing arch specific support for  ‘arch_perf_parse_sample_weight', it will report ‘Local Weight’ wrongly since weak function takes it as 64 bit. Not sure if that is a valid case to consider though.

Thanks
Athira
> 
> diff --git a/tools/perf/arch/x86/util/event.c b/tools/perf/arch/x86/util/event.c
> index 047dc00..9b31734 100644
> --- a/tools/perf/arch/x86/util/event.c
> +++ b/tools/perf/arch/x86/util/event.c
> @@ -75,3 +75,28 @@ int perf_event__synthesize_extra_kmaps(struct perf_tool *tool,
> }
> 
> #endif
> +
> +void arch_perf_parse_sample_weight(struct perf_sample *data,
> +				   const __u64 *array, u64 type)
> +{
> +	union perf_sample_weight weight;
> +
> +	weight.full = *array;
> +	if (type & PERF_SAMPLE_WEIGHT)
> +		data->weight = weight.full;
> +	else {
> +		data->weight = weight.var1_dw;
> +		data->ins_lat = weight.var2_w;
> +	}
> +}
> +
> +void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
> +					__u64 *array, u64 type)
> +{
> +	*array = data->weight;
> +
> +	if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
> +		*array &= 0xffffffff;
> +		*array |= ((u64)data->ins_lat << 32);
> +	}
> +}
> diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
> index 60752e4..9f123aa 100644
> --- a/tools/perf/util/event.h
> +++ b/tools/perf/util/event.h
> @@ -414,4 +414,7 @@ extern unsigned int proc_map_timeout;
> #define PAGE_SIZE_NAME_LEN	32
> char *get_page_size_name(u64 size, char *str);
> 
> +void arch_perf_parse_sample_weight(struct perf_sample *data, const __u64 *array, u64 type);
> +void arch_perf_synthesize_sample_weight(const struct perf_sample *data, __u64 *array, u64 type);
> +
> #endif /* __PERF_RECORD_H */
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 30b5452..1da8903 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2106,6 +2106,13 @@ perf_event__check_size(union perf_event *event, unsigned int sample_size)
> 	return 0;
> }
> 
> +void __weak arch_perf_parse_sample_weight(struct perf_sample *data,
> +					  const __u64 *array,
> +					  u64 type __maybe_unused)
> +{
> +	data->weight = *array;
> +}
> +
> int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> 			struct perf_sample *data)
> {
> @@ -2347,16 +2354,8 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
> 	}
> 
> 	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
> -		union perf_sample_weight weight;
> -
> 		OVERFLOW_CHECK_u64(array);
> -		weight.full = *array;
> -		if (type & PERF_SAMPLE_WEIGHT)
> -			data->weight = weight.full;
> -		else {
> -			data->weight = weight.var1_dw;
> -			data->ins_lat = weight.var2_w;
> -		}
> +		arch_perf_parse_sample_weight(data, array, type);
> 		array++;
> 	}
> 
> diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
> index c6f9db3..412f4c3 100644
> --- a/tools/perf/util/synthetic-events.c
> +++ b/tools/perf/util/synthetic-events.c
> @@ -1507,6 +1507,12 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
> 	return result;
> }
> 
> +void __weak arch_perf_synthesize_sample_weight(const struct perf_sample *data,
> +					       __u64 *array, u64 type __maybe_unused)
> +{
> +	*array = data->weight;
> +}
> +
> int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_format,
> 				  const struct perf_sample *sample)
> {
> @@ -1643,11 +1649,7 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
> 	}
> 
> 	if (type & PERF_SAMPLE_WEIGHT_TYPE) {
> -		*array = sample->weight;
> -		if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
> -			*array &= 0xffffffff;
> -			*array |= ((u64)sample->ins_lat << 32);
> -		}
> +		arch_perf_synthesize_sample_weight(sample, array, type);
> 		array++;
> 	}
> 
> Thanks,
> Kan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-05 10:52   ` Namhyung Kim
@ 2021-02-05 14:13     ` Liang, Kan
  2021-02-05 15:26       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-05 14:13 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy



On 2/5/2021 5:52 AM, Namhyung Kim wrote:
> On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> On the Intel Sapphire Rapids server, an auxiliary event has to be
>> enabled simultaneously with the load latency event to retrieve complete
>> Memory Info.
>>
>> Add X86 specific perf_mem_events__name() to handle the auxiliary event.
>> - Users are only interested in the samples of the mem-loads event.
>>    Sample read the auxiliary event.
>> - The auxiliary event must be in front of the load latency event in a
>>    group. Assume the second event to sample if the auxiliary event is the
>>    leader.
>> - Add a weak is_mem_loads_aux_event() to check the auxiliary event for
>>    X86. For other ARCHs, it always return false.
>>
>> Parse the unique event name, mem-loads-aux, for the auxiliary event.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>   tools/perf/arch/x86/util/Build        |  1 +
>>   tools/perf/arch/x86/util/mem-events.c | 44 +++++++++++++++++++++++++++++++++++
>>   tools/perf/util/evsel.c               |  3 +++
>>   tools/perf/util/mem-events.c          |  5 ++++
>>   tools/perf/util/mem-events.h          |  2 ++
>>   tools/perf/util/parse-events.l        |  1 +
>>   tools/perf/util/record.c              |  5 +++-
>>   7 files changed, 60 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/perf/arch/x86/util/mem-events.c
>>
>> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
>> index 347c39b..d73f548 100644
>> --- a/tools/perf/arch/x86/util/Build
>> +++ b/tools/perf/arch/x86/util/Build
>> @@ -6,6 +6,7 @@ perf-y += perf_regs.o
>>   perf-y += topdown.o
>>   perf-y += machine.o
>>   perf-y += event.o
>> +perf-y += mem-events.o
>>
>>   perf-$(CONFIG_DWARF) += dwarf-regs.o
>>   perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
>> diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
>> new file mode 100644
>> index 0000000..11b8469
>> --- /dev/null
>> +++ b/tools/perf/arch/x86/util/mem-events.c
>> @@ -0,0 +1,44 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include "util/pmu.h"
>> +#include "map_symbol.h"
>> +#include "mem-events.h"
>> +
>> +static char mem_loads_name[100];
>> +static bool mem_loads_name__init;
>> +
>> +#define MEM_LOADS_AUX          0x8203
>> +#define MEM_LOADS_AUX_NAME     "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
>> +
>> +bool is_mem_loads_aux_event(struct evsel *leader)
>> +{
>> +       if (!pmu_have_event("cpu", "mem-loads-aux"))
>> +               return false;
>> +
>> +       return leader->core.attr.config == MEM_LOADS_AUX;
>> +}
>> +
>> +char *perf_mem_events__name(int i)
>> +{
>> +       struct perf_mem_event *e = perf_mem_events__ptr(i);
>> +
>> +       if (!e)
>> +               return NULL;
>> +
>> +       if (i == PERF_MEM_EVENTS__LOAD) {
>> +               if (mem_loads_name__init)
>> +                       return mem_loads_name;
>> +
>> +               mem_loads_name__init = true;
>> +
>> +               if (pmu_have_event("cpu", "mem-loads-aux")) {
>> +                       scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
>> +                                 MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
> 
> It changes "%u" to an actual latency value, right?
> What if the value takes 3 or more digits?
> I'm not sure scnprintf() will handle it properly.
>

Yes, you are right. We should use the sizeof(mem_loads_name) as below.
I will submit a patch to fix it.

diff --git a/tools/perf/arch/x86/util/mem-events.c 
b/tools/perf/arch/x86/util/mem-events.c
index 11b8469..588110f 100644
--- a/tools/perf/arch/x86/util/mem-events.c
+++ b/tools/perf/arch/x86/util/mem-events.c
@@ -31,7 +31,7 @@ char *perf_mem_events__name(int i)
  		mem_loads_name__init = true;

  		if (pmu_have_event("cpu", "mem-loads-aux")) {
-			scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
+			scnprintf(mem_loads_name, sizeof(mem_loads_name),
  				  MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
  		} else {
  			scnprintf(mem_loads_name, sizeof(mem_loads_name),




Thanks,
Kan


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/9] perf tools: Support data block and addr block
  2021-02-05 11:02   ` Namhyung Kim
@ 2021-02-05 14:17     ` Liang, Kan
  0 siblings, 0 replies; 34+ messages in thread
From: Liang, Kan @ 2021-02-05 14:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy



On 2/5/2021 6:02 AM, Namhyung Kim wrote:
> On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Two new data source fields, to indicate the block reasons of a load
>> instruction, are introduced on the Intel Sapphire Rapids server. The
>> fields can be used by the memory profiling.
>>
>> Add a new sort function, SORT_MEM_BLOCKED, for the two fields.
>>
>> For the previous platforms or the block reason is unknown, print "N/A"
>> for the block reason.
>>
>> Add blocked as a default mem sort key for perf report and
>> perf mem report.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
> [SNIP]
>> +int perf_mem__blk_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
>> +{
>> +       size_t l = 0;
>> +       u64 mask = PERF_MEM_BLK_NA;
>> +
>> +       sz -= 1; /* -1 for null termination */
>> +       out[0] = '\0';
>> +
>> +       if (mem_info)
>> +               mask = mem_info->data_src.mem_blk;
>> +
>> +       if (!mask || (mask & PERF_MEM_BLK_NA)) {
>> +               l += scnprintf(out + l, sz - l, " N/A");
>> +               return l;
>> +       }
>> +       if (mask & PERF_MEM_BLK_DATA)
>> +               l += scnprintf(out + l, sz - l, " Data");
>> +       if (mask & PERF_MEM_BLK_ADDR)
>> +               l += scnprintf(out + l, sz - l, " Addr");
> 
> So this means it's possible to have BLK_DATA and BLK_ADDR
> together and in that case it'll print "DataAddr", right?

Yes, it's possible. If so, it will print "Data Addr".

Thanks,
Kan
> 
> 
>> +
>> +       return l;
>> +}
>> +
>>   int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
>>   {
>>          int i = 0;
>> @@ -348,6 +371,8 @@ int perf_script__meminfo_scnprintf(char *out, size_t sz, struct mem_info *mem_in
>>          i += perf_mem__tlb_scnprintf(out + i, sz - i, mem_info);
>>          i += scnprintf(out + i, sz - i, "|LCK ");
>>          i += perf_mem__lck_scnprintf(out + i, sz - i, mem_info);
>> +       i += scnprintf(out + i, sz - i, "|BLK ");
>> +       i += perf_mem__blk_scnprintf(out + i, sz - i, mem_info);
>>
>>          return i;
>>   }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-05 11:08   ` Namhyung Kim
@ 2021-02-05 14:38     ` Liang, Kan
  2021-02-06  8:09       ` Namhyung Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-05 14:38 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy



On 2/5/2021 6:08 AM, Namhyung Kim wrote:
> On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The instruction latency information can be recorded on some platforms,
>> e.g., the Intel Sapphire Rapids server. With both memory latency
>> (weight) and the new instruction latency information, users can easily
>> locate the expensive load instructions, and also understand the time
>> spent in different stages. The users can optimize their applications
>> in different pipeline stages.
>>
>> The 'weight' field is shared among different architectures. Reusing the
>> 'weight' field may impacts other architectures. Add a new field to store
>> the instruction latency.
>>
>> Like the 'weight' support, introduce a 'ins_lat' for the global
>> instruction latency, and a 'local_ins_lat' for the local instruction
>> latency version.
> 
> Could you please clarify the difference between the global latency
> and the local latency?
>

The global means the total latency.
The local means average latency, aka total / number of samples.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-05 12:55       ` Athira Rajeev
@ 2021-02-05 14:51         ` Liang, Kan
  2021-02-07 16:45           ` Athira Rajeev
  0 siblings, 1 reply; 34+ messages in thread
From: Liang, Kan @ 2021-02-05 14:51 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: Arnaldo Carvalho de Melo, mingo, linux-kernel, Peter Zijlstra,
	eranian, namhyung, jolsa, ak, yao.jin, maddy



On 2/5/2021 7:55 AM, Athira Rajeev wrote:
>>> Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency.
>>> Can we have some weak function to populate the header string ?
>> I agree that the var2_w has different meanings among architectures. We should not force it to data->ins_lat.
>>
>> The patch as below should fix it. Does it work for you?
> My point about weak function was actually for the arch specific header string. But I guess we should not force it to data->ins_lat

Yes, I don't think PowerPC should force var2_w to data->ins_lat. I think 
you can create your own field.

> as you mentioned. I checked the below patch defining an ‘arch_perf_parse_sample_weight' for powerpc and it works.
> 
> But one observation is that, for cases with kernel having support for PERF_SAMPLE_WEIGHT_STRUCT but missing arch specific support for  ‘arch_perf_parse_sample_weight', it will report ‘Local Weight’ wrongly since weak function takes it as 64 bit. Not sure if that is a valid case to consider though.

Currently, the PERF_SAMPLE_WEIGHT_STRUCT is only enabled on X86 by default.
https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.liang@linux.intel.com/

For PowerPC, the PERF_SAMPLE_WEIGHT is still the default setting. There 
is no way to set PERF_SAMPLE_WEIGHT_STRUCT via perf tool.
I don't think the above case will happen.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-05 14:13     ` Liang, Kan
@ 2021-02-05 15:26       ` Arnaldo Carvalho de Melo
  2021-02-05 15:45         ` Liang, Kan
  0 siblings, 1 reply; 34+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-02-05 15:26 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Namhyung Kim, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin, maddy

Em Fri, Feb 05, 2021 at 09:13:34AM -0500, Liang, Kan escreveu:
> 
> 
> On 2/5/2021 5:52 AM, Namhyung Kim wrote:
> > On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
> > > 
> > > From: Kan Liang <kan.liang@linux.intel.com>
> > > 
> > > On the Intel Sapphire Rapids server, an auxiliary event has to be
> > > enabled simultaneously with the load latency event to retrieve complete
> > > Memory Info.
> > > 
> > > Add X86 specific perf_mem_events__name() to handle the auxiliary event.
> > > - Users are only interested in the samples of the mem-loads event.
> > >    Sample read the auxiliary event.
> > > - The auxiliary event must be in front of the load latency event in a
> > >    group. Assume the second event to sample if the auxiliary event is the
> > >    leader.
> > > - Add a weak is_mem_loads_aux_event() to check the auxiliary event for
> > >    X86. For other ARCHs, it always return false.
> > > 
> > > Parse the unique event name, mem-loads-aux, for the auxiliary event.
> > > 
> > > Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> > > ---
> > >   tools/perf/arch/x86/util/Build        |  1 +
> > >   tools/perf/arch/x86/util/mem-events.c | 44 +++++++++++++++++++++++++++++++++++
> > >   tools/perf/util/evsel.c               |  3 +++
> > >   tools/perf/util/mem-events.c          |  5 ++++
> > >   tools/perf/util/mem-events.h          |  2 ++
> > >   tools/perf/util/parse-events.l        |  1 +
> > >   tools/perf/util/record.c              |  5 +++-
> > >   7 files changed, 60 insertions(+), 1 deletion(-)
> > >   create mode 100644 tools/perf/arch/x86/util/mem-events.c
> > > 
> > > diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
> > > index 347c39b..d73f548 100644
> > > --- a/tools/perf/arch/x86/util/Build
> > > +++ b/tools/perf/arch/x86/util/Build
> > > @@ -6,6 +6,7 @@ perf-y += perf_regs.o
> > >   perf-y += topdown.o
> > >   perf-y += machine.o
> > >   perf-y += event.o
> > > +perf-y += mem-events.o
> > > 
> > >   perf-$(CONFIG_DWARF) += dwarf-regs.o
> > >   perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
> > > diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
> > > new file mode 100644
> > > index 0000000..11b8469
> > > --- /dev/null
> > > +++ b/tools/perf/arch/x86/util/mem-events.c
> > > @@ -0,0 +1,44 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "util/pmu.h"
> > > +#include "map_symbol.h"
> > > +#include "mem-events.h"
> > > +
> > > +static char mem_loads_name[100];
> > > +static bool mem_loads_name__init;
> > > +
> > > +#define MEM_LOADS_AUX          0x8203
> > > +#define MEM_LOADS_AUX_NAME     "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
> > > +
> > > +bool is_mem_loads_aux_event(struct evsel *leader)
> > > +{
> > > +       if (!pmu_have_event("cpu", "mem-loads-aux"))
> > > +               return false;
> > > +
> > > +       return leader->core.attr.config == MEM_LOADS_AUX;
> > > +}
> > > +
> > > +char *perf_mem_events__name(int i)
> > > +{
> > > +       struct perf_mem_event *e = perf_mem_events__ptr(i);
> > > +
> > > +       if (!e)
> > > +               return NULL;
> > > +
> > > +       if (i == PERF_MEM_EVENTS__LOAD) {
> > > +               if (mem_loads_name__init)
> > > +                       return mem_loads_name;
> > > +
> > > +               mem_loads_name__init = true;
> > > +
> > > +               if (pmu_have_event("cpu", "mem-loads-aux")) {
> > > +                       scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
> > > +                                 MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
> > 
> > It changes "%u" to an actual latency value, right?
> > What if the value takes 3 or more digits?
> > I'm not sure scnprintf() will handle it properly.
> > 
> 
> Yes, you are right. We should use the sizeof(mem_loads_name) as below.
> I will submit a patch to fix it.
> 
> diff --git a/tools/perf/arch/x86/util/mem-events.c
> b/tools/perf/arch/x86/util/mem-events.c
> index 11b8469..588110f 100644
> --- a/tools/perf/arch/x86/util/mem-events.c
> +++ b/tools/perf/arch/x86/util/mem-events.c
> @@ -31,7 +31,7 @@ char *perf_mem_events__name(int i)
>  		mem_loads_name__init = true;
> 
>  		if (pmu_have_event("cpu", "mem-loads-aux")) {
> -			scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
> +			scnprintf(mem_loads_name, sizeof(mem_loads_name),
>  				  MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
>  		} else {
>  			scnprintf(mem_loads_name, sizeof(mem_loads_name),

I'll fold this in the relevant cset.

- Arnaldo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/9] perf tools: Support the auxiliary event
  2021-02-05 15:26       ` Arnaldo Carvalho de Melo
@ 2021-02-05 15:45         ` Liang, Kan
  0 siblings, 0 replies; 34+ messages in thread
From: Liang, Kan @ 2021-02-05 15:45 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Namhyung Kim, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin, maddy



On 2/5/2021 10:26 AM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Feb 05, 2021 at 09:13:34AM -0500, Liang, Kan escreveu:
>>
>>
>> On 2/5/2021 5:52 AM, Namhyung Kim wrote:
>>> On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>>>>
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> On the Intel Sapphire Rapids server, an auxiliary event has to be
>>>> enabled simultaneously with the load latency event to retrieve complete
>>>> Memory Info.
>>>>
>>>> Add X86 specific perf_mem_events__name() to handle the auxiliary event.
>>>> - Users are only interested in the samples of the mem-loads event.
>>>>     Sample read the auxiliary event.
>>>> - The auxiliary event must be in front of the load latency event in a
>>>>     group. Assume the second event to sample if the auxiliary event is the
>>>>     leader.
>>>> - Add a weak is_mem_loads_aux_event() to check the auxiliary event for
>>>>     X86. For other ARCHs, it always return false.
>>>>
>>>> Parse the unique event name, mem-loads-aux, for the auxiliary event.
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> ---
>>>>    tools/perf/arch/x86/util/Build        |  1 +
>>>>    tools/perf/arch/x86/util/mem-events.c | 44 +++++++++++++++++++++++++++++++++++
>>>>    tools/perf/util/evsel.c               |  3 +++
>>>>    tools/perf/util/mem-events.c          |  5 ++++
>>>>    tools/perf/util/mem-events.h          |  2 ++
>>>>    tools/perf/util/parse-events.l        |  1 +
>>>>    tools/perf/util/record.c              |  5 +++-
>>>>    7 files changed, 60 insertions(+), 1 deletion(-)
>>>>    create mode 100644 tools/perf/arch/x86/util/mem-events.c
>>>>
>>>> diff --git a/tools/perf/arch/x86/util/Build b/tools/perf/arch/x86/util/Build
>>>> index 347c39b..d73f548 100644
>>>> --- a/tools/perf/arch/x86/util/Build
>>>> +++ b/tools/perf/arch/x86/util/Build
>>>> @@ -6,6 +6,7 @@ perf-y += perf_regs.o
>>>>    perf-y += topdown.o
>>>>    perf-y += machine.o
>>>>    perf-y += event.o
>>>> +perf-y += mem-events.o
>>>>
>>>>    perf-$(CONFIG_DWARF) += dwarf-regs.o
>>>>    perf-$(CONFIG_BPF_PROLOGUE) += dwarf-regs.o
>>>> diff --git a/tools/perf/arch/x86/util/mem-events.c b/tools/perf/arch/x86/util/mem-events.c
>>>> new file mode 100644
>>>> index 0000000..11b8469
>>>> --- /dev/null
>>>> +++ b/tools/perf/arch/x86/util/mem-events.c
>>>> @@ -0,0 +1,44 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +#include "util/pmu.h"
>>>> +#include "map_symbol.h"
>>>> +#include "mem-events.h"
>>>> +
>>>> +static char mem_loads_name[100];
>>>> +static bool mem_loads_name__init;
>>>> +
>>>> +#define MEM_LOADS_AUX          0x8203
>>>> +#define MEM_LOADS_AUX_NAME     "{cpu/mem-loads-aux/,cpu/mem-loads,ldlat=%u/pp}:S"
>>>> +
>>>> +bool is_mem_loads_aux_event(struct evsel *leader)
>>>> +{
>>>> +       if (!pmu_have_event("cpu", "mem-loads-aux"))
>>>> +               return false;
>>>> +
>>>> +       return leader->core.attr.config == MEM_LOADS_AUX;
>>>> +}
>>>> +
>>>> +char *perf_mem_events__name(int i)
>>>> +{
>>>> +       struct perf_mem_event *e = perf_mem_events__ptr(i);
>>>> +
>>>> +       if (!e)
>>>> +               return NULL;
>>>> +
>>>> +       if (i == PERF_MEM_EVENTS__LOAD) {
>>>> +               if (mem_loads_name__init)
>>>> +                       return mem_loads_name;
>>>> +
>>>> +               mem_loads_name__init = true;
>>>> +
>>>> +               if (pmu_have_event("cpu", "mem-loads-aux")) {
>>>> +                       scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
>>>> +                                 MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
>>>
>>> It changes "%u" to an actual latency value, right?
>>> What if the value takes 3 or more digits?
>>> I'm not sure scnprintf() will handle it properly.
>>>
>>
>> Yes, you are right. We should use the sizeof(mem_loads_name) as below.
>> I will submit a patch to fix it.
>>
>> diff --git a/tools/perf/arch/x86/util/mem-events.c
>> b/tools/perf/arch/x86/util/mem-events.c
>> index 11b8469..588110f 100644
>> --- a/tools/perf/arch/x86/util/mem-events.c
>> +++ b/tools/perf/arch/x86/util/mem-events.c
>> @@ -31,7 +31,7 @@ char *perf_mem_events__name(int i)
>>   		mem_loads_name__init = true;
>>
>>   		if (pmu_have_event("cpu", "mem-loads-aux")) {
>> -			scnprintf(mem_loads_name, sizeof(MEM_LOADS_AUX_NAME),
>> +			scnprintf(mem_loads_name, sizeof(mem_loads_name),
>>   				  MEM_LOADS_AUX_NAME, perf_mem_events__loads_ldlat);
>>   		} else {
>>   			scnprintf(mem_loads_name, sizeof(mem_loads_name),
> 
> I'll fold this in the relevant cset.
> 

Thanks!

Kan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-05 14:38     ` Liang, Kan
@ 2021-02-06  8:09       ` Namhyung Kim
  2021-02-08 13:50         ` Liang, Kan
  0 siblings, 1 reply; 34+ messages in thread
From: Namhyung Kim @ 2021-02-06  8:09 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy

On Fri, Feb 5, 2021 at 11:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
> On 2/5/2021 6:08 AM, Namhyung Kim wrote:
> > On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
> >>
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> The instruction latency information can be recorded on some platforms,
> >> e.g., the Intel Sapphire Rapids server. With both memory latency
> >> (weight) and the new instruction latency information, users can easily
> >> locate the expensive load instructions, and also understand the time
> >> spent in different stages. The users can optimize their applications
> >> in different pipeline stages.
> >>
> >> The 'weight' field is shared among different architectures. Reusing the
> >> 'weight' field may impacts other architectures. Add a new field to store
> >> the instruction latency.
> >>
> >> Like the 'weight' support, introduce a 'ins_lat' for the global
> >> instruction latency, and a 'local_ins_lat' for the local instruction
> >> latency version.
> >
> > Could you please clarify the difference between the global latency
> > and the local latency?
> >
>
> The global means the total latency.
> The local means average latency, aka total / number of samples.

Thanks for the explanation, but I think it's confusing.
Why not call it just total_latency and avg_latency?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-05 14:51         ` Liang, Kan
@ 2021-02-07 16:45           ` Athira Rajeev
  0 siblings, 0 replies; 34+ messages in thread
From: Athira Rajeev @ 2021-02-07 16:45 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, mingo, linux-kernel, Peter Zijlstra,
	eranian, namhyung, jolsa, ak, yao.jin, maddy



> On 05-Feb-2021, at 8:21 PM, Liang, Kan <kan.liang@linux.intel.com> wrote:
> 
> 
> 
> On 2/5/2021 7:55 AM, Athira Rajeev wrote:
>>>> Because in other archs, the var2_w of ‘perf_sample_weight’ could be used to capture something else than the Local INSTR Latency.
>>>> Can we have some weak function to populate the header string ?
>>> I agree that the var2_w has different meanings among architectures. We should not force it to data->ins_lat.
>>> 
>>> The patch as below should fix it. Does it work for you?
>> My point about weak function was actually for the arch specific header string. But I guess we should not force it to data->ins_lat
> 
> Yes, I don't think PowerPC should force var2_w to data->ins_lat. I think you can create your own field.
> 
>> as you mentioned. I checked the below patch defining an ‘arch_perf_parse_sample_weight' for powerpc and it works.
>> But one observation is that, for cases with kernel having support for PERF_SAMPLE_WEIGHT_STRUCT but missing arch specific support for  ‘arch_perf_parse_sample_weight', it will report ‘Local Weight’ wrongly since weak function takes it as 64 bit. Not sure if that is a valid case to consider though.
> 
> Currently, the PERF_SAMPLE_WEIGHT_STRUCT is only enabled on X86 by default.
> https://lore.kernel.org/lkml/1612296553-21962-6-git-send-email-kan.liang@linux.intel.com/
> 
> For PowerPC, the PERF_SAMPLE_WEIGHT is still the default setting. There is no way to set PERF_SAMPLE_WEIGHT_STRUCT via perf tool.
> I don't think the above case will happen.

Yes. 

I tested with kernel changes from perf/core branch of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
And perf tools side changes from tmp.perf/core branch of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git along with the above change. 
The default setting for powerpc works with out breaking anything and verified using “perf mem record <workload>”

Tested-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>

Thanks
Athira Rajeev
> 
> Thanks,
> Kan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/9] perf report: Support instruction latency
  2021-02-06  8:09       ` Namhyung Kim
@ 2021-02-08 13:50         ` Liang, Kan
  0 siblings, 0 replies; 34+ messages in thread
From: Liang, Kan @ 2021-02-08 13:50 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, linux-kernel,
	Peter Zijlstra, Stephane Eranian, Jiri Olsa, Andi Kleen, Yao Jin,
	maddy



On 2/6/2021 3:09 AM, Namhyung Kim wrote:
> On Fri, Feb 5, 2021 at 11:38 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>> On 2/5/2021 6:08 AM, Namhyung Kim wrote:
>>> On Wed, Feb 3, 2021 at 5:14 AM <kan.liang@linux.intel.com> wrote:
>>>>
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> The instruction latency information can be recorded on some platforms,
>>>> e.g., the Intel Sapphire Rapids server. With both memory latency
>>>> (weight) and the new instruction latency information, users can easily
>>>> locate the expensive load instructions, and also understand the time
>>>> spent in different stages. The users can optimize their applications
>>>> in different pipeline stages.
>>>>
>>>> The 'weight' field is shared among different architectures. Reusing the
>>>> 'weight' field may impacts other architectures. Add a new field to store
>>>> the instruction latency.
>>>>
>>>> Like the 'weight' support, introduce a 'ins_lat' for the global
>>>> instruction latency, and a 'local_ins_lat' for the local instruction
>>>> latency version.
>>>
>>> Could you please clarify the difference between the global latency
>>> and the local latency?
>>>
>>
>> The global means the total latency.
>> The local means average latency, aka total / number of samples.
> 
> Thanks for the explanation, but I think it's confusing.
> Why not call it just total_latency and avg_latency?
> 

The instruction latency field is an extension of the weight field, so I 
follow the same way to name the field. I still think we should make the 
naming consistency.

To address the confusion, I think we may update the document for both 
the weight and the instruction latency fields.

How about the below patch?

 From d5e80f541cb7288b24a7c5661ae5faede4747807 Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Mon, 8 Feb 2021 05:27:03 -0800
Subject: [PATCH] perf documentation: Add comments to the local/global 
weight related fields

Current 'local' and 'global' prefix is confusing for the weight related
fields, e.g., weight, instruction latency.

Add comments to clarify.
'global' means total weight/instruction latency sum.
'local' means average weight/instruction latency per sample

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
  tools/perf/Documentation/perf-report.txt | 10 ++++++----
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index f546b5e..acc1c1d 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -92,8 +92,9 @@ OPTIONS
  	- srcfile: file name of the source file of the samples. Requires dwarf
  	information.
  	- weight: Event specific weight, e.g. memory latency or transaction
-	abort cost. This is the global weight.
-	- local_weight: Local weight version of the weight above.
+	abort cost. This is the global weight (total weight sum).
+	- local_weight: Local weight (average weight per sample) version of the
+	  weight above.
  	- cgroup_id: ID derived from cgroup namespace device and inode numbers.
  	- cgroup: cgroup pathname in the cgroupfs.
  	- transaction: Transaction abort flags.
@@ -110,8 +111,9 @@ OPTIONS
  	--time-quantum (default 100ms). Specify with overhead and before it.
  	- code_page_size: the code page size of sampled code address (ip)
  	- ins_lat: Instruction latency in core cycles. This is the global 
instruction
-	  latency
-	- local_ins_lat: Local instruction latency version
+	  latency (total instruction latency sum)
+	- local_ins_lat: Local instruction latency (average instruction 
latency per
+	  sample) version

  	By default, comm, dso and symbol keys are used.
  	(i.e. --sort comm,dso,symbol)
-- 
2.7.4


Thanks,
Kan



^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-02-08 13:52 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02 20:09 [PATCH 0/9] perf core PMU support for Sapphire Rapids (User tools) kan.liang
2021-02-02 20:09 ` [PATCH 1/9] tools headers uapi: Update tools's copy of linux/perf_event.h kan.liang
2021-02-02 20:09 ` [PATCH 2/9] perf tools: Support the auxiliary event kan.liang
2021-02-03 20:02   ` Arnaldo Carvalho de Melo
2021-02-03 21:20     ` Liang, Kan
2021-02-03 21:30       ` Arnaldo Carvalho de Melo
2021-02-05 10:52   ` Namhyung Kim
2021-02-05 14:13     ` Liang, Kan
2021-02-05 15:26       ` Arnaldo Carvalho de Melo
2021-02-05 15:45         ` Liang, Kan
2021-02-02 20:09 ` [PATCH 3/9] perf tools: Support data block and addr block kan.liang
2021-02-05 11:02   ` Namhyung Kim
2021-02-05 14:17     ` Liang, Kan
2021-02-02 20:09 ` [PATCH 4/9] perf c2c: " kan.liang
2021-02-03 20:21   ` Arnaldo Carvalho de Melo
2021-02-02 20:09 ` [PATCH 5/9] perf tools: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
2021-02-03 20:31   ` Arnaldo Carvalho de Melo
2021-02-03 21:19     ` Liang, Kan
2021-02-03 21:29       ` Arnaldo Carvalho de Melo
2021-02-02 20:09 ` [PATCH 6/9] perf report: Support instruction latency kan.liang
2021-02-03 20:43   ` Arnaldo Carvalho de Melo
2021-02-04 13:11   ` Athira Rajeev
2021-02-04 15:19     ` Liang, Kan
2021-02-05 12:55       ` Athira Rajeev
2021-02-05 14:51         ` Liang, Kan
2021-02-07 16:45           ` Athira Rajeev
2021-02-05 11:08   ` Namhyung Kim
2021-02-05 14:38     ` Liang, Kan
2021-02-06  8:09       ` Namhyung Kim
2021-02-08 13:50         ` Liang, Kan
2021-02-02 20:09 ` [PATCH 7/9] perf test: Support PERF_SAMPLE_WEIGHT_STRUCT kan.liang
2021-02-03 20:44   ` Arnaldo Carvalho de Melo
2021-02-02 20:09 ` [PATCH 8/9] perf stat: Support L2 Topdown events kan.liang
2021-02-02 20:09 ` [PATCH 9/9] perf, tools: Update topdown documentation for Sapphire Rapids kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).