linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information
@ 2021-03-22 14:57 Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

Performance Monitoring Unit (PMU) registers in powerpc exports
number of cycles elapsed between different stages in the pipeline.
Example, sampling registers in ISA v3.1.

This patchset implements kernel and perf tools support to expose
these pipeline stage cycles using the sample type PERF_SAMPLE_WEIGHT_TYPE.

Patch 1/5 adds kernel side support to store the cycle counter
values as part of 'var2_w' and 'var3_w' fields of perf_sample_weight
structure.

Patch 2/5 adds support to make the perf report column header
strings as dynamic.
Patch 3/5 adds powerpc support in perf tools for PERF_SAMPLE_WEIGHT_STRUCT
in sample type: PERF_SAMPLE_WEIGHT_TYPE.
Patch 4/5 adds support to present pipeline stage cycles as part of
mem-mode.
Patch 5/5 is to display the new sort dimenstion in perf report columns
only on powerpc.

Sample output on powerpc:

# perf mem record ls
# perf mem report

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 11  of event 'cpu/mem-loads/'
# Total weight : 1332
# Sort order   : local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat,stall_cyc
#
# Overhead       Samples  Local Weight  Memory access             Symbol                              Shared Object     Data Symbol                                    Data Object            Snoop         TLB access              Locked  Blocked     Finish Cyc     Dispatch Cyc 
# ........  ............  ............  ........................  ..................................  ................  .............................................  .....................  ............  ......................  ......  ..........  .............  .............
#
    44.14%             1  588           L1 hit                    [k] rcu_nmi_exit                    [kernel.vmlinux]  [k] 0xc0000007ffdd21b0                         [unknown]              N/A           N/A                     No       N/A        7              5            
    22.22%             1  296           L1 hit                    [k] copypage_power7                 [kernel.vmlinux]  [k] 0xc0000000ff6a1780                         [unknown]              N/A           N/A                     No       N/A        293            3            
     6.98%             1  93            L1 hit                    [.] _dl_addr                        libc-2.31.so      [.] 0x00007fff86fa5058                         libc-2.31.so           N/A           N/A                     No       N/A        7              1            
     6.61%             1  88            L2 hit                    [.] new_do_write                    libc-2.31.so      [.] _IO_2_1_stdout_+0x0                        libc-2.31.so           N/A           N/A                     No       N/A        84             1            
     5.93%             1  79            L1 hit                    [k] printk_nmi_exit                 [kernel.vmlinux]  [k] 0xc0000006085df6b0                         [unknown]              N/A           N/A                     No       N/A        7              1            
     4.05%             1  54            L2 hit                    [.] __alloc_dir                     libc-2.31.so      [.] 0x00007fffdb70a640                         [stack]                N/A           N/A                     No       N/A        18             1            
     3.60%             1  48            L1 hit                    [.] _init                           ls                [.] 0x000000016ca82118                         [heap]                 N/A           N/A                     No       N/A        7              6            
     2.40%             1  32            L1 hit                    [k] desc_read                       [kernel.vmlinux]  [k] _printk_rb_static_descs+0x1ea10            [kernel.vmlinux].data  N/A           N/A                     No       N/A        7              1            
     1.65%             1  22            L2 hit                    [k] perf_iterate_ctx.constprop.139  [kernel.vmlinux]  [k] 0xc00000064d79e8a8                         [unknown]              N/A           N/A                     No       N/A        16             1            
     1.58%             1  21            L1 hit                    [k] perf_event_interrupt            [kernel.vmlinux]  [k] 0xc0000006085df6b0                         [unknown]              N/A           N/A                     No       N/A        7              1            
     0.83%             1  11            L1 hit                    [k] perf_event_exec                 [kernel.vmlinux]  [k] 0xc0000007ffdd3288                         [unknown]              N/A           N/A                     No       N/A        7              4            


Changelog:
Changes from v1 -> v2
  Addressed Jiri's review comments:
  - Display the new sort dimension 'p_stage_cyc' only
    on supported architecture.
  - Check for arch specific header string for matching
    sort order in patch2.
  
Athira Rajeev (5):
  powerpc/perf: Expose processor pipeline stage cycles using
    PERF_SAMPLE_WEIGHT_STRUCT
  tools/perf: Add dynamic headers for perf report columns
  tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT
  tools/perf: Support pipeline stage cycles for powerpc
  tools/perf: Display sort dimension p_stage_cyc only on supported archs

 arch/powerpc/include/asm/perf_event_server.h |  2 +-
 arch/powerpc/perf/core-book3s.c              |  4 +-
 arch/powerpc/perf/isa207-common.c            | 29 ++++++++++++--
 arch/powerpc/perf/isa207-common.h            |  6 ++-
 tools/perf/Documentation/perf-report.txt     |  2 +
 tools/perf/arch/powerpc/util/Build           |  2 +
 tools/perf/arch/powerpc/util/event.c         | 53 ++++++++++++++++++++++++
 tools/perf/arch/powerpc/util/evsel.c         |  8 ++++
 tools/perf/util/event.h                      |  3 ++
 tools/perf/util/hist.c                       | 11 +++--
 tools/perf/util/hist.h                       |  1 +
 tools/perf/util/session.c                    |  4 +-
 tools/perf/util/sort.c                       | 60 +++++++++++++++++++++++++++-
 tools/perf/util/sort.h                       |  2 +
 14 files changed, 174 insertions(+), 13 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/event.c
 create mode 100644 tools/perf/arch/powerpc/util/evsel.c

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
@ 2021-03-22 14:57 ` Athira Rajeev
  2021-03-24  4:35   ` Madhavan Srinivasan
  2021-03-22 14:57 ` [PATCH V2 2/5] tools/perf: Add dynamic headers for perf report columns Athira Rajeev
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

Performance Monitoring Unit (PMU) registers in powerpc provides
information on cycles elapsed between different stages in the
pipeline. This can be used for application tuning. On ISA v3.1
platform, this information is exposed by sampling registers.
Patch adds kernel support to capture two of the cycle counters
as part of perf sample using the sample type:
PERF_SAMPLE_WEIGHT_STRUCT.

The power PMU function 'get_mem_weight' currently uses 64 bit weight
field of perf_sample_data to capture memory latency. But following the
introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
64-bit or 32-bit value depending on the architexture support for
PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
pipeline stage cycles info. Hence update the ppmu functions to work for
64-bit and 32-bit weight values.

If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
latency is stored in the low 32bits of perf_sample_weight structure.
Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
two 16 bit fields of perf_sample_weight structure.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/perf_event_server.h |  2 +-
 arch/powerpc/perf/core-book3s.c              |  4 ++--
 arch/powerpc/perf/isa207-common.c            | 29 +++++++++++++++++++++++++---
 arch/powerpc/perf/isa207-common.h            |  6 +++++-
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
index 00e7e671bb4b..112cf092d7b3 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -43,7 +43,7 @@ struct power_pmu {
 				u64 alt[]);
 	void		(*get_mem_data_src)(union perf_mem_data_src *dsrc,
 				u32 flags, struct pt_regs *regs);
-	void		(*get_mem_weight)(u64 *weight);
+	void		(*get_mem_weight)(u64 *weight, u64 type);
 	unsigned long	group_constraint_mask;
 	unsigned long	group_constraint_val;
 	u64             (*bhrb_filter_map)(u64 branch_sample_type);
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 766f064f00fb..6936763246bd 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 						ppmu->get_mem_data_src)
 			ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
 
-		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
+		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
 						ppmu->get_mem_weight)
-			ppmu->get_mem_weight(&data.weight.full);
+			ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
 
 		if (perf_event_overflow(event, &data, regs))
 			power_pmu_stop(event, 0);
diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
index e4f577da33d8..5dcbdbd54598 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
 	}
 }
 
-void isa207_get_mem_weight(u64 *weight)
+void isa207_get_mem_weight(u64 *weight, u64 type)
 {
+	union perf_sample_weight *weight_fields;
+	u64 weight_lat;
 	u64 mmcra = mfspr(SPRN_MMCRA);
 	u64 exp = MMCRA_THR_CTR_EXP(mmcra);
 	u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
@@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
 		mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
 
 	if (val == 0 || val == 7)
-		*weight = 0;
+		weight_lat = 0;
 	else
-		*weight = mantissa << (2 * exp);
+		weight_lat = mantissa << (2 * exp);
+
+	/*
+	 * Use 64 bit weight field (full) if sample type is
+	 * WEIGHT.
+	 *
+	 * if sample type is WEIGHT_STRUCT:
+	 * - store memory latency in the lower 32 bits.
+	 * - For ISA v3.1, use remaining two 16 bit fields of
+	 *   perf_sample_weight to store cycle counter values
+	 *   from sier2.
+	 */
+	weight_fields = (union perf_sample_weight *)weight;
+	if (type & PERF_SAMPLE_WEIGHT)
+		weight_fields->full = weight_lat;
+	else {
+		weight_fields->var1_dw = (u32)weight_lat;
+		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+			weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
+			weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
+		}
+	}
 }
 
 int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
index 1af0e8c97ac7..fc30d43c4d0c 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -265,6 +265,10 @@
 #define ISA207_SIER_DATA_SRC_SHIFT	53
 #define ISA207_SIER_DATA_SRC_MASK	(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
 
+/* Bits in SIER2/SIER3 for Power10 */
+#define P10_SIER2_FINISH_CYC(sier2)	(((sier2) >> (63 - 37)) & 0x7fful)
+#define P10_SIER2_DISPATCH_CYC(sier2)	(((sier2) >> (63 - 13)) & 0x7fful)
+
 #define P(a, b)				PERF_MEM_S(a, b)
 #define PH(a, b)			(P(LVL, HIT) | P(a, b))
 #define PM(a, b)			(P(LVL, MISS) | P(a, b))
@@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
 					const unsigned int ev_alt[][MAX_ALT]);
 void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
 							struct pt_regs *regs);
-void isa207_get_mem_weight(u64 *weight);
+void isa207_get_mem_weight(u64 *weight, u64 type);
 
 #endif
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 2/5] tools/perf: Add dynamic headers for perf report columns
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
@ 2021-03-22 14:57 ` Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

Currently the header string for different columns in perf report
is fixed. Some fields of perf sample could have different meaning
for different architectures than the meaning conveyed by the header
string. An example is the new field 'var2_w' of perf_sample_weight
structure. This is presently captured as 'Local INSTR Latency' in
perf mem report. But this could be used to denote a different latency
cycle in another architecture.

Introduce a weak function arch_perf_header_entry() to set
the arch specific header string for the fields which can contain dynamic
header. If the architecture do not have this function, fall back to the
default header string value.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 tools/perf/util/event.h |  1 +
 tools/perf/util/sort.c  | 19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index f603edbbbc6f..6106a9c134c9 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -427,5 +427,6 @@ void  cpu_map_data__synthesize(struct perf_record_cpu_map_data *data, struct per
 
 void arch_perf_parse_sample_weight(struct perf_sample *data, const __u64 *array, u64 type);
 void arch_perf_synthesize_sample_weight(const struct perf_sample *data, __u64 *array, u64 type);
+const char *arch_perf_header_entry(const char *se_header);
 
 #endif /* __PERF_RECORD_H */
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 552b590485bf..eeb03e749181 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -25,6 +25,7 @@
 #include <traceevent/event-parse.h>
 #include "mem-events.h"
 #include "annotate.h"
+#include "event.h"
 #include "time-utils.h"
 #include "cgroup.h"
 #include "machine.h"
@@ -45,6 +46,7 @@
 regex_t		ignore_callees_regex;
 int		have_ignore_callees = 0;
 enum sort_mode	sort__mode = SORT_MODE__NORMAL;
+const char	*dynamic_headers[] = {"local_ins_lat"};
 
 /*
  * Replaces all occurrences of a char used with the:
@@ -1816,6 +1818,16 @@ struct sort_dimension {
 	int			taken;
 };
 
+const char * __weak arch_perf_header_entry(const char *se_header)
+{
+	return se_header;
+}
+
+static void sort_dimension_add_dynamic_header(struct sort_dimension *sd)
+{
+	sd->entry->se_header = arch_perf_header_entry(sd->entry->se_header);
+}
+
 #define DIM(d, n, func) [d] = { .name = n, .entry = &(func) }
 
 static struct sort_dimension common_sort_dimensions[] = {
@@ -2739,7 +2751,7 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 			struct evlist *evlist,
 			int level)
 {
-	unsigned int i;
+	unsigned int i, j;
 
 	for (i = 0; i < ARRAY_SIZE(common_sort_dimensions); i++) {
 		struct sort_dimension *sd = &common_sort_dimensions[i];
@@ -2747,6 +2759,11 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 		if (strncasecmp(tok, sd->name, strlen(tok)))
 			continue;
 
+		for (j = 0; j < ARRAY_SIZE(dynamic_headers); j++) {
+			if (!strcmp(dynamic_headers[j], sd->name))
+				sort_dimension_add_dynamic_header(sd);
+		}
+
 		if (sd->entry == &sort_parent) {
 			int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
 			if (ret) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 2/5] tools/perf: Add dynamic headers for perf report columns Athira Rajeev
@ 2021-03-22 14:57 ` Athira Rajeev
  2021-03-24 19:43   ` Jiri Olsa
  2021-03-22 14:57 ` [PATCH V2 4/5] tools/perf: Support pipeline stage cycles for powerpc Athira Rajeev
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

Add arch specific arch_evsel__set_sample_weight() to set the new
sample type for powerpc.

Add arch specific arch_perf_parse_sample_weight() to store the
sample->weight values depending on the sample type applied.
if the new sample type (PERF_SAMPLE_WEIGHT_STRUCT) is applied,
store only the lower 32 bits to sample->weight. If sample type
is 'PERF_SAMPLE_WEIGHT', store the full 64-bit to sample->weight.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 tools/perf/arch/powerpc/util/Build   |  2 ++
 tools/perf/arch/powerpc/util/event.c | 32 ++++++++++++++++++++++++++++++++
 tools/perf/arch/powerpc/util/evsel.c |  8 ++++++++
 3 files changed, 42 insertions(+)
 create mode 100644 tools/perf/arch/powerpc/util/event.c
 create mode 100644 tools/perf/arch/powerpc/util/evsel.c

diff --git a/tools/perf/arch/powerpc/util/Build b/tools/perf/arch/powerpc/util/Build
index b7945e5a543b..8a79c4126e5b 100644
--- a/tools/perf/arch/powerpc/util/Build
+++ b/tools/perf/arch/powerpc/util/Build
@@ -4,6 +4,8 @@ perf-y += kvm-stat.o
 perf-y += perf_regs.o
 perf-y += mem-events.o
 perf-y += sym-handling.o
+perf-y += evsel.o
+perf-y += event.o
 
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_DWARF) += skip-callchain-idx.o
diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
new file mode 100644
index 000000000000..f49d32c2c8ae
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/event.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/zalloc.h>
+
+#include "../../../util/event.h"
+#include "../../../util/synthetic-events.h"
+#include "../../../util/machine.h"
+#include "../../../util/tool.h"
+#include "../../../util/map.h"
+#include "../../../util/debug.h"
+
+void arch_perf_parse_sample_weight(struct perf_sample *data,
+				   const __u64 *array, u64 type)
+{
+	union perf_sample_weight weight;
+
+	weight.full = *array;
+	if (type & PERF_SAMPLE_WEIGHT)
+		data->weight = weight.full;
+	else
+		data->weight = weight.var1_dw;
+}
+
+void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
+					__u64 *array, u64 type)
+{
+	*array = data->weight;
+
+	if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+		*array &= 0xffffffff;
+}
diff --git a/tools/perf/arch/powerpc/util/evsel.c b/tools/perf/arch/powerpc/util/evsel.c
new file mode 100644
index 000000000000..2f733cdc8dbb
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/evsel.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include "util/evsel.h"
+
+void arch_evsel__set_sample_weight(struct evsel *evsel)
+{
+	evsel__set_sample_bit(evsel, WEIGHT_STRUCT);
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 4/5] tools/perf: Support pipeline stage cycles for powerpc
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
                   ` (2 preceding siblings ...)
  2021-03-22 14:57 ` [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
@ 2021-03-22 14:57 ` Athira Rajeev
  2021-03-22 14:57 ` [PATCH V2 5/5] tools/perf: Display sort dimension p_stage_cyc only on supported archs Athira Rajeev
  2021-04-21 13:08 ` [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Michael Ellerman
  5 siblings, 0 replies; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

The pipeline stage cycles details can be recorded on powerpc from
the contents of Performance Monitor Unit (PMU) registers. On
ISA v3.1 platform, sampling registers exposes the cycles spent in
different pipeline stages. Patch adds perf tools support to present
two of the cycle counter information along with memory latency (weight).

Re-use the field 'ins_lat' for storing the first pipeline stage cycle.
This is stored in 'var2_w' field of 'perf_sample_weight'.

Add a new field 'p_stage_cyc' to store the second pipeline stage cycle
which is stored in 'var3_w' field of perf_sample_weight.

Add new sort function 'Pipeline Stage Cycle' and include this in
default_mem_sort_order[]. This new sort function may be used to denote
some other pipeline stage in another architecture. So add this to
list of sort entries that can have dynamic header string.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 tools/perf/Documentation/perf-report.txt |  2 ++
 tools/perf/arch/powerpc/util/event.c     | 18 ++++++++++++++++--
 tools/perf/util/event.h                  |  1 +
 tools/perf/util/hist.c                   | 11 ++++++++---
 tools/perf/util/hist.h                   |  1 +
 tools/perf/util/session.c                |  4 +++-
 tools/perf/util/sort.c                   | 24 ++++++++++++++++++++++--
 tools/perf/util/sort.h                   |  2 ++
 8 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index f546b5e9db05..563fb01a9b8d 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -112,6 +112,8 @@ OPTIONS
 	- ins_lat: Instruction latency in core cycles. This is the global instruction
 	  latency
 	- local_ins_lat: Local instruction latency version
+	- p_stage_cyc: On powerpc, this presents the number of cycles spent in a
+	  pipeline stage. And currently supported only on powerpc.
 
 	By default, comm, dso and symbol keys are used.
 	(i.e. --sort comm,dso,symbol)
diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
index f49d32c2c8ae..22521bc9481a 100644
--- a/tools/perf/arch/powerpc/util/event.c
+++ b/tools/perf/arch/powerpc/util/event.c
@@ -18,8 +18,11 @@ void arch_perf_parse_sample_weight(struct perf_sample *data,
 	weight.full = *array;
 	if (type & PERF_SAMPLE_WEIGHT)
 		data->weight = weight.full;
-	else
+	else {
 		data->weight = weight.var1_dw;
+		data->ins_lat = weight.var2_w;
+		data->p_stage_cyc = weight.var3_w;
+	}
 }
 
 void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
@@ -27,6 +30,17 @@ void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
 {
 	*array = data->weight;
 
-	if (type & PERF_SAMPLE_WEIGHT_STRUCT)
+	if (type & PERF_SAMPLE_WEIGHT_STRUCT) {
 		*array &= 0xffffffff;
+		*array |= ((u64)data->ins_lat << 32);
+	}
+}
+
+const char *arch_perf_header_entry(const char *se_header)
+{
+	if (!strcmp(se_header, "Local INSTR Latency"))
+		return "Finish Cyc";
+	else if (!strcmp(se_header, "Pipeline Stage Cycle"))
+		return "Dispatch Cyc";
+	return se_header;
 }
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 6106a9c134c9..e5da4a695ff2 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -147,6 +147,7 @@ struct perf_sample {
 	u8  cpumode;
 	u16 misc;
 	u16 ins_lat;
+	u16 p_stage_cyc;
 	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
 	char insn[MAX_INSN];
 	void *raw_data;
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index c82f5fc26af8..9299ee535518 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -211,6 +211,7 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	hists__new_col_len(hists, HISTC_MEM_BLOCKED, 10);
 	hists__new_col_len(hists, HISTC_LOCAL_INS_LAT, 13);
 	hists__new_col_len(hists, HISTC_GLOBAL_INS_LAT, 13);
+	hists__new_col_len(hists, HISTC_P_STAGE_CYC, 13);
 	if (symbol_conf.nanosecs)
 		hists__new_col_len(hists, HISTC_TIME, 16);
 	else
@@ -289,13 +290,14 @@ static long hist_time(unsigned long htime)
 }
 
 static void he_stat__add_period(struct he_stat *he_stat, u64 period,
-				u64 weight, u64 ins_lat)
+				u64 weight, u64 ins_lat, u64 p_stage_cyc)
 {
 
 	he_stat->period		+= period;
 	he_stat->weight		+= weight;
 	he_stat->nr_events	+= 1;
 	he_stat->ins_lat	+= ins_lat;
+	he_stat->p_stage_cyc	+= p_stage_cyc;
 }
 
 static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
@@ -308,6 +310,7 @@ static void he_stat__add_stat(struct he_stat *dest, struct he_stat *src)
 	dest->nr_events		+= src->nr_events;
 	dest->weight		+= src->weight;
 	dest->ins_lat		+= src->ins_lat;
+	dest->p_stage_cyc		+= src->p_stage_cyc;
 }
 
 static void he_stat__decay(struct he_stat *he_stat)
@@ -597,6 +600,7 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 	u64 period = entry->stat.period;
 	u64 weight = entry->stat.weight;
 	u64 ins_lat = entry->stat.ins_lat;
+	u64 p_stage_cyc = entry->stat.p_stage_cyc;
 	bool leftmost = true;
 
 	p = &hists->entries_in->rb_root.rb_node;
@@ -615,11 +619,11 @@ static struct hist_entry *hists__findnew_entry(struct hists *hists,
 
 		if (!cmp) {
 			if (sample_self) {
-				he_stat__add_period(&he->stat, period, weight, ins_lat);
+				he_stat__add_period(&he->stat, period, weight, ins_lat, p_stage_cyc);
 				hist_entry__add_callchain_period(he, period);
 			}
 			if (symbol_conf.cumulate_callchain)
-				he_stat__add_period(he->stat_acc, period, weight, ins_lat);
+				he_stat__add_period(he->stat_acc, period, weight, ins_lat, p_stage_cyc);
 
 			/*
 			 * This mem info was allocated from sample__resolve_mem
@@ -731,6 +735,7 @@ static void hists__res_sample(struct hist_entry *he, struct perf_sample *sample)
 			.period	= sample->period,
 			.weight = sample->weight,
 			.ins_lat = sample->ins_lat,
+			.p_stage_cyc = sample->p_stage_cyc,
 		},
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent) | al->filtered,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 3c537232294b..e2faa745c8d6 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -75,6 +75,7 @@ enum hist_column {
 	HISTC_MEM_BLOCKED,
 	HISTC_LOCAL_INS_LAT,
 	HISTC_GLOBAL_INS_LAT,
+	HISTC_P_STAGE_CYC,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 859832a82496..a6fed96d783d 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1302,8 +1302,10 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 
 	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
 		printf("... weight: %" PRIu64 "", sample->weight);
-			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+			if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT) {
 				printf(",0x%"PRIx16"", sample->ins_lat);
+				printf(",0x%"PRIx16"", sample->p_stage_cyc);
+			}
 		printf("\n");
 	}
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index eeb03e749181..d262261ad1a6 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -37,7 +37,7 @@
 const char	*parent_pattern = default_parent_pattern;
 const char	*default_sort_order = "comm,dso,symbol";
 const char	default_branch_sort_order[] = "comm,dso_from,symbol_from,symbol_to,cycles";
-const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat";
+const char	default_mem_sort_order[] = "local_weight,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked,blocked,local_ins_lat,p_stage_cyc";
 const char	default_top_sort_order[] = "dso,symbol";
 const char	default_diff_sort_order[] = "dso,symbol";
 const char	default_tracepoint_sort_order[] = "trace";
@@ -46,7 +46,7 @@
 regex_t		ignore_callees_regex;
 int		have_ignore_callees = 0;
 enum sort_mode	sort__mode = SORT_MODE__NORMAL;
-const char	*dynamic_headers[] = {"local_ins_lat"};
+const char	*dynamic_headers[] = {"local_ins_lat", "p_stage_cyc"};
 
 /*
  * Replaces all occurrences of a char used with the:
@@ -1410,6 +1410,25 @@ struct sort_entry sort_global_ins_lat = {
 	.se_width_idx	= HISTC_GLOBAL_INS_LAT,
 };
 
+static int64_t
+sort__global_p_stage_cyc_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return left->stat.p_stage_cyc - right->stat.p_stage_cyc;
+}
+
+static int hist_entry__p_stage_cyc_snprintf(struct hist_entry *he, char *bf,
+					size_t size, unsigned int width)
+{
+	return repsep_snprintf(bf, size, "%-*u", width, he->stat.p_stage_cyc);
+}
+
+struct sort_entry sort_p_stage_cyc = {
+	.se_header      = "Pipeline Stage Cycle",
+	.se_cmp         = sort__global_p_stage_cyc_cmp,
+	.se_snprintf	= hist_entry__p_stage_cyc_snprintf,
+	.se_width_idx	= HISTC_P_STAGE_CYC,
+};
+
 struct sort_entry sort_mem_daddr_sym = {
 	.se_header	= "Data Symbol",
 	.se_cmp		= sort__daddr_cmp,
@@ -1853,6 +1872,7 @@ static void sort_dimension_add_dynamic_header(struct sort_dimension *sd)
 	DIM(SORT_CODE_PAGE_SIZE, "code_page_size", sort_code_page_size),
 	DIM(SORT_LOCAL_INS_LAT, "local_ins_lat", sort_local_ins_lat),
 	DIM(SORT_GLOBAL_INS_LAT, "ins_lat", sort_global_ins_lat),
+	DIM(SORT_PIPELINE_STAGE_CYC, "p_stage_cyc", sort_p_stage_cyc),
 };
 
 #undef DIM
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 63f67a3f3630..d9795ca0d676 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -51,6 +51,7 @@ struct he_stat {
 	u64			period_guest_us;
 	u64			weight;
 	u64			ins_lat;
+	u64			p_stage_cyc;
 	u32			nr_events;
 };
 
@@ -234,6 +235,7 @@ enum sort_type {
 	SORT_CODE_PAGE_SIZE,
 	SORT_LOCAL_INS_LAT,
 	SORT_GLOBAL_INS_LAT,
+	SORT_PIPELINE_STAGE_CYC,
 
 	/* branch stack specific sort keys */
 	__SORT_BRANCH_STACK,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 5/5] tools/perf: Display sort dimension p_stage_cyc only on supported archs
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
                   ` (3 preceding siblings ...)
  2021-03-22 14:57 ` [PATCH V2 4/5] tools/perf: Support pipeline stage cycles for powerpc Athira Rajeev
@ 2021-03-22 14:57 ` Athira Rajeev
  2021-04-21 13:08 ` [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Michael Ellerman
  5 siblings, 0 replies; 16+ messages in thread
From: Athira Rajeev @ 2021-03-22 14:57 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa
  Cc: maddy, ravi.bangoria, kjain, kan.liang, peterz

The sort dimension "p_stage_cyc" is used to represent pipeline
stage cycle information. Presently, this is used only in powerpc.
For unsupported platforms, we don't want to display it
in the perf report output columns. Hence add check in sort_dimension__add()
and skip the sort key incase it is not applicable for the particular arch.

Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
---
 tools/perf/arch/powerpc/util/event.c |  7 +++++++
 tools/perf/util/event.h              |  1 +
 tools/perf/util/sort.c               | 19 +++++++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
index 22521bc9481a..3bf441257466 100644
--- a/tools/perf/arch/powerpc/util/event.c
+++ b/tools/perf/arch/powerpc/util/event.c
@@ -44,3 +44,10 @@ const char *arch_perf_header_entry(const char *se_header)
 		return "Dispatch Cyc";
 	return se_header;
 }
+
+int arch_support_sort_key(const char *sort_key)
+{
+	if (!strcmp(sort_key, "p_stage_cyc"))
+		return 1;
+	return 0;
+}
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index e5da4a695ff2..8a62fb39e365 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -429,5 +429,6 @@ void  cpu_map_data__synthesize(struct perf_record_cpu_map_data *data, struct per
 void arch_perf_parse_sample_weight(struct perf_sample *data, const __u64 *array, u64 type);
 void arch_perf_synthesize_sample_weight(const struct perf_sample *data, __u64 *array, u64 type);
 const char *arch_perf_header_entry(const char *se_header);
+int arch_support_sort_key(const char *sort_key);
 
 #endif /* __PERF_RECORD_H */
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index d262261ad1a6..e8030778ff44 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -47,6 +47,7 @@
 int		have_ignore_callees = 0;
 enum sort_mode	sort__mode = SORT_MODE__NORMAL;
 const char	*dynamic_headers[] = {"local_ins_lat", "p_stage_cyc"};
+const char	*arch_specific_sort_keys[] = {"p_stage_cyc"};
 
 /*
  * Replaces all occurrences of a char used with the:
@@ -1837,6 +1838,11 @@ struct sort_dimension {
 	int			taken;
 };
 
+int __weak arch_support_sort_key(const char *sort_key __maybe_unused)
+{
+	return 0;
+}
+
 const char * __weak arch_perf_header_entry(const char *se_header)
 {
 	return se_header;
@@ -2773,6 +2779,19 @@ int sort_dimension__add(struct perf_hpp_list *list, const char *tok,
 {
 	unsigned int i, j;
 
+	/*
+	 * Check to see if there are any arch specific
+	 * sort dimensions not applicable for the current
+	 * architecture. If so, Skip that sort key since
+	 * we don't want to display it in the output fields.
+	 */
+	for (j = 0; j < ARRAY_SIZE(arch_specific_sort_keys); j++) {
+		if (!strcmp(arch_specific_sort_keys[j], tok) &&
+				!arch_support_sort_key(tok)) {
+			return 0;
+		}
+	}
+
 	for (i = 0; i < ARRAY_SIZE(common_sort_dimensions); i++) {
 		struct sort_dimension *sd = &common_sort_dimensions[i];
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-22 14:57 ` [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
@ 2021-03-24  4:35   ` Madhavan Srinivasan
  2021-03-25 13:01     ` Arnaldo Carvalho de Melo
  2021-03-25 13:06     ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 16+ messages in thread
From: Madhavan Srinivasan @ 2021-03-24  4:35 UTC (permalink / raw)
  To: Athira Rajeev, linuxppc-dev, linux-kernel, linux-perf-users, mpe,
	acme, jolsa
  Cc: ravi.bangoria, kjain, kan.liang, peterz


On 3/22/21 8:27 PM, Athira Rajeev wrote:
> Performance Monitoring Unit (PMU) registers in powerpc provides
> information on cycles elapsed between different stages in the
> pipeline. This can be used for application tuning. On ISA v3.1
> platform, this information is exposed by sampling registers.
> Patch adds kernel support to capture two of the cycle counters
> as part of perf sample using the sample type:
> PERF_SAMPLE_WEIGHT_STRUCT.
>
> The power PMU function 'get_mem_weight' currently uses 64 bit weight
> field of perf_sample_data to capture memory latency. But following the
> introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> 64-bit or 32-bit value depending on the architexture support for
> PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> pipeline stage cycles info. Hence update the ppmu functions to work for
> 64-bit and 32-bit weight values.
>
> If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> latency is stored in the low 32bits of perf_sample_weight structure.
> Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> two 16 bit fields of perf_sample_weight structure.

Changes looks fine to me.

Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>


> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>   arch/powerpc/include/asm/perf_event_server.h |  2 +-
>   arch/powerpc/perf/core-book3s.c              |  4 ++--
>   arch/powerpc/perf/isa207-common.c            | 29 +++++++++++++++++++++++++---
>   arch/powerpc/perf/isa207-common.h            |  6 +++++-
>   4 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> index 00e7e671bb4b..112cf092d7b3 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -43,7 +43,7 @@ struct power_pmu {
>   				u64 alt[]);
>   	void		(*get_mem_data_src)(union perf_mem_data_src *dsrc,
>   				u32 flags, struct pt_regs *regs);
> -	void		(*get_mem_weight)(u64 *weight);
> +	void		(*get_mem_weight)(u64 *weight, u64 type);
>   	unsigned long	group_constraint_mask;
>   	unsigned long	group_constraint_val;
>   	u64             (*bhrb_filter_map)(u64 branch_sample_type);
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 766f064f00fb..6936763246bd 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>   						ppmu->get_mem_data_src)
>   			ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
>   
> -		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> +		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
>   						ppmu->get_mem_weight)
> -			ppmu->get_mem_weight(&data.weight.full);
> +			ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
>   
>   		if (perf_event_overflow(event, &data, regs))
>   			power_pmu_stop(event, 0);
> diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> index e4f577da33d8..5dcbdbd54598 100644
> --- a/arch/powerpc/perf/isa207-common.c
> +++ b/arch/powerpc/perf/isa207-common.c
> @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>   	}
>   }
>   
> -void isa207_get_mem_weight(u64 *weight)
> +void isa207_get_mem_weight(u64 *weight, u64 type)
>   {
> +	union perf_sample_weight *weight_fields;
> +	u64 weight_lat;
>   	u64 mmcra = mfspr(SPRN_MMCRA);
>   	u64 exp = MMCRA_THR_CTR_EXP(mmcra);
>   	u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
>   		mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
>   
>   	if (val == 0 || val == 7)
> -		*weight = 0;
> +		weight_lat = 0;
>   	else
> -		*weight = mantissa << (2 * exp);
> +		weight_lat = mantissa << (2 * exp);
> +
> +	/*
> +	 * Use 64 bit weight field (full) if sample type is
> +	 * WEIGHT.
> +	 *
> +	 * if sample type is WEIGHT_STRUCT:
> +	 * - store memory latency in the lower 32 bits.
> +	 * - For ISA v3.1, use remaining two 16 bit fields of
> +	 *   perf_sample_weight to store cycle counter values
> +	 *   from sier2.
> +	 */
> +	weight_fields = (union perf_sample_weight *)weight;
> +	if (type & PERF_SAMPLE_WEIGHT)
> +		weight_fields->full = weight_lat;
> +	else {
> +		weight_fields->var1_dw = (u32)weight_lat;
> +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> +			weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> +			weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> +		}
> +	}
>   }
>   
>   int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> index 1af0e8c97ac7..fc30d43c4d0c 100644
> --- a/arch/powerpc/perf/isa207-common.h
> +++ b/arch/powerpc/perf/isa207-common.h
> @@ -265,6 +265,10 @@
>   #define ISA207_SIER_DATA_SRC_SHIFT	53
>   #define ISA207_SIER_DATA_SRC_MASK	(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
>   
> +/* Bits in SIER2/SIER3 for Power10 */
> +#define P10_SIER2_FINISH_CYC(sier2)	(((sier2) >> (63 - 37)) & 0x7fful)
> +#define P10_SIER2_DISPATCH_CYC(sier2)	(((sier2) >> (63 - 13)) & 0x7fful)
> +
>   #define P(a, b)				PERF_MEM_S(a, b)
>   #define PH(a, b)			(P(LVL, HIT) | P(a, b))
>   #define PM(a, b)			(P(LVL, MISS) | P(a, b))
> @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
>   					const unsigned int ev_alt[][MAX_ALT]);
>   void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>   							struct pt_regs *regs);
> -void isa207_get_mem_weight(u64 *weight);
> +void isa207_get_mem_weight(u64 *weight, u64 type);
>   
>   #endif

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-22 14:57 ` [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
@ 2021-03-24 19:43   ` Jiri Olsa
       [not found]     ` <80EE46ED-9007-4CB7-9A52-A7A2ADC616C6@linux.vnet.ibm.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Jiri Olsa @ 2021-03-24 19:43 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa,
	maddy, ravi.bangoria, kjain, kan.liang, peterz

On Mon, Mar 22, 2021 at 10:57:25AM -0400, Athira Rajeev wrote:
> Add arch specific arch_evsel__set_sample_weight() to set the new
> sample type for powerpc.
> 
> Add arch specific arch_perf_parse_sample_weight() to store the
> sample->weight values depending on the sample type applied.
> if the new sample type (PERF_SAMPLE_WEIGHT_STRUCT) is applied,
> store only the lower 32 bits to sample->weight. If sample type
> is 'PERF_SAMPLE_WEIGHT', store the full 64-bit to sample->weight.
> 
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  tools/perf/arch/powerpc/util/Build   |  2 ++
>  tools/perf/arch/powerpc/util/event.c | 32 ++++++++++++++++++++++++++++++++
>  tools/perf/arch/powerpc/util/evsel.c |  8 ++++++++
>  3 files changed, 42 insertions(+)
>  create mode 100644 tools/perf/arch/powerpc/util/event.c
>  create mode 100644 tools/perf/arch/powerpc/util/evsel.c
> 
> diff --git a/tools/perf/arch/powerpc/util/Build b/tools/perf/arch/powerpc/util/Build
> index b7945e5a543b..8a79c4126e5b 100644
> --- a/tools/perf/arch/powerpc/util/Build
> +++ b/tools/perf/arch/powerpc/util/Build
> @@ -4,6 +4,8 @@ perf-y += kvm-stat.o
>  perf-y += perf_regs.o
>  perf-y += mem-events.o
>  perf-y += sym-handling.o
> +perf-y += evsel.o
> +perf-y += event.o
>  
>  perf-$(CONFIG_DWARF) += dwarf-regs.o
>  perf-$(CONFIG_DWARF) += skip-callchain-idx.o
> diff --git a/tools/perf/arch/powerpc/util/event.c b/tools/perf/arch/powerpc/util/event.c
> new file mode 100644
> index 000000000000..f49d32c2c8ae
> --- /dev/null
> +++ b/tools/perf/arch/powerpc/util/event.c
> @@ -0,0 +1,32 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/zalloc.h>
> +
> +#include "../../../util/event.h"
> +#include "../../../util/synthetic-events.h"
> +#include "../../../util/machine.h"
> +#include "../../../util/tool.h"
> +#include "../../../util/map.h"
> +#include "../../../util/debug.h"

nit, just #include "utils/...h" should work no?

other than that, the patchset looks ok to me

Acked-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

> +
> +void arch_perf_parse_sample_weight(struct perf_sample *data,
> +				   const __u64 *array, u64 type)
> +{
> +	union perf_sample_weight weight;
> +
> +	weight.full = *array;
> +	if (type & PERF_SAMPLE_WEIGHT)
> +		data->weight = weight.full;
> +	else
> +		data->weight = weight.var1_dw;
> +}
> +
> +void arch_perf_synthesize_sample_weight(const struct perf_sample *data,
> +					__u64 *array, u64 type)
> +{
> +	*array = data->weight;
> +
> +	if (type & PERF_SAMPLE_WEIGHT_STRUCT)
> +		*array &= 0xffffffff;
> +}
> diff --git a/tools/perf/arch/powerpc/util/evsel.c b/tools/perf/arch/powerpc/util/evsel.c
> new file mode 100644
> index 000000000000..2f733cdc8dbb
> --- /dev/null
> +++ b/tools/perf/arch/powerpc/util/evsel.c
> @@ -0,0 +1,8 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdio.h>
> +#include "util/evsel.h"
> +
> +void arch_evsel__set_sample_weight(struct evsel *evsel)
> +{
> +	evsel__set_sample_bit(evsel, WEIGHT_STRUCT);
> +}
> -- 
> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-24  4:35   ` Madhavan Srinivasan
@ 2021-03-25 13:01     ` Arnaldo Carvalho de Melo
  2021-03-25 14:38       ` Peter Zijlstra
  2021-03-25 13:06     ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 16+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-25 13:01 UTC (permalink / raw)
  To: Madhavan Srinivasan
  Cc: Athira Rajeev, linuxppc-dev, linux-kernel, linux-perf-users, mpe,
	jolsa, ravi.bangoria, kjain, kan.liang, peterz

Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
> 
> On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > Performance Monitoring Unit (PMU) registers in powerpc provides
> > information on cycles elapsed between different stages in the
> > pipeline. This can be used for application tuning. On ISA v3.1
> > platform, this information is exposed by sampling registers.
> > Patch adds kernel support to capture two of the cycle counters
> > as part of perf sample using the sample type:
> > PERF_SAMPLE_WEIGHT_STRUCT.
> > 
> > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > field of perf_sample_data to capture memory latency. But following the
> > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > 64-bit or 32-bit value depending on the architexture support for
> > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > pipeline stage cycles info. Hence update the ppmu functions to work for
> > 64-bit and 32-bit weight values.
> > 
> > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > latency is stored in the low 32bits of perf_sample_weight structure.
> > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > two 16 bit fields of perf_sample_weight structure.
> 
> Changes looks fine to me.
> 
> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>

So who will process the kernel bits? I'm merging the tooling parts,

Thanks,

- Arnaldo
 
> 
> > Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> > ---
> >   arch/powerpc/include/asm/perf_event_server.h |  2 +-
> >   arch/powerpc/perf/core-book3s.c              |  4 ++--
> >   arch/powerpc/perf/isa207-common.c            | 29 +++++++++++++++++++++++++---
> >   arch/powerpc/perf/isa207-common.h            |  6 +++++-
> >   4 files changed, 34 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> > index 00e7e671bb4b..112cf092d7b3 100644
> > --- a/arch/powerpc/include/asm/perf_event_server.h
> > +++ b/arch/powerpc/include/asm/perf_event_server.h
> > @@ -43,7 +43,7 @@ struct power_pmu {
> >   				u64 alt[]);
> >   	void		(*get_mem_data_src)(union perf_mem_data_src *dsrc,
> >   				u32 flags, struct pt_regs *regs);
> > -	void		(*get_mem_weight)(u64 *weight);
> > +	void		(*get_mem_weight)(u64 *weight, u64 type);
> >   	unsigned long	group_constraint_mask;
> >   	unsigned long	group_constraint_val;
> >   	u64             (*bhrb_filter_map)(u64 branch_sample_type);
> > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> > index 766f064f00fb..6936763246bd 100644
> > --- a/arch/powerpc/perf/core-book3s.c
> > +++ b/arch/powerpc/perf/core-book3s.c
> > @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> >   						ppmu->get_mem_data_src)
> >   			ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
> > -		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> > +		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
> >   						ppmu->get_mem_weight)
> > -			ppmu->get_mem_weight(&data.weight.full);
> > +			ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
> >   		if (perf_event_overflow(event, &data, regs))
> >   			power_pmu_stop(event, 0);
> > diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> > index e4f577da33d8..5dcbdbd54598 100644
> > --- a/arch/powerpc/perf/isa207-common.c
> > +++ b/arch/powerpc/perf/isa207-common.c
> > @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> >   	}
> >   }
> > -void isa207_get_mem_weight(u64 *weight)
> > +void isa207_get_mem_weight(u64 *weight, u64 type)
> >   {
> > +	union perf_sample_weight *weight_fields;
> > +	u64 weight_lat;
> >   	u64 mmcra = mfspr(SPRN_MMCRA);
> >   	u64 exp = MMCRA_THR_CTR_EXP(mmcra);
> >   	u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> > @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
> >   		mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> >   	if (val == 0 || val == 7)
> > -		*weight = 0;
> > +		weight_lat = 0;
> >   	else
> > -		*weight = mantissa << (2 * exp);
> > +		weight_lat = mantissa << (2 * exp);
> > +
> > +	/*
> > +	 * Use 64 bit weight field (full) if sample type is
> > +	 * WEIGHT.
> > +	 *
> > +	 * if sample type is WEIGHT_STRUCT:
> > +	 * - store memory latency in the lower 32 bits.
> > +	 * - For ISA v3.1, use remaining two 16 bit fields of
> > +	 *   perf_sample_weight to store cycle counter values
> > +	 *   from sier2.
> > +	 */
> > +	weight_fields = (union perf_sample_weight *)weight;
> > +	if (type & PERF_SAMPLE_WEIGHT)
> > +		weight_fields->full = weight_lat;
> > +	else {
> > +		weight_fields->var1_dw = (u32)weight_lat;
> > +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> > +			weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> > +			weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> > +		}
> > +	}
> >   }
> >   int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> > diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> > index 1af0e8c97ac7..fc30d43c4d0c 100644
> > --- a/arch/powerpc/perf/isa207-common.h
> > +++ b/arch/powerpc/perf/isa207-common.h
> > @@ -265,6 +265,10 @@
> >   #define ISA207_SIER_DATA_SRC_SHIFT	53
> >   #define ISA207_SIER_DATA_SRC_MASK	(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
> > +/* Bits in SIER2/SIER3 for Power10 */
> > +#define P10_SIER2_FINISH_CYC(sier2)	(((sier2) >> (63 - 37)) & 0x7fful)
> > +#define P10_SIER2_DISPATCH_CYC(sier2)	(((sier2) >> (63 - 13)) & 0x7fful)
> > +
> >   #define P(a, b)				PERF_MEM_S(a, b)
> >   #define PH(a, b)			(P(LVL, HIT) | P(a, b))
> >   #define PM(a, b)			(P(LVL, MISS) | P(a, b))
> > @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
> >   					const unsigned int ev_alt[][MAX_ALT]);
> >   void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> >   							struct pt_regs *regs);
> > -void isa207_get_mem_weight(u64 *weight);
> > +void isa207_get_mem_weight(u64 *weight, u64 type);
> >   #endif

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-24  4:35   ` Madhavan Srinivasan
  2021-03-25 13:01     ` Arnaldo Carvalho de Melo
@ 2021-03-25 13:06     ` Arnaldo Carvalho de Melo
  2021-03-26  8:32       ` Madhavan Srinivasan
  1 sibling, 1 reply; 16+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-25 13:06 UTC (permalink / raw)
  To: Madhavan Srinivasan
  Cc: Athira Rajeev, linuxppc-dev, linux-kernel, linux-perf-users, mpe,
	jolsa, ravi.bangoria, kjain, kan.liang, peterz

Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
> 
> On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > Performance Monitoring Unit (PMU) registers in powerpc provides
> > information on cycles elapsed between different stages in the
> > pipeline. This can be used for application tuning. On ISA v3.1
> > platform, this information is exposed by sampling registers.
> > Patch adds kernel support to capture two of the cycle counters
> > as part of perf sample using the sample type:
> > PERF_SAMPLE_WEIGHT_STRUCT.
> > 
> > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > field of perf_sample_data to capture memory latency. But following the
> > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > 64-bit or 32-bit value depending on the architexture support for
> > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > pipeline stage cycles info. Hence update the ppmu functions to work for
> > 64-bit and 32-bit weight values.
> > 
> > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > latency is stored in the low 32bits of perf_sample_weight structure.
> > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > two 16 bit fields of perf_sample_weight structure.
> 
> Changes looks fine to me.

You mean just the kernel part or can I add your Reviewed-by to all the
patchset?
 
> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> 
> 
> > Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> > ---
> >   arch/powerpc/include/asm/perf_event_server.h |  2 +-
> >   arch/powerpc/perf/core-book3s.c              |  4 ++--
> >   arch/powerpc/perf/isa207-common.c            | 29 +++++++++++++++++++++++++---
> >   arch/powerpc/perf/isa207-common.h            |  6 +++++-
> >   4 files changed, 34 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> > index 00e7e671bb4b..112cf092d7b3 100644
> > --- a/arch/powerpc/include/asm/perf_event_server.h
> > +++ b/arch/powerpc/include/asm/perf_event_server.h
> > @@ -43,7 +43,7 @@ struct power_pmu {
> >   				u64 alt[]);
> >   	void		(*get_mem_data_src)(union perf_mem_data_src *dsrc,
> >   				u32 flags, struct pt_regs *regs);
> > -	void		(*get_mem_weight)(u64 *weight);
> > +	void		(*get_mem_weight)(u64 *weight, u64 type);
> >   	unsigned long	group_constraint_mask;
> >   	unsigned long	group_constraint_val;
> >   	u64             (*bhrb_filter_map)(u64 branch_sample_type);
> > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> > index 766f064f00fb..6936763246bd 100644
> > --- a/arch/powerpc/perf/core-book3s.c
> > +++ b/arch/powerpc/perf/core-book3s.c
> > @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
> >   						ppmu->get_mem_data_src)
> >   			ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
> > -		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
> > +		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
> >   						ppmu->get_mem_weight)
> > -			ppmu->get_mem_weight(&data.weight.full);
> > +			ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
> >   		if (perf_event_overflow(event, &data, regs))
> >   			power_pmu_stop(event, 0);
> > diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
> > index e4f577da33d8..5dcbdbd54598 100644
> > --- a/arch/powerpc/perf/isa207-common.c
> > +++ b/arch/powerpc/perf/isa207-common.c
> > @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> >   	}
> >   }
> > -void isa207_get_mem_weight(u64 *weight)
> > +void isa207_get_mem_weight(u64 *weight, u64 type)
> >   {
> > +	union perf_sample_weight *weight_fields;
> > +	u64 weight_lat;
> >   	u64 mmcra = mfspr(SPRN_MMCRA);
> >   	u64 exp = MMCRA_THR_CTR_EXP(mmcra);
> >   	u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
> > @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
> >   		mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
> >   	if (val == 0 || val == 7)
> > -		*weight = 0;
> > +		weight_lat = 0;
> >   	else
> > -		*weight = mantissa << (2 * exp);
> > +		weight_lat = mantissa << (2 * exp);
> > +
> > +	/*
> > +	 * Use 64 bit weight field (full) if sample type is
> > +	 * WEIGHT.
> > +	 *
> > +	 * if sample type is WEIGHT_STRUCT:
> > +	 * - store memory latency in the lower 32 bits.
> > +	 * - For ISA v3.1, use remaining two 16 bit fields of
> > +	 *   perf_sample_weight to store cycle counter values
> > +	 *   from sier2.
> > +	 */
> > +	weight_fields = (union perf_sample_weight *)weight;
> > +	if (type & PERF_SAMPLE_WEIGHT)
> > +		weight_fields->full = weight_lat;
> > +	else {
> > +		weight_fields->var1_dw = (u32)weight_lat;
> > +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> > +			weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
> > +			weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
> > +		}
> > +	}
> >   }
> >   int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
> > diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
> > index 1af0e8c97ac7..fc30d43c4d0c 100644
> > --- a/arch/powerpc/perf/isa207-common.h
> > +++ b/arch/powerpc/perf/isa207-common.h
> > @@ -265,6 +265,10 @@
> >   #define ISA207_SIER_DATA_SRC_SHIFT	53
> >   #define ISA207_SIER_DATA_SRC_MASK	(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
> > +/* Bits in SIER2/SIER3 for Power10 */
> > +#define P10_SIER2_FINISH_CYC(sier2)	(((sier2) >> (63 - 37)) & 0x7fful)
> > +#define P10_SIER2_DISPATCH_CYC(sier2)	(((sier2) >> (63 - 13)) & 0x7fful)
> > +
> >   #define P(a, b)				PERF_MEM_S(a, b)
> >   #define PH(a, b)			(P(LVL, HIT) | P(a, b))
> >   #define PM(a, b)			(P(LVL, MISS) | P(a, b))
> > @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
> >   					const unsigned int ev_alt[][MAX_ALT]);
> >   void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
> >   							struct pt_regs *regs);
> > -void isa207_get_mem_weight(u64 *weight);
> > +void isa207_get_mem_weight(u64 *weight, u64 type);
> >   #endif

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-25 13:01     ` Arnaldo Carvalho de Melo
@ 2021-03-25 14:38       ` Peter Zijlstra
  2021-03-25 16:42         ` Arnaldo
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-03-25 14:38 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Madhavan Srinivasan, Athira Rajeev, linuxppc-dev, linux-kernel,
	linux-perf-users, mpe, jolsa, ravi.bangoria, kjain, kan.liang

On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
> > 
> > On 3/22/21 8:27 PM, Athira Rajeev wrote:
> > > Performance Monitoring Unit (PMU) registers in powerpc provides
> > > information on cycles elapsed between different stages in the
> > > pipeline. This can be used for application tuning. On ISA v3.1
> > > platform, this information is exposed by sampling registers.
> > > Patch adds kernel support to capture two of the cycle counters
> > > as part of perf sample using the sample type:
> > > PERF_SAMPLE_WEIGHT_STRUCT.
> > > 
> > > The power PMU function 'get_mem_weight' currently uses 64 bit weight
> > > field of perf_sample_data to capture memory latency. But following the
> > > introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
> > > 64-bit or 32-bit value depending on the architexture support for
> > > PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
> > > pipeline stage cycles info. Hence update the ppmu functions to work for
> > > 64-bit and 32-bit weight values.
> > > 
> > > If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
> > > if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
> > > latency is stored in the low 32bits of perf_sample_weight structure.
> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
> > > two 16 bit fields of perf_sample_weight structure.
> > 
> > Changes looks fine to me.
> > 
> > Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> 
> So who will process the kernel bits? I'm merging the tooling parts,

I was sorta expecting these to go through the powerpc tree. Let me know
if you want them in tip/perf/core instead.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-25 14:38       ` Peter Zijlstra
@ 2021-03-25 16:42         ` Arnaldo
  2021-03-27 13:14           ` Michael Ellerman
  0 siblings, 1 reply; 16+ messages in thread
From: Arnaldo @ 2021-03-25 16:42 UTC (permalink / raw)
  To: Peter Zijlstra, Arnaldo Carvalho de Melo
  Cc: Madhavan Srinivasan, Athira Rajeev, linuxppc-dev, linux-kernel,
	linux-perf-users, mpe, jolsa, ravi.bangoria, kjain, kan.liang



On March 25, 2021 11:38:01 AM GMT-03:00, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo
>wrote:.
>> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter
>information in
>> > > two 16 bit fields of perf_sample_weight structure.
>> > 
>> > Changes looks fine to me.
>> > 
>> > Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> 
>> So who will process the kernel bits? I'm merging the tooling parts,
>
>I was sorta expecting these to go through the powerpc tree. Let me know
>if you want them in tip/perf/core instead.

Shouldn't matter by which tree it gets upstream, as long as it gets picked :-)

- Arnaldo

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-25 13:06     ` Arnaldo Carvalho de Melo
@ 2021-03-26  8:32       ` Madhavan Srinivasan
  0 siblings, 0 replies; 16+ messages in thread
From: Madhavan Srinivasan @ 2021-03-26  8:32 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Athira Rajeev, linuxppc-dev, linux-kernel, linux-perf-users, mpe,
	jolsa, ravi.bangoria, kjain, kan.liang, peterz


On 3/25/21 6:36 PM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Mar 24, 2021 at 10:05:23AM +0530, Madhavan Srinivasan escreveu:
>> On 3/22/21 8:27 PM, Athira Rajeev wrote:
>>> Performance Monitoring Unit (PMU) registers in powerpc provides
>>> information on cycles elapsed between different stages in the
>>> pipeline. This can be used for application tuning. On ISA v3.1
>>> platform, this information is exposed by sampling registers.
>>> Patch adds kernel support to capture two of the cycle counters
>>> as part of perf sample using the sample type:
>>> PERF_SAMPLE_WEIGHT_STRUCT.
>>>
>>> The power PMU function 'get_mem_weight' currently uses 64 bit weight
>>> field of perf_sample_data to capture memory latency. But following the
>>> introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
>>> 64-bit or 32-bit value depending on the architexture support for
>>> PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
>>> pipeline stage cycles info. Hence update the ppmu functions to work for
>>> 64-bit and 32-bit weight values.
>>>
>>> If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
>>> if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
>>> latency is stored in the low 32bits of perf_sample_weight structure.
>>> Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
>>> two 16 bit fields of perf_sample_weight structure.
>> Changes looks fine to me.
> You mean just the kernel part or can I add your Reviewed-by to all the
> patchset?


Yes, kindly add it, I did review the patchset. My bad, i should have 
mentioned it here

or should have replied to the cover letter.


Maddy


>   
>> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>>
>>
>>> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>>> ---
>>>    arch/powerpc/include/asm/perf_event_server.h |  2 +-
>>>    arch/powerpc/perf/core-book3s.c              |  4 ++--
>>>    arch/powerpc/perf/isa207-common.c            | 29 +++++++++++++++++++++++++---
>>>    arch/powerpc/perf/isa207-common.h            |  6 +++++-
>>>    4 files changed, 34 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
>>> index 00e7e671bb4b..112cf092d7b3 100644
>>> --- a/arch/powerpc/include/asm/perf_event_server.h
>>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>>> @@ -43,7 +43,7 @@ struct power_pmu {
>>>    				u64 alt[]);
>>>    	void		(*get_mem_data_src)(union perf_mem_data_src *dsrc,
>>>    				u32 flags, struct pt_regs *regs);
>>> -	void		(*get_mem_weight)(u64 *weight);
>>> +	void		(*get_mem_weight)(u64 *weight, u64 type);
>>>    	unsigned long	group_constraint_mask;
>>>    	unsigned long	group_constraint_val;
>>>    	u64             (*bhrb_filter_map)(u64 branch_sample_type);
>>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>>> index 766f064f00fb..6936763246bd 100644
>>> --- a/arch/powerpc/perf/core-book3s.c
>>> +++ b/arch/powerpc/perf/core-book3s.c
>>> @@ -2206,9 +2206,9 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>>>    						ppmu->get_mem_data_src)
>>>    			ppmu->get_mem_data_src(&data.data_src, ppmu->flags, regs);
>>> -		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT &&
>>> +		if (event->attr.sample_type & PERF_SAMPLE_WEIGHT_TYPE &&
>>>    						ppmu->get_mem_weight)
>>> -			ppmu->get_mem_weight(&data.weight.full);
>>> +			ppmu->get_mem_weight(&data.weight.full, event->attr.sample_type);
>>>    		if (perf_event_overflow(event, &data, regs))
>>>    			power_pmu_stop(event, 0);
>>> diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207-common.c
>>> index e4f577da33d8..5dcbdbd54598 100644
>>> --- a/arch/powerpc/perf/isa207-common.c
>>> +++ b/arch/powerpc/perf/isa207-common.c
>>> @@ -284,8 +284,10 @@ void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>>>    	}
>>>    }
>>> -void isa207_get_mem_weight(u64 *weight)
>>> +void isa207_get_mem_weight(u64 *weight, u64 type)
>>>    {
>>> +	union perf_sample_weight *weight_fields;
>>> +	u64 weight_lat;
>>>    	u64 mmcra = mfspr(SPRN_MMCRA);
>>>    	u64 exp = MMCRA_THR_CTR_EXP(mmcra);
>>>    	u64 mantissa = MMCRA_THR_CTR_MANT(mmcra);
>>> @@ -296,9 +298,30 @@ void isa207_get_mem_weight(u64 *weight)
>>>    		mantissa = P10_MMCRA_THR_CTR_MANT(mmcra);
>>>    	if (val == 0 || val == 7)
>>> -		*weight = 0;
>>> +		weight_lat = 0;
>>>    	else
>>> -		*weight = mantissa << (2 * exp);
>>> +		weight_lat = mantissa << (2 * exp);
>>> +
>>> +	/*
>>> +	 * Use 64 bit weight field (full) if sample type is
>>> +	 * WEIGHT.
>>> +	 *
>>> +	 * if sample type is WEIGHT_STRUCT:
>>> +	 * - store memory latency in the lower 32 bits.
>>> +	 * - For ISA v3.1, use remaining two 16 bit fields of
>>> +	 *   perf_sample_weight to store cycle counter values
>>> +	 *   from sier2.
>>> +	 */
>>> +	weight_fields = (union perf_sample_weight *)weight;
>>> +	if (type & PERF_SAMPLE_WEIGHT)
>>> +		weight_fields->full = weight_lat;
>>> +	else {
>>> +		weight_fields->var1_dw = (u32)weight_lat;
>>> +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>>> +			weight_fields->var2_w = P10_SIER2_FINISH_CYC(mfspr(SPRN_SIER2));
>>> +			weight_fields->var3_w = P10_SIER2_DISPATCH_CYC(mfspr(SPRN_SIER2));
>>> +		}
>>> +	}
>>>    }
>>>    int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long *valp, u64 event_config1)
>>> diff --git a/arch/powerpc/perf/isa207-common.h b/arch/powerpc/perf/isa207-common.h
>>> index 1af0e8c97ac7..fc30d43c4d0c 100644
>>> --- a/arch/powerpc/perf/isa207-common.h
>>> +++ b/arch/powerpc/perf/isa207-common.h
>>> @@ -265,6 +265,10 @@
>>>    #define ISA207_SIER_DATA_SRC_SHIFT	53
>>>    #define ISA207_SIER_DATA_SRC_MASK	(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
>>> +/* Bits in SIER2/SIER3 for Power10 */
>>> +#define P10_SIER2_FINISH_CYC(sier2)	(((sier2) >> (63 - 37)) & 0x7fful)
>>> +#define P10_SIER2_DISPATCH_CYC(sier2)	(((sier2) >> (63 - 13)) & 0x7fful)
>>> +
>>>    #define P(a, b)				PERF_MEM_S(a, b)
>>>    #define PH(a, b)			(P(LVL, HIT) | P(a, b))
>>>    #define PM(a, b)			(P(LVL, MISS) | P(a, b))
>>> @@ -278,6 +282,6 @@ int isa207_get_alternatives(u64 event, u64 alt[], int size, unsigned int flags,
>>>    					const unsigned int ev_alt[][MAX_ALT]);
>>>    void isa207_get_mem_data_src(union perf_mem_data_src *dsrc, u32 flags,
>>>    							struct pt_regs *regs);
>>> -void isa207_get_mem_weight(u64 *weight);
>>> +void isa207_get_mem_weight(u64 *weight, u64 type);
>>>    #endif

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT
       [not found]     ` <80EE46ED-9007-4CB7-9A52-A7A2ADC616C6@linux.vnet.ibm.com>
@ 2021-03-26 15:50       ` Arnaldo
  0 siblings, 0 replies; 16+ messages in thread
From: Arnaldo @ 2021-03-26 15:50 UTC (permalink / raw)
  To: Athira Rajeev, Jiri Olsa
  Cc: linuxppc-dev, linux-kernel, linux-perf-users, mpe, acme, jolsa,
	Madhavan Srinivasan, ravi.bangoria, kjain, kan.liang, peterz



On March 26, 2021 12:23:04 PM GMT-03:00, Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
>
>
>On 25-Mar-2021, at 1:13 AM, Jiri Olsa <jolsa@redhat.com> wrote:
>
>On Mon, Mar 22, 2021 at 10:57:25AM -0400, Athira Rajeev wrote:
>
>Add arch specific arch_evsel__set_sample_weight() to set the new
>sample type for powerpc.
>
>Add arch specific arch_perf_parse_sample_weight() to store the
>sample->weight values depending on the sample type applied.
>if the new sample type (PERF_SAMPLE_WEIGHT_STRUCT) is applied,
>store only the lower 32 bits to sample->weight. If sample type
>is 'PERF_SAMPLE_WEIGHT', store the full 64-bit to sample->weight.
>
>Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
>---
>tools/perf/arch/powerpc/util/Build   |  2 ++
>tools/perf/arch/powerpc/util/event.c | 32
>++++++++++++++++++++++++++++++++
>tools/perf/arch/powerpc/util/evsel.c |  8 ++++++++
>3 files changed, 42 insertions(+)
>create mode 100644 tools/perf/arch/powerpc/util/event.c
>create mode 100644 tools/perf/arch/powerpc/util/evsel.c
>
>diff --git a/tools/perf/arch/powerpc/util/Build
>b/tools/perf/arch/powerpc/util/Build
>index b7945e5a543b..8a79c4126e5b 100644
>--- a/tools/perf/arch/powerpc/util/Build
>+++ b/tools/perf/arch/powerpc/util/Build
>@@ -4,6 +4,8 @@ perf-y += kvm-stat.o
>perf-y += perf_regs.o
>perf-y += mem-events.o
>perf-y += sym-handling.o
>+perf-y += evsel.o
>+perf-y += event.o
>
>perf-$(CONFIG_DWARF) += dwarf-regs.o
>perf-$(CONFIG_DWARF) += skip-callchain-idx.o
>diff --git a/tools/perf/arch/powerpc/util/event.c
>b/tools/perf/arch/powerpc/util/event.c
>new file mode 100644
>index 000000000000..f49d32c2c8ae
>--- /dev/null
>+++ b/tools/perf/arch/powerpc/util/event.c
>@@ -0,0 +1,32 @@
>+// SPDX-License-Identifier: GPL-2.0
>+#include <linux/types.h>
>+#include <linux/string.h>
>+#include <linux/zalloc.h>
>+
>+#include "../../../util/event.h"
>+#include "../../../util/synthetic-events.h"
>+#include "../../../util/machine.h"
>+#include "../../../util/tool.h"
>+#include "../../../util/map.h"
>+#include "../../../util/debug.h"
>
>
>nit, just #include "utils/...h" should work no?
>
>other than that, the patchset looks ok to me
>
>Acked-by: Jiri Olsa <jolsa@redhat.com>
>
>
>
>Hi Jiri, Arnaldo
>
>Thanks for reviewing the patch set.
>I checked that, just using "utils/...h" also works.
>Below is the change which I verified. Since the patches are presently
>merged in 
>https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core,
>
>can you please suggest how can we go about this change ?

I'll fix it up here,

Thanks for the patch.

- Arnaldo

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
  2021-03-25 16:42         ` Arnaldo
@ 2021-03-27 13:14           ` Michael Ellerman
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2021-03-27 13:14 UTC (permalink / raw)
  To: Arnaldo, Peter Zijlstra, Arnaldo Carvalho de Melo
  Cc: Madhavan Srinivasan, Athira Rajeev, linuxppc-dev, linux-kernel,
	linux-perf-users, jolsa, ravi.bangoria, kjain, kan.liang

Arnaldo <arnaldo.melo@gmail.com> writes:
> On March 25, 2021 11:38:01 AM GMT-03:00, Peter Zijlstra <peterz@infradead.org> wrote:
>>On Thu, Mar 25, 2021 at 10:01:35AM -0300, Arnaldo Carvalho de Melo
>>wrote:.
>>> > > Also for CPU_FTR_ARCH_31, capture the two cycle counter
>>information in
>>> > > two 16 bit fields of perf_sample_weight structure.
>>> > 
>>> > Changes looks fine to me.
>>> > 
>>> > Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>>> 
>>> So who will process the kernel bits? I'm merging the tooling parts,
>>
>>I was sorta expecting these to go through the powerpc tree. Let me know
>>if you want them in tip/perf/core instead.
>
> Shouldn't matter by which tree it gets upstream, as long as it gets picked :-)

I plan to take them, just haven't got around to it yet :}

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information
  2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
                   ` (4 preceding siblings ...)
  2021-03-22 14:57 ` [PATCH V2 5/5] tools/perf: Display sort dimension p_stage_cyc only on supported archs Athira Rajeev
@ 2021-04-21 13:08 ` Michael Ellerman
  5 siblings, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2021-04-21 13:08 UTC (permalink / raw)
  To: Athira Rajeev, acme, linux-kernel, linuxppc-dev, jolsa, mpe,
	linux-perf-users
  Cc: ravi.bangoria, peterz, maddy, kjain, kan.liang

On Mon, 22 Mar 2021 10:57:22 -0400, Athira Rajeev wrote:
> Performance Monitoring Unit (PMU) registers in powerpc exports
> number of cycles elapsed between different stages in the pipeline.
> Example, sampling registers in ISA v3.1.
> 
> This patchset implements kernel and perf tools support to expose
> these pipeline stage cycles using the sample type PERF_SAMPLE_WEIGHT_TYPE.
> 
> [...]

Patch 1 applied to powerpc/next.

[1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
      https://git.kernel.org/powerpc/c/af31fd0c9107e400a8eb89d0eafb40bb78802f79

cheers

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-04-21 13:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22 14:57 [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Athira Rajeev
2021-03-22 14:57 ` [PATCH V2 1/5] powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
2021-03-24  4:35   ` Madhavan Srinivasan
2021-03-25 13:01     ` Arnaldo Carvalho de Melo
2021-03-25 14:38       ` Peter Zijlstra
2021-03-25 16:42         ` Arnaldo
2021-03-27 13:14           ` Michael Ellerman
2021-03-25 13:06     ` Arnaldo Carvalho de Melo
2021-03-26  8:32       ` Madhavan Srinivasan
2021-03-22 14:57 ` [PATCH V2 2/5] tools/perf: Add dynamic headers for perf report columns Athira Rajeev
2021-03-22 14:57 ` [PATCH V2 3/5] tools/perf: Add powerpc support for PERF_SAMPLE_WEIGHT_STRUCT Athira Rajeev
2021-03-24 19:43   ` Jiri Olsa
     [not found]     ` <80EE46ED-9007-4CB7-9A52-A7A2ADC616C6@linux.vnet.ibm.com>
2021-03-26 15:50       ` Arnaldo
2021-03-22 14:57 ` [PATCH V2 4/5] tools/perf: Support pipeline stage cycles for powerpc Athira Rajeev
2021-03-22 14:57 ` [PATCH V2 5/5] tools/perf: Display sort dimension p_stage_cyc only on supported archs Athira Rajeev
2021-04-21 13:08 ` [PATCH V2 0/5] powerpc/perf: Export processor pipeline stage cycles information Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).