linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c
@ 2022-04-08 19:53 Ali Saidi
  2022-04-08 19:53 ` [PATCH v5 1/5] tools: arm64: Import cputype.h Ali Saidi
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores so we can detect situtions like cache line
contention and transfers on Arm platforms. 

This changes enables the expected behavior of perf c2c on a system with
SPE where lines that are shared among multiple cores show up in perf c2c
output. 

These changes switch to use mem_lvl_num to encode the level information
instead of mem_lvl which is being deprecated, but I haven't found other
users of mem_lvl_num. 

Changes in v5:
  * Add a new snooping type to disambiguate cache-to-cache transfers where
    we don't know if the data is clean or dirty.
  * Set snoop flags on all the data-source cases
  * Special case stores as we have no information on them

Changes in v4:
  * Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/ 
  * Add neoverse-v1 to the neoverse cores list

Ali Saidi (4):
  tools: arm64: Import cputype.h
  perf arm-spe: Use SPE data source for neoverse cores
  perf mem: Support mem_lvl_num in c2c command
  perf mem: Support HITM for when mem_lvl_num is any

 tools/arch/arm64/include/asm/cputype.h        | 258 ++++++++++++++++++
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 +
 tools/perf/util/arm-spe.c                     | 110 +++++++-
 tools/perf/util/mem-events.c                  |  20 +-
 5 files changed, 383 insertions(+), 18 deletions(-)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

-- 
2.32.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 1/5] tools: arm64: Import cputype.h
  2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
@ 2022-04-08 19:53 ` Ali Saidi
  2022-04-18 14:57   ` Leo Yan
  2022-04-08 19:53 ` [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Ali Saidi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
for arm64 to make use of all the core-type definitions in perf.

Replace sysreg.h with the version already imported into tools/.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
new file mode 100644
index 000000000000..9afcc6467a09
--- /dev/null
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -0,0 +1,258 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+#ifndef __ASM_CPUTYPE_H
+#define __ASM_CPUTYPE_H
+
+#define INVALID_HWID		ULONG_MAX
+
+#define MPIDR_UP_BITMASK	(0x1 << 30)
+#define MPIDR_MT_BITMASK	(0x1 << 24)
+#define MPIDR_HWID_BITMASK	UL(0xff00ffffff)
+
+#define MPIDR_LEVEL_BITS_SHIFT	3
+#define MPIDR_LEVEL_BITS	(1 << MPIDR_LEVEL_BITS_SHIFT)
+#define MPIDR_LEVEL_MASK	((1 << MPIDR_LEVEL_BITS) - 1)
+
+#define MPIDR_LEVEL_SHIFT(level) \
+	(((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
+
+#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
+	((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
+
+#define MIDR_REVISION_MASK	0xf
+#define MIDR_REVISION(midr)	((midr) & MIDR_REVISION_MASK)
+#define MIDR_PARTNUM_SHIFT	4
+#define MIDR_PARTNUM_MASK	(0xfff << MIDR_PARTNUM_SHIFT)
+#define MIDR_PARTNUM(midr)	\
+	(((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT	16
+#define MIDR_ARCHITECTURE_MASK	(0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_ARCHITECTURE(midr)	\
+	(((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_VARIANT_SHIFT	20
+#define MIDR_VARIANT_MASK	(0xf << MIDR_VARIANT_SHIFT)
+#define MIDR_VARIANT(midr)	\
+	(((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
+#define MIDR_IMPLEMENTOR_SHIFT	24
+#define MIDR_IMPLEMENTOR_MASK	(0xff << MIDR_IMPLEMENTOR_SHIFT)
+#define MIDR_IMPLEMENTOR(midr)	\
+	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
+
+#define MIDR_CPU_MODEL(imp, partnum) \
+	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
+	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
+	((partnum)		<< MIDR_PARTNUM_SHIFT))
+
+#define MIDR_CPU_VAR_REV(var, rev) \
+	(((var)	<< MIDR_VARIANT_SHIFT) | (rev))
+
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+			     MIDR_ARCHITECTURE_MASK)
+
+#define ARM_CPU_IMP_ARM			0x41
+#define ARM_CPU_IMP_APM			0x50
+#define ARM_CPU_IMP_CAVIUM		0x43
+#define ARM_CPU_IMP_BRCM		0x42
+#define ARM_CPU_IMP_QCOM		0x51
+#define ARM_CPU_IMP_NVIDIA		0x4E
+#define ARM_CPU_IMP_FUJITSU		0x46
+#define ARM_CPU_IMP_HISI		0x48
+#define ARM_CPU_IMP_APPLE		0x61
+
+#define ARM_CPU_PART_AEM_V8		0xD0F
+#define ARM_CPU_PART_FOUNDATION		0xD00
+#define ARM_CPU_PART_CORTEX_A57		0xD07
+#define ARM_CPU_PART_CORTEX_A72		0xD08
+#define ARM_CPU_PART_CORTEX_A53		0xD03
+#define ARM_CPU_PART_CORTEX_A73		0xD09
+#define ARM_CPU_PART_CORTEX_A75		0xD0A
+#define ARM_CPU_PART_CORTEX_A35		0xD04
+#define ARM_CPU_PART_CORTEX_A55		0xD05
+#define ARM_CPU_PART_CORTEX_A76		0xD0B
+#define ARM_CPU_PART_NEOVERSE_N1	0xD0C
+#define ARM_CPU_PART_CORTEX_A77		0xD0D
+#define ARM_CPU_PART_NEOVERSE_V1	0xD40
+#define ARM_CPU_PART_CORTEX_A78		0xD41
+#define ARM_CPU_PART_CORTEX_X1		0xD44
+#define ARM_CPU_PART_CORTEX_A510	0xD46
+#define ARM_CPU_PART_CORTEX_A710	0xD47
+#define ARM_CPU_PART_CORTEX_X2		0xD48
+#define ARM_CPU_PART_NEOVERSE_N2	0xD49
+#define ARM_CPU_PART_CORTEX_A78C	0xD4B
+
+#define APM_CPU_PART_POTENZA		0x000
+
+#define CAVIUM_CPU_PART_THUNDERX	0x0A1
+#define CAVIUM_CPU_PART_THUNDERX_81XX	0x0A2
+#define CAVIUM_CPU_PART_THUNDERX_83XX	0x0A3
+#define CAVIUM_CPU_PART_THUNDERX2	0x0AF
+/* OcteonTx2 series */
+#define CAVIUM_CPU_PART_OCTX2_98XX	0x0B1
+#define CAVIUM_CPU_PART_OCTX2_96XX	0x0B2
+#define CAVIUM_CPU_PART_OCTX2_95XX	0x0B3
+#define CAVIUM_CPU_PART_OCTX2_95XXN	0x0B4
+#define CAVIUM_CPU_PART_OCTX2_95XXMM	0x0B5
+#define CAVIUM_CPU_PART_OCTX2_95XXO	0x0B6
+
+#define BRCM_CPU_PART_BRAHMA_B53	0x100
+#define BRCM_CPU_PART_VULCAN		0x516
+
+#define QCOM_CPU_PART_FALKOR_V1		0x800
+#define QCOM_CPU_PART_FALKOR		0xC00
+#define QCOM_CPU_PART_KRYO		0x200
+#define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
+#define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
+#define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
+#define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
+#define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
+
+#define NVIDIA_CPU_PART_DENVER		0x003
+#define NVIDIA_CPU_PART_CARMEL		0x004
+
+#define FUJITSU_CPU_PART_A64FX		0x001
+
+#define HISI_CPU_PART_TSV110		0xD01
+
+#define APPLE_CPU_PART_M1_ICESTORM	0x022
+#define APPLE_CPU_PART_M1_FIRESTORM	0x023
+
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
+#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
+#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
+#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
+#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
+#define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
+#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
+#define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
+#define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
+#define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
+#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
+#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
+#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
+#define MIDR_CORTEX_A78C	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
+#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
+#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
+#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
+#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
+#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
+#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
+#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
+#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
+#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
+#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
+#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
+#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
+#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
+#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
+#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
+#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
+#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
+#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
+#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
+#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
+#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
+
+/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
+#define MIDR_FUJITSU_ERRATUM_010001		MIDR_FUJITSU_A64FX
+#define MIDR_FUJITSU_ERRATUM_010001_MASK	(~MIDR_CPU_VAR_REV(1, 0))
+#define TCR_CLEAR_FUJITSU_ERRATUM_010001	(TCR_NFD1 | TCR_NFD0)
+
+#ifndef __ASSEMBLY__
+
+#include "sysreg.h"
+
+#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
+
+/*
+ * Represent a range of MIDR values for a given CPU model and a
+ * range of variant/revision values.
+ *
+ * @model	- CPU model as defined by MIDR_CPU_MODEL
+ * @rv_min	- Minimum value for the revision/variant as defined by
+ *		  MIDR_CPU_VAR_REV
+ * @rv_max	- Maximum value for the variant/revision for the range.
+ */
+struct midr_range {
+	u32 model;
+	u32 rv_min;
+	u32 rv_max;
+};
+
+#define MIDR_RANGE(m, v_min, r_min, v_max, r_max)		\
+	{							\
+		.model = m,					\
+		.rv_min = MIDR_CPU_VAR_REV(v_min, r_min),	\
+		.rv_max = MIDR_CPU_VAR_REV(v_max, r_max),	\
+	}
+
+#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
+#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
+#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
+
+static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
+					   u32 rv_max)
+{
+	u32 _model = midr & MIDR_CPU_MODEL_MASK;
+	u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
+
+	return _model == model && rv >= rv_min && rv <= rv_max;
+}
+
+static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
+{
+	return midr_is_cpu_model_range(midr, range->model,
+				       range->rv_min, range->rv_max);
+}
+
+static inline bool
+is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
+{
+	while (ranges->model)
+		if (is_midr_in_range(midr, ranges++))
+			return true;
+	return false;
+}
+
+/*
+ * The CPU ID never changes at run time, so we might as well tell the
+ * compiler that it's constant.  Use this function to read the CPU ID
+ * rather than directly reading processor_id or read_cpuid() directly.
+ */
+static inline u32 __attribute_const__ read_cpuid_id(void)
+{
+	return read_cpuid(MIDR_EL1);
+}
+
+static inline u64 __attribute_const__ read_cpuid_mpidr(void)
+{
+	return read_cpuid(MPIDR_EL1);
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
+{
+	return MIDR_IMPLEMENTOR(read_cpuid_id());
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
+{
+	return MIDR_PARTNUM(read_cpuid_id());
+}
+
+static inline u32 __attribute_const__ read_cpuid_cachetype(void)
+{
+	return read_cpuid(CTR_EL0);
+}
+#endif /* __ASSEMBLY__ */
+
+#endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
  2022-04-08 19:53 ` [PATCH v5 1/5] tools: arm64: Import cputype.h Ali Saidi
@ 2022-04-08 19:53 ` Ali Saidi
  2022-04-20  8:20   ` Leo Yan
  2022-04-20 18:43   ` Liang, Kan
  2022-04-08 19:53 ` [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER Ali Saidi
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Add a flag to the perf mem data struct to signal that a request caused a
cache-to-cache transfer of a line from a peer of the requestor and
wasn't sourced from a lower cache level.  The line being moved from one
peer cache to another has latency and performance implications. On Arm64
Neoverse systems the data source can indicate a cache-to-cache transfer
but not if the line is dirty or clean, so instead of overloading HITM
define a new flag that indicates this type of transfer.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 include/uapi/linux/perf_event.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 82858b697c05..c9e58c79f3e5 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1308,7 +1308,7 @@ union perf_mem_data_src {
 #define PERF_MEM_SNOOP_SHIFT	19
 
 #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
-/* 1 free */
+#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
 #define PERF_MEM_SNOOPX_SHIFT  38
 
 /* locked instruction */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER
  2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
  2022-04-08 19:53 ` [PATCH v5 1/5] tools: arm64: Import cputype.h Ali Saidi
  2022-04-08 19:53 ` [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Ali Saidi
@ 2022-04-08 19:53 ` Ali Saidi
  2022-04-11 10:26   ` German Gomez
  2022-04-08 19:53 ` [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
  2022-04-08 19:53 ` [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command Ali Saidi
  4 siblings, 1 reply; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Add a flag to the perf mem data struct to signal that a request caused a
cache-to-cache transfer of a line from a peer of the requestor and
wasn't sourced from a lower cache level.  The line being moved from one
peer cache to another has latency and performance implications. On Arm64
Neoverse systems the data source can indicate a cache-to-cache transfer
but not if the line is dirty or clean, so instead of overloading HITM
define a new flag that indicates this type of transfer.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 tools/include/uapi/linux/perf_event.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 82858b697c05..c9e58c79f3e5 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1308,7 +1308,7 @@ union perf_mem_data_src {
 #define PERF_MEM_SNOOP_SHIFT	19
 
 #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
-/* 1 free */
+#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
 #define PERF_MEM_SNOOPX_SHIFT  38
 
 /* locked instruction */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
                   ` (2 preceding siblings ...)
  2022-04-08 19:53 ` [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER Ali Saidi
@ 2022-04-08 19:53 ` Ali Saidi
  2022-04-20  8:30   ` Leo Yan
  2022-04-20  8:42   ` Leo Yan
  2022-04-08 19:53 ` [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command Ali Saidi
  4 siblings, 2 replies; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
the same encoding. I can't find encoding information for any other SPE
implementations to unify their choices with Arm's thus that is left for
future work.

This change populates the mem_lvl_num for Neoverse cores instead of the
deprecated mem_lvl namespace.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 ++
 tools/perf/util/arm-spe.c                     | 127 ++++++++++++++++--
 3 files changed, 126 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
index 5e390a1a79ab..091987dd3966 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
@@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
 
 			break;
 		case ARM_SPE_DATA_SOURCE:
+			decoder->record.source = payload;
 			break;
 		case ARM_SPE_BAD:
 			break;
diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
index 69b31084d6be..46a61df1145b 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
@@ -29,6 +29,17 @@ enum arm_spe_op_type {
 	ARM_SPE_ST		= 1 << 1,
 };
 
+enum arm_spe_neoverse_data_source {
+	ARM_SPE_NV_L1D		 = 0x0,
+	ARM_SPE_NV_L2		 = 0x8,
+	ARM_SPE_NV_PEER_CORE	 = 0x9,
+	ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
+	ARM_SPE_NV_SYS_CACHE	 = 0xb,
+	ARM_SPE_NV_PEER_CLUSTER	 = 0xc,
+	ARM_SPE_NV_REMOTE	 = 0xd,
+	ARM_SPE_NV_DRAM		 = 0xe,
+};
+
 struct arm_spe_record {
 	enum arm_spe_sample_type type;
 	int err;
@@ -40,6 +51,7 @@ struct arm_spe_record {
 	u64 virt_addr;
 	u64 phys_addr;
 	u64 context_id;
+	u16 source;
 };
 
 struct arm_spe_insn;
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..a20285cf98e3 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -34,6 +34,7 @@
 #include "arm-spe-decoder/arm-spe-decoder.h"
 #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
 
+#include "../../arch/arm64/include/asm/cputype.h"
 #define MAX_TIMESTAMP (~0ULL)
 
 struct arm_spe {
@@ -45,6 +46,7 @@ struct arm_spe {
 	struct perf_session		*session;
 	struct machine			*machine;
 	u32				pmu_type;
+	u64				midr;
 
 	struct perf_tsc_conversion	tc;
 
@@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
 	return false;
 }
 
-static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
+static const struct midr_range neoverse_spe[] = {
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
+	{},
+};
+
+
+static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
 {
-	union perf_mem_data_src	data_src = { 0 };
+	/*
+	 * Even though four levels of cache hierarchy are possible, no known
+	 * production Neoverse systems currently include more than three levels
+	 * so for the time being we assume three exist. If a production system
+	 * is built with four the this function would have to be changed to
+	 * detect the number of levels for reporting.
+	 */
 
-	if (record->op == ARM_SPE_LD)
-		data_src.mem_op = PERF_MEM_OP_LOAD;
-	else
-		data_src.mem_op = PERF_MEM_OP_STORE;
+	/*
+	 * We have no data on the hit level or data source for stores in the
+	 * Neoverse SPE records.
+	 */
+	if (record->op & ARM_SPE_ST) {
+		data_src->mem_lvl = PERF_MEM_LVL_NA;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+		data_src->mem_snoop = PERF_MEM_SNOOP_NA;
+		return;
+	}
+
+
+	switch (record->source) {
+	case ARM_SPE_NV_L1D:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
+		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+		break;
+	case ARM_SPE_NV_L2:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+		break;
+	case ARM_SPE_NV_PEER_CORE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+		break;
+	/*
+	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
+	 * transfer, so set SNOOPX_PEER
+	 */
+	case ARM_SPE_NV_LOCAL_CLUSTER:
+	case ARM_SPE_NV_PEER_CLUSTER:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+		break;
+	/*
+	 * System cache is assumed to be L3
+	 */
+	case ARM_SPE_NV_SYS_CACHE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+		data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
+		break;
+	/*
+	 * We don't know what level it hit in, except it came from the other
+	 * socket
+	 */
+	case ARM_SPE_NV_REMOTE:
+		data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
+		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
+		data_src->mem_snoop = PERF_MEM_SNOOP_NA;
+		break;
+	case ARM_SPE_NV_DRAM:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
+		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
+		break;
+	default:
+		break;
+	}
+}
 
+static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
+{
 	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L3;
+		data_src->mem_lvl = PERF_MEM_LVL_L3;
 
 		if (record->type & ARM_SPE_LLC_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L1;
+		data_src->mem_lvl = PERF_MEM_LVL_L1;
 
 		if (record->type & ARM_SPE_L1D_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	}
 
 	if (record->type & ARM_SPE_REMOTE_ACCESS)
-		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+}
+
+static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
+{
+	union perf_mem_data_src	data_src = { 0 };
+	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
+
+	if (record->op & ARM_SPE_LD)
+		data_src.mem_op = PERF_MEM_OP_LOAD;
+	else
+		data_src.mem_op = PERF_MEM_OP_STORE;
+
+	if (is_neoverse)
+		arm_spe__synth_data_source_neoverse(record, &data_src);
+	else
+		arm_spe__synth_data_source_generic(record, &data_src);
 
 	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
 		data_src.mem_dtlb = PERF_MEM_TLB_WK;
@@ -446,7 +542,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
 	u64 data_src;
 	int err;
 
-	data_src = arm_spe__synth_data_source(record);
+	data_src = arm_spe__synth_data_source(record, spe->midr);
 
 	if (spe->sample_flc) {
 		if (record->type & ARM_SPE_L1D_MISS) {
@@ -1183,6 +1279,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
 	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
 	struct perf_record_time_conv *tc = &session->time_conv;
+	const char *cpuid = perf_env__cpuid(session->evlist->env);
+	u64 midr = strtol(cpuid, NULL, 16);
 	struct arm_spe *spe;
 	int err;
 
@@ -1202,6 +1300,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	spe->machine = &session->machines.host; /* No kvm support */
 	spe->auxtrace_type = auxtrace_info->type;
 	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+	spe->midr = midr;
 
 	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command
  2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
                   ` (3 preceding siblings ...)
  2022-04-08 19:53 ` [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
@ 2022-04-08 19:53 ` Ali Saidi
  2022-04-11 10:04   ` German Gomez
  2022-04-20 19:02   ` Liang, Kan
  4 siblings, 2 replies; 28+ messages in thread
From: Ali Saidi @ 2022-04-08 19:53 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

In addition to summarizing data encoded in mem_lvl also support data
encoded in mem_lvl_num.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 tools/perf/util/mem-events.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index ed0ab838bcc5..e5e405185498 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
 	u64 daddr  = mi->daddr.addr;
 	u64 op     = data_src->mem_op;
 	u64 lvl    = data_src->mem_lvl;
+	u64 lnum   = data_src->mem_lvl_num;
 	u64 snoop  = data_src->mem_snoop;
 	u64 lock   = data_src->mem_lock;
 	u64 blk    = data_src->mem_blk;
@@ -527,16 +528,18 @@ do {				\
 			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
 			if (lvl & P(LVL, IO))  stats->ld_io++;
 			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
-			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
-			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
-			if (lvl & P(LVL, L3 )) {
+			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
+				stats->ld_l1hit++;
+			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
+				stats->ld_l2hit++;
+			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
 				if (snoop & P(SNOOP, HITM))
 					HITM_INC(lcl_hitm);
 				else
 					stats->ld_llchit++;
 			}
 
-			if (lvl & P(LVL, LOC_RAM)) {
+			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
 				stats->lcl_dram++;
 				if (snoop & P(SNOOP, HIT))
 					stats->ld_shared++;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command
  2022-04-08 19:53 ` [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command Ali Saidi
@ 2022-04-11 10:04   ` German Gomez
  2022-04-20  8:48     ` Leo Yan
  2022-04-20 19:02   ` Liang, Kan
  1 sibling, 1 reply; 28+ messages in thread
From: German Gomez @ 2022-04-11 10:04 UTC (permalink / raw)
  To: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	leo.yan, acme
  Cc: benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will


On 08/04/2022 20:53, Ali Saidi wrote:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
>
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
>
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  tools/perf/util/mem-events.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
>  	u64 daddr  = mi->daddr.addr;
>  	u64 op     = data_src->mem_op;
>  	u64 lvl    = data_src->mem_lvl;
> +	u64 lnum   = data_src->mem_lvl_num;
>  	u64 snoop  = data_src->mem_snoop;
>  	u64 lock   = data_src->mem_lock;
>  	u64 blk    = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do {				\
>  			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
>  			if (lvl & P(LVL, IO))  stats->ld_io++;
>  			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;

Just for completion, can we also handle LFB as it seems to be being set
in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)

> -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> -			if (lvl & P(LVL, L3 )) {
> +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> +				stats->ld_l1hit++;
> +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> +				stats->ld_l2hit++;
> +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
>  				if (snoop & P(SNOOP, HITM))
>  					HITM_INC(lcl_hitm);
>  				else
>  					stats->ld_llchit++;
>  			}
>  
> -			if (lvl & P(LVL, LOC_RAM)) {
> +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
>  				stats->lcl_dram++;
>  				if (snoop & P(SNOOP, HIT))
>  					stats->ld_shared++;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER
  2022-04-08 19:53 ` [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER Ali Saidi
@ 2022-04-11 10:26   ` German Gomez
  2022-04-11 14:35     ` German Gomez
  0 siblings, 1 reply; 28+ messages in thread
From: German Gomez @ 2022-04-11 10:26 UTC (permalink / raw)
  To: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	leo.yan, acme
  Cc: benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will


On 08/04/2022 20:53, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.  The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.

I think it's fine to merge this and the previous commit rather than have
two commits with the same msg.

>
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  tools/include/uapi/linux/perf_event.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
>  #define PERF_MEM_SNOOP_SHIFT	19
>  
>  #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
>  #define PERF_MEM_SNOOPX_SHIFT  38
>  
>  /* locked instruction */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER
  2022-04-11 10:26   ` German Gomez
@ 2022-04-11 14:35     ` German Gomez
  2022-04-20  8:23       ` Leo Yan
  0 siblings, 1 reply; 28+ messages in thread
From: German Gomez @ 2022-04-11 14:35 UTC (permalink / raw)
  To: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	leo.yan, acme
  Cc: benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will


On 11/04/2022 11:26, German Gomez wrote:
> On 08/04/2022 20:53, Ali Saidi wrote:
>> Add a flag to the perf mem data struct to signal that a request caused a
>> cache-to-cache transfer of a line from a peer of the requestor and
>> wasn't sourced from a lower cache level.  The line being moved from one
>> peer cache to another has latency and performance implications. On Arm64
>> Neoverse systems the data source can indicate a cache-to-cache transfer
>> but not if the line is dirty or clean, so instead of overloading HITM
>> define a new flag that indicates this type of transfer.
> I think it's fine to merge this and the previous commit rather than have
> two commits with the same msg.
>

I take it back. It has been pointed out to me that having the separation
is normal/ok. So my comment doesn't apply.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 1/5] tools: arm64: Import cputype.h
  2022-04-08 19:53 ` [PATCH v5 1/5] tools: arm64: Import cputype.h Ali Saidi
@ 2022-04-18 14:57   ` Leo Yan
  0 siblings, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-18 14:57 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Fri, Apr 08, 2022 at 07:53:40PM +0000, Ali Saidi wrote:
> Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> for arm64 to make use of all the core-type definitions in perf.
> 
> Replace sysreg.h with the version already imported into tools/.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>

Please drop this patch in next spin, it has been merged in the
mainline kernel.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-08 19:53 ` [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Ali Saidi
@ 2022-04-20  8:20   ` Leo Yan
  2022-04-20 18:43   ` Liang, Kan
  1 sibling, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-20  8:20 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Fri, Apr 08, 2022 at 07:53:41PM +0000, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.  The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>

The patch looks good to me:
Reviewed-by: Leo Yan <leo.yan@linaro.org>

Sine this is a common flag, it's better if x86 or PowerPC maintainers
could take a look for this new snooping type.  Thanks!

Leo

> ---
>  include/uapi/linux/perf_event.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
>  #define PERF_MEM_SNOOP_SHIFT	19
>  
>  #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
>  #define PERF_MEM_SNOOPX_SHIFT  38
>  
>  /* locked instruction */
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER
  2022-04-11 14:35     ` German Gomez
@ 2022-04-20  8:23       ` Leo Yan
  0 siblings, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-20  8:23 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Mon, Apr 11, 2022 at 03:35:52PM +0100, German Gomez wrote:
> 
> On 11/04/2022 11:26, German Gomez wrote:
> > On 08/04/2022 20:53, Ali Saidi wrote:
> >> Add a flag to the perf mem data struct to signal that a request caused a
> >> cache-to-cache transfer of a line from a peer of the requestor and
> >> wasn't sourced from a lower cache level.  The line being moved from one
> >> peer cache to another has latency and performance implications. On Arm64
> >> Neoverse systems the data source can indicate a cache-to-cache transfer
> >> but not if the line is dirty or clean, so instead of overloading HITM
> >> define a new flag that indicates this type of transfer.
> > I think it's fine to merge this and the previous commit rather than have
> > two commits with the same msg.
> >
> 
> I take it back. It has been pointed out to me that having the separation
> is normal/ok. So my comment doesn't apply.

Yeah, it's good that we split these two patches since it's easier
for maintainer to pick up separately :)

Thanks,
Leo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-08 19:53 ` [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
@ 2022-04-20  8:30   ` Leo Yan
  2022-04-20  8:42   ` Leo Yan
  1 sibling, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-20  8:30 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Fri, Apr 08, 2022 at 07:53:43PM +0000, Ali Saidi wrote:
> When synthesizing data from SPE, augment the type with source information
> for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
> the same encoding. I can't find encoding information for any other SPE
> implementations to unify their choices with Arm's thus that is left for
> future work.
> 
> This change populates the mem_lvl_num for Neoverse cores instead of the
> deprecated mem_lvl namespace.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
>  .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 ++
>  tools/perf/util/arm-spe.c                     | 127 ++++++++++++++++--
>  3 files changed, 126 insertions(+), 14 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> index 5e390a1a79ab..091987dd3966 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> @@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
>  
>  			break;
>  		case ARM_SPE_DATA_SOURCE:
> +			decoder->record.source = payload;
>  			break;
>  		case ARM_SPE_BAD:
>  			break;
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> index 69b31084d6be..46a61df1145b 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> @@ -29,6 +29,17 @@ enum arm_spe_op_type {
>  	ARM_SPE_ST		= 1 << 1,
>  };
>  
> +enum arm_spe_neoverse_data_source {
> +	ARM_SPE_NV_L1D		 = 0x0,
> +	ARM_SPE_NV_L2		 = 0x8,
> +	ARM_SPE_NV_PEER_CORE	 = 0x9,
> +	ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
> +	ARM_SPE_NV_SYS_CACHE	 = 0xb,
> +	ARM_SPE_NV_PEER_CLUSTER	 = 0xc,
> +	ARM_SPE_NV_REMOTE	 = 0xd,
> +	ARM_SPE_NV_DRAM		 = 0xe,
> +};
> +
>  struct arm_spe_record {
>  	enum arm_spe_sample_type type;
>  	int err;
> @@ -40,6 +51,7 @@ struct arm_spe_record {
>  	u64 virt_addr;
>  	u64 phys_addr;
>  	u64 context_id;
> +	u16 source;
>  };
>  
>  struct arm_spe_insn;
> diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
> index d2b64e3f588b..a20285cf98e3 100644
> --- a/tools/perf/util/arm-spe.c
> +++ b/tools/perf/util/arm-spe.c
> @@ -34,6 +34,7 @@
>  #include "arm-spe-decoder/arm-spe-decoder.h"
>  #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
>  
> +#include "../../arch/arm64/include/asm/cputype.h"
>  #define MAX_TIMESTAMP (~0ULL)
>  
>  struct arm_spe {
> @@ -45,6 +46,7 @@ struct arm_spe {
>  	struct perf_session		*session;
>  	struct machine			*machine;
>  	u32				pmu_type;
> +	u64				midr;
>  
>  	struct perf_tsc_conversion	tc;
>  
> @@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
>  	return false;
>  }
>  
> -static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
> +static const struct midr_range neoverse_spe[] = {
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
> +	{},
> +};
> +
> +
> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
>  {
> -	union perf_mem_data_src	data_src = { 0 };
> +	/*
> +	 * Even though four levels of cache hierarchy are possible, no known
> +	 * production Neoverse systems currently include more than three levels
> +	 * so for the time being we assume three exist. If a production system
> +	 * is built with four the this function would have to be changed to
> +	 * detect the number of levels for reporting.
> +	 */
>  
> -	if (record->op == ARM_SPE_LD)
> -		data_src.mem_op = PERF_MEM_OP_LOAD;
> -	else
> -		data_src.mem_op = PERF_MEM_OP_STORE;
> +	/*
> +	 * We have no data on the hit level or data source for stores in the
> +	 * Neoverse SPE records.
> +	 */
> +	if (record->op & ARM_SPE_ST) {
> +		data_src->mem_lvl = PERF_MEM_LVL_NA;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> +		return;
> +	}

For the store operation, I found we need to use more strictly criteria
to check memory operations, otherwise, we might wrongly synthesize
memory sample even for other types of operations.

To fix the issue, I think we need to add below patch; if this is okay
for you, please consider to include it in the next patch set version.

Thanks,
Leo

From 7f8499d4f44b400d217c01d42059f00e8a1697b0 Mon Sep 17 00:00:00 2001
From: Leo Yan <leo.yan@linaro.org>
Date: Wed, 20 Apr 2022 15:46:21 +0800
Subject: [PATCH] perf arm-spe: Don't set data source if it's not a memory
 operation

Except memory load and store operations, Arm SPE records also can
support other operation types, bug when set the data source field the
current code assumes a record is a either load operation or store
operation, this leads to wrongly synthesize memory samples.

This patch strictly checks the record operation type, it only sets data
source only for the operation types ARM_SPE_LD and ARM_SPE_ST,
otherwise, returns zero for data source.  Therefore, we can synthesize
memory samples only when data source is a non-zero value, the function
arm_spe__is_memory_event() is useless and removed.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/arm-spe.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..76251825c01d 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -387,26 +387,16 @@ static int arm_spe__synth_instruction_sample(struct arm_spe_queue *speq,
 	return arm_spe_deliver_synth_event(spe, speq, event, &sample);
 }
 
-#define SPE_MEM_TYPE	(ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS | \
-			 ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS | \
-			 ARM_SPE_REMOTE_ACCESS)
-
-static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
-{
-	if (type & SPE_MEM_TYPE)
-		return true;
-
-	return false;
-}
-
 static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
 {
 	union perf_mem_data_src	data_src = { 0 };
 
 	if (record->op == ARM_SPE_LD)
 		data_src.mem_op = PERF_MEM_OP_LOAD;
-	else
+	else if (record->op & ARM_SPE_ST)
 		data_src.mem_op = PERF_MEM_OP_STORE;
+	else
+		return 0;
 
 	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
 		data_src.mem_lvl = PERF_MEM_LVL_L3;
@@ -510,7 +500,11 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
 			return err;
 	}
 
-	if (spe->sample_memory && arm_spe__is_memory_event(record->type)) {
+	/*
+	 * When data_src is zero it means the record is not a memory operation,
+	 * skip to synthesize memory sample for this case.
+	 */
+	if (spe->sample_memory && data_src) {
 		err = arm_spe__synth_mem_sample(speq, spe->memory_id, data_src);
 		if (err)
 			return err;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-08 19:53 ` [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
  2022-04-20  8:30   ` Leo Yan
@ 2022-04-20  8:42   ` Leo Yan
  1 sibling, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-20  8:42 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Fri, Apr 08, 2022 at 07:53:43PM +0000, Ali Saidi wrote:
> When synthesizing data from SPE, augment the type with source information
> for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
> the same encoding. I can't find encoding information for any other SPE
> implementations to unify their choices with Arm's thus that is left for
> future work.
> 
> This change populates the mem_lvl_num for Neoverse cores instead of the
> deprecated mem_lvl namespace.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
>  .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 ++
>  tools/perf/util/arm-spe.c                     | 127 ++++++++++++++++--
>  3 files changed, 126 insertions(+), 14 deletions(-)
> 
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> index 5e390a1a79ab..091987dd3966 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
> @@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
>  
>  			break;
>  		case ARM_SPE_DATA_SOURCE:
> +			decoder->record.source = payload;
>  			break;
>  		case ARM_SPE_BAD:
>  			break;
> diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> index 69b31084d6be..46a61df1145b 100644
> --- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> +++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
> @@ -29,6 +29,17 @@ enum arm_spe_op_type {
>  	ARM_SPE_ST		= 1 << 1,
>  };
>  
> +enum arm_spe_neoverse_data_source {
> +	ARM_SPE_NV_L1D		 = 0x0,
> +	ARM_SPE_NV_L2		 = 0x8,
> +	ARM_SPE_NV_PEER_CORE	 = 0x9,
> +	ARM_SPE_NV_LOCAL_CLUSTER = 0xa,
> +	ARM_SPE_NV_SYS_CACHE	 = 0xb,
> +	ARM_SPE_NV_PEER_CLUSTER	 = 0xc,
> +	ARM_SPE_NV_REMOTE	 = 0xd,
> +	ARM_SPE_NV_DRAM		 = 0xe,
> +};
> +
>  struct arm_spe_record {
>  	enum arm_spe_sample_type type;
>  	int err;
> @@ -40,6 +51,7 @@ struct arm_spe_record {
>  	u64 virt_addr;
>  	u64 phys_addr;
>  	u64 context_id;
> +	u16 source;
>  };
>  
>  struct arm_spe_insn;
> diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
> index d2b64e3f588b..a20285cf98e3 100644
> --- a/tools/perf/util/arm-spe.c
> +++ b/tools/perf/util/arm-spe.c
> @@ -34,6 +34,7 @@
>  #include "arm-spe-decoder/arm-spe-decoder.h"
>  #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
>  
> +#include "../../arch/arm64/include/asm/cputype.h"
>  #define MAX_TIMESTAMP (~0ULL)
>  
>  struct arm_spe {
> @@ -45,6 +46,7 @@ struct arm_spe {
>  	struct perf_session		*session;
>  	struct machine			*machine;
>  	u32				pmu_type;
> +	u64				midr;
>  
>  	struct perf_tsc_conversion	tc;
>  
> @@ -399,33 +401,127 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
>  	return false;
>  }
>  
> -static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
> +static const struct midr_range neoverse_spe[] = {
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
> +	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
> +	{},
> +};
> +
> +
> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
>  {
> -	union perf_mem_data_src	data_src = { 0 };
> +	/*
> +	 * Even though four levels of cache hierarchy are possible, no known
> +	 * production Neoverse systems currently include more than three levels
> +	 * so for the time being we assume three exist. If a production system
> +	 * is built with four the this function would have to be changed to
> +	 * detect the number of levels for reporting.
> +	 */
>  
> -	if (record->op == ARM_SPE_LD)
> -		data_src.mem_op = PERF_MEM_OP_LOAD;
> -	else
> -		data_src.mem_op = PERF_MEM_OP_STORE;
> +	/*
> +	 * We have no data on the hit level or data source for stores in the
> +	 * Neoverse SPE records.
> +	 */
> +	if (record->op & ARM_SPE_ST) {
> +		data_src->mem_lvl = PERF_MEM_LVL_NA;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> +		return;
> +	}
> +
> +

Redundant new line.

> +	switch (record->source) {
> +	case ARM_SPE_NV_L1D:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> +		break;
> +	case ARM_SPE_NV_L2:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> +		break;
> +	case ARM_SPE_NV_PEER_CORE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> +		break;
> +	/*
> +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> +	 * transfer, so set SNOOPX_PEER
> +	 */
> +	case ARM_SPE_NV_LOCAL_CLUSTER:
> +	case ARM_SPE_NV_PEER_CLUSTER:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoopx = PERF_MEM_SNOOPX_PEER;

As a side topic, it's better to use a new patch to dump snooping flag
PERF_MEM_SNOOPX_PEER, some code like below:

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index f8f234251f92..66d44280a4ea 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -410,6 +410,11 @@ static const char * const snoop_access[] = {
 	"HitM",
 };
 
+static const char * const snoopx_access[] = {
+	"Fwd",
+	"Peer",
+};
+
 int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 {
 	size_t i, l = 0;
@@ -430,13 +435,18 @@ int perf_mem__snp_scnprintf(char *out, size_t sz, struct mem_info *mem_info)
 		}
 		l += scnprintf(out + l, sz - l, snoop_access[i]);
 	}
-	if (mem_info &&
-	     (mem_info->data_src.mem_snoopx & PERF_MEM_SNOOPX_FWD)) {
+
+	if (mem_info)
+		m = mem_info->data_src.mem_snoopx;
+
+	for (i = 0; m && i < ARRAY_SIZE(snoopx_access); i++, m >>= 1) {
+		if (!(m & 0x1))
+			continue;
 		if (l) {
 			strcat(out, " or ");
 			l += 4;
 		}
-		l += scnprintf(out + l, sz - l, "Fwd");
+		l += scnprintf(out + l, sz - l, snoopx_access[i]);
 	}
 

> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> +		break;
> +	/*
> +	 * System cache is assumed to be L3
> +	 */
> +	case ARM_SPE_NV_SYS_CACHE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> +		break;
> +	/*
> +	 * We don't know what level it hit in, except it came from the other
> +	 * socket
> +	 */
> +	case ARM_SPE_NV_REMOTE:
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NA;
> +		break;
> +	case ARM_SPE_NV_DRAM:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> +		break;
> +	default:
> +		break;
> +	}
> +}
>
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
> +{
>  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> +		data_src->mem_lvl = PERF_MEM_LVL_L3;
>  
>  		if (record->type & ARM_SPE_LLC_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> +		data_src->mem_lvl = PERF_MEM_LVL_L1;
>  
>  		if (record->type & ARM_SPE_L1D_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	}
>  
>  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> +	union perf_mem_data_src	data_src = { 0 };
> +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> +	if (record->op & ARM_SPE_LD)
> +		data_src.mem_op = PERF_MEM_OP_LOAD;
> +	else
> +		data_src.mem_op = PERF_MEM_OP_STORE;
> +
> +	if (is_neoverse)
> +		arm_spe__synth_data_source_neoverse(record, &data_src);
> +	else
> +		arm_spe__synth_data_source_generic(record, &data_src);
>  
>  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
>  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -446,7 +542,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
>  	u64 data_src;
>  	int err;
>  
> -	data_src = arm_spe__synth_data_source(record);
> +	data_src = arm_spe__synth_data_source(record, spe->midr);
>  
>  	if (spe->sample_flc) {
>  		if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1183,6 +1279,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
>  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
>  	struct perf_record_time_conv *tc = &session->time_conv;
> +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> +	u64 midr = strtol(cpuid, NULL, 16);
>  	struct arm_spe *spe;
>  	int err;
>  
> @@ -1202,6 +1300,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	spe->machine = &session->machines.host; /* No kvm support */
>  	spe->auxtrace_type = auxtrace_info->type;
>  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> +	spe->midr = midr;

Except the redundant line, the patch is good for me and I tested it at
my side:

Reviewed-by: Leo Yan <leo.yan@linaro.org>
Tested-by: Leo Yan <leo.yan@linaro.org>

>  
>  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>  
> -- 
> 2.32.0
> 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command
  2022-04-11 10:04   ` German Gomez
@ 2022-04-20  8:48     ` Leo Yan
  2022-04-20 16:47       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 28+ messages in thread
From: Leo Yan @ 2022-04-20  8:48 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Mon, Apr 11, 2022 at 11:04:28AM +0100, German Gomez wrote:
> 
> On 08/04/2022 20:53, Ali Saidi wrote:
> > In addition to summarizing data encoded in mem_lvl also support data
> > encoded in mem_lvl_num.
> >
> > Since other architectures don't seem to populate the mem_lvl_num field
> > here there shouldn't be a change in functionality.
> >
> > Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> > ---
> >  tools/perf/util/mem-events.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > index ed0ab838bcc5..e5e405185498 100644
> > --- a/tools/perf/util/mem-events.c
> > +++ b/tools/perf/util/mem-events.c
> > @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> >  	u64 daddr  = mi->daddr.addr;
> >  	u64 op     = data_src->mem_op;
> >  	u64 lvl    = data_src->mem_lvl;
> > +	u64 lnum   = data_src->mem_lvl_num;
> >  	u64 snoop  = data_src->mem_snoop;
> >  	u64 lock   = data_src->mem_lock;
> >  	u64 blk    = data_src->mem_blk;
> > @@ -527,16 +528,18 @@ do {				\
> >  			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> >  			if (lvl & P(LVL, IO))  stats->ld_io++;
> >  			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> 
> Just for completion, can we also handle LFB as it seems to be being set
> in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)

With fixing LFB issue pointed by German, the change looks good to me:

Reviewed-by: Leo Yan <leo.yan@linaro.org>

It would be appreciate if x86 or PowerPC maintainers could take a look
for this patch.  Thanks!

Leo

> > -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> > -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> > -			if (lvl & P(LVL, L3 )) {
> > +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> > +				stats->ld_l1hit++;
> > +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> > +				stats->ld_l2hit++;
> > +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> >  				if (snoop & P(SNOOP, HITM))
> >  					HITM_INC(lcl_hitm);
> >  				else
> >  					stats->ld_llchit++;
> >  			}
> >  
> > -			if (lvl & P(LVL, LOC_RAM)) {
> > +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
> >  				stats->lcl_dram++;
> >  				if (snoop & P(SNOOP, HIT))
> >  					stats->ld_shared++;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command
  2022-04-20  8:48     ` Leo Yan
@ 2022-04-20 16:47       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 28+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-04-20 16:47 UTC (permalink / raw)
  To: Leo Yan
  Cc: German Gomez, Ali Saidi, linux-kernel, linux-perf-users,
	linux-arm-kernel, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Wed, Apr 20, 2022 at 04:48:23PM +0800, Leo Yan escreveu:
> On Mon, Apr 11, 2022 at 11:04:28AM +0100, German Gomez wrote:
> > 
> > On 08/04/2022 20:53, Ali Saidi wrote:
> > > In addition to summarizing data encoded in mem_lvl also support data
> > > encoded in mem_lvl_num.
> > >
> > > Since other architectures don't seem to populate the mem_lvl_num field
> > > here there shouldn't be a change in functionality.
> > >
> > > Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> > > ---
> > >  tools/perf/util/mem-events.c | 11 +++++++----
> > >  1 file changed, 7 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > > index ed0ab838bcc5..e5e405185498 100644
> > > --- a/tools/perf/util/mem-events.c
> > > +++ b/tools/perf/util/mem-events.c
> > > @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
> > >  	u64 daddr  = mi->daddr.addr;
> > >  	u64 op     = data_src->mem_op;
> > >  	u64 lvl    = data_src->mem_lvl;
> > > +	u64 lnum   = data_src->mem_lvl_num;
> > >  	u64 snoop  = data_src->mem_snoop;
> > >  	u64 lock   = data_src->mem_lock;
> > >  	u64 blk    = data_src->mem_blk;
> > > @@ -527,16 +528,18 @@ do {				\
> > >  			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
> > >  			if (lvl & P(LVL, IO))  stats->ld_io++;
> > >  			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> > 
> > Just for completion, can we also handle LFB as it seems to be being set
> > in "/arch/x86/events/intel/ds.c"? (Sorry I missed this in the v4)
> 
> With fixing LFB issue pointed by German, the change looks good to me:

Waiting for a v6 then, please collect Leo's reviewed-by tag when
submitting it.

- Arnaldo
 
> Reviewed-by: Leo Yan <leo.yan@linaro.org>
> 
> It would be appreciate if x86 or PowerPC maintainers could take a look
> for this patch.  Thanks!
 

> Leo
> 
> > > -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> > > -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> > > -			if (lvl & P(LVL, L3 )) {
> > > +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> > > +				stats->ld_l1hit++;
> > > +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> > > +				stats->ld_l2hit++;
> > > +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
> > >  				if (snoop & P(SNOOP, HITM))
> > >  					HITM_INC(lcl_hitm);
> > >  				else
> > >  					stats->ld_llchit++;
> > >  			}
> > >  
> > > -			if (lvl & P(LVL, LOC_RAM)) {
> > > +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
> > >  				stats->lcl_dram++;
> > >  				if (snoop & P(SNOOP, HIT))
> > >  					stats->ld_shared++;

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-08 19:53 ` [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Ali Saidi
  2022-04-20  8:20   ` Leo Yan
@ 2022-04-20 18:43   ` Liang, Kan
  2022-04-22 18:49     ` Ali Saidi
  1 sibling, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2022-04-20 18:43 UTC (permalink / raw)
  To: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, leo.yan, acme
  Cc: benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will



On 4/8/2022 3:53 PM, Ali Saidi wrote:
> Add a flag to the perf mem data struct to signal that a request caused a
> cache-to-cache transfer of a line from a peer of the requestor and
> wasn't sourced from a lower cache level.

It sounds similar to the Forward state. Why can't the 
PERF_MEM_SNOOPX_FWD be reused?

Thanks,
Kan

> The line being moved from one
> peer cache to another has latency and performance implications. On Arm64
> Neoverse systems the data source can indicate a cache-to-cache transfer
> but not if the line is dirty or clean, so instead of overloading HITM
> define a new flag that indicates this type of transfer.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>   include/uapi/linux/perf_event.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 82858b697c05..c9e58c79f3e5 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1308,7 +1308,7 @@ union perf_mem_data_src {
>   #define PERF_MEM_SNOOP_SHIFT	19
>   
>   #define PERF_MEM_SNOOPX_FWD	0x01 /* forward */
> -/* 1 free */
> +#define PERF_MEM_SNOOPX_PEER	0x02 /* xfer from peer */
>   #define PERF_MEM_SNOOPX_SHIFT  38
>   
>   /* locked instruction */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command
  2022-04-08 19:53 ` [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command Ali Saidi
  2022-04-11 10:04   ` German Gomez
@ 2022-04-20 19:02   ` Liang, Kan
  1 sibling, 0 replies; 28+ messages in thread
From: Liang, Kan @ 2022-04-20 19:02 UTC (permalink / raw)
  To: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, leo.yan, acme
  Cc: benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will



On 4/8/2022 3:53 PM, Ali Saidi wrote:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
> 
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>   tools/perf/util/mem-events.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
>   	u64 daddr  = mi->daddr.addr;
>   	u64 op     = data_src->mem_op;
>   	u64 lvl    = data_src->mem_lvl;
> +	u64 lnum   = data_src->mem_lvl_num;
>   	u64 snoop  = data_src->mem_snoop;
>   	u64 lock   = data_src->mem_lock;
>   	u64 blk    = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do {				\
>   			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
>   			if (lvl & P(LVL, IO))  stats->ld_io++;
>   			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> -			if (lvl & P(LVL, L3 )) {
> +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> +				stats->ld_l1hit++;
> +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> +				stats->ld_l2hit++;
> +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
>   				if (snoop & P(SNOOP, HITM))
>   					HITM_INC(lcl_hitm);
>   				else
>   					stats->ld_llchit++;
>   			}
>   
> -			if (lvl & P(LVL, LOC_RAM)) {
> +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {

I think the PERF_MEM_LVLNUM_RAM only means it's a DRAM.
It doesn't contain the location information. To distinguish the local 
and remote dram, X86 uses PERF_MEM_REMOTE_REMOTE.
Here the remote dram will be mistakenly calculated if you only check the 
PERF_MEM_LVLNUM_RAM.

Actually, it looks like the mem_lvl_num fields supported in this patch 
are also supported by the PERF_MEM_LVL*. Why don't you set both 
PERF_MEM_LVLNUM_* and PERF_MEM_LVL* in your previous patch 4?
Then you can drop this patch.

Thanks,
Kan
>   				stats->lcl_dram++;
>   				if (snoop & P(SNOOP, HIT))
>   					stats->ld_shared++;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-20 18:43   ` Liang, Kan
@ 2022-04-22 18:49     ` Ali Saidi
  2022-04-22 21:08       ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Ali Saidi @ 2022-04-22 18:49 UTC (permalink / raw)
  To: kan.liang
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, leo.yan, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > Add a flag to the perf mem data struct to signal that a request caused a
> > cache-to-cache transfer of a line from a peer of the requestor and
> > wasn't sourced from a lower cache level.
> 
> It sounds similar to the Forward state. Why can't the 
> PERF_MEM_SNOOPX_FWD be reused?

Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
the semantics align between architectures.

Thanks,

Ali


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-22 18:49     ` Ali Saidi
@ 2022-04-22 21:08       ` Liang, Kan
  2022-04-22 21:22         ` Ali Saidi
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2022-04-22 21:08 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, leo.yan,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will,
	Andi Kleen



On 4/22/2022 2:49 PM, Ali Saidi wrote:
> On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
>> On 4/8/2022 3:53 PM, Ali Saidi wrote:
>>> Add a flag to the perf mem data struct to signal that a request caused a
>>> cache-to-cache transfer of a line from a peer of the requestor and
>>> wasn't sourced from a lower cache level.
>>
>> It sounds similar to the Forward state. Why can't the
>> PERF_MEM_SNOOPX_FWD be reused?
> 
> Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> the semantics align between architectures.
> 

+ Andi

As my understanding, the SNOOPX_FWD means the Forward state, which is a 
non-modified (clean) cache-to-cache copy.
https://en.wikipedia.org/wiki/MESIF_protocol

Thanks,
Kan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-22 21:08       ` Liang, Kan
@ 2022-04-22 21:22         ` Ali Saidi
  2022-04-23  6:38           ` Leo Yan
  0 siblings, 1 reply; 28+ messages in thread
From: Ali Saidi @ 2022-04-22 21:22 UTC (permalink / raw)
  To: kan.liang
  Cc: Nick.Forrington, acme, ak, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, leo.yan, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> >>> Add a flag to the perf mem data struct to signal that a request caused a
> >>> cache-to-cache transfer of a line from a peer of the requestor and
> >>> wasn't sourced from a lower cache level.
> >>
> >> It sounds similar to the Forward state. Why can't the
> >> PERF_MEM_SNOOPX_FWD be reused?
> > 
> > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > the semantics align between architectures.
> > 
> 
> + Andi
> 
> As my understanding, the SNOOPX_FWD means the Forward state, which is a 
> non-modified (clean) cache-to-cache copy.
> https://en.wikipedia.org/wiki/MESIF_protocol
  
In this case the semantics are different. We know the line was transferred from
another peer cache, but don't know if it was clean, dirty, or if the receiving core
now has exclusive ownership of it.

Thanks,

Ali

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-22 21:22         ` Ali Saidi
@ 2022-04-23  6:38           ` Leo Yan
  2022-04-23 12:53             ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Leo Yan @ 2022-04-23  6:38 UTC (permalink / raw)
  To: Ali Saidi
  Cc: kan.liang, Nick.Forrington, acme, ak, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Fri, Apr 22, 2022 at 09:22:49PM +0000, Ali Saidi wrote:
> 
> On Fri, 22 Apr 2022 21:43:28, Kan Liang wrote:
> > On 4/22/2022 2:49 PM, Ali Saidi wrote:
> > > On Wed, 20 Apr 2022 18:43:28, Kan Liang wrote:
> > >> On 4/8/2022 3:53 PM, Ali Saidi wrote:
> > >>> Add a flag to the perf mem data struct to signal that a request caused a
> > >>> cache-to-cache transfer of a line from a peer of the requestor and
> > >>> wasn't sourced from a lower cache level.
> > >>
> > >> It sounds similar to the Forward state. Why can't the
> > >> PERF_MEM_SNOOPX_FWD be reused?
> > > 
> > > Is there a definition of SNOOPX_FWD i can refer to? Happy to use this instead if
> > > the semantics align between architectures.
> > > 
> > 
> > + Andi
> > 
> > As my understanding, the SNOOPX_FWD means the Forward state, which is a 
> > non-modified (clean) cache-to-cache copy.
> > https://en.wikipedia.org/wiki/MESIF_protocol
>   
> In this case the semantics are different. We know the line was transferred from
> another peer cache, but don't know if it was clean, dirty, or if the receiving core
> now has exclusive ownership of it.

In the spec "Intel 64 and IA-32 Architectures Software Developer's Manual,
Volume 3B: System Programming Guide, Part 2", section "18.8.1.3 Off-core
Response Performance Monitoring in the Processor Core", it defines the
REMOTE_CACHE_FWD as:

"L3 Miss: local homed requests that missed the L3 cache and was serviced
by forwarded data following a cross package snoop where no modified copies
found. (Remote home requests are not counted)".

Except SNOOPX_FWD means a no modified cache snooping, it also means it's
a cache conherency from *remote* socket.  This is quite different from we
define SNOOPX_PEER, which only snoop from peer CPU or clusters.

If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
this would be easier for us to distinguish the semantics and support the
statistics for SNOOPX_FWD and SNOOPX_PEER separately.

I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-23  6:38           ` Leo Yan
@ 2022-04-23 12:53             ` Andi Kleen
  2022-04-24 11:43               ` Leo Yan
  0 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2022-04-23 12:53 UTC (permalink / raw)
  To: Leo Yan, Ali Saidi
  Cc: kan.liang, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> a cache conherency from *remote* socket.  This is quite different from we
> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>
> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> this would be easier for us to distinguish the semantics and support the
> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>
> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.

Yes seems better to keep using a separate flag if they don't exactly match.

It's not that we're short on flags anyways.

-Andi


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-23 12:53             ` Andi Kleen
@ 2022-04-24 11:43               ` Leo Yan
  2022-04-25 17:01                 ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Leo Yan @ 2022-04-24 11:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ali Saidi, kan.liang, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
> 
> > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > a cache conherency from *remote* socket.  This is quite different from we
> > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> > 
> > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > this would be easier for us to distinguish the semantics and support the
> > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> > 
> > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
> 
> Yes seems better to keep using a separate flag if they don't exactly match.
> 
> It's not that we're short on flags anyways.

Thanks for confirmation.

Leo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-24 11:43               ` Leo Yan
@ 2022-04-25 17:01                 ` Liang, Kan
  2022-04-27 16:19                   ` Leo Yan
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2022-04-25 17:01 UTC (permalink / raw)
  To: Leo Yan, Andi Kleen
  Cc: Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will



On 4/24/2022 7:43 AM, Leo Yan wrote:
> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>
>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>> a cache conherency from *remote* socket.  This is quite different from we
>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>

The FWD doesn't have to be *remote*. The definition you quoted is just 
for the "L3 Miss", which is indeed a remote forward. But we still have 
cross-core FWD. See Table 19-101.

Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to 
indicate the remote FWD, not just SNOOPX_FWD.

>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>> this would be easier for us to distinguish the semantics and support the
>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>
>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>
>> Yes seems better to keep using a separate flag if they don't exactly match.
>>

Yes, I agree with Andi. If you still think the existing flag combination 
doesn't match your requirement, a new separate flag should be 
introduced. I'm not familiar with ARM. I think I will leave it to you 
and the maintainer to decide.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-25 17:01                 ` Liang, Kan
@ 2022-04-27 16:19                   ` Leo Yan
  2022-04-27 19:29                     ` Liang, Kan
  0 siblings, 1 reply; 28+ messages in thread
From: Leo Yan @ 2022-04-27 16:19 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Andi Kleen, Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

Hi Kan,

On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
> 
> 
> On 4/24/2022 7:43 AM, Leo Yan wrote:
> > On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
> > > 
> > > > Except SNOOPX_FWD means a no modified cache snooping, it also means it's
> > > > a cache conherency from *remote* socket.  This is quite different from we
> > > > define SNOOPX_PEER, which only snoop from peer CPU or clusters.
> > > > 
> 
> The FWD doesn't have to be *remote*. The definition you quoted is just for
> the "L3 Miss", which is indeed a remote forward. But we still have
> cross-core FWD. See Table 19-101.
> 
> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
> indicate the remote FWD, not just SNOOPX_FWD.

Thanks a lot for the info.

> > > > If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
> > > > this would be easier for us to distinguish the semantics and support the
> > > > statistics for SNOOPX_FWD and SNOOPX_PEER separately.
> > > > 
> > > > I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
> > > 
> > > Yes seems better to keep using a separate flag if they don't exactly match.
> > > 
> 
> Yes, I agree with Andi. If you still think the existing flag combination
> doesn't match your requirement, a new separate flag should be introduced.
> I'm not familiar with ARM. I think I will leave it to you and the maintainer
> to decide.

It's a bit difficult for me to make decision is because now SNOOPX_FWD
is not used in the file util/mem-events.c, so I am not very sure if
SNOOPX_FWD has the consistent usage across different arches.

On the other hand, I sent a patch for 'peer' flag statistics [1], you
could review it and it only stats for L2 and L3 cache level for local
node.

The main purpose for my sending this email is if you think the FWD can
be the consistent for both arches, and even the new added display mode
is also useful for x86 arch (we can rename it as 'fwd' display mode),
then I am very glad to unify the flag.

Thanks,
Leo

[1] https://lore.kernel.org/lkml/20220427155013.1833222-5-leo.yan@linaro.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-27 16:19                   ` Leo Yan
@ 2022-04-27 19:29                     ` Liang, Kan
  2022-04-29  9:28                       ` Leo Yan
  0 siblings, 1 reply; 28+ messages in thread
From: Liang, Kan @ 2022-04-27 19:29 UTC (permalink / raw)
  To: Leo Yan
  Cc: Andi Kleen, Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will



On 4/27/2022 12:19 PM, Leo Yan wrote:
> Hi Kan,
> 
> On Mon, Apr 25, 2022 at 01:01:40PM -0400, Liang, Kan wrote:
>>
>>
>> On 4/24/2022 7:43 AM, Leo Yan wrote:
>>> On Sat, Apr 23, 2022 at 05:53:28AM -0700, Andi Kleen wrote:
>>>>
>>>>> Except SNOOPX_FWD means a no modified cache snooping, it also means it's
>>>>> a cache conherency from *remote* socket.  This is quite different from we
>>>>> define SNOOPX_PEER, which only snoop from peer CPU or clusters.
>>>>>
>>
>> The FWD doesn't have to be *remote*. The definition you quoted is just for
>> the "L3 Miss", which is indeed a remote forward. But we still have
>> cross-core FWD. See Table 19-101.
>>
>> Actually, X86 uses the PERF_MEM_REMOTE_REMOTE + PERF_MEM_SNOOPX_FWD to
>> indicate the remote FWD, not just SNOOPX_FWD.
> 
> Thanks a lot for the info.
> 
>>>>> If no objection, I prefer we could keep the new snoop type SNOOPX_PEER,
>>>>> this would be easier for us to distinguish the semantics and support the
>>>>> statistics for SNOOPX_FWD and SNOOPX_PEER separately.
>>>>>
>>>>> I overlooked the flag SNOOPX_FWD, thanks a lot for Kan's reminding.
>>>>
>>>> Yes seems better to keep using a separate flag if they don't exactly match.
>>>>
>>
>> Yes, I agree with Andi. If you still think the existing flag combination
>> doesn't match your requirement, a new separate flag should be introduced.
>> I'm not familiar with ARM. I think I will leave it to you and the maintainer
>> to decide.
> 
> It's a bit difficult for me to make decision is because now SNOOPX_FWD
> is not used in the file util/mem-events.c, so I am not very sure if
> SNOOPX_FWD has the consistent usage across different arches.

No, it's used in the file util/mem-events.c
See perf_mem__snp_scnprintf().

> 
> On the other hand, I sent a patch for 'peer' flag statistics [1], you
> could review it and it only stats for L2 and L3 cache level for local
> node.

If it's for the local node, why don't you use the hop level which is 
introduced recently by Power? The below seems a good fit.

PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?

/* hop level */
#define PERF_MEM_HOPS_0		0x01 /* remote core, same node */
#define PERF_MEM_HOPS_1		0x02 /* remote node, same socket */
#define PERF_MEM_HOPS_2		0x03 /* remote socket, same board */
#define PERF_MEM_HOPS_3		0x04 /* remote board */
/* 5-7 available */
#define PERF_MEM_HOPS_SHIFT	43

Thanks,
Kan

> 
> The main purpose for my sending this email is if you think the FWD can
> be the consistent for both arches, and even the new added display mode
> is also useful for x86 arch (we can rename it as 'fwd' display mode),
> then I am very glad to unify the flag.
> 
> Thanks,
> Leo
> 
> [1] https://lore.kernel.org/lkml/20220427155013.1833222-5-leo.yan@linaro.org/




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct
  2022-04-27 19:29                     ` Liang, Kan
@ 2022-04-29  9:28                       ` Leo Yan
  0 siblings, 0 replies; 28+ messages in thread
From: Leo Yan @ 2022-04-29  9:28 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Andi Kleen, Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Wed, Apr 27, 2022 at 03:29:31PM -0400, Liang, Kan wrote:

[...]

> > It's a bit difficult for me to make decision is because now SNOOPX_FWD
> > is not used in the file util/mem-events.c, so I am not very sure if
> > SNOOPX_FWD has the consistent usage across different arches.
> 
> No, it's used in the file util/mem-events.c
> See perf_mem__snp_scnprintf().

Right.  Actually I mean FWD flag is not for statistics.

> > On the other hand, I sent a patch for 'peer' flag statistics [1], you
> > could review it and it only stats for L2 and L3 cache level for local
> > node.
> 
> If it's for the local node, why don't you use the hop level which is
> introduced recently by Power? The below seems a good fit.
> 
> PERF_MEM_LVLNUM_ANY_CACHE | PERF_MEM_HOPS_0?
> 
> /* hop level */
> #define PERF_MEM_HOPS_0		0x01 /* remote core, same node */
> #define PERF_MEM_HOPS_1		0x02 /* remote node, same socket */
> #define PERF_MEM_HOPS_2		0x03 /* remote socket, same board */
> #define PERF_MEM_HOPS_3		0x04 /* remote board */
> /* 5-7 available */
> #define PERF_MEM_HOPS_SHIFT	43

Thanks for reminding.  I have considered HOPS flags during the
discussion with Ali for the flags, you could see PERF_MEM_HOPS_0 is for
"remote core, same node", this could introduce confusion for Arm
Neoverse CPUs.  Another thinking is how we consume the flags in perf c2c
tool, perf c2c tool uses snoop flag to find out the costly cache
conherency operations, if we use PERF_MEM_HOPS_0 that means we need to
extend perf c2c tool to support two kinds of flags: snoop flag and hop
flag, so it would introduce complexity for perf c2c code.

If we step back to review current flags, we can see different arches
have different memory model (and implementations), it is a bit painful
when we try to unify the flags.  So at current stage, I prefer to use
PEER flag for Arm arches, essentially it's not too bad that now we
introduce one bit, and later we may consider to consolidate memory
flags in more general way.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-04-29  9:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08 19:53 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
2022-04-08 19:53 ` [PATCH v5 1/5] tools: arm64: Import cputype.h Ali Saidi
2022-04-18 14:57   ` Leo Yan
2022-04-08 19:53 ` [PATCH v5 2/5] perf: Add SNOOP_PEER flag to perf mem data struct Ali Saidi
2022-04-20  8:20   ` Leo Yan
2022-04-20 18:43   ` Liang, Kan
2022-04-22 18:49     ` Ali Saidi
2022-04-22 21:08       ` Liang, Kan
2022-04-22 21:22         ` Ali Saidi
2022-04-23  6:38           ` Leo Yan
2022-04-23 12:53             ` Andi Kleen
2022-04-24 11:43               ` Leo Yan
2022-04-25 17:01                 ` Liang, Kan
2022-04-27 16:19                   ` Leo Yan
2022-04-27 19:29                     ` Liang, Kan
2022-04-29  9:28                       ` Leo Yan
2022-04-08 19:53 ` [PATCH v5 3/5] perf tools: sync addition of PERF_MEM_SNOOPX_PEER Ali Saidi
2022-04-11 10:26   ` German Gomez
2022-04-11 14:35     ` German Gomez
2022-04-20  8:23       ` Leo Yan
2022-04-08 19:53 ` [PATCH v5 4/5] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
2022-04-20  8:30   ` Leo Yan
2022-04-20  8:42   ` Leo Yan
2022-04-08 19:53 ` [PATCH v5 5/5] perf mem: Support mem_lvl_num in c2c command Ali Saidi
2022-04-11 10:04   ` German Gomez
2022-04-20  8:48     ` Leo Yan
2022-04-20 16:47       ` Arnaldo Carvalho de Melo
2022-04-20 19:02   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).