[PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c
@ 2022-03-24 18:33 ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores so we can detect situtions like cache line
contention and transfers on Arm platforms. 

This changes enables the expected behavior of perf c2c on a system with
SPE where lines that are shared among multiple cores show up in perf c2c
output. 

These changes switch to use mem_lvl_num to encode the level information
instead of mem_lvl which is being deprecated, but I haven't found other
users of mem_lvl_num. 

Changes in v4:
  * Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/ 
  * Add neoverse-v1 to the neoverse cores list

Ali Saidi (4):
  tools: arm64: Import cputype.h
  perf arm-spe: Use SPE data source for neoverse cores
  perf mem: Support mem_lvl_num in c2c command
  perf mem: Support HITM for when mem_lvl_num is any

 tools/arch/arm64/include/asm/cputype.h        | 258 ++++++++++++++++++
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 +
 tools/perf/util/arm-spe.c                     | 110 +++++++-
 tools/perf/util/mem-events.c                  |  20 +-
 5 files changed, 383 insertions(+), 18 deletions(-)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

-- 
2.32.0


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c
@ 2022-03-24 18:33 ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores so we can detect situtions like cache line
contention and transfers on Arm platforms. 

This changes enables the expected behavior of perf c2c on a system with
SPE where lines that are shared among multiple cores show up in perf c2c
output. 

These changes switch to use mem_lvl_num to encode the level information
instead of mem_lvl which is being deprecated, but I haven't found other
users of mem_lvl_num. 

Changes in v4:
  * Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/ 
  * Add neoverse-v1 to the neoverse cores list

Ali Saidi (4):
  tools: arm64: Import cputype.h
  perf arm-spe: Use SPE data source for neoverse cores
  perf mem: Support mem_lvl_num in c2c command
  perf mem: Support HITM for when mem_lvl_num is any

 tools/arch/arm64/include/asm/cputype.h        | 258 ++++++++++++++++++
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 +
 tools/perf/util/arm-spe.c                     | 110 +++++++-
 tools/perf/util/mem-events.c                  |  20 +-
 5 files changed, 383 insertions(+), 18 deletions(-)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

-- 
2.32.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-24 18:33 ` Ali Saidi
@ 2022-03-24 18:33   ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
for arm64 to make use of all the core-type definitions in perf.

Replace sysreg.h with the version already imported into tools/.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
new file mode 100644
index 000000000000..9afcc6467a09
--- /dev/null
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -0,0 +1,258 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+#ifndef __ASM_CPUTYPE_H
+#define __ASM_CPUTYPE_H
+
+#define INVALID_HWID		ULONG_MAX
+
+#define MPIDR_UP_BITMASK	(0x1 << 30)
+#define MPIDR_MT_BITMASK	(0x1 << 24)
+#define MPIDR_HWID_BITMASK	UL(0xff00ffffff)
+
+#define MPIDR_LEVEL_BITS_SHIFT	3
+#define MPIDR_LEVEL_BITS	(1 << MPIDR_LEVEL_BITS_SHIFT)
+#define MPIDR_LEVEL_MASK	((1 << MPIDR_LEVEL_BITS) - 1)
+
+#define MPIDR_LEVEL_SHIFT(level) \
+	(((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
+
+#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
+	((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
+
+#define MIDR_REVISION_MASK	0xf
+#define MIDR_REVISION(midr)	((midr) & MIDR_REVISION_MASK)
+#define MIDR_PARTNUM_SHIFT	4
+#define MIDR_PARTNUM_MASK	(0xfff << MIDR_PARTNUM_SHIFT)
+#define MIDR_PARTNUM(midr)	\
+	(((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT	16
+#define MIDR_ARCHITECTURE_MASK	(0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_ARCHITECTURE(midr)	\
+	(((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_VARIANT_SHIFT	20
+#define MIDR_VARIANT_MASK	(0xf << MIDR_VARIANT_SHIFT)
+#define MIDR_VARIANT(midr)	\
+	(((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
+#define MIDR_IMPLEMENTOR_SHIFT	24
+#define MIDR_IMPLEMENTOR_MASK	(0xff << MIDR_IMPLEMENTOR_SHIFT)
+#define MIDR_IMPLEMENTOR(midr)	\
+	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
+
+#define MIDR_CPU_MODEL(imp, partnum) \
+	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
+	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
+	((partnum)		<< MIDR_PARTNUM_SHIFT))
+
+#define MIDR_CPU_VAR_REV(var, rev) \
+	(((var)	<< MIDR_VARIANT_SHIFT) | (rev))
+
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+			     MIDR_ARCHITECTURE_MASK)
+
+#define ARM_CPU_IMP_ARM			0x41
+#define ARM_CPU_IMP_APM			0x50
+#define ARM_CPU_IMP_CAVIUM		0x43
+#define ARM_CPU_IMP_BRCM		0x42
+#define ARM_CPU_IMP_QCOM		0x51
+#define ARM_CPU_IMP_NVIDIA		0x4E
+#define ARM_CPU_IMP_FUJITSU		0x46
+#define ARM_CPU_IMP_HISI		0x48
+#define ARM_CPU_IMP_APPLE		0x61
+
+#define ARM_CPU_PART_AEM_V8		0xD0F
+#define ARM_CPU_PART_FOUNDATION		0xD00
+#define ARM_CPU_PART_CORTEX_A57		0xD07
+#define ARM_CPU_PART_CORTEX_A72		0xD08
+#define ARM_CPU_PART_CORTEX_A53		0xD03
+#define ARM_CPU_PART_CORTEX_A73		0xD09
+#define ARM_CPU_PART_CORTEX_A75		0xD0A
+#define ARM_CPU_PART_CORTEX_A35		0xD04
+#define ARM_CPU_PART_CORTEX_A55		0xD05
+#define ARM_CPU_PART_CORTEX_A76		0xD0B
+#define ARM_CPU_PART_NEOVERSE_N1	0xD0C
+#define ARM_CPU_PART_CORTEX_A77		0xD0D
+#define ARM_CPU_PART_NEOVERSE_V1	0xD40
+#define ARM_CPU_PART_CORTEX_A78		0xD41
+#define ARM_CPU_PART_CORTEX_X1		0xD44
+#define ARM_CPU_PART_CORTEX_A510	0xD46
+#define ARM_CPU_PART_CORTEX_A710	0xD47
+#define ARM_CPU_PART_CORTEX_X2		0xD48
+#define ARM_CPU_PART_NEOVERSE_N2	0xD49
+#define ARM_CPU_PART_CORTEX_A78C	0xD4B
+
+#define APM_CPU_PART_POTENZA		0x000
+
+#define CAVIUM_CPU_PART_THUNDERX	0x0A1
+#define CAVIUM_CPU_PART_THUNDERX_81XX	0x0A2
+#define CAVIUM_CPU_PART_THUNDERX_83XX	0x0A3
+#define CAVIUM_CPU_PART_THUNDERX2	0x0AF
+/* OcteonTx2 series */
+#define CAVIUM_CPU_PART_OCTX2_98XX	0x0B1
+#define CAVIUM_CPU_PART_OCTX2_96XX	0x0B2
+#define CAVIUM_CPU_PART_OCTX2_95XX	0x0B3
+#define CAVIUM_CPU_PART_OCTX2_95XXN	0x0B4
+#define CAVIUM_CPU_PART_OCTX2_95XXMM	0x0B5
+#define CAVIUM_CPU_PART_OCTX2_95XXO	0x0B6
+
+#define BRCM_CPU_PART_BRAHMA_B53	0x100
+#define BRCM_CPU_PART_VULCAN		0x516
+
+#define QCOM_CPU_PART_FALKOR_V1		0x800
+#define QCOM_CPU_PART_FALKOR		0xC00
+#define QCOM_CPU_PART_KRYO		0x200
+#define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
+#define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
+#define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
+#define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
+#define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
+
+#define NVIDIA_CPU_PART_DENVER		0x003
+#define NVIDIA_CPU_PART_CARMEL		0x004
+
+#define FUJITSU_CPU_PART_A64FX		0x001
+
+#define HISI_CPU_PART_TSV110		0xD01
+
+#define APPLE_CPU_PART_M1_ICESTORM	0x022
+#define APPLE_CPU_PART_M1_FIRESTORM	0x023
+
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
+#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
+#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
+#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
+#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
+#define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
+#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
+#define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
+#define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
+#define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
+#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
+#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
+#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
+#define MIDR_CORTEX_A78C	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
+#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
+#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
+#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
+#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
+#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
+#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
+#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
+#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
+#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
+#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
+#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
+#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
+#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
+#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
+#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
+#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
+#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
+#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
+#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
+#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
+#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
+
+/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
+#define MIDR_FUJITSU_ERRATUM_010001		MIDR_FUJITSU_A64FX
+#define MIDR_FUJITSU_ERRATUM_010001_MASK	(~MIDR_CPU_VAR_REV(1, 0))
+#define TCR_CLEAR_FUJITSU_ERRATUM_010001	(TCR_NFD1 | TCR_NFD0)
+
+#ifndef __ASSEMBLY__
+
+#include "sysreg.h"
+
+#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
+
+/*
+ * Represent a range of MIDR values for a given CPU model and a
+ * range of variant/revision values.
+ *
+ * @model	- CPU model as defined by MIDR_CPU_MODEL
+ * @rv_min	- Minimum value for the revision/variant as defined by
+ *		  MIDR_CPU_VAR_REV
+ * @rv_max	- Maximum value for the variant/revision for the range.
+ */
+struct midr_range {
+	u32 model;
+	u32 rv_min;
+	u32 rv_max;
+};
+
+#define MIDR_RANGE(m, v_min, r_min, v_max, r_max)		\
+	{							\
+		.model = m,					\
+		.rv_min = MIDR_CPU_VAR_REV(v_min, r_min),	\
+		.rv_max = MIDR_CPU_VAR_REV(v_max, r_max),	\
+	}
+
+#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
+#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
+#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
+
+static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
+					   u32 rv_max)
+{
+	u32 _model = midr & MIDR_CPU_MODEL_MASK;
+	u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
+
+	return _model == model && rv >= rv_min && rv <= rv_max;
+}
+
+static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
+{
+	return midr_is_cpu_model_range(midr, range->model,
+				       range->rv_min, range->rv_max);
+}
+
+static inline bool
+is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
+{
+	while (ranges->model)
+		if (is_midr_in_range(midr, ranges++))
+			return true;
+	return false;
+}
+
+/*
+ * The CPU ID never changes at run time, so we might as well tell the
+ * compiler that it's constant.  Use this function to read the CPU ID
+ * rather than directly reading processor_id or read_cpuid() directly.
+ */
+static inline u32 __attribute_const__ read_cpuid_id(void)
+{
+	return read_cpuid(MIDR_EL1);
+}
+
+static inline u64 __attribute_const__ read_cpuid_mpidr(void)
+{
+	return read_cpuid(MPIDR_EL1);
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
+{
+	return MIDR_IMPLEMENTOR(read_cpuid_id());
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
+{
+	return MIDR_PARTNUM(read_cpuid_id());
+}
+
+static inline u32 __attribute_const__ read_cpuid_cachetype(void)
+{
+	return read_cpuid(CTR_EL0);
+}
+#endif /* __ASSEMBLY__ */
+
+#endif
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-24 18:33   ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
for arm64 to make use of all the core-type definitions in perf.

Replace sysreg.h with the version already imported into tools/.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
---
 tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
 1 file changed, 258 insertions(+)
 create mode 100644 tools/arch/arm64/include/asm/cputype.h

diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
new file mode 100644
index 000000000000..9afcc6467a09
--- /dev/null
+++ b/tools/arch/arm64/include/asm/cputype.h
@@ -0,0 +1,258 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2012 ARM Ltd.
+ */
+#ifndef __ASM_CPUTYPE_H
+#define __ASM_CPUTYPE_H
+
+#define INVALID_HWID		ULONG_MAX
+
+#define MPIDR_UP_BITMASK	(0x1 << 30)
+#define MPIDR_MT_BITMASK	(0x1 << 24)
+#define MPIDR_HWID_BITMASK	UL(0xff00ffffff)
+
+#define MPIDR_LEVEL_BITS_SHIFT	3
+#define MPIDR_LEVEL_BITS	(1 << MPIDR_LEVEL_BITS_SHIFT)
+#define MPIDR_LEVEL_MASK	((1 << MPIDR_LEVEL_BITS) - 1)
+
+#define MPIDR_LEVEL_SHIFT(level) \
+	(((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
+
+#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
+	((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
+
+#define MIDR_REVISION_MASK	0xf
+#define MIDR_REVISION(midr)	((midr) & MIDR_REVISION_MASK)
+#define MIDR_PARTNUM_SHIFT	4
+#define MIDR_PARTNUM_MASK	(0xfff << MIDR_PARTNUM_SHIFT)
+#define MIDR_PARTNUM(midr)	\
+	(((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
+#define MIDR_ARCHITECTURE_SHIFT	16
+#define MIDR_ARCHITECTURE_MASK	(0xf << MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_ARCHITECTURE(midr)	\
+	(((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
+#define MIDR_VARIANT_SHIFT	20
+#define MIDR_VARIANT_MASK	(0xf << MIDR_VARIANT_SHIFT)
+#define MIDR_VARIANT(midr)	\
+	(((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
+#define MIDR_IMPLEMENTOR_SHIFT	24
+#define MIDR_IMPLEMENTOR_MASK	(0xff << MIDR_IMPLEMENTOR_SHIFT)
+#define MIDR_IMPLEMENTOR(midr)	\
+	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
+
+#define MIDR_CPU_MODEL(imp, partnum) \
+	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
+	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
+	((partnum)		<< MIDR_PARTNUM_SHIFT))
+
+#define MIDR_CPU_VAR_REV(var, rev) \
+	(((var)	<< MIDR_VARIANT_SHIFT) | (rev))
+
+#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
+			     MIDR_ARCHITECTURE_MASK)
+
+#define ARM_CPU_IMP_ARM			0x41
+#define ARM_CPU_IMP_APM			0x50
+#define ARM_CPU_IMP_CAVIUM		0x43
+#define ARM_CPU_IMP_BRCM		0x42
+#define ARM_CPU_IMP_QCOM		0x51
+#define ARM_CPU_IMP_NVIDIA		0x4E
+#define ARM_CPU_IMP_FUJITSU		0x46
+#define ARM_CPU_IMP_HISI		0x48
+#define ARM_CPU_IMP_APPLE		0x61
+
+#define ARM_CPU_PART_AEM_V8		0xD0F
+#define ARM_CPU_PART_FOUNDATION		0xD00
+#define ARM_CPU_PART_CORTEX_A57		0xD07
+#define ARM_CPU_PART_CORTEX_A72		0xD08
+#define ARM_CPU_PART_CORTEX_A53		0xD03
+#define ARM_CPU_PART_CORTEX_A73		0xD09
+#define ARM_CPU_PART_CORTEX_A75		0xD0A
+#define ARM_CPU_PART_CORTEX_A35		0xD04
+#define ARM_CPU_PART_CORTEX_A55		0xD05
+#define ARM_CPU_PART_CORTEX_A76		0xD0B
+#define ARM_CPU_PART_NEOVERSE_N1	0xD0C
+#define ARM_CPU_PART_CORTEX_A77		0xD0D
+#define ARM_CPU_PART_NEOVERSE_V1	0xD40
+#define ARM_CPU_PART_CORTEX_A78		0xD41
+#define ARM_CPU_PART_CORTEX_X1		0xD44
+#define ARM_CPU_PART_CORTEX_A510	0xD46
+#define ARM_CPU_PART_CORTEX_A710	0xD47
+#define ARM_CPU_PART_CORTEX_X2		0xD48
+#define ARM_CPU_PART_NEOVERSE_N2	0xD49
+#define ARM_CPU_PART_CORTEX_A78C	0xD4B
+
+#define APM_CPU_PART_POTENZA		0x000
+
+#define CAVIUM_CPU_PART_THUNDERX	0x0A1
+#define CAVIUM_CPU_PART_THUNDERX_81XX	0x0A2
+#define CAVIUM_CPU_PART_THUNDERX_83XX	0x0A3
+#define CAVIUM_CPU_PART_THUNDERX2	0x0AF
+/* OcteonTx2 series */
+#define CAVIUM_CPU_PART_OCTX2_98XX	0x0B1
+#define CAVIUM_CPU_PART_OCTX2_96XX	0x0B2
+#define CAVIUM_CPU_PART_OCTX2_95XX	0x0B3
+#define CAVIUM_CPU_PART_OCTX2_95XXN	0x0B4
+#define CAVIUM_CPU_PART_OCTX2_95XXMM	0x0B5
+#define CAVIUM_CPU_PART_OCTX2_95XXO	0x0B6
+
+#define BRCM_CPU_PART_BRAHMA_B53	0x100
+#define BRCM_CPU_PART_VULCAN		0x516
+
+#define QCOM_CPU_PART_FALKOR_V1		0x800
+#define QCOM_CPU_PART_FALKOR		0xC00
+#define QCOM_CPU_PART_KRYO		0x200
+#define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
+#define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
+#define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
+#define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
+#define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
+
+#define NVIDIA_CPU_PART_DENVER		0x003
+#define NVIDIA_CPU_PART_CARMEL		0x004
+
+#define FUJITSU_CPU_PART_A64FX		0x001
+
+#define HISI_CPU_PART_TSV110		0xD01
+
+#define APPLE_CPU_PART_M1_ICESTORM	0x022
+#define APPLE_CPU_PART_M1_FIRESTORM	0x023
+
+#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
+#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
+#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
+#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
+#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
+#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
+#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
+#define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
+#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
+#define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
+#define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
+#define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
+#define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
+#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
+#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
+#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
+#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
+#define MIDR_CORTEX_A78C	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
+#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
+#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
+#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
+#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
+#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
+#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
+#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
+#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
+#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
+#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
+#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
+#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
+#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
+#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
+#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
+#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
+#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
+#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
+#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
+#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
+#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
+#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
+#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
+#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
+#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
+
+/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
+#define MIDR_FUJITSU_ERRATUM_010001		MIDR_FUJITSU_A64FX
+#define MIDR_FUJITSU_ERRATUM_010001_MASK	(~MIDR_CPU_VAR_REV(1, 0))
+#define TCR_CLEAR_FUJITSU_ERRATUM_010001	(TCR_NFD1 | TCR_NFD0)
+
+#ifndef __ASSEMBLY__
+
+#include "sysreg.h"
+
+#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
+
+/*
+ * Represent a range of MIDR values for a given CPU model and a
+ * range of variant/revision values.
+ *
+ * @model	- CPU model as defined by MIDR_CPU_MODEL
+ * @rv_min	- Minimum value for the revision/variant as defined by
+ *		  MIDR_CPU_VAR_REV
+ * @rv_max	- Maximum value for the variant/revision for the range.
+ */
+struct midr_range {
+	u32 model;
+	u32 rv_min;
+	u32 rv_max;
+};
+
+#define MIDR_RANGE(m, v_min, r_min, v_max, r_max)		\
+	{							\
+		.model = m,					\
+		.rv_min = MIDR_CPU_VAR_REV(v_min, r_min),	\
+		.rv_max = MIDR_CPU_VAR_REV(v_max, r_max),	\
+	}
+
+#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
+#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
+#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
+
+static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
+					   u32 rv_max)
+{
+	u32 _model = midr & MIDR_CPU_MODEL_MASK;
+	u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
+
+	return _model == model && rv >= rv_min && rv <= rv_max;
+}
+
+static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
+{
+	return midr_is_cpu_model_range(midr, range->model,
+				       range->rv_min, range->rv_max);
+}
+
+static inline bool
+is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
+{
+	while (ranges->model)
+		if (is_midr_in_range(midr, ranges++))
+			return true;
+	return false;
+}
+
+/*
+ * The CPU ID never changes at run time, so we might as well tell the
+ * compiler that it's constant.  Use this function to read the CPU ID
+ * rather than directly reading processor_id or read_cpuid() directly.
+ */
+static inline u32 __attribute_const__ read_cpuid_id(void)
+{
+	return read_cpuid(MIDR_EL1);
+}
+
+static inline u64 __attribute_const__ read_cpuid_mpidr(void)
+{
+	return read_cpuid(MPIDR_EL1);
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
+{
+	return MIDR_IMPLEMENTOR(read_cpuid_id());
+}
+
+static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
+{
+	return MIDR_PARTNUM(read_cpuid_id());
+}
+
+static inline u32 __attribute_const__ read_cpuid_cachetype(void)
+{
+	return read_cpuid(CTR_EL0);
+}
+#endif /* __ASSEMBLY__ */
+
+#endif
-- 
2.32.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-24 18:33 ` Ali Saidi
@ 2022-03-24 18:33   ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
the same encoding. I can't find encoding information for any other SPE
implementations to unify their choices with Arm's thus that is left for
future work.

This change populates the mem_lvl_num for Neoverse cores instead of the
deprecated mem_lvl namespace.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 ++
 tools/perf/util/arm-spe.c                     | 110 +++++++++++++++---
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
index 5e390a1a79ab..091987dd3966 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
@@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
 
 			break;
 		case ARM_SPE_DATA_SOURCE:
+			decoder->record.source = payload;
 			break;
 		case ARM_SPE_BAD:
 			break;
diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
index 69b31084d6be..c81bf90c0996 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
@@ -29,6 +29,17 @@ enum arm_spe_op_type {
 	ARM_SPE_ST		= 1 << 1,
 };
 
+enum arm_spe_neoverse_data_source {
+	ARM_SPE_NV_L1D        = 0x0,
+	ARM_SPE_NV_L2         = 0x8,
+	ARM_SPE_NV_PEER_CORE  = 0x9,
+	ARM_SPE_NV_LCL_CLSTR  = 0xa,
+	ARM_SPE_NV_SYS_CACHE  = 0xb,
+	ARM_SPE_NV_PEER_CLSTR = 0xc,
+	ARM_SPE_NV_REMOTE     = 0xd,
+	ARM_SPE_NV_DRAM       = 0xe,
+};
+
 struct arm_spe_record {
 	enum arm_spe_sample_type type;
 	int err;
@@ -40,6 +51,7 @@ struct arm_spe_record {
 	u64 virt_addr;
 	u64 phys_addr;
 	u64 context_id;
+	u16 source;
 };
 
 struct arm_spe_insn;
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..f92ebce88c6a 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -34,6 +34,7 @@
 #include "arm-spe-decoder/arm-spe-decoder.h"
 #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
 
+#include "../../arch/arm64/include/asm/cputype.h"
 #define MAX_TIMESTAMP (~0ULL)
 
 struct arm_spe {
@@ -45,6 +46,7 @@ struct arm_spe {
 	struct perf_session		*session;
 	struct machine			*machine;
 	u32				pmu_type;
+	u64				midr;
 
 	struct perf_tsc_conversion	tc;
 
@@ -399,33 +401,110 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
 	return false;
 }
 
-static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
+static const struct midr_range neoverse_spe[] = {
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
+	{},
+};
+
+
+static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
 {
-	union perf_mem_data_src	data_src = { 0 };
+	/*
+	 * Even though four levels of cache hierarchy are possible, no known
+	 * production Neoverse systems currently include more than three levels
+	 * so for the time being we assume three exist. If a production system
+	 * is built with four the this function would have to be changed to
+	 * detect the number of levels for reporting.
+	 */
 
-	if (record->op == ARM_SPE_LD)
-		data_src.mem_op = PERF_MEM_OP_LOAD;
-	else
-		data_src.mem_op = PERF_MEM_OP_STORE;
+	switch (record->source) {
+	case ARM_SPE_NV_L1D:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
+		break;
+	case ARM_SPE_NV_L2:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+		break;
+	case ARM_SPE_NV_PEER_CORE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+		break;
+	/*
+	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
+	 * transfer, so set SNOOP_HITM
+	 */
+	case ARM_SPE_NV_LCL_CLSTR:
+	case ARM_SPE_NV_PEER_CLSTR:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+		break;
+	/*
+	 * System cache is assumed to be L3
+	 */
+	case ARM_SPE_NV_SYS_CACHE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+		break;
+	/*
+	 * We don't know what level it hit in, except it came from the other
+	 * socket
+	 */
+	case ARM_SPE_NV_REMOTE:
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
+		break;
+	case ARM_SPE_NV_DRAM:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
+		break;
+	default:
+		break;
+	}
+}
 
+static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
+{
 	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L3;
+		data_src->mem_lvl = PERF_MEM_LVL_L3;
 
 		if (record->type & ARM_SPE_LLC_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L1;
+		data_src->mem_lvl = PERF_MEM_LVL_L1;
 
 		if (record->type & ARM_SPE_L1D_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	}
 
 	if (record->type & ARM_SPE_REMOTE_ACCESS)
-		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+}
+
+static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
+{
+	union perf_mem_data_src	data_src = { 0 };
+	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
+
+	if (record->op & ARM_SPE_LD)
+		data_src.mem_op = PERF_MEM_OP_LOAD;
+	else
+		data_src.mem_op = PERF_MEM_OP_STORE;
+
+	if (is_neoverse)
+		arm_spe__synth_data_source_neoverse(record, &data_src);
+	else
+		arm_spe__synth_data_source_generic(record, &data_src);
 
 	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
 		data_src.mem_dtlb = PERF_MEM_TLB_WK;
@@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
 	u64 data_src;
 	int err;
 
-	data_src = arm_spe__synth_data_source(record);
+	data_src = arm_spe__synth_data_source(record, spe->midr);
 
 	if (spe->sample_flc) {
 		if (record->type & ARM_SPE_L1D_MISS) {
@@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
 	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
 	struct perf_record_time_conv *tc = &session->time_conv;
+	const char *cpuid = perf_env__cpuid(session->evlist->env);
+	u64 midr = strtol(cpuid, NULL, 16);
 	struct arm_spe *spe;
 	int err;
 
@@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	spe->machine = &session->machines.host; /* No kvm support */
 	spe->auxtrace_type = auxtrace_info->type;
 	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+	spe->midr = midr;
 
 	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-24 18:33   ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

When synthesizing data from SPE, augment the type with source information
for Arm Neoverse cores. The field is IMPLDEF but the Neoverse cores all use
the same encoding. I can't find encoding information for any other SPE
implementations to unify their choices with Arm's thus that is left for
future work.

This change populates the mem_lvl_num for Neoverse cores instead of the
deprecated mem_lvl namespace.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 .../util/arm-spe-decoder/arm-spe-decoder.c    |   1 +
 .../util/arm-spe-decoder/arm-spe-decoder.h    |  12 ++
 tools/perf/util/arm-spe.c                     | 110 +++++++++++++++---
 3 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
index 5e390a1a79ab..091987dd3966 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.c
@@ -220,6 +220,7 @@ static int arm_spe_read_record(struct arm_spe_decoder *decoder)
 
 			break;
 		case ARM_SPE_DATA_SOURCE:
+			decoder->record.source = payload;
 			break;
 		case ARM_SPE_BAD:
 			break;
diff --git a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
index 69b31084d6be..c81bf90c0996 100644
--- a/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
+++ b/tools/perf/util/arm-spe-decoder/arm-spe-decoder.h
@@ -29,6 +29,17 @@ enum arm_spe_op_type {
 	ARM_SPE_ST		= 1 << 1,
 };
 
+enum arm_spe_neoverse_data_source {
+	ARM_SPE_NV_L1D        = 0x0,
+	ARM_SPE_NV_L2         = 0x8,
+	ARM_SPE_NV_PEER_CORE  = 0x9,
+	ARM_SPE_NV_LCL_CLSTR  = 0xa,
+	ARM_SPE_NV_SYS_CACHE  = 0xb,
+	ARM_SPE_NV_PEER_CLSTR = 0xc,
+	ARM_SPE_NV_REMOTE     = 0xd,
+	ARM_SPE_NV_DRAM       = 0xe,
+};
+
 struct arm_spe_record {
 	enum arm_spe_sample_type type;
 	int err;
@@ -40,6 +51,7 @@ struct arm_spe_record {
 	u64 virt_addr;
 	u64 phys_addr;
 	u64 context_id;
+	u16 source;
 };
 
 struct arm_spe_insn;
diff --git a/tools/perf/util/arm-spe.c b/tools/perf/util/arm-spe.c
index d2b64e3f588b..f92ebce88c6a 100644
--- a/tools/perf/util/arm-spe.c
+++ b/tools/perf/util/arm-spe.c
@@ -34,6 +34,7 @@
 #include "arm-spe-decoder/arm-spe-decoder.h"
 #include "arm-spe-decoder/arm-spe-pkt-decoder.h"
 
+#include "../../arch/arm64/include/asm/cputype.h"
 #define MAX_TIMESTAMP (~0ULL)
 
 struct arm_spe {
@@ -45,6 +46,7 @@ struct arm_spe {
 	struct perf_session		*session;
 	struct machine			*machine;
 	u32				pmu_type;
+	u64				midr;
 
 	struct perf_tsc_conversion	tc;
 
@@ -399,33 +401,110 @@ static bool arm_spe__is_memory_event(enum arm_spe_sample_type type)
 	return false;
 }
 
-static u64 arm_spe__synth_data_source(const struct arm_spe_record *record)
+static const struct midr_range neoverse_spe[] = {
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2),
+	MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1),
+	{},
+};
+
+
+static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
 {
-	union perf_mem_data_src	data_src = { 0 };
+	/*
+	 * Even though four levels of cache hierarchy are possible, no known
+	 * production Neoverse systems currently include more than three levels
+	 * so for the time being we assume three exist. If a production system
+	 * is built with four the this function would have to be changed to
+	 * detect the number of levels for reporting.
+	 */
 
-	if (record->op == ARM_SPE_LD)
-		data_src.mem_op = PERF_MEM_OP_LOAD;
-	else
-		data_src.mem_op = PERF_MEM_OP_STORE;
+	switch (record->source) {
+	case ARM_SPE_NV_L1D:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
+		break;
+	case ARM_SPE_NV_L2:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
+		break;
+	case ARM_SPE_NV_PEER_CORE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+		break;
+	/*
+	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
+	 * transfer, so set SNOOP_HITM
+	 */
+	case ARM_SPE_NV_LCL_CLSTR:
+	case ARM_SPE_NV_PEER_CLSTR:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
+		break;
+	/*
+	 * System cache is assumed to be L3
+	 */
+	case ARM_SPE_NV_SYS_CACHE:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
+		break;
+	/*
+	 * We don't know what level it hit in, except it came from the other
+	 * socket
+	 */
+	case ARM_SPE_NV_REMOTE:
+		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
+		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
+		break;
+	case ARM_SPE_NV_DRAM:
+		data_src->mem_lvl = PERF_MEM_LVL_HIT;
+		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
+		break;
+	default:
+		break;
+	}
+}
 
+static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
+						union perf_mem_data_src *data_src)
+{
 	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L3;
+		data_src->mem_lvl = PERF_MEM_LVL_L3;
 
 		if (record->type & ARM_SPE_LLC_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
-		data_src.mem_lvl = PERF_MEM_LVL_L1;
+		data_src->mem_lvl = PERF_MEM_LVL_L1;
 
 		if (record->type & ARM_SPE_L1D_MISS)
-			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
+			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
 		else
-			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
+			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
 	}
 
 	if (record->type & ARM_SPE_REMOTE_ACCESS)
-		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
+}
+
+static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
+{
+	union perf_mem_data_src	data_src = { 0 };
+	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
+
+	if (record->op & ARM_SPE_LD)
+		data_src.mem_op = PERF_MEM_OP_LOAD;
+	else
+		data_src.mem_op = PERF_MEM_OP_STORE;
+
+	if (is_neoverse)
+		arm_spe__synth_data_source_neoverse(record, &data_src);
+	else
+		arm_spe__synth_data_source_generic(record, &data_src);
 
 	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
 		data_src.mem_dtlb = PERF_MEM_TLB_WK;
@@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
 	u64 data_src;
 	int err;
 
-	data_src = arm_spe__synth_data_source(record);
+	data_src = arm_spe__synth_data_source(record, spe->midr);
 
 	if (spe->sample_flc) {
 		if (record->type & ARM_SPE_L1D_MISS) {
@@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
 	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
 	struct perf_record_time_conv *tc = &session->time_conv;
+	const char *cpuid = perf_env__cpuid(session->evlist->env);
+	u64 midr = strtol(cpuid, NULL, 16);
 	struct arm_spe *spe;
 	int err;
 
@@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
 	spe->machine = &session->machines.host; /* No kvm support */
 	spe->auxtrace_type = auxtrace_info->type;
 	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
+	spe->midr = midr;
 
 	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
 
-- 
2.32.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 3/4] perf mem: Support mem_lvl_num in c2c command
  2022-03-24 18:33 ` Ali Saidi
@ 2022-03-24 18:33   ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

In addition to summarizing data encoded in mem_lvl also support data
encoded in mem_lvl_num.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 tools/perf/util/mem-events.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index ed0ab838bcc5..e5e405185498 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
 	u64 daddr  = mi->daddr.addr;
 	u64 op     = data_src->mem_op;
 	u64 lvl    = data_src->mem_lvl;
+	u64 lnum   = data_src->mem_lvl_num;
 	u64 snoop  = data_src->mem_snoop;
 	u64 lock   = data_src->mem_lock;
 	u64 blk    = data_src->mem_blk;
@@ -527,16 +528,18 @@ do {				\
 			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
 			if (lvl & P(LVL, IO))  stats->ld_io++;
 			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
-			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
-			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
-			if (lvl & P(LVL, L3 )) {
+			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
+				stats->ld_l1hit++;
+			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
+				stats->ld_l2hit++;
+			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
 				if (snoop & P(SNOOP, HITM))
 					HITM_INC(lcl_hitm);
 				else
 					stats->ld_llchit++;
 			}
 
-			if (lvl & P(LVL, LOC_RAM)) {
+			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
 				stats->lcl_dram++;
 				if (snoop & P(SNOOP, HIT))
 					stats->ld_shared++;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 3/4] perf mem: Support mem_lvl_num in c2c command
@ 2022-03-24 18:33   ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

In addition to summarizing data encoded in mem_lvl also support data
encoded in mem_lvl_num.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 tools/perf/util/mem-events.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index ed0ab838bcc5..e5e405185498 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
 	u64 daddr  = mi->daddr.addr;
 	u64 op     = data_src->mem_op;
 	u64 lvl    = data_src->mem_lvl;
+	u64 lnum   = data_src->mem_lvl_num;
 	u64 snoop  = data_src->mem_snoop;
 	u64 lock   = data_src->mem_lock;
 	u64 blk    = data_src->mem_blk;
@@ -527,16 +528,18 @@ do {				\
 			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
 			if (lvl & P(LVL, IO))  stats->ld_io++;
 			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
-			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
-			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
-			if (lvl & P(LVL, L3 )) {
+			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
+				stats->ld_l1hit++;
+			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
+				stats->ld_l2hit++;
+			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
 				if (snoop & P(SNOOP, HITM))
 					HITM_INC(lcl_hitm);
 				else
 					stats->ld_llchit++;
 			}
 
-			if (lvl & P(LVL, LOC_RAM)) {
+			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
 				stats->lcl_dram++;
 				if (snoop & P(SNOOP, HIT))
 					stats->ld_shared++;
-- 
2.32.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
  2022-03-24 18:33 ` Ali Saidi
@ 2022-03-24 18:33   ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

For loads that hit in a the LLC snoop filter and are fulfilled from a
higher level cache on arm64 Neoverse cores, it's not usually clear what
the true level of the cache the data came from (i.e. a transfer from a
core could come from it's L1 or L2). Instead of making an assumption of
where the line came from, add support for incrementing HITM if the
source is CACHE_ANY.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 tools/perf/util/mem-events.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index e5e405185498..084977cfebef 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -539,6 +539,15 @@ do {				\
 					stats->ld_llchit++;
 			}
 
+			/*
+			 * A hit in another cores cache must mean a llc snoop
+			 * filter hit
+			 */
+			if (lnum == P(LVLNUM, ANY_CACHE)) {
+				if (snoop & P(SNOOP, HITM))
+					HITM_INC(lcl_hitm);
+			}
+
 			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
 				stats->lcl_dram++;
 				if (snoop & P(SNOOP, HIT))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
@ 2022-03-24 18:33   ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-24 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, acme
  Cc: alisaidi, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

For loads that hit in a the LLC snoop filter and are fulfilled from a
higher level cache on arm64 Neoverse cores, it's not usually clear what
the true level of the cache the data came from (i.e. a transfer from a
core could come from it's L1 or L2). Instead of making an assumption of
where the line came from, add support for incrementing HITM if the
source is CACHE_ANY.

Since other architectures don't seem to populate the mem_lvl_num field
here there shouldn't be a change in functionality.

Signed-off-by: Ali Saidi <alisaidi@amazon.com>
Tested-by: German Gomez <german.gomez@arm.com>
Reviewed-by: German Gomez <german.gomez@arm.com>
---
 tools/perf/util/mem-events.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
index e5e405185498..084977cfebef 100644
--- a/tools/perf/util/mem-events.c
+++ b/tools/perf/util/mem-events.c
@@ -539,6 +539,15 @@ do {				\
 					stats->ld_llchit++;
 			}
 
+			/*
+			 * A hit in another cores cache must mean a llc snoop
+			 * filter hit
+			 */
+			if (lnum == P(LVLNUM, ANY_CACHE)) {
+				if (snoop & P(SNOOP, HITM))
+					HITM_INC(lcl_hitm);
+			}
+
 			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
 				stats->lcl_dram++;
 				if (snoop & P(SNOOP, HIT))
-- 
2.32.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-24 18:33   ` Ali Saidi
@ 2022-03-25 18:39     ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-25 18:39 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> for arm64 to make use of all the core-type definitions in perf.
> 
> Replace sysreg.h with the version already imported into tools/.

You forgot to add it to tools/perf/check-headers.sh so that we get
notificed when the original file in the kernel sources gets updated, so
that we can check if this needs any tooling adjustments.

⬢[acme@toolbox perf]$ diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h
--- tools/arch/arm64/include/asm/cputype.h	2022-03-25 15:29:41.185173403 -0300
+++ arch/arm64/include/asm/cputype.h	2022-03-22 17:52:10.881311839 -0300
@@ -170,7 +170,7 @@

 #ifndef __ASSEMBLY__

-#include "sysreg.h"
+#include <asm/sysreg.h>

 #define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)

⬢[acme@toolbox perf]$


I'll add the entry together with the waiver for this specific
difference.

- Arnaldo
 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
>  1 file changed, 258 insertions(+)
>  create mode 100644 tools/arch/arm64/include/asm/cputype.h
> 
> diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
> new file mode 100644
> index 000000000000..9afcc6467a09
> --- /dev/null
> +++ b/tools/arch/arm64/include/asm/cputype.h
> @@ -0,0 +1,258 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +#ifndef __ASM_CPUTYPE_H
> +#define __ASM_CPUTYPE_H
> +
> +#define INVALID_HWID		ULONG_MAX
> +
> +#define MPIDR_UP_BITMASK	(0x1 << 30)
> +#define MPIDR_MT_BITMASK	(0x1 << 24)
> +#define MPIDR_HWID_BITMASK	UL(0xff00ffffff)
> +
> +#define MPIDR_LEVEL_BITS_SHIFT	3
> +#define MPIDR_LEVEL_BITS	(1 << MPIDR_LEVEL_BITS_SHIFT)
> +#define MPIDR_LEVEL_MASK	((1 << MPIDR_LEVEL_BITS) - 1)
> +
> +#define MPIDR_LEVEL_SHIFT(level) \
> +	(((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
> +
> +#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
> +	((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
> +
> +#define MIDR_REVISION_MASK	0xf
> +#define MIDR_REVISION(midr)	((midr) & MIDR_REVISION_MASK)
> +#define MIDR_PARTNUM_SHIFT	4
> +#define MIDR_PARTNUM_MASK	(0xfff << MIDR_PARTNUM_SHIFT)
> +#define MIDR_PARTNUM(midr)	\
> +	(((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
> +#define MIDR_ARCHITECTURE_SHIFT	16
> +#define MIDR_ARCHITECTURE_MASK	(0xf << MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_ARCHITECTURE(midr)	\
> +	(((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_VARIANT_SHIFT	20
> +#define MIDR_VARIANT_MASK	(0xf << MIDR_VARIANT_SHIFT)
> +#define MIDR_VARIANT(midr)	\
> +	(((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
> +#define MIDR_IMPLEMENTOR_SHIFT	24
> +#define MIDR_IMPLEMENTOR_MASK	(0xff << MIDR_IMPLEMENTOR_SHIFT)
> +#define MIDR_IMPLEMENTOR(midr)	\
> +	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
> +
> +#define MIDR_CPU_MODEL(imp, partnum) \
> +	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
> +	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
> +	((partnum)		<< MIDR_PARTNUM_SHIFT))
> +
> +#define MIDR_CPU_VAR_REV(var, rev) \
> +	(((var)	<< MIDR_VARIANT_SHIFT) | (rev))
> +
> +#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
> +			     MIDR_ARCHITECTURE_MASK)
> +
> +#define ARM_CPU_IMP_ARM			0x41
> +#define ARM_CPU_IMP_APM			0x50
> +#define ARM_CPU_IMP_CAVIUM		0x43
> +#define ARM_CPU_IMP_BRCM		0x42
> +#define ARM_CPU_IMP_QCOM		0x51
> +#define ARM_CPU_IMP_NVIDIA		0x4E
> +#define ARM_CPU_IMP_FUJITSU		0x46
> +#define ARM_CPU_IMP_HISI		0x48
> +#define ARM_CPU_IMP_APPLE		0x61
> +
> +#define ARM_CPU_PART_AEM_V8		0xD0F
> +#define ARM_CPU_PART_FOUNDATION		0xD00
> +#define ARM_CPU_PART_CORTEX_A57		0xD07
> +#define ARM_CPU_PART_CORTEX_A72		0xD08
> +#define ARM_CPU_PART_CORTEX_A53		0xD03
> +#define ARM_CPU_PART_CORTEX_A73		0xD09
> +#define ARM_CPU_PART_CORTEX_A75		0xD0A
> +#define ARM_CPU_PART_CORTEX_A35		0xD04
> +#define ARM_CPU_PART_CORTEX_A55		0xD05
> +#define ARM_CPU_PART_CORTEX_A76		0xD0B
> +#define ARM_CPU_PART_NEOVERSE_N1	0xD0C
> +#define ARM_CPU_PART_CORTEX_A77		0xD0D
> +#define ARM_CPU_PART_NEOVERSE_V1	0xD40
> +#define ARM_CPU_PART_CORTEX_A78		0xD41
> +#define ARM_CPU_PART_CORTEX_X1		0xD44
> +#define ARM_CPU_PART_CORTEX_A510	0xD46
> +#define ARM_CPU_PART_CORTEX_A710	0xD47
> +#define ARM_CPU_PART_CORTEX_X2		0xD48
> +#define ARM_CPU_PART_NEOVERSE_N2	0xD49
> +#define ARM_CPU_PART_CORTEX_A78C	0xD4B
> +
> +#define APM_CPU_PART_POTENZA		0x000
> +
> +#define CAVIUM_CPU_PART_THUNDERX	0x0A1
> +#define CAVIUM_CPU_PART_THUNDERX_81XX	0x0A2
> +#define CAVIUM_CPU_PART_THUNDERX_83XX	0x0A3
> +#define CAVIUM_CPU_PART_THUNDERX2	0x0AF
> +/* OcteonTx2 series */
> +#define CAVIUM_CPU_PART_OCTX2_98XX	0x0B1
> +#define CAVIUM_CPU_PART_OCTX2_96XX	0x0B2
> +#define CAVIUM_CPU_PART_OCTX2_95XX	0x0B3
> +#define CAVIUM_CPU_PART_OCTX2_95XXN	0x0B4
> +#define CAVIUM_CPU_PART_OCTX2_95XXMM	0x0B5
> +#define CAVIUM_CPU_PART_OCTX2_95XXO	0x0B6
> +
> +#define BRCM_CPU_PART_BRAHMA_B53	0x100
> +#define BRCM_CPU_PART_VULCAN		0x516
> +
> +#define QCOM_CPU_PART_FALKOR_V1		0x800
> +#define QCOM_CPU_PART_FALKOR		0xC00
> +#define QCOM_CPU_PART_KRYO		0x200
> +#define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
> +#define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
> +#define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
> +#define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
> +#define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
> +
> +#define NVIDIA_CPU_PART_DENVER		0x003
> +#define NVIDIA_CPU_PART_CARMEL		0x004
> +
> +#define FUJITSU_CPU_PART_A64FX		0x001
> +
> +#define HISI_CPU_PART_TSV110		0xD01
> +
> +#define APPLE_CPU_PART_M1_ICESTORM	0x022
> +#define APPLE_CPU_PART_M1_FIRESTORM	0x023
> +
> +#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
> +#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
> +#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
> +#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
> +#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
> +#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
> +#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
> +#define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
> +#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
> +#define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
> +#define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
> +#define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
> +#define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
> +#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
> +#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
> +#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
> +#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
> +#define MIDR_CORTEX_A78C	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
> +#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
> +#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
> +#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
> +#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
> +#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
> +#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
> +#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
> +#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
> +#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
> +#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
> +#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
> +#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
> +#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
> +#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
> +#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
> +#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
> +#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
> +#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
> +#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
> +#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
> +#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
> +#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
> +#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
> +#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
> +#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
> +#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
> +
> +/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
> +#define MIDR_FUJITSU_ERRATUM_010001		MIDR_FUJITSU_A64FX
> +#define MIDR_FUJITSU_ERRATUM_010001_MASK	(~MIDR_CPU_VAR_REV(1, 0))
> +#define TCR_CLEAR_FUJITSU_ERRATUM_010001	(TCR_NFD1 | TCR_NFD0)
> +
> +#ifndef __ASSEMBLY__
> +
> +#include "sysreg.h"
> +
> +#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
> +
> +/*
> + * Represent a range of MIDR values for a given CPU model and a
> + * range of variant/revision values.
> + *
> + * @model	- CPU model as defined by MIDR_CPU_MODEL
> + * @rv_min	- Minimum value for the revision/variant as defined by
> + *		  MIDR_CPU_VAR_REV
> + * @rv_max	- Maximum value for the variant/revision for the range.
> + */
> +struct midr_range {
> +	u32 model;
> +	u32 rv_min;
> +	u32 rv_max;
> +};
> +
> +#define MIDR_RANGE(m, v_min, r_min, v_max, r_max)		\
> +	{							\
> +		.model = m,					\
> +		.rv_min = MIDR_CPU_VAR_REV(v_min, r_min),	\
> +		.rv_max = MIDR_CPU_VAR_REV(v_max, r_max),	\
> +	}
> +
> +#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
> +#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
> +#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
> +
> +static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
> +					   u32 rv_max)
> +{
> +	u32 _model = midr & MIDR_CPU_MODEL_MASK;
> +	u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
> +
> +	return _model == model && rv >= rv_min && rv <= rv_max;
> +}
> +
> +static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
> +{
> +	return midr_is_cpu_model_range(midr, range->model,
> +				       range->rv_min, range->rv_max);
> +}
> +
> +static inline bool
> +is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
> +{
> +	while (ranges->model)
> +		if (is_midr_in_range(midr, ranges++))
> +			return true;
> +	return false;
> +}
> +
> +/*
> + * The CPU ID never changes at run time, so we might as well tell the
> + * compiler that it's constant.  Use this function to read the CPU ID
> + * rather than directly reading processor_id or read_cpuid() directly.
> + */
> +static inline u32 __attribute_const__ read_cpuid_id(void)
> +{
> +	return read_cpuid(MIDR_EL1);
> +}
> +
> +static inline u64 __attribute_const__ read_cpuid_mpidr(void)
> +{
> +	return read_cpuid(MPIDR_EL1);
> +}
> +
> +static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
> +{
> +	return MIDR_IMPLEMENTOR(read_cpuid_id());
> +}
> +
> +static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
> +{
> +	return MIDR_PARTNUM(read_cpuid_id());
> +}
> +
> +static inline u32 __attribute_const__ read_cpuid_cachetype(void)
> +{
> +	return read_cpuid(CTR_EL0);
> +}
> +#endif /* __ASSEMBLY__ */
> +
> +#endif
> -- 
> 2.32.0

-- 

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-25 18:39     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-25 18:39 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> for arm64 to make use of all the core-type definitions in perf.
> 
> Replace sysreg.h with the version already imported into tools/.

You forgot to add it to tools/perf/check-headers.sh so that we get
notificed when the original file in the kernel sources gets updated, so
that we can check if this needs any tooling adjustments.

⬢[acme@toolbox perf]$ diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h
--- tools/arch/arm64/include/asm/cputype.h	2022-03-25 15:29:41.185173403 -0300
+++ arch/arm64/include/asm/cputype.h	2022-03-22 17:52:10.881311839 -0300
@@ -170,7 +170,7 @@

 #ifndef __ASSEMBLY__

-#include "sysreg.h"
+#include <asm/sysreg.h>

 #define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)

⬢[acme@toolbox perf]$


I'll add the entry together with the waiver for this specific
difference.

- Arnaldo
 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> ---
>  tools/arch/arm64/include/asm/cputype.h | 258 +++++++++++++++++++++++++
>  1 file changed, 258 insertions(+)
>  create mode 100644 tools/arch/arm64/include/asm/cputype.h
> 
> diff --git a/tools/arch/arm64/include/asm/cputype.h b/tools/arch/arm64/include/asm/cputype.h
> new file mode 100644
> index 000000000000..9afcc6467a09
> --- /dev/null
> +++ b/tools/arch/arm64/include/asm/cputype.h
> @@ -0,0 +1,258 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +#ifndef __ASM_CPUTYPE_H
> +#define __ASM_CPUTYPE_H
> +
> +#define INVALID_HWID		ULONG_MAX
> +
> +#define MPIDR_UP_BITMASK	(0x1 << 30)
> +#define MPIDR_MT_BITMASK	(0x1 << 24)
> +#define MPIDR_HWID_BITMASK	UL(0xff00ffffff)
> +
> +#define MPIDR_LEVEL_BITS_SHIFT	3
> +#define MPIDR_LEVEL_BITS	(1 << MPIDR_LEVEL_BITS_SHIFT)
> +#define MPIDR_LEVEL_MASK	((1 << MPIDR_LEVEL_BITS) - 1)
> +
> +#define MPIDR_LEVEL_SHIFT(level) \
> +	(((1 << level) >> 1) << MPIDR_LEVEL_BITS_SHIFT)
> +
> +#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
> +	((mpidr >> MPIDR_LEVEL_SHIFT(level)) & MPIDR_LEVEL_MASK)
> +
> +#define MIDR_REVISION_MASK	0xf
> +#define MIDR_REVISION(midr)	((midr) & MIDR_REVISION_MASK)
> +#define MIDR_PARTNUM_SHIFT	4
> +#define MIDR_PARTNUM_MASK	(0xfff << MIDR_PARTNUM_SHIFT)
> +#define MIDR_PARTNUM(midr)	\
> +	(((midr) & MIDR_PARTNUM_MASK) >> MIDR_PARTNUM_SHIFT)
> +#define MIDR_ARCHITECTURE_SHIFT	16
> +#define MIDR_ARCHITECTURE_MASK	(0xf << MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_ARCHITECTURE(midr)	\
> +	(((midr) & MIDR_ARCHITECTURE_MASK) >> MIDR_ARCHITECTURE_SHIFT)
> +#define MIDR_VARIANT_SHIFT	20
> +#define MIDR_VARIANT_MASK	(0xf << MIDR_VARIANT_SHIFT)
> +#define MIDR_VARIANT(midr)	\
> +	(((midr) & MIDR_VARIANT_MASK) >> MIDR_VARIANT_SHIFT)
> +#define MIDR_IMPLEMENTOR_SHIFT	24
> +#define MIDR_IMPLEMENTOR_MASK	(0xff << MIDR_IMPLEMENTOR_SHIFT)
> +#define MIDR_IMPLEMENTOR(midr)	\
> +	(((midr) & MIDR_IMPLEMENTOR_MASK) >> MIDR_IMPLEMENTOR_SHIFT)
> +
> +#define MIDR_CPU_MODEL(imp, partnum) \
> +	(((imp)			<< MIDR_IMPLEMENTOR_SHIFT) | \
> +	(0xf			<< MIDR_ARCHITECTURE_SHIFT) | \
> +	((partnum)		<< MIDR_PARTNUM_SHIFT))
> +
> +#define MIDR_CPU_VAR_REV(var, rev) \
> +	(((var)	<< MIDR_VARIANT_SHIFT) | (rev))
> +
> +#define MIDR_CPU_MODEL_MASK (MIDR_IMPLEMENTOR_MASK | MIDR_PARTNUM_MASK | \
> +			     MIDR_ARCHITECTURE_MASK)
> +
> +#define ARM_CPU_IMP_ARM			0x41
> +#define ARM_CPU_IMP_APM			0x50
> +#define ARM_CPU_IMP_CAVIUM		0x43
> +#define ARM_CPU_IMP_BRCM		0x42
> +#define ARM_CPU_IMP_QCOM		0x51
> +#define ARM_CPU_IMP_NVIDIA		0x4E
> +#define ARM_CPU_IMP_FUJITSU		0x46
> +#define ARM_CPU_IMP_HISI		0x48
> +#define ARM_CPU_IMP_APPLE		0x61
> +
> +#define ARM_CPU_PART_AEM_V8		0xD0F
> +#define ARM_CPU_PART_FOUNDATION		0xD00
> +#define ARM_CPU_PART_CORTEX_A57		0xD07
> +#define ARM_CPU_PART_CORTEX_A72		0xD08
> +#define ARM_CPU_PART_CORTEX_A53		0xD03
> +#define ARM_CPU_PART_CORTEX_A73		0xD09
> +#define ARM_CPU_PART_CORTEX_A75		0xD0A
> +#define ARM_CPU_PART_CORTEX_A35		0xD04
> +#define ARM_CPU_PART_CORTEX_A55		0xD05
> +#define ARM_CPU_PART_CORTEX_A76		0xD0B
> +#define ARM_CPU_PART_NEOVERSE_N1	0xD0C
> +#define ARM_CPU_PART_CORTEX_A77		0xD0D
> +#define ARM_CPU_PART_NEOVERSE_V1	0xD40
> +#define ARM_CPU_PART_CORTEX_A78		0xD41
> +#define ARM_CPU_PART_CORTEX_X1		0xD44
> +#define ARM_CPU_PART_CORTEX_A510	0xD46
> +#define ARM_CPU_PART_CORTEX_A710	0xD47
> +#define ARM_CPU_PART_CORTEX_X2		0xD48
> +#define ARM_CPU_PART_NEOVERSE_N2	0xD49
> +#define ARM_CPU_PART_CORTEX_A78C	0xD4B
> +
> +#define APM_CPU_PART_POTENZA		0x000
> +
> +#define CAVIUM_CPU_PART_THUNDERX	0x0A1
> +#define CAVIUM_CPU_PART_THUNDERX_81XX	0x0A2
> +#define CAVIUM_CPU_PART_THUNDERX_83XX	0x0A3
> +#define CAVIUM_CPU_PART_THUNDERX2	0x0AF
> +/* OcteonTx2 series */
> +#define CAVIUM_CPU_PART_OCTX2_98XX	0x0B1
> +#define CAVIUM_CPU_PART_OCTX2_96XX	0x0B2
> +#define CAVIUM_CPU_PART_OCTX2_95XX	0x0B3
> +#define CAVIUM_CPU_PART_OCTX2_95XXN	0x0B4
> +#define CAVIUM_CPU_PART_OCTX2_95XXMM	0x0B5
> +#define CAVIUM_CPU_PART_OCTX2_95XXO	0x0B6
> +
> +#define BRCM_CPU_PART_BRAHMA_B53	0x100
> +#define BRCM_CPU_PART_VULCAN		0x516
> +
> +#define QCOM_CPU_PART_FALKOR_V1		0x800
> +#define QCOM_CPU_PART_FALKOR		0xC00
> +#define QCOM_CPU_PART_KRYO		0x200
> +#define QCOM_CPU_PART_KRYO_2XX_GOLD	0x800
> +#define QCOM_CPU_PART_KRYO_2XX_SILVER	0x801
> +#define QCOM_CPU_PART_KRYO_3XX_SILVER	0x803
> +#define QCOM_CPU_PART_KRYO_4XX_GOLD	0x804
> +#define QCOM_CPU_PART_KRYO_4XX_SILVER	0x805
> +
> +#define NVIDIA_CPU_PART_DENVER		0x003
> +#define NVIDIA_CPU_PART_CARMEL		0x004
> +
> +#define FUJITSU_CPU_PART_A64FX		0x001
> +
> +#define HISI_CPU_PART_TSV110		0xD01
> +
> +#define APPLE_CPU_PART_M1_ICESTORM	0x022
> +#define APPLE_CPU_PART_M1_FIRESTORM	0x023
> +
> +#define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53)
> +#define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57)
> +#define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72)
> +#define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73)
> +#define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75)
> +#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35)
> +#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55)
> +#define MIDR_CORTEX_A76	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76)
> +#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1)
> +#define MIDR_CORTEX_A77	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77)
> +#define MIDR_NEOVERSE_V1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1)
> +#define MIDR_CORTEX_A78	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78)
> +#define MIDR_CORTEX_X1	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1)
> +#define MIDR_CORTEX_A510 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A510)
> +#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710)
> +#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2)
> +#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2)
> +#define MIDR_CORTEX_A78C	MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C)
> +#define MIDR_THUNDERX	MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX)
> +#define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX)
> +#define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX)
> +#define MIDR_OCTX2_98XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_98XX)
> +#define MIDR_OCTX2_96XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_96XX)
> +#define MIDR_OCTX2_95XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XX)
> +#define MIDR_OCTX2_95XXN MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXN)
> +#define MIDR_OCTX2_95XXMM MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXMM)
> +#define MIDR_OCTX2_95XXO MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_OCTX2_95XXO)
> +#define MIDR_CAVIUM_THUNDERX2 MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX2)
> +#define MIDR_BRAHMA_B53 MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_BRAHMA_B53)
> +#define MIDR_BRCM_VULCAN MIDR_CPU_MODEL(ARM_CPU_IMP_BRCM, BRCM_CPU_PART_VULCAN)
> +#define MIDR_QCOM_FALKOR_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR_V1)
> +#define MIDR_QCOM_FALKOR MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_FALKOR)
> +#define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
> +#define MIDR_QCOM_KRYO_2XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_GOLD)
> +#define MIDR_QCOM_KRYO_2XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_2XX_SILVER)
> +#define MIDR_QCOM_KRYO_3XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_3XX_SILVER)
> +#define MIDR_QCOM_KRYO_4XX_GOLD MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_GOLD)
> +#define MIDR_QCOM_KRYO_4XX_SILVER MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO_4XX_SILVER)
> +#define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_DENVER)
> +#define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, NVIDIA_CPU_PART_CARMEL)
> +#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, FUJITSU_CPU_PART_A64FX)
> +#define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, HISI_CPU_PART_TSV110)
> +#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_ICESTORM)
> +#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, APPLE_CPU_PART_M1_FIRESTORM)
> +
> +/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
> +#define MIDR_FUJITSU_ERRATUM_010001		MIDR_FUJITSU_A64FX
> +#define MIDR_FUJITSU_ERRATUM_010001_MASK	(~MIDR_CPU_VAR_REV(1, 0))
> +#define TCR_CLEAR_FUJITSU_ERRATUM_010001	(TCR_NFD1 | TCR_NFD0)
> +
> +#ifndef __ASSEMBLY__
> +
> +#include "sysreg.h"
> +
> +#define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
> +
> +/*
> + * Represent a range of MIDR values for a given CPU model and a
> + * range of variant/revision values.
> + *
> + * @model	- CPU model as defined by MIDR_CPU_MODEL
> + * @rv_min	- Minimum value for the revision/variant as defined by
> + *		  MIDR_CPU_VAR_REV
> + * @rv_max	- Maximum value for the variant/revision for the range.
> + */
> +struct midr_range {
> +	u32 model;
> +	u32 rv_min;
> +	u32 rv_max;
> +};
> +
> +#define MIDR_RANGE(m, v_min, r_min, v_max, r_max)		\
> +	{							\
> +		.model = m,					\
> +		.rv_min = MIDR_CPU_VAR_REV(v_min, r_min),	\
> +		.rv_max = MIDR_CPU_VAR_REV(v_max, r_max),	\
> +	}
> +
> +#define MIDR_REV_RANGE(m, v, r_min, r_max) MIDR_RANGE(m, v, r_min, v, r_max)
> +#define MIDR_REV(m, v, r) MIDR_RANGE(m, v, r, v, r)
> +#define MIDR_ALL_VERSIONS(m) MIDR_RANGE(m, 0, 0, 0xf, 0xf)
> +
> +static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
> +					   u32 rv_max)
> +{
> +	u32 _model = midr & MIDR_CPU_MODEL_MASK;
> +	u32 rv = midr & (MIDR_REVISION_MASK | MIDR_VARIANT_MASK);
> +
> +	return _model == model && rv >= rv_min && rv <= rv_max;
> +}
> +
> +static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
> +{
> +	return midr_is_cpu_model_range(midr, range->model,
> +				       range->rv_min, range->rv_max);
> +}
> +
> +static inline bool
> +is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
> +{
> +	while (ranges->model)
> +		if (is_midr_in_range(midr, ranges++))
> +			return true;
> +	return false;
> +}
> +
> +/*
> + * The CPU ID never changes at run time, so we might as well tell the
> + * compiler that it's constant.  Use this function to read the CPU ID
> + * rather than directly reading processor_id or read_cpuid() directly.
> + */
> +static inline u32 __attribute_const__ read_cpuid_id(void)
> +{
> +	return read_cpuid(MIDR_EL1);
> +}
> +
> +static inline u64 __attribute_const__ read_cpuid_mpidr(void)
> +{
> +	return read_cpuid(MPIDR_EL1);
> +}
> +
> +static inline unsigned int __attribute_const__ read_cpuid_implementor(void)
> +{
> +	return MIDR_IMPLEMENTOR(read_cpuid_id());
> +}
> +
> +static inline unsigned int __attribute_const__ read_cpuid_part_number(void)
> +{
> +	return MIDR_PARTNUM(read_cpuid_id());
> +}
> +
> +static inline u32 __attribute_const__ read_cpuid_cachetype(void)
> +{
> +	return read_cpuid(CTR_EL0);
> +}
> +#endif /* __ASSEMBLY__ */
> +
> +#endif
> -- 
> 2.32.0

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-25 18:39     ` Arnaldo Carvalho de Melo
@ 2022-03-25 18:58       ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-25 18:58 UTC (permalink / raw)
  To: acme
  Cc: Nick.Forrington, alexander.shishkin, alisaidi, andrew.kilroy,
	benh, german.gomez, james.clark, john.garry, jolsa, kjain,
	leo.yan, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


Hi Arnaldo,

On Fri, 25 Mar 2022 18:39:44 -0000, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > for arm64 to make use of all the core-type definitions in perf.
> >
> > Replace sysreg.h with the version already imported into tools/.
> 
> You forgot to add it to tools/perf/check-headers.sh so that we get
> notificed when the original file in the kernel sources gets updated, so
> that we can check if this needs any tooling adjustments.

Sorry.

> ⬢[acme@toolbox perf]$ diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h
> --- tools/arch/arm64/include/asm/cputype.h	2022-03-25 15:29:41.185173403 -0300
> +++ arch/arm64/include/asm/cputype.h	2022-03-22 17:52:10.881311839 -0300
> @@ -170,7 +170,7 @@
> 
>  #ifndef __ASSEMBLY__
> 
> -#include "sysreg.h"
> +#include <asm/sysreg.h>
> 
>  #define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
> 
> ⬢[acme@toolbox perf]$
> 
> 
> I'll add the entry together with the waiver for this specific
> difference.

Thank you! 

It looks like it's been missed several times:
% find  tools/arch/arm64 -type f
tools/arch/arm64/include/uapi/asm/unistd.h
tools/arch/arm64/include/uapi/asm/bpf_perf_event.h
tools/arch/arm64/include/uapi/asm/kvm.h
tools/arch/arm64/include/uapi/asm/mman.h
tools/arch/arm64/include/uapi/asm/perf_regs.h
tools/arch/arm64/include/uapi/asm/bitsperlong.h
tools/arch/arm64/include/asm/barrier.h
tools/arch/arm64/include/asm/cputype.h
tools/arch/arm64/include/asm/sysreg.h

% grep arm64 tools/perf/check-headers.sh
arch/arm64/include/uapi/asm/perf_regs.h
arch/arm64/include/uapi/asm/kvm.h
arch/arm64/include/uapi/asm/unistd.h


Thanks,
Ali


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-25 18:58       ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-25 18:58 UTC (permalink / raw)
  To: acme
  Cc: Nick.Forrington, alexander.shishkin, alisaidi, andrew.kilroy,
	benh, german.gomez, james.clark, john.garry, jolsa, kjain,
	leo.yan, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


Hi Arnaldo,

On Fri, 25 Mar 2022 18:39:44 -0000, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > for arm64 to make use of all the core-type definitions in perf.
> >
> > Replace sysreg.h with the version already imported into tools/.
> 
> You forgot to add it to tools/perf/check-headers.sh so that we get
> notificed when the original file in the kernel sources gets updated, so
> that we can check if this needs any tooling adjustments.

Sorry.

> ⬢[acme@toolbox perf]$ diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h
> --- tools/arch/arm64/include/asm/cputype.h	2022-03-25 15:29:41.185173403 -0300
> +++ arch/arm64/include/asm/cputype.h	2022-03-22 17:52:10.881311839 -0300
> @@ -170,7 +170,7 @@
> 
>  #ifndef __ASSEMBLY__
> 
> -#include "sysreg.h"
> +#include <asm/sysreg.h>
> 
>  #define read_cpuid(reg)			read_sysreg_s(SYS_ ## reg)
> 
> ⬢[acme@toolbox perf]$
> 
> 
> I'll add the entry together with the waiver for this specific
> difference.

Thank you! 

It looks like it's been missed several times:
% find  tools/arch/arm64 -type f
tools/arch/arm64/include/uapi/asm/unistd.h
tools/arch/arm64/include/uapi/asm/bpf_perf_event.h
tools/arch/arm64/include/uapi/asm/kvm.h
tools/arch/arm64/include/uapi/asm/mman.h
tools/arch/arm64/include/uapi/asm/perf_regs.h
tools/arch/arm64/include/uapi/asm/bitsperlong.h
tools/arch/arm64/include/asm/barrier.h
tools/arch/arm64/include/asm/cputype.h
tools/arch/arm64/include/asm/sysreg.h

% grep arm64 tools/perf/check-headers.sh
arch/arm64/include/uapi/asm/perf_regs.h
arch/arm64/include/uapi/asm/kvm.h
arch/arm64/include/uapi/asm/unistd.h


Thanks,
Ali


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-25 18:39     ` Arnaldo Carvalho de Melo
@ 2022-03-25 19:42       ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-25 19:42 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > for arm64 to make use of all the core-type definitions in perf.

> > Replace sysreg.h with the version already imported into tools/.
 
> You forgot to add it to tools/perf/check-headers.sh so that we get
> notificed when the original file in the kernel sources gets updated, so
> that we can check if this needs any tooling adjustments.
 
> I'll add the entry together with the waiver for this specific
> difference.

This:

diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
--- a/tools/perf/check-headers.sh
+++ b/tools/perf/check-headers.sh
@@ -146,6 +146,7 @@ done
 check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
 check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
 check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
+check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
 check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
 check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
 check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'


Cheers,

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-25 19:42       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-25 19:42 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > for arm64 to make use of all the core-type definitions in perf.

> > Replace sysreg.h with the version already imported into tools/.
 
> You forgot to add it to tools/perf/check-headers.sh so that we get
> notificed when the original file in the kernel sources gets updated, so
> that we can check if this needs any tooling adjustments.
 
> I'll add the entry together with the waiver for this specific
> difference.

This:

diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
--- a/tools/perf/check-headers.sh
+++ b/tools/perf/check-headers.sh
@@ -146,6 +146,7 @@ done
 check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
 check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
 check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
+check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
 check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
 check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
 check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'


Cheers,

- Arnaldo

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-25 19:42       ` Arnaldo Carvalho de Melo
@ 2022-03-26  5:49         ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26  5:49 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Hi Arnaldo, Ali,

On Fri, Mar 25, 2022 at 04:42:32PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > > for arm64 to make use of all the core-type definitions in perf.
> 
> > > Replace sysreg.h with the version already imported into tools/.
>  
> > You forgot to add it to tools/perf/check-headers.sh so that we get
> > notificed when the original file in the kernel sources gets updated, so
> > that we can check if this needs any tooling adjustments.
>  
> > I'll add the entry together with the waiver for this specific
> > difference.
> 
> This:
> 
> diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
> --- a/tools/perf/check-headers.sh
> +++ b/tools/perf/check-headers.sh
> @@ -146,6 +146,7 @@ done
>  check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
>  check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
>  check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
> +check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
>  check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
>  check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
>  check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'

LGTM.  I did the testing on both my x86 and Arm64 platforms, thanks for
the fixing up.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-26  5:49         ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26  5:49 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Hi Arnaldo, Ali,

On Fri, Mar 25, 2022 at 04:42:32PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > > for arm64 to make use of all the core-type definitions in perf.
> 
> > > Replace sysreg.h with the version already imported into tools/.
>  
> > You forgot to add it to tools/perf/check-headers.sh so that we get
> > notificed when the original file in the kernel sources gets updated, so
> > that we can check if this needs any tooling adjustments.
>  
> > I'll add the entry together with the waiver for this specific
> > difference.
> 
> This:
> 
> diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
> --- a/tools/perf/check-headers.sh
> +++ b/tools/perf/check-headers.sh
> @@ -146,6 +146,7 @@ done
>  check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
>  check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
>  check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
> +check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
>  check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
>  check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
>  check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'

LGTM.  I did the testing on both my x86 and Arm64 platforms, thanks for
the fixing up.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
  2022-03-24 18:33   ` Ali Saidi
@ 2022-03-26  6:23     ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26  6:23 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> For loads that hit in a the LLC snoop filter and are fulfilled from a
> higher level cache on arm64 Neoverse cores, it's not usually clear what
> the true level of the cache the data came from (i.e. a transfer from a
> core could come from it's L1 or L2). Instead of making an assumption of
> where the line came from, add support for incrementing HITM if the
> source is CACHE_ANY.
> 
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> Tested-by: German Gomez <german.gomez@arm.com>
> Reviewed-by: German Gomez <german.gomez@arm.com>
> ---
>  tools/perf/util/mem-events.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index e5e405185498..084977cfebef 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -539,6 +539,15 @@ do {				\
>  					stats->ld_llchit++;
>  			}
>  
> +			/*
> +			 * A hit in another cores cache must mean a llc snoop
> +			 * filter hit
> +			 */
> +			if (lnum == P(LVLNUM, ANY_CACHE)) {
> +				if (snoop & P(SNOOP, HITM))
> +					HITM_INC(lcl_hitm);
> +			}

This might break the memory profiling result for x86, see file
arch/x86/events/intel/ds.c:

  97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
  98 {
  99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
  ...
 105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
 106 }

Which means that it's possible that it's a remote access and the cache
level is ANY_CACHE, it's good to add checking for bit
PERF_MEM_REMOTE_REMOTE:

	u64 remote = data_src->mem_remote;

	/*
	 * A hit in another cores cache must mean a llc snoop
	 * filter hit
	 */
	if (lnum == P(LVLNUM, ANY_CACHE) && remote != P(REMOTE, REMOTE)) {
	        if (snoop & P(SNOOP, HITM))
	                HITM_INC(lcl_hitm);
	}

Appreciate German's reviewing and testing, and sorry I jumped in very
late.

Thanks,
Leo

> +
>  			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
>  				stats->lcl_dram++;
>  				if (snoop & P(SNOOP, HIT))
> -- 
> 2.32.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
@ 2022-03-26  6:23     ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26  6:23 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> For loads that hit in a the LLC snoop filter and are fulfilled from a
> higher level cache on arm64 Neoverse cores, it's not usually clear what
> the true level of the cache the data came from (i.e. a transfer from a
> core could come from it's L1 or L2). Instead of making an assumption of
> where the line came from, add support for incrementing HITM if the
> source is CACHE_ANY.
> 
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.
> 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> Tested-by: German Gomez <german.gomez@arm.com>
> Reviewed-by: German Gomez <german.gomez@arm.com>
> ---
>  tools/perf/util/mem-events.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index e5e405185498..084977cfebef 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -539,6 +539,15 @@ do {				\
>  					stats->ld_llchit++;
>  			}
>  
> +			/*
> +			 * A hit in another cores cache must mean a llc snoop
> +			 * filter hit
> +			 */
> +			if (lnum == P(LVLNUM, ANY_CACHE)) {
> +				if (snoop & P(SNOOP, HITM))
> +					HITM_INC(lcl_hitm);
> +			}

This might break the memory profiling result for x86, see file
arch/x86/events/intel/ds.c:

  97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
  98 {
  99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
  ...
 105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
 106 }

Which means that it's possible that it's a remote access and the cache
level is ANY_CACHE, it's good to add checking for bit
PERF_MEM_REMOTE_REMOTE:

	u64 remote = data_src->mem_remote;

	/*
	 * A hit in another cores cache must mean a llc snoop
	 * filter hit
	 */
	if (lnum == P(LVLNUM, ANY_CACHE) && remote != P(REMOTE, REMOTE)) {
	        if (snoop & P(SNOOP, HITM))
	                HITM_INC(lcl_hitm);
	}

Appreciate German's reviewing and testing, and sorry I jumped in very
late.

Thanks,
Leo

> +
>  			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
>  				stats->lcl_dram++;
>  				if (snoop & P(SNOOP, HIT))
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
  2022-03-26  6:23     ` Leo Yan
@ 2022-03-26 13:30       ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:30 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 02:23:03PM +0800, Leo Yan escreveu:
> On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> > For loads that hit in a the LLC snoop filter and are fulfilled from a
> > higher level cache on arm64 Neoverse cores, it's not usually clear what
> > the true level of the cache the data came from (i.e. a transfer from a
> > core could come from it's L1 or L2). Instead of making an assumption of
> > where the line came from, add support for incrementing HITM if the
> > source is CACHE_ANY.
> > 
> > Since other architectures don't seem to populate the mem_lvl_num field
> > here there shouldn't be a change in functionality.
> > 
> > Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> > Tested-by: German Gomez <german.gomez@arm.com>
> > Reviewed-by: German Gomez <german.gomez@arm.com>
> > ---
> >  tools/perf/util/mem-events.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > index e5e405185498..084977cfebef 100644
> > --- a/tools/perf/util/mem-events.c
> > +++ b/tools/perf/util/mem-events.c
> > @@ -539,6 +539,15 @@ do {				\
> >  					stats->ld_llchit++;
> >  			}
> >  
> > +			/*
> > +			 * A hit in another cores cache must mean a llc snoop
> > +			 * filter hit
> > +			 */
> > +			if (lnum == P(LVLNUM, ANY_CACHE)) {
> > +				if (snoop & P(SNOOP, HITM))
> > +					HITM_INC(lcl_hitm);
> > +			}
> 
> This might break the memory profiling result for x86, see file
> arch/x86/events/intel/ds.c:
> 
>   97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
>   98 {
>   99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
>   ...
>  105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
>  106 }
> 
> Which means that it's possible that it's a remote access and the cache
> level is ANY_CACHE, it's good to add checking for bit
> PERF_MEM_REMOTE_REMOTE:
> 
> 	u64 remote = data_src->mem_remote;
> 
> 	/*
> 	 * A hit in another cores cache must mean a llc snoop
> 	 * filter hit
> 	 */
> 	if (lnum == P(LVLNUM, ANY_CACHE) && remote != P(REMOTE, REMOTE)) {
> 	        if (snoop & P(SNOOP, HITM))
> 	                HITM_INC(lcl_hitm);
> 	}
> 
> Appreciate German's reviewing and testing, and sorry I jumped in very
> late.

I have not published this on perf/core, its just in tmp.perf/core while
tests ran, so I'll remove this specific patch and rerun tests, thanks
for reviewing.

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
@ 2022-03-26 13:30       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:30 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 02:23:03PM +0800, Leo Yan escreveu:
> On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> > For loads that hit in a the LLC snoop filter and are fulfilled from a
> > higher level cache on arm64 Neoverse cores, it's not usually clear what
> > the true level of the cache the data came from (i.e. a transfer from a
> > core could come from it's L1 or L2). Instead of making an assumption of
> > where the line came from, add support for incrementing HITM if the
> > source is CACHE_ANY.
> > 
> > Since other architectures don't seem to populate the mem_lvl_num field
> > here there shouldn't be a change in functionality.
> > 
> > Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> > Tested-by: German Gomez <german.gomez@arm.com>
> > Reviewed-by: German Gomez <german.gomez@arm.com>
> > ---
> >  tools/perf/util/mem-events.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> > index e5e405185498..084977cfebef 100644
> > --- a/tools/perf/util/mem-events.c
> > +++ b/tools/perf/util/mem-events.c
> > @@ -539,6 +539,15 @@ do {				\
> >  					stats->ld_llchit++;
> >  			}
> >  
> > +			/*
> > +			 * A hit in another cores cache must mean a llc snoop
> > +			 * filter hit
> > +			 */
> > +			if (lnum == P(LVLNUM, ANY_CACHE)) {
> > +				if (snoop & P(SNOOP, HITM))
> > +					HITM_INC(lcl_hitm);
> > +			}
> 
> This might break the memory profiling result for x86, see file
> arch/x86/events/intel/ds.c:
> 
>   97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
>   98 {
>   99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
>   ...
>  105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
>  106 }
> 
> Which means that it's possible that it's a remote access and the cache
> level is ANY_CACHE, it's good to add checking for bit
> PERF_MEM_REMOTE_REMOTE:
> 
> 	u64 remote = data_src->mem_remote;
> 
> 	/*
> 	 * A hit in another cores cache must mean a llc snoop
> 	 * filter hit
> 	 */
> 	if (lnum == P(LVLNUM, ANY_CACHE) && remote != P(REMOTE, REMOTE)) {
> 	        if (snoop & P(SNOOP, HITM))
> 	                HITM_INC(lcl_hitm);
> 	}
> 
> Appreciate German's reviewing and testing, and sorry I jumped in very
> late.

I have not published this on perf/core, its just in tmp.perf/core while
tests ran, so I'll remove this specific patch and rerun tests, thanks
for reviewing.

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-24 18:33   ` Ali Saidi
@ 2022-03-26 13:47     ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26 13:47 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali, German,

On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:

[...]

> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
>  {
> -	union perf_mem_data_src	data_src = { 0 };
> +	/*
> +	 * Even though four levels of cache hierarchy are possible, no known
> +	 * production Neoverse systems currently include more than three levels
> +	 * so for the time being we assume three exist. If a production system
> +	 * is built with four the this function would have to be changed to
> +	 * detect the number of levels for reporting.
> +	 */
>  
> -	if (record->op == ARM_SPE_LD)
> -		data_src.mem_op = PERF_MEM_OP_LOAD;
> -	else
> -		data_src.mem_op = PERF_MEM_OP_STORE;

Firstly, apologize that I didn't give clear idea when Ali sent patch sets
v2 and v3.

IMHO, we need to consider two kinds of information which can guide us
for a reliable implementation.  The first thing is to summarize the data
source configuration for x86 PEBS, we can dive in more details for this
part; the second thing is we can refer to the AMBA architecture document
ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
its sub section 'Suggested DataSource values', which would help us
much for mapping the cache topology to Arm SPE data source.

As a result, I summarized the data source configurations for PEBS and
Arm SPE Neoverse in the spreadsheet:
https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing

Please see below comments.

> +	switch (record->source) {
> +	case ARM_SPE_NV_L1D:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> +		break;

I think we need to set the field 'mem_snoop' for L1 cache hit:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

For L1 cache hit, it doesn't involve snooping.

> +	case ARM_SPE_NV_L2:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> +		break;

Ditto:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

> +	case ARM_SPE_NV_PEER_CORE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Peer core contains its local L1 cache, so I think we can set the
memory level L1 to indicate this case.

For this data source type and below types, though they indicate
the snooping happens, but it doesn't mean the data in the cache line
is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
think this will mislead users when report the result.

I prefer we set below fields for ARM_SPE_NV_PEER_CORE:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;

> +		break;
> +	/*
> +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> +	 * transfer, so set SNOOP_HITM
> +	 */
> +	case ARM_SPE_NV_LCL_CLSTR:

For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
the cluster level, it should happen in L2 cache:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;

> +	case ARM_SPE_NV_PEER_CLSTR:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> +		break;

This type can snoop from L1 or L2 cache in the peer cluster, so it
makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
should use the snoop type PERF_MEM_SNOOP_HIT, so:

        data_src->mem_lvl = PERF_MEM_LVL_HIT
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	/*
> +	 * System cache is assumed to be L3
> +	 */
> +	case ARM_SPE_NV_SYS_CACHE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> +		break;

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;

> +	/*
> +	 * We don't know what level it hit in, except it came from the other
> +	 * socket
> +	 */
> +	case ARM_SPE_NV_REMOTE:
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> +		break;

The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
in any cache levels in remote chip:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	case ARM_SPE_NV_DRAM:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> +		break;

We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;

The rest of this patch looks good to me.

Thanks,
Leo

> +	default:
> +		break;
> +	}
> +}
>  
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
> +{
>  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> +		data_src->mem_lvl = PERF_MEM_LVL_L3;
>  
>  		if (record->type & ARM_SPE_LLC_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> +		data_src->mem_lvl = PERF_MEM_LVL_L1;
>  
>  		if (record->type & ARM_SPE_L1D_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	}
>  
>  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> +	union perf_mem_data_src	data_src = { 0 };
> +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> +	if (record->op & ARM_SPE_LD)
> +		data_src.mem_op = PERF_MEM_OP_LOAD;
> +	else
> +		data_src.mem_op = PERF_MEM_OP_STORE;
> +
> +	if (is_neoverse)
> +		arm_spe__synth_data_source_neoverse(record, &data_src);
> +	else
> +		arm_spe__synth_data_source_generic(record, &data_src);
>  
>  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
>  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
>  	u64 data_src;
>  	int err;
>  
> -	data_src = arm_spe__synth_data_source(record);
> +	data_src = arm_spe__synth_data_source(record, spe->midr);
>  
>  	if (spe->sample_flc) {
>  		if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
>  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
>  	struct perf_record_time_conv *tc = &session->time_conv;
> +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> +	u64 midr = strtol(cpuid, NULL, 16);
>  	struct arm_spe *spe;
>  	int err;
>  
> @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	spe->machine = &session->machines.host; /* No kvm support */
>  	spe->auxtrace_type = auxtrace_info->type;
>  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> +	spe->midr = midr;
>  
>  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>  
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-26 13:47     ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26 13:47 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	acme, benh, Nick.Forrington, alexander.shishkin, andrew.kilroy,
	james.clark, john.garry, jolsa, kjain, lihuafei1, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali, German,

On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:

[...]

> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
>  {
> -	union perf_mem_data_src	data_src = { 0 };
> +	/*
> +	 * Even though four levels of cache hierarchy are possible, no known
> +	 * production Neoverse systems currently include more than three levels
> +	 * so for the time being we assume three exist. If a production system
> +	 * is built with four the this function would have to be changed to
> +	 * detect the number of levels for reporting.
> +	 */
>  
> -	if (record->op == ARM_SPE_LD)
> -		data_src.mem_op = PERF_MEM_OP_LOAD;
> -	else
> -		data_src.mem_op = PERF_MEM_OP_STORE;

Firstly, apologize that I didn't give clear idea when Ali sent patch sets
v2 and v3.

IMHO, we need to consider two kinds of information which can guide us
for a reliable implementation.  The first thing is to summarize the data
source configuration for x86 PEBS, we can dive in more details for this
part; the second thing is we can refer to the AMBA architecture document
ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
its sub section 'Suggested DataSource values', which would help us
much for mapping the cache topology to Arm SPE data source.

As a result, I summarized the data source configurations for PEBS and
Arm SPE Neoverse in the spreadsheet:
https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing

Please see below comments.

> +	switch (record->source) {
> +	case ARM_SPE_NV_L1D:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> +		break;

I think we need to set the field 'mem_snoop' for L1 cache hit:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

For L1 cache hit, it doesn't involve snooping.

> +	case ARM_SPE_NV_L2:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> +		break;

Ditto:

        data_src->mem_snoop = PERF_MEM_SNOOP_NONE;

> +	case ARM_SPE_NV_PEER_CORE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Peer core contains its local L1 cache, so I think we can set the
memory level L1 to indicate this case.

For this data source type and below types, though they indicate
the snooping happens, but it doesn't mean the data in the cache line
is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
think this will mislead users when report the result.

I prefer we set below fields for ARM_SPE_NV_PEER_CORE:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;

> +		break;
> +	/*
> +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> +	 * transfer, so set SNOOP_HITM
> +	 */
> +	case ARM_SPE_NV_LCL_CLSTR:

For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
the cluster level, it should happen in L2 cache:

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;

> +	case ARM_SPE_NV_PEER_CLSTR:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> +		break;

This type can snoop from L1 or L2 cache in the peer cluster, so it
makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
should use the snoop type PERF_MEM_SNOOP_HIT, so:

        data_src->mem_lvl = PERF_MEM_LVL_HIT
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	/*
> +	 * System cache is assumed to be L3
> +	 */
> +	case ARM_SPE_NV_SYS_CACHE:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> +		break;

        data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;

> +	/*
> +	 * We don't know what level it hit in, except it came from the other
> +	 * socket
> +	 */
> +	case ARM_SPE_NV_REMOTE:
> +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> +		break;

The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
in any cache levels in remote chip:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
        data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

> +	case ARM_SPE_NV_DRAM:
> +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> +		break;

We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:

        data_src->mem_lvl = PERF_MEM_LVL_HIT;
        data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
        data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;

The rest of this patch looks good to me.

Thanks,
Leo

> +	default:
> +		break;
> +	}
> +}
>  
> +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> +						union perf_mem_data_src *data_src)
> +{
>  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> +		data_src->mem_lvl = PERF_MEM_LVL_L3;
>  
>  		if (record->type & ARM_SPE_LLC_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> +		data_src->mem_lvl = PERF_MEM_LVL_L1;
>  
>  		if (record->type & ARM_SPE_L1D_MISS)
> -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
>  		else
> -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
>  	}
>  
>  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> +}
> +
> +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> +{
> +	union perf_mem_data_src	data_src = { 0 };
> +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> +
> +	if (record->op & ARM_SPE_LD)
> +		data_src.mem_op = PERF_MEM_OP_LOAD;
> +	else
> +		data_src.mem_op = PERF_MEM_OP_STORE;
> +
> +	if (is_neoverse)
> +		arm_spe__synth_data_source_neoverse(record, &data_src);
> +	else
> +		arm_spe__synth_data_source_generic(record, &data_src);
>  
>  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
>  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
>  	u64 data_src;
>  	int err;
>  
> -	data_src = arm_spe__synth_data_source(record);
> +	data_src = arm_spe__synth_data_source(record, spe->midr);
>  
>  	if (spe->sample_flc) {
>  		if (record->type & ARM_SPE_L1D_MISS) {
> @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
>  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
>  	struct perf_record_time_conv *tc = &session->time_conv;
> +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> +	u64 midr = strtol(cpuid, NULL, 16);
>  	struct arm_spe *spe;
>  	int err;
>  
> @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
>  	spe->machine = &session->machines.host; /* No kvm support */
>  	spe->auxtrace_type = auxtrace_info->type;
>  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> +	spe->midr = midr;
>  
>  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
>  
> -- 
> 2.32.0
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 13:47     ` Leo Yan
@ 2022-03-26 13:52       ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:52 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> Hi Ali, German,
> 
> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> >  {
> > -	union perf_mem_data_src	data_src = { 0 };
> > +	/*
> > +	 * Even though four levels of cache hierarchy are possible, no known
> > +	 * production Neoverse systems currently include more than three levels
> > +	 * so for the time being we assume three exist. If a production system
> > +	 * is built with four the this function would have to be changed to
> > +	 * detect the number of levels for reporting.
> > +	 */
> >  
> > -	if (record->op == ARM_SPE_LD)
> > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > -	else
> > -		data_src.mem_op = PERF_MEM_OP_STORE;
> 
> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> v2 and v3.

Ok, removing this as well.

Thanks for reviewing.

- Arnaldo
 
> IMHO, we need to consider two kinds of information which can guide us
> for a reliable implementation.  The first thing is to summarize the data
> source configuration for x86 PEBS, we can dive in more details for this
> part; the second thing is we can refer to the AMBA architecture document
> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> its sub section 'Suggested DataSource values', which would help us
> much for mapping the cache topology to Arm SPE data source.
> 
> As a result, I summarized the data source configurations for PEBS and
> Arm SPE Neoverse in the spreadsheet:
> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
> 
> Please see below comments.
> 
> > +	switch (record->source) {
> > +	case ARM_SPE_NV_L1D:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > +		break;
> 
> I think we need to set the field 'mem_snoop' for L1 cache hit:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> For L1 cache hit, it doesn't involve snooping.
> 
> > +	case ARM_SPE_NV_L2:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > +		break;
> 
> Ditto:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> > +	case ARM_SPE_NV_PEER_CORE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Peer core contains its local L1 cache, so I think we can set the
> memory level L1 to indicate this case.
> 
> For this data source type and below types, though they indicate
> the snooping happens, but it doesn't mean the data in the cache line
> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> think this will mislead users when report the result.
> 
> I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> 
> > +		break;
> > +	/*
> > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > +	 * transfer, so set SNOOP_HITM
> > +	 */
> > +	case ARM_SPE_NV_LCL_CLSTR:
> 
> For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> the cluster level, it should happen in L2 cache:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> 
> > +	case ARM_SPE_NV_PEER_CLSTR:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > +		break;
> 
> This type can snoop from L1 or L2 cache in the peer cluster, so it
> makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> should use the snoop type PERF_MEM_SNOOP_HIT, so:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	/*
> > +	 * System cache is assumed to be L3
> > +	 */
> > +	case ARM_SPE_NV_SYS_CACHE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > +		break;
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> 
> > +	/*
> > +	 * We don't know what level it hit in, except it came from the other
> > +	 * socket
> > +	 */
> > +	case ARM_SPE_NV_REMOTE:
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > +		break;
> 
> The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> in any cache levels in remote chip:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	case ARM_SPE_NV_DRAM:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > +		break;
> 
> We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> 
> The rest of this patch looks good to me.
> 
> Thanks,
> Leo
> 
> > +	default:
> > +		break;
> > +	}
> > +}
> >  
> > +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> > +{
> >  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L3;
> >  
> >  		if (record->type & ARM_SPE_LLC_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L1;
> >  
> >  		if (record->type & ARM_SPE_L1D_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	}
> >  
> >  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> > -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +}
> > +
> > +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> > +{
> > +	union perf_mem_data_src	data_src = { 0 };
> > +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> > +
> > +	if (record->op & ARM_SPE_LD)
> > +		data_src.mem_op = PERF_MEM_OP_LOAD;
> > +	else
> > +		data_src.mem_op = PERF_MEM_OP_STORE;
> > +
> > +	if (is_neoverse)
> > +		arm_spe__synth_data_source_neoverse(record, &data_src);
> > +	else
> > +		arm_spe__synth_data_source_generic(record, &data_src);
> >  
> >  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
> >  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> > @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
> >  	u64 data_src;
> >  	int err;
> >  
> > -	data_src = arm_spe__synth_data_source(record);
> > +	data_src = arm_spe__synth_data_source(record, spe->midr);
> >  
> >  	if (spe->sample_flc) {
> >  		if (record->type & ARM_SPE_L1D_MISS) {
> > @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
> >  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
> >  	struct perf_record_time_conv *tc = &session->time_conv;
> > +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> > +	u64 midr = strtol(cpuid, NULL, 16);
> >  	struct arm_spe *spe;
> >  	int err;
> >  
> > @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	spe->machine = &session->machines.host; /* No kvm support */
> >  	spe->auxtrace_type = auxtrace_info->type;
> >  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> > +	spe->midr = midr;
> >  
> >  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
> >  
> > -- 
> > 2.32.0
> > 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-26 13:52       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:52 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> Hi Ali, German,
> 
> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> >  {
> > -	union perf_mem_data_src	data_src = { 0 };
> > +	/*
> > +	 * Even though four levels of cache hierarchy are possible, no known
> > +	 * production Neoverse systems currently include more than three levels
> > +	 * so for the time being we assume three exist. If a production system
> > +	 * is built with four the this function would have to be changed to
> > +	 * detect the number of levels for reporting.
> > +	 */
> >  
> > -	if (record->op == ARM_SPE_LD)
> > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > -	else
> > -		data_src.mem_op = PERF_MEM_OP_STORE;
> 
> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> v2 and v3.

Ok, removing this as well.

Thanks for reviewing.

- Arnaldo
 
> IMHO, we need to consider two kinds of information which can guide us
> for a reliable implementation.  The first thing is to summarize the data
> source configuration for x86 PEBS, we can dive in more details for this
> part; the second thing is we can refer to the AMBA architecture document
> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> its sub section 'Suggested DataSource values', which would help us
> much for mapping the cache topology to Arm SPE data source.
> 
> As a result, I summarized the data source configurations for PEBS and
> Arm SPE Neoverse in the spreadsheet:
> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
> 
> Please see below comments.
> 
> > +	switch (record->source) {
> > +	case ARM_SPE_NV_L1D:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > +		break;
> 
> I think we need to set the field 'mem_snoop' for L1 cache hit:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> For L1 cache hit, it doesn't involve snooping.
> 
> > +	case ARM_SPE_NV_L2:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > +		break;
> 
> Ditto:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> > +	case ARM_SPE_NV_PEER_CORE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Peer core contains its local L1 cache, so I think we can set the
> memory level L1 to indicate this case.
> 
> For this data source type and below types, though they indicate
> the snooping happens, but it doesn't mean the data in the cache line
> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> think this will mislead users when report the result.
> 
> I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> 
> > +		break;
> > +	/*
> > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > +	 * transfer, so set SNOOP_HITM
> > +	 */
> > +	case ARM_SPE_NV_LCL_CLSTR:
> 
> For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> the cluster level, it should happen in L2 cache:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> 
> > +	case ARM_SPE_NV_PEER_CLSTR:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > +		break;
> 
> This type can snoop from L1 or L2 cache in the peer cluster, so it
> makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> should use the snoop type PERF_MEM_SNOOP_HIT, so:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	/*
> > +	 * System cache is assumed to be L3
> > +	 */
> > +	case ARM_SPE_NV_SYS_CACHE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > +		break;
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> 
> > +	/*
> > +	 * We don't know what level it hit in, except it came from the other
> > +	 * socket
> > +	 */
> > +	case ARM_SPE_NV_REMOTE:
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > +		break;
> 
> The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> in any cache levels in remote chip:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> > +	case ARM_SPE_NV_DRAM:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > +		break;
> 
> We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> 
> The rest of this patch looks good to me.
> 
> Thanks,
> Leo
> 
> > +	default:
> > +		break;
> > +	}
> > +}
> >  
> > +static void arm_spe__synth_data_source_generic(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> > +{
> >  	if (record->type & (ARM_SPE_LLC_ACCESS | ARM_SPE_LLC_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L3;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L3;
> >  
> >  		if (record->type & ARM_SPE_LLC_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	} else if (record->type & (ARM_SPE_L1D_ACCESS | ARM_SPE_L1D_MISS)) {
> > -		data_src.mem_lvl = PERF_MEM_LVL_L1;
> > +		data_src->mem_lvl = PERF_MEM_LVL_L1;
> >  
> >  		if (record->type & ARM_SPE_L1D_MISS)
> > -			data_src.mem_lvl |= PERF_MEM_LVL_MISS;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_MISS;
> >  		else
> > -			data_src.mem_lvl |= PERF_MEM_LVL_HIT;
> > +			data_src->mem_lvl |= PERF_MEM_LVL_HIT;
> >  	}
> >  
> >  	if (record->type & ARM_SPE_REMOTE_ACCESS)
> > -		data_src.mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +		data_src->mem_lvl |= PERF_MEM_LVL_REM_CCE1;
> > +}
> > +
> > +static u64 arm_spe__synth_data_source(const struct arm_spe_record *record, u64 midr)
> > +{
> > +	union perf_mem_data_src	data_src = { 0 };
> > +	bool is_neoverse = is_midr_in_range(midr, neoverse_spe);
> > +
> > +	if (record->op & ARM_SPE_LD)
> > +		data_src.mem_op = PERF_MEM_OP_LOAD;
> > +	else
> > +		data_src.mem_op = PERF_MEM_OP_STORE;
> > +
> > +	if (is_neoverse)
> > +		arm_spe__synth_data_source_neoverse(record, &data_src);
> > +	else
> > +		arm_spe__synth_data_source_generic(record, &data_src);
> >  
> >  	if (record->type & (ARM_SPE_TLB_ACCESS | ARM_SPE_TLB_MISS)) {
> >  		data_src.mem_dtlb = PERF_MEM_TLB_WK;
> > @@ -446,7 +525,7 @@ static int arm_spe_sample(struct arm_spe_queue *speq)
> >  	u64 data_src;
> >  	int err;
> >  
> > -	data_src = arm_spe__synth_data_source(record);
> > +	data_src = arm_spe__synth_data_source(record, spe->midr);
> >  
> >  	if (spe->sample_flc) {
> >  		if (record->type & ARM_SPE_L1D_MISS) {
> > @@ -1183,6 +1262,8 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
> >  	size_t min_sz = sizeof(u64) * ARM_SPE_AUXTRACE_PRIV_MAX;
> >  	struct perf_record_time_conv *tc = &session->time_conv;
> > +	const char *cpuid = perf_env__cpuid(session->evlist->env);
> > +	u64 midr = strtol(cpuid, NULL, 16);
> >  	struct arm_spe *spe;
> >  	int err;
> >  
> > @@ -1202,6 +1283,7 @@ int arm_spe_process_auxtrace_info(union perf_event *event,
> >  	spe->machine = &session->machines.host; /* No kvm support */
> >  	spe->auxtrace_type = auxtrace_info->type;
> >  	spe->pmu_type = auxtrace_info->priv[ARM_SPE_PMU_TYPE];
> > +	spe->midr = midr;
> >  
> >  	spe->timeless_decoding = arm_spe__is_timeless_decoding(spe);
> >  
> > -- 
> > 2.32.0
> > 

-- 

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/4] perf mem: Support mem_lvl_num in c2c command
  2022-03-24 18:33   ` Ali Saidi
@ 2022-03-26 13:54     ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:54 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Thu, Mar 24, 2022 at 06:33:22PM +0000, Ali Saidi escreveu:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
> 
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.

I'm removing this one as well, will wait for further discussion as the
other two got yanked out as per Leo's review comments.

The first patch is in with Leo's ack.

- Arnaldo
 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> Tested-by: German Gomez <german.gomez@arm.com>
> Reviewed-by: German Gomez <german.gomez@arm.com>
> ---
>  tools/perf/util/mem-events.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
>  	u64 daddr  = mi->daddr.addr;
>  	u64 op     = data_src->mem_op;
>  	u64 lvl    = data_src->mem_lvl;
> +	u64 lnum   = data_src->mem_lvl_num;
>  	u64 snoop  = data_src->mem_snoop;
>  	u64 lock   = data_src->mem_lock;
>  	u64 blk    = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do {				\
>  			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
>  			if (lvl & P(LVL, IO))  stats->ld_io++;
>  			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> -			if (lvl & P(LVL, L3 )) {
> +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> +				stats->ld_l1hit++;
> +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> +				stats->ld_l2hit++;
> +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
>  				if (snoop & P(SNOOP, HITM))
>  					HITM_INC(lcl_hitm);
>  				else
>  					stats->ld_llchit++;
>  			}
>  
> -			if (lvl & P(LVL, LOC_RAM)) {
> +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
>  				stats->lcl_dram++;
>  				if (snoop & P(SNOOP, HIT))
>  					stats->ld_shared++;
> -- 
> 2.32.0

-- 

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 3/4] perf mem: Support mem_lvl_num in c2c command
@ 2022-03-26 13:54     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:54 UTC (permalink / raw)
  To: Ali Saidi
  Cc: linux-kernel, linux-perf-users, linux-arm-kernel, german.gomez,
	leo.yan, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Thu, Mar 24, 2022 at 06:33:22PM +0000, Ali Saidi escreveu:
> In addition to summarizing data encoded in mem_lvl also support data
> encoded in mem_lvl_num.
> 
> Since other architectures don't seem to populate the mem_lvl_num field
> here there shouldn't be a change in functionality.

I'm removing this one as well, will wait for further discussion as the
other two got yanked out as per Leo's review comments.

The first patch is in with Leo's ack.

- Arnaldo
 
> Signed-off-by: Ali Saidi <alisaidi@amazon.com>
> Tested-by: German Gomez <german.gomez@arm.com>
> Reviewed-by: German Gomez <german.gomez@arm.com>
> ---
>  tools/perf/util/mem-events.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/util/mem-events.c b/tools/perf/util/mem-events.c
> index ed0ab838bcc5..e5e405185498 100644
> --- a/tools/perf/util/mem-events.c
> +++ b/tools/perf/util/mem-events.c
> @@ -485,6 +485,7 @@ int c2c_decode_stats(struct c2c_stats *stats, struct mem_info *mi)
>  	u64 daddr  = mi->daddr.addr;
>  	u64 op     = data_src->mem_op;
>  	u64 lvl    = data_src->mem_lvl;
> +	u64 lnum   = data_src->mem_lvl_num;
>  	u64 snoop  = data_src->mem_snoop;
>  	u64 lock   = data_src->mem_lock;
>  	u64 blk    = data_src->mem_blk;
> @@ -527,16 +528,18 @@ do {				\
>  			if (lvl & P(LVL, UNC)) stats->ld_uncache++;
>  			if (lvl & P(LVL, IO))  stats->ld_io++;
>  			if (lvl & P(LVL, LFB)) stats->ld_fbhit++;
> -			if (lvl & P(LVL, L1 )) stats->ld_l1hit++;
> -			if (lvl & P(LVL, L2 )) stats->ld_l2hit++;
> -			if (lvl & P(LVL, L3 )) {
> +			if (lvl & P(LVL, L1) || lnum == P(LVLNUM, L1))
> +				stats->ld_l1hit++;
> +			if (lvl & P(LVL, L2) || lnum == P(LVLNUM, L2))
> +				stats->ld_l2hit++;
> +			if (lvl & P(LVL, L3) || lnum == P(LVLNUM, L3)) {
>  				if (snoop & P(SNOOP, HITM))
>  					HITM_INC(lcl_hitm);
>  				else
>  					stats->ld_llchit++;
>  			}
>  
> -			if (lvl & P(LVL, LOC_RAM)) {
> +			if (lvl & P(LVL, LOC_RAM) || lnum == P(LVLNUM, RAM)) {
>  				stats->lcl_dram++;
>  				if (snoop & P(SNOOP, HIT))
>  					stats->ld_shared++;
> -- 
> 2.32.0

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 13:52       ` Arnaldo Carvalho de Melo
@ 2022-03-26 13:56         ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26 13:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Sat, Mar 26, 2022 at 10:52:01AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> > Hi Ali, German,
> > 
> > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > 
> > [...]
> > 
> > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > +						union perf_mem_data_src *data_src)
> > >  {
> > > -	union perf_mem_data_src	data_src = { 0 };
> > > +	/*
> > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > +	 * production Neoverse systems currently include more than three levels
> > > +	 * so for the time being we assume three exist. If a production system
> > > +	 * is built with four the this function would have to be changed to
> > > +	 * detect the number of levels for reporting.
> > > +	 */
> > >  
> > > -	if (record->op == ARM_SPE_LD)
> > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > -	else
> > > -		data_src.mem_op = PERF_MEM_OP_STORE;
> > 
> > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > v2 and v3.
> 
> Ok, removing this as well.
> 
> Thanks for reviewing.

Thanks a lot, Arnaldo.  Yeah, it's good to give a bit more time to
dismiss the concerns in this patch.

Sorry again for the inconvenience.

Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-26 13:56         ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-26 13:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Sat, Mar 26, 2022 at 10:52:01AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> > Hi Ali, German,
> > 
> > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > 
> > [...]
> > 
> > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > +						union perf_mem_data_src *data_src)
> > >  {
> > > -	union perf_mem_data_src	data_src = { 0 };
> > > +	/*
> > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > +	 * production Neoverse systems currently include more than three levels
> > > +	 * so for the time being we assume three exist. If a production system
> > > +	 * is built with four the this function would have to be changed to
> > > +	 * detect the number of levels for reporting.
> > > +	 */
> > >  
> > > -	if (record->op == ARM_SPE_LD)
> > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > -	else
> > > -		data_src.mem_op = PERF_MEM_OP_STORE;
> > 
> > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > v2 and v3.
> 
> Ok, removing this as well.
> 
> Thanks for reviewing.

Thanks a lot, Arnaldo.  Yeah, it's good to give a bit more time to
dismiss the concerns in this patch.

Sorry again for the inconvenience.

Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
  2022-03-26  5:49         ` Leo Yan
@ 2022-03-26 13:59           ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:59 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 01:49:56PM +0800, Leo Yan escreveu:
> Hi Arnaldo, Ali,
> 
> On Fri, Mar 25, 2022 at 04:42:32PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > > > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > > > for arm64 to make use of all the core-type definitions in perf.
> > 
> > > > Replace sysreg.h with the version already imported into tools/.
> >  
> > > You forgot to add it to tools/perf/check-headers.sh so that we get
> > > notificed when the original file in the kernel sources gets updated, so
> > > that we can check if this needs any tooling adjustments.
> >  
> > > I'll add the entry together with the waiver for this specific
> > > difference.
> > 
> > This:
> > 
> > diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> > index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
> > --- a/tools/perf/check-headers.sh
> > +++ b/tools/perf/check-headers.sh
> > @@ -146,6 +146,7 @@ done
> >  check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
> >  check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
> >  check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
> > +check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
> >  check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
> >  check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
> >  check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'
> 
> LGTM.  I did the testing on both my x86 and Arm64 platforms, thanks for
> the fixing up.

Thanks, adding a:

Tested-by: Leo Yan <leo.yan@linaro.org>

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 1/4] tools: arm64: Import cputype.h
@ 2022-03-26 13:59           ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 13:59 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 01:49:56PM +0800, Leo Yan escreveu:
> Hi Arnaldo, Ali,
> 
> On Fri, Mar 25, 2022 at 04:42:32PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Mar 25, 2022 at 03:39:44PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Em Thu, Mar 24, 2022 at 06:33:20PM +0000, Ali Saidi escreveu:
> > > > Bring-in the kernel's arch/arm64/include/asm/cputype.h into tools/
> > > > for arm64 to make use of all the core-type definitions in perf.
> > 
> > > > Replace sysreg.h with the version already imported into tools/.
> >  
> > > You forgot to add it to tools/perf/check-headers.sh so that we get
> > > notificed when the original file in the kernel sources gets updated, so
> > > that we can check if this needs any tooling adjustments.
> >  
> > > I'll add the entry together with the waiver for this specific
> > > difference.
> > 
> > This:
> > 
> > diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> > index 30ecf3a0f68b6830..6ee44b18c6b57cf1 100755
> > --- a/tools/perf/check-headers.sh
> > +++ b/tools/perf/check-headers.sh
> > @@ -146,6 +146,7 @@ done
> >  check arch/x86/lib/memcpy_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memcpy_\(erms\|orig\))"'
> >  check arch/x86/lib/memset_64.S        '-I "^EXPORT_SYMBOL" -I "^#include <asm/export.h>" -I"^SYM_FUNC_START\(_LOCAL\)*(memset_\(erms\|orig\))"'
> >  check arch/x86/include/asm/amd-ibs.h  '-I "^#include [<\"]\(asm/\)*msr-index.h"'
> > +check arch/arm64/include/asm/cputype.h '-I "^#include [<\"]\(asm/\)*sysreg.h"'
> >  check include/uapi/asm-generic/mman.h '-I "^#include <\(uapi/\)*asm-generic/mman-common\(-tools\)*.h>"'
> >  check include/uapi/linux/mman.h       '-I "^#include <\(uapi/\)*asm/mman.h>"'
> >  check include/linux/build_bug.h       '-I "^#\(ifndef\|endif\)\( \/\/\)* static_assert$"'
> 
> LGTM.  I did the testing on both my x86 and Arm64 platforms, thanks for
> the fixing up.

Thanks, adding a:

Tested-by: Leo Yan <leo.yan@linaro.org>

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 13:56         ` Leo Yan
@ 2022-03-26 14:04           ` Arnaldo Carvalho de Melo
  -1 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 14:04 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 09:56:53PM +0800, Leo Yan escreveu:
> On Sat, Mar 26, 2022 at 10:52:01AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> > > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > > +						union perf_mem_data_src *data_src)
> > > >  {
> > > > -	union perf_mem_data_src	data_src = { 0 };
> > > > +	/*
> > > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > > +	 * production Neoverse systems currently include more than three levels
> > > > +	 * so for the time being we assume three exist. If a production system
> > > > +	 * is built with four the this function would have to be changed to
> > > > +	 * detect the number of levels for reporting.
> > > > +	 */

> > > > -	if (record->op == ARM_SPE_LD)
> > > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > > -	else
> > > > -		data_src.mem_op = PERF_MEM_OP_STORE;

> > > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > > v2 and v3.

> > Ok, removing this as well.

> > Thanks for reviewing.

> Thanks a lot, Arnaldo.  Yeah, it's good to give a bit more time to
> dismiss the concerns in this patch.

Sure, at least it was build tested on many distros/cross compilers and
this part is ok 8-)
 
> Sorry again for the inconvenience.

np.

- Arnaldo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-26 14:04           ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 66+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-03-26 14:04 UTC (permalink / raw)
  To: Leo Yan
  Cc: Ali Saidi, linux-kernel, linux-perf-users, linux-arm-kernel,
	german.gomez, benh, Nick.Forrington, alexander.shishkin,
	andrew.kilroy, james.clark, john.garry, jolsa, kjain, lihuafei1,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Em Sat, Mar 26, 2022 at 09:56:53PM +0800, Leo Yan escreveu:
> On Sat, Mar 26, 2022 at 10:52:01AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Sat, Mar 26, 2022 at 09:47:54PM +0800, Leo Yan escreveu:
> > > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > > +						union perf_mem_data_src *data_src)
> > > >  {
> > > > -	union perf_mem_data_src	data_src = { 0 };
> > > > +	/*
> > > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > > +	 * production Neoverse systems currently include more than three levels
> > > > +	 * so for the time being we assume three exist. If a production system
> > > > +	 * is built with four the this function would have to be changed to
> > > > +	 * detect the number of levels for reporting.
> > > > +	 */

> > > > -	if (record->op == ARM_SPE_LD)
> > > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > > -	else
> > > > -		data_src.mem_op = PERF_MEM_OP_STORE;

> > > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > > v2 and v3.

> > Ok, removing this as well.

> > Thanks for reviewing.

> Thanks a lot, Arnaldo.  Yeah, it's good to give a bit more time to
> dismiss the concerns in this patch.

Sure, at least it was build tested on many distros/cross compilers and
this part is ok 8-)
 
> Sorry again for the inconvenience.

np.

- Arnaldo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
  2022-03-26  6:23     ` Leo Yan
@ 2022-03-26 19:14       ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-26 19:14 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Sat, 26 Mar 2022 22:23:03 +0000, Leo Yan wrote:
> On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> > For loads that hit in a the LLC snoop filter and are fulfilled from a
> > higher level cache on arm64 Neoverse cores, it's not usually clear what
> > the true level of the cache the data came from (i.e. a transfer from a
> > core could come from it's L1 or L2). Instead of making an assumption of
> > where the line came from, add support for incrementing HITM if the
> > source is CACHE_ANY.A
[snip]
> 
> This might break the memory profiling result for x86, see file
> arch/x86/events/intel/ds.c:
> 
>   97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
>   98 {
>   99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
>   ...
>  105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
>  106 }
> 

Thanks for catching this Leo, I'll add your fix.

Ali

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any
@ 2022-03-26 19:14       ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-26 19:14 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Sat, 26 Mar 2022 22:23:03 +0000, Leo Yan wrote:
> On Thu, Mar 24, 2022 at 06:33:23PM +0000, Ali Saidi wrote:
> > For loads that hit in a the LLC snoop filter and are fulfilled from a
> > higher level cache on arm64 Neoverse cores, it's not usually clear what
> > the true level of the cache the data came from (i.e. a transfer from a
> > core could come from it's L1 or L2). Instead of making an assumption of
> > where the line came from, add support for incrementing HITM if the
> > source is CACHE_ANY.A
[snip]
> 
> This might break the memory profiling result for x86, see file
> arch/x86/events/intel/ds.c:
> 
>   97 void __init intel_pmu_pebs_data_source_skl(bool pmem)
>   98 {
>   99         u64 pmem_or_l4 = pmem ? LEVEL(PMEM) : LEVEL(L4);
>   ...
>  105         pebs_data_source[0x0d] = OP_LH | LEVEL(ANY_CACHE) | REM | P(SNOOP, HITM);
>  106 }
> 

Thanks for catching this Leo, I'll add your fix.

Ali

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 13:47     ` Leo Yan
@ 2022-03-26 19:43       ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-26 19:43 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

Hi Leo,
On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
> Hi Ali, German,
> 
> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> >  {
> > -	union perf_mem_data_src	data_src = { 0 };
> > +	/*
> > +	 * Even though four levels of cache hierarchy are possible, no known
> > +	 * production Neoverse systems currently include more than three levels
> > +	 * so for the time being we assume three exist. If a production system
> > +	 * is built with four the this function would have to be changed to
> > +	 * detect the number of levels for reporting.
> > +	 */
> >  
> > -	if (record->op == ARM_SPE_LD)
> > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > -	else
> > -		data_src.mem_op = PERF_MEM_OP_STORE;
> 
> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> v2 and v3.
> 
> IMHO, we need to consider two kinds of information which can guide us
> for a reliable implementation.  The first thing is to summarize the data
> source configuration for x86 PEBS, we can dive in more details for this
> part; the second thing is we can refer to the AMBA architecture document
> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> its sub section 'Suggested DataSource values', which would help us
> much for mapping the cache topology to Arm SPE data source.
> 
> As a result, I summarized the data source configurations for PEBS and
> Arm SPE Neoverse in the spreadsheet:
> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing

Thanks for putting this together and digging into the details, but you're making
assumptions in neoverse data sources about the core configurations that aren't
correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
line is coming from a peer-core we don't know which level it's actually coming
from.  Similarly, if it's coming from a local cluster, that could mean a cluster
l3, but it's not the L2. 


> Please see below comments.
> 
> > +	switch (record->source) {
> > +	case ARM_SPE_NV_L1D:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > +		break;
> 
> I think we need to set the field 'mem_snoop' for L1 cache hit:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> For L1 cache hit, it doesn't involve snooping.
I can't find a precise definition for SNOOP_NONE, but it seemed as though
this would be used for cases where a snoop could have occurred but didn't
not for accesses that by definition don't snoop? I'm happy with either way,
perhaps i just read more into it. 

> > +	case ARM_SPE_NV_L2:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > +		break;
> 
> Ditto:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
Same comment as above.

> > +	case ARM_SPE_NV_PEER_CORE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Peer core contains its local L1 cache, so I think we can set the
> memory level L1 to indicate this case.
It could be either the L1 or the L2. All the neoverse cores have private L2
caches and we don't know. 

> For this data source type and below types, though they indicate
> the snooping happens, but it doesn't mean the data in the cache line
> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> think this will mislead users when report the result.

I'm of the opposite opinion. If the data wasn't modified, it will likely be
found in the lower-level shared cache and the transaction wouldn't require a
cache-to-cache transfer of the modified data, so the most common case when we
source a line out of another cores cache will be if it was "modifiable" in that
cache. 

> 
> I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> 
> > +		break;
> > +	/*
> > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > +	 * transfer, so set SNOOP_HITM
> > +	 */
> > +	case ARM_SPE_NV_LCL_CLSTR:
> 
> For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> the cluster level, it should happen in L2 cache:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;

We don't know if this is coming from the cluster cache, or the private L1 or L2
core caches. The description above about why we'll be transferring the line from
cache-to-cache applies here too. 

> > +	case ARM_SPE_NV_PEER_CLSTR:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > +		break;
> 
> This type can snoop from L1 or L2 cache in the peer cluster, so it
> makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> should use the snoop type PERF_MEM_SNOOP_HIT, so:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Given that we agreed to only focus on the three levels generally used by
the existing implementations LCL and PEER should be the same for now. 

> > +	/*
> > +	 * System cache is assumed to be L3
> > +	 */
> > +	case ARM_SPE_NV_SYS_CACHE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > +		break;
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;

I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 
If we're hitting in the unified L3 cache, we aren't actually snooping anything
which is why I didn't set mem_snoop here.

> > +	/*
> > +	 * We don't know what level it hit in, except it came from the other
> > +	 * socket
> > +	 */
> > +	case ARM_SPE_NV_REMOTE:
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > +		break;
> 
> The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> in any cache levels in remote chip:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Ok.

> 
> > +	case ARM_SPE_NV_DRAM:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > +		break;
> 
> We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> 

Ok.

Thanks,
Ali




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-26 19:43       ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-26 19:43 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

Hi Leo,
On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
> Hi Ali, German,
> 
> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > +						union perf_mem_data_src *data_src)
> >  {
> > -	union perf_mem_data_src	data_src = { 0 };
> > +	/*
> > +	 * Even though four levels of cache hierarchy are possible, no known
> > +	 * production Neoverse systems currently include more than three levels
> > +	 * so for the time being we assume three exist. If a production system
> > +	 * is built with four the this function would have to be changed to
> > +	 * detect the number of levels for reporting.
> > +	 */
> >  
> > -	if (record->op == ARM_SPE_LD)
> > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > -	else
> > -		data_src.mem_op = PERF_MEM_OP_STORE;
> 
> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> v2 and v3.
> 
> IMHO, we need to consider two kinds of information which can guide us
> for a reliable implementation.  The first thing is to summarize the data
> source configuration for x86 PEBS, we can dive in more details for this
> part; the second thing is we can refer to the AMBA architecture document
> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> its sub section 'Suggested DataSource values', which would help us
> much for mapping the cache topology to Arm SPE data source.
> 
> As a result, I summarized the data source configurations for PEBS and
> Arm SPE Neoverse in the spreadsheet:
> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing

Thanks for putting this together and digging into the details, but you're making
assumptions in neoverse data sources about the core configurations that aren't
correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
line is coming from a peer-core we don't know which level it's actually coming
from.  Similarly, if it's coming from a local cluster, that could mean a cluster
l3, but it's not the L2. 


> Please see below comments.
> 
> > +	switch (record->source) {
> > +	case ARM_SPE_NV_L1D:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > +		break;
> 
> I think we need to set the field 'mem_snoop' for L1 cache hit:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> 
> For L1 cache hit, it doesn't involve snooping.
I can't find a precise definition for SNOOP_NONE, but it seemed as though
this would be used for cases where a snoop could have occurred but didn't
not for accesses that by definition don't snoop? I'm happy with either way,
perhaps i just read more into it. 

> > +	case ARM_SPE_NV_L2:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > +		break;
> 
> Ditto:
> 
>         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
Same comment as above.

> > +	case ARM_SPE_NV_PEER_CORE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Peer core contains its local L1 cache, so I think we can set the
> memory level L1 to indicate this case.
It could be either the L1 or the L2. All the neoverse cores have private L2
caches and we don't know. 

> For this data source type and below types, though they indicate
> the snooping happens, but it doesn't mean the data in the cache line
> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> think this will mislead users when report the result.

I'm of the opposite opinion. If the data wasn't modified, it will likely be
found in the lower-level shared cache and the transaction wouldn't require a
cache-to-cache transfer of the modified data, so the most common case when we
source a line out of another cores cache will be if it was "modifiable" in that
cache. 

> 
> I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> 
> > +		break;
> > +	/*
> > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > +	 * transfer, so set SNOOP_HITM
> > +	 */
> > +	case ARM_SPE_NV_LCL_CLSTR:
> 
> For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> the cluster level, it should happen in L2 cache:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;

We don't know if this is coming from the cluster cache, or the private L1 or L2
core caches. The description above about why we'll be transferring the line from
cache-to-cache applies here too. 

> > +	case ARM_SPE_NV_PEER_CLSTR:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > +		break;
> 
> This type can snoop from L1 or L2 cache in the peer cluster, so it
> makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> should use the snoop type PERF_MEM_SNOOP_HIT, so:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Given that we agreed to only focus on the three levels generally used by
the existing implementations LCL and PEER should be the same for now. 

> > +	/*
> > +	 * System cache is assumed to be L3
> > +	 */
> > +	case ARM_SPE_NV_SYS_CACHE:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > +		break;
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;

I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 
If we're hitting in the unified L3 cache, we aren't actually snooping anything
which is why I didn't set mem_snoop here.

> > +	/*
> > +	 * We don't know what level it hit in, except it came from the other
> > +	 * socket
> > +	 */
> > +	case ARM_SPE_NV_REMOTE:
> > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > +		break;
> 
> The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> in any cache levels in remote chip:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
>         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;

Ok.

> 
> > +	case ARM_SPE_NV_DRAM:
> > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > +		break;
> 
> We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> 
>         data_src->mem_lvl = PERF_MEM_LVL_HIT;
>         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
>         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> 

Ok.

Thanks,
Ali




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 19:43       ` Ali Saidi
@ 2022-03-27  9:09         ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-27  9:09 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Sat, Mar 26, 2022 at 07:43:27PM +0000, Ali Saidi wrote:
> Hi Leo,
> On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
> > Hi Ali, German,
> > 
> > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > 
> > [...]
> > 
> > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > +						union perf_mem_data_src *data_src)
> > >  {
> > > -	union perf_mem_data_src	data_src = { 0 };
> > > +	/*
> > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > +	 * production Neoverse systems currently include more than three levels
> > > +	 * so for the time being we assume three exist. If a production system
> > > +	 * is built with four the this function would have to be changed to
> > > +	 * detect the number of levels for reporting.
> > > +	 */
> > >  
> > > -	if (record->op == ARM_SPE_LD)
> > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > -	else
> > > -		data_src.mem_op = PERF_MEM_OP_STORE;
> > 
> > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > v2 and v3.
> > 
> > IMHO, we need to consider two kinds of information which can guide us
> > for a reliable implementation.  The first thing is to summarize the data
> > source configuration for x86 PEBS, we can dive in more details for this
> > part; the second thing is we can refer to the AMBA architecture document
> > ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> > its sub section 'Suggested DataSource values', which would help us
> > much for mapping the cache topology to Arm SPE data source.
> > 
> > As a result, I summarized the data source configurations for PEBS and
> > Arm SPE Neoverse in the spreadsheet:
> > https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
> 
> Thanks for putting this together and digging into the details, but you're making
> assumptions in neoverse data sources about the core configurations that aren't
> correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
> line is coming from a peer-core we don't know which level it's actually coming
> from.  Similarly, if it's coming from a local cluster, that could mean a cluster
> l3, but it's not the L2. 

I remembered before you have mentioned this, and yes, these concerns are
valid for me.  Please see below comments.

> > Please see below comments.
> > 
> > > +	switch (record->source) {
> > > +	case ARM_SPE_NV_L1D:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > +		break;
> > 
> > I think we need to set the field 'mem_snoop' for L1 cache hit:
> > 
> >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> > 
> > For L1 cache hit, it doesn't involve snooping.
>
> I can't find a precise definition for SNOOP_NONE, but it seemed as though
> this would be used for cases where a snoop could have occurred but didn't
> not for accesses that by definition don't snoop? I'm happy with either way,
> perhaps i just read more into it.

I have the same understanding with you that "this would be used for
cases where a snoop could have occurred but didn't not for accesses
that by definition don't snoop".

If we refer to PEBS's data source type 01H: "Minimal latency core cache
hit.  This request was satisfied by the L1 data cache" and x86 sets
SNOOP_NONE as the snoop type for this case.

If we connect with snooping protocol, let's use MOIS protocol as an
example (here simply use the MOIS protocol for discussion, but I
don't have any knowledge for implementation of CPUs actually), as
described in the wiki page [1], when a cache line is in Owned (O)
state or Shared (S) state, "a processor read (PrRd) does not generate
any snooped signal".  In these cases, I think we should use snoop type
PERF_MEM_SNOOP_NONE.

> > > +	case ARM_SPE_NV_L2:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > > +		break;
> > 
> > Ditto:
> > 
> >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> Same comment as above.
> 
> > > +	case ARM_SPE_NV_PEER_CORE:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > 
> > Peer core contains its local L1 cache, so I think we can set the
> > memory level L1 to indicate this case.
> It could be either the L1 or the L2. All the neoverse cores have private L2
> caches and we don't know. 

How about set both L1 and L2 cache together for this case?

Although 'L1 | L2' cannot tell the exact cache level, I think it's
better than use ANY_CACHE, at least this can help us to distinguish
from other data source types if we avoid to use ANY_CACHE for all of
them.

> > For this data source type and below types, though they indicate
> > the snooping happens, but it doesn't mean the data in the cache line
> > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > think this will mislead users when report the result.
> 
> I'm of the opposite opinion. If the data wasn't modified, it will likely be
> found in the lower-level shared cache and the transaction wouldn't require a
> cache-to-cache transfer of the modified data, so the most common case when we
> source a line out of another cores cache will be if it was "modifiable" in that
> cache. 

Let's still use MOSI protocol as example.  I think there have two
cases: on case is the peer cache line is in 'Shared' state and another
case is the peer cache line is in 'Owned' state.

Quotes the wiki page for these two cases:

"When the cache block is in the Shared (S) state and there is a
snooped bus read (BusRd) transaction, then the block stays in the same
state and generates no more transactions as all the cache blocks have
the same value including the main memory and it is only being read,
not written into."

"While in the Owner (O) state and there is a snooped read request
(BusRd), the block remains in the same state while flushing (Flush)
the data for the other processor to read from it."

Seems to me, it's reasonable to set HTIM flag when the snooping happens
for the cache line line is in the Modified (M) state.

Again, my comment is based on the literal understanding; so please
correct if have any misunderstanding at here.

> > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > 
> > > +		break;
> > > +	/*
> > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > +	 * transfer, so set SNOOP_HITM
> > > +	 */
> > > +	case ARM_SPE_NV_LCL_CLSTR:
> > 
> > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > the cluster level, it should happen in L2 cache:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> 
> We don't know if this is coming from the cluster cache, or the private L1 or L2
> core caches. The description above about why we'll be transferring the line from
> cache-to-cache applies here too. 

I think a minor difference between PEER_CORE and LCL_CLSTR is:
if data source is PEER_CORE, the snooping happens on the peer core's
local cache (Core's L1 or Core's L2 when core contains L2); for the data
source LCL_CLSTR, the snooping occurs on the cluster level's cache
(cluster's L2 cache or cluster's L3 cache when cluster contains L3).

So can we set both 'L2 | L3' for LCL_CLSTR case?

> > > +	case ARM_SPE_NV_PEER_CLSTR:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > +		break;
> > 
> > This type can snoop from L1 or L2 cache in the peer cluster, so it
> > makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> > should use the snoop type PERF_MEM_SNOOP_HIT, so:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Given that we agreed to only focus on the three levels generally used by
> the existing implementations LCL and PEER should be the same for now. 

For PEER_CLSTR, we don't know the snooping happening on CPU's private
cache or cluster's shared cache, this is why we should use ANY_CACHE
for cache level.

But LCL_CLSTR is different from PEER_CLSTR, LCL_CLSTR indicates the
snooping on the local cluster's share cache, we set 'L2 | L3' for it;
therefore, we can distinguish between these two cases.

> > > +	/*
> > > +	 * System cache is assumed to be L3
> > > +	 */
> > > +	case ARM_SPE_NV_SYS_CACHE:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > > +		break;
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> 
> I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 

Is this because the decoding flow has any specific requirement that
only can set mem_lvl_num and not set mem_lvl?

> If we're hitting in the unified L3 cache, we aren't actually snooping anything
> which is why I didn't set mem_snoop here.

I am a bit suspecious for the clarification.  If the system cache is in
the cache conhernecy domain, then snooping occurs on it; in other words,
if system cache is connected with a bus (like CCI or CMN), and the bus
can help for data consistency, I prefer to set SNOOP_HIT flag.

Could you confirm for this?

> > > +	/*
> > > +	 * We don't know what level it hit in, except it came from the other
> > > +	 * socket
> > > +	 */
> > > +	case ARM_SPE_NV_REMOTE:
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > > +		break;
> > 
> > The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> > in any cache levels in remote chip:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Ok.
> 
> > 
> > > +	case ARM_SPE_NV_DRAM:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > > +		break;
> > 
> > We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > 
> 
> Ok.

Thanks a lot for your work!

Leo

[1] https://en.wikipedia.org/wiki/MOSI_protocol

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-27  9:09         ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-27  9:09 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Sat, Mar 26, 2022 at 07:43:27PM +0000, Ali Saidi wrote:
> Hi Leo,
> On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
> > Hi Ali, German,
> > 
> > On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
> > 
> > [...]
> > 
> > > +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
> > > +						union perf_mem_data_src *data_src)
> > >  {
> > > -	union perf_mem_data_src	data_src = { 0 };
> > > +	/*
> > > +	 * Even though four levels of cache hierarchy are possible, no known
> > > +	 * production Neoverse systems currently include more than three levels
> > > +	 * so for the time being we assume three exist. If a production system
> > > +	 * is built with four the this function would have to be changed to
> > > +	 * detect the number of levels for reporting.
> > > +	 */
> > >  
> > > -	if (record->op == ARM_SPE_LD)
> > > -		data_src.mem_op = PERF_MEM_OP_LOAD;
> > > -	else
> > > -		data_src.mem_op = PERF_MEM_OP_STORE;
> > 
> > Firstly, apologize that I didn't give clear idea when Ali sent patch sets
> > v2 and v3.
> > 
> > IMHO, we need to consider two kinds of information which can guide us
> > for a reliable implementation.  The first thing is to summarize the data
> > source configuration for x86 PEBS, we can dive in more details for this
> > part; the second thing is we can refer to the AMBA architecture document
> > ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
> > its sub section 'Suggested DataSource values', which would help us
> > much for mapping the cache topology to Arm SPE data source.
> > 
> > As a result, I summarized the data source configurations for PEBS and
> > Arm SPE Neoverse in the spreadsheet:
> > https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
> 
> Thanks for putting this together and digging into the details, but you're making
> assumptions in neoverse data sources about the core configurations that aren't
> correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
> line is coming from a peer-core we don't know which level it's actually coming
> from.  Similarly, if it's coming from a local cluster, that could mean a cluster
> l3, but it's not the L2. 

I remembered before you have mentioned this, and yes, these concerns are
valid for me.  Please see below comments.

> > Please see below comments.
> > 
> > > +	switch (record->source) {
> > > +	case ARM_SPE_NV_L1D:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > +		break;
> > 
> > I think we need to set the field 'mem_snoop' for L1 cache hit:
> > 
> >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> > 
> > For L1 cache hit, it doesn't involve snooping.
>
> I can't find a precise definition for SNOOP_NONE, but it seemed as though
> this would be used for cases where a snoop could have occurred but didn't
> not for accesses that by definition don't snoop? I'm happy with either way,
> perhaps i just read more into it.

I have the same understanding with you that "this would be used for
cases where a snoop could have occurred but didn't not for accesses
that by definition don't snoop".

If we refer to PEBS's data source type 01H: "Minimal latency core cache
hit.  This request was satisfied by the L1 data cache" and x86 sets
SNOOP_NONE as the snoop type for this case.

If we connect with snooping protocol, let's use MOIS protocol as an
example (here simply use the MOIS protocol for discussion, but I
don't have any knowledge for implementation of CPUs actually), as
described in the wiki page [1], when a cache line is in Owned (O)
state or Shared (S) state, "a processor read (PrRd) does not generate
any snooped signal".  In these cases, I think we should use snoop type
PERF_MEM_SNOOP_NONE.

> > > +	case ARM_SPE_NV_L2:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > > +		break;
> > 
> > Ditto:
> > 
> >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> Same comment as above.
> 
> > > +	case ARM_SPE_NV_PEER_CORE:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > 
> > Peer core contains its local L1 cache, so I think we can set the
> > memory level L1 to indicate this case.
> It could be either the L1 or the L2. All the neoverse cores have private L2
> caches and we don't know. 

How about set both L1 and L2 cache together for this case?

Although 'L1 | L2' cannot tell the exact cache level, I think it's
better than use ANY_CACHE, at least this can help us to distinguish
from other data source types if we avoid to use ANY_CACHE for all of
them.

> > For this data source type and below types, though they indicate
> > the snooping happens, but it doesn't mean the data in the cache line
> > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > think this will mislead users when report the result.
> 
> I'm of the opposite opinion. If the data wasn't modified, it will likely be
> found in the lower-level shared cache and the transaction wouldn't require a
> cache-to-cache transfer of the modified data, so the most common case when we
> source a line out of another cores cache will be if it was "modifiable" in that
> cache. 

Let's still use MOSI protocol as example.  I think there have two
cases: on case is the peer cache line is in 'Shared' state and another
case is the peer cache line is in 'Owned' state.

Quotes the wiki page for these two cases:

"When the cache block is in the Shared (S) state and there is a
snooped bus read (BusRd) transaction, then the block stays in the same
state and generates no more transactions as all the cache blocks have
the same value including the main memory and it is only being read,
not written into."

"While in the Owner (O) state and there is a snooped read request
(BusRd), the block remains in the same state while flushing (Flush)
the data for the other processor to read from it."

Seems to me, it's reasonable to set HTIM flag when the snooping happens
for the cache line line is in the Modified (M) state.

Again, my comment is based on the literal understanding; so please
correct if have any misunderstanding at here.

> > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > 
> > > +		break;
> > > +	/*
> > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > +	 * transfer, so set SNOOP_HITM
> > > +	 */
> > > +	case ARM_SPE_NV_LCL_CLSTR:
> > 
> > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > the cluster level, it should happen in L2 cache:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> 
> We don't know if this is coming from the cluster cache, or the private L1 or L2
> core caches. The description above about why we'll be transferring the line from
> cache-to-cache applies here too. 

I think a minor difference between PEER_CORE and LCL_CLSTR is:
if data source is PEER_CORE, the snooping happens on the peer core's
local cache (Core's L1 or Core's L2 when core contains L2); for the data
source LCL_CLSTR, the snooping occurs on the cluster level's cache
(cluster's L2 cache or cluster's L3 cache when cluster contains L3).

So can we set both 'L2 | L3' for LCL_CLSTR case?

> > > +	case ARM_SPE_NV_PEER_CLSTR:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > +		break;
> > 
> > This type can snoop from L1 or L2 cache in the peer cluster, so it
> > makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> > should use the snoop type PERF_MEM_SNOOP_HIT, so:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Given that we agreed to only focus on the three levels generally used by
> the existing implementations LCL and PEER should be the same for now. 

For PEER_CLSTR, we don't know the snooping happening on CPU's private
cache or cluster's shared cache, this is why we should use ANY_CACHE
for cache level.

But LCL_CLSTR is different from PEER_CLSTR, LCL_CLSTR indicates the
snooping on the local cluster's share cache, we set 'L2 | L3' for it;
therefore, we can distinguish between these two cases.

> > > +	/*
> > > +	 * System cache is assumed to be L3
> > > +	 */
> > > +	case ARM_SPE_NV_SYS_CACHE:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > > +		break;
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> 
> I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 

Is this because the decoding flow has any specific requirement that
only can set mem_lvl_num and not set mem_lvl?

> If we're hitting in the unified L3 cache, we aren't actually snooping anything
> which is why I didn't set mem_snoop here.

I am a bit suspecious for the clarification.  If the system cache is in
the cache conhernecy domain, then snooping occurs on it; in other words,
if system cache is connected with a bus (like CCI or CMN), and the bus
can help for data consistency, I prefer to set SNOOP_HIT flag.

Could you confirm for this?

> > > +	/*
> > > +	 * We don't know what level it hit in, except it came from the other
> > > +	 * socket
> > > +	 */
> > > +	case ARM_SPE_NV_REMOTE:
> > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > +		data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> > > +		break;
> > 
> > The type ARM_SPE_NV_REMOTE is a snooping operation and it can happen
> > in any cache levels in remote chip:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> >         data_src->mem_remote = PERF_MEM_REMOTE_REMOTE;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> 
> Ok.
> 
> > 
> > > +	case ARM_SPE_NV_DRAM:
> > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > > +		break;
> > 
> > We can set snoop as PERF_MEM_SNOOP_MISS for DRAM data source:
> > 
> >         data_src->mem_lvl = PERF_MEM_LVL_HIT;
> >         data_src->mem_snoop = PERF_MEM_SNOOP_MISS;
> >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_RAM;
> > 
> 
> Ok.

Thanks a lot for your work!

Leo

[1] https://en.wikipedia.org/wiki/MOSI_protocol

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-26 19:43       ` Ali Saidi
@ 2022-03-28  3:08         ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-28  3:08 UTC (permalink / raw)
  To: alisaidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, leo.yan,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Hi Leo,

Thanks for the additional comments.

On Sun, 27 Mar 2022 07:09:19 +0000, Leo Yan wrote:
[snip]
> 
> > > Please see below comments.
> > > 
> > > > +	switch (record->source) {
> > > > +	case ARM_SPE_NV_L1D:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > > +		break;
> > > 
> > > I think we need to set the field 'mem_snoop' for L1 cache hit:
> > > 
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> > > 
> > > For L1 cache hit, it doesn't involve snooping.
> >
> > I can't find a precise definition for SNOOP_NONE, but it seemed as though
> > this would be used for cases where a snoop could have occurred but didn't
> > not for accesses that by definition don't snoop? I'm happy with either way,
> > perhaps i just read more into it.
> 
> I have the same understanding with you that "this would be used for
> cases where a snoop could have occurred but didn't not for accesses
> that by definition don't snoop".
> 
> If we refer to PEBS's data source type 01H: "Minimal latency core cache
> hit.  This request was satisfied by the L1 data cache" and x86 sets
> SNOOP_NONE as the snoop type for this case.

I'm happy to set it if everyone believes it's the right thing.

[snip]
> > Same comment as above.
> > 
> > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > 
> > > Peer core contains its local L1 cache, so I think we can set the
> > > memory level L1 to indicate this case.
> > It could be either the L1 or the L2. All the neoverse cores have private L2
> > caches and we don't know. 
> 
> How about set both L1 and L2 cache together for this case?
> 
> Although 'L1 | L2' cannot tell the exact cache level, I think it's
> better than use ANY_CACHE, at least this can help us to distinguish
> from other data source types if we avoid to use ANY_CACHE for all of
> them.

This seems much more confusing then the ambiguity of where the line came from
and is only possible with the deprecated mem_lvl enconding.  It will make
perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
attribute a line to a single cache will find that the sum of the number of hits
is greater than the number of accesses which also seems terribly confusing.

> 
> > > For this data source type and below types, though they indicate
> > > the snooping happens, but it doesn't mean the data in the cache line
> > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > think this will mislead users when report the result.
> > 
> > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > found in the lower-level shared cache and the transaction wouldn't require a
> > cache-to-cache transfer of the modified data, so the most common case when we
> > source a line out of another cores cache will be if it was "modifiable" in that
> > cache. 
> 
> Let's still use MOSI protocol as example.  I think there have two
> cases: on case is the peer cache line is in 'Shared' state and another
> case is the peer cache line is in 'Owned' state.
> 
> Quotes the wiki page for these two cases:
> 
> "When the cache block is in the Shared (S) state and there is a
> snooped bus read (BusRd) transaction, then the block stays in the same
> state and generates no more transactions as all the cache blocks have
> the same value including the main memory and it is only being read,
> not written into."
> 
> "While in the Owner (O) state and there is a snooped read request
> (BusRd), the block remains in the same state while flushing (Flush)
> the data for the other processor to read from it."
> 
> Seems to me, it's reasonable to set HTIM flag when the snooping happens
> for the cache line line is in the Modified (M) state.
> 
> Again, my comment is based on the literal understanding; so please
> correct if have any misunderstanding at here.

Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
where sharing patterns are detected," so if a line is shared among caches it
will likely also be in the SLC (and thus we'd get an L3 hit). 

If there is one copy of the cache line and that cache line needs to transition
from one core to another core, I don't believe it matters if it was truly
modified or not because the core already had permission to modify it and the
transaction is just as costly (ping ponging between core caches). This is the
one thing I really want to be able to uniquely identify as any cacheline doing
this that isn't a lock is a place where optimization is likely possible. 

> 
> > > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > 
> > > > +		break;
> > > > +	/*
> > > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > > +	 * transfer, so set SNOOP_HITM
> > > > +	 */
> > > > +	case ARM_SPE_NV_LCL_CLSTR:
> > > 
> > > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > > the cluster level, it should happen in L2 cache:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > 
> > We don't know if this is coming from the cluster cache, or the private L1 or L2
> > core caches. The description above about why we'll be transferring the line from
> > cache-to-cache applies here too. 
> 
> I think a minor difference between PEER_CORE and LCL_CLSTR is:
> if data source is PEER_CORE, the snooping happens on the peer core's
> local cache (Core's L1 or Core's L2 when core contains L2); for the data
> source LCL_CLSTR, the snooping occurs on the cluster level's cache
> (cluster's L2 cache or cluster's L3 cache when cluster contains L3).
> 
> So can we set both 'L2 | L3' for LCL_CLSTR case?

Just as above, I believe setting two options will lead to more confusion.
Additionally, we agreed in the previous discussion that the system cache shall
be the L3, so i don't see how this helps alleviate any confusion. That said, the
systems I'm most familiar with never set LCL_CLSTR as a source. 


> > > > +	case ARM_SPE_NV_PEER_CLSTR:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > +		break;
> > > 
> > > This type can snoop from L1 or L2 cache in the peer cluster, so it
> > > makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> > > should use the snoop type PERF_MEM_SNOOP_HIT, so:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > 
> > Given that we agreed to only focus on the three levels generally used by
> > the existing implementations LCL and PEER should be the same for now. 
> 
> For PEER_CLSTR, we don't know the snooping happening on CPU's private
> cache or cluster's shared cache, this is why we should use ANY_CACHE
> for cache level.
> 
> But LCL_CLSTR is different from PEER_CLSTR, LCL_CLSTR indicates the
> snooping on the local cluster's share cache, we set 'L2 | L3' for it;
> therefore, we can distinguish between these two cases.

For the same reasons above, I believe we should only set a single value. 

> 
> > > > +	/*
> > > > +	 * System cache is assumed to be L3
> > > > +	 */
> > > > +	case ARM_SPE_NV_SYS_CACHE:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > > > +		break;
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > 
> > I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 
> 
> Is this because the decoding flow has any specific requirement that
> only can set mem_lvl_num and not set mem_lvl?

As one example, perf_mem__lvl_scnprintf() breaks if both are set. 

> 
> > If we're hitting in the unified L3 cache, we aren't actually snooping anything
> > which is why I didn't set mem_snoop here.
> 
> I am a bit suspecious for the clarification.  If the system cache is in
> the cache conhernecy domain, then snooping occurs on it; in other words,
> if system cache is connected with a bus (like CCI or CMN), and the bus
> can help for data consistency, I prefer to set SNOOP_HIT flag.
> 
> Could you confirm for this?

Thinking about this more that seems reasonable. 

Thanks,
Ali


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-28  3:08         ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-28  3:08 UTC (permalink / raw)
  To: alisaidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, leo.yan,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

Hi Leo,

Thanks for the additional comments.

On Sun, 27 Mar 2022 07:09:19 +0000, Leo Yan wrote:
[snip]
> 
> > > Please see below comments.
> > > 
> > > > +	switch (record->source) {
> > > > +	case ARM_SPE_NV_L1D:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > > +		break;
> > > 
> > > I think we need to set the field 'mem_snoop' for L1 cache hit:
> > > 
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_NONE;
> > > 
> > > For L1 cache hit, it doesn't involve snooping.
> >
> > I can't find a precise definition for SNOOP_NONE, but it seemed as though
> > this would be used for cases where a snoop could have occurred but didn't
> > not for accesses that by definition don't snoop? I'm happy with either way,
> > perhaps i just read more into it.
> 
> I have the same understanding with you that "this would be used for
> cases where a snoop could have occurred but didn't not for accesses
> that by definition don't snoop".
> 
> If we refer to PEBS's data source type 01H: "Minimal latency core cache
> hit.  This request was satisfied by the L1 data cache" and x86 sets
> SNOOP_NONE as the snoop type for this case.

I'm happy to set it if everyone believes it's the right thing.

[snip]
> > Same comment as above.
> > 
> > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > 
> > > Peer core contains its local L1 cache, so I think we can set the
> > > memory level L1 to indicate this case.
> > It could be either the L1 or the L2. All the neoverse cores have private L2
> > caches and we don't know. 
> 
> How about set both L1 and L2 cache together for this case?
> 
> Although 'L1 | L2' cannot tell the exact cache level, I think it's
> better than use ANY_CACHE, at least this can help us to distinguish
> from other data source types if we avoid to use ANY_CACHE for all of
> them.

This seems much more confusing then the ambiguity of where the line came from
and is only possible with the deprecated mem_lvl enconding.  It will make
perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
attribute a line to a single cache will find that the sum of the number of hits
is greater than the number of accesses which also seems terribly confusing.

> 
> > > For this data source type and below types, though they indicate
> > > the snooping happens, but it doesn't mean the data in the cache line
> > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > think this will mislead users when report the result.
> > 
> > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > found in the lower-level shared cache and the transaction wouldn't require a
> > cache-to-cache transfer of the modified data, so the most common case when we
> > source a line out of another cores cache will be if it was "modifiable" in that
> > cache. 
> 
> Let's still use MOSI protocol as example.  I think there have two
> cases: on case is the peer cache line is in 'Shared' state and another
> case is the peer cache line is in 'Owned' state.
> 
> Quotes the wiki page for these two cases:
> 
> "When the cache block is in the Shared (S) state and there is a
> snooped bus read (BusRd) transaction, then the block stays in the same
> state and generates no more transactions as all the cache blocks have
> the same value including the main memory and it is only being read,
> not written into."
> 
> "While in the Owner (O) state and there is a snooped read request
> (BusRd), the block remains in the same state while flushing (Flush)
> the data for the other processor to read from it."
> 
> Seems to me, it's reasonable to set HTIM flag when the snooping happens
> for the cache line line is in the Modified (M) state.
> 
> Again, my comment is based on the literal understanding; so please
> correct if have any misunderstanding at here.

Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
where sharing patterns are detected," so if a line is shared among caches it
will likely also be in the SLC (and thus we'd get an L3 hit). 

If there is one copy of the cache line and that cache line needs to transition
from one core to another core, I don't believe it matters if it was truly
modified or not because the core already had permission to modify it and the
transaction is just as costly (ping ponging between core caches). This is the
one thing I really want to be able to uniquely identify as any cacheline doing
this that isn't a lock is a place where optimization is likely possible. 

> 
> > > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > 
> > > > +		break;
> > > > +	/*
> > > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > > +	 * transfer, so set SNOOP_HITM
> > > > +	 */
> > > > +	case ARM_SPE_NV_LCL_CLSTR:
> > > 
> > > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > > the cluster level, it should happen in L2 cache:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > 
> > We don't know if this is coming from the cluster cache, or the private L1 or L2
> > core caches. The description above about why we'll be transferring the line from
> > cache-to-cache applies here too. 
> 
> I think a minor difference between PEER_CORE and LCL_CLSTR is:
> if data source is PEER_CORE, the snooping happens on the peer core's
> local cache (Core's L1 or Core's L2 when core contains L2); for the data
> source LCL_CLSTR, the snooping occurs on the cluster level's cache
> (cluster's L2 cache or cluster's L3 cache when cluster contains L3).
> 
> So can we set both 'L2 | L3' for LCL_CLSTR case?

Just as above, I believe setting two options will lead to more confusion.
Additionally, we agreed in the previous discussion that the system cache shall
be the L3, so i don't see how this helps alleviate any confusion. That said, the
systems I'm most familiar with never set LCL_CLSTR as a source. 


> > > > +	case ARM_SPE_NV_PEER_CLSTR:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > +		break;
> > > 
> > > This type can snoop from L1 or L2 cache in the peer cluster, so it
> > > makes sense to set cache level as PERF_MEM_LVLNUM_ANY_CACHE.  But here
> > > should use the snoop type PERF_MEM_SNOOP_HIT, so:
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > 
> > Given that we agreed to only focus on the three levels generally used by
> > the existing implementations LCL and PEER should be the same for now. 
> 
> For PEER_CLSTR, we don't know the snooping happening on CPU's private
> cache or cluster's shared cache, this is why we should use ANY_CACHE
> for cache level.
> 
> But LCL_CLSTR is different from PEER_CLSTR, LCL_CLSTR indicates the
> snooping on the local cluster's share cache, we set 'L2 | L3' for it;
> therefore, we can distinguish between these two cases.

For the same reasons above, I believe we should only set a single value. 

> 
> > > > +	/*
> > > > +	 * System cache is assumed to be L3
> > > > +	 */
> > > > +	case ARM_SPE_NV_SYS_CACHE:
> > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > > > +		break;
> > > 
> > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L3;
> > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L3;
> > 
> > I don't think we should set both the deprecated mem_lvl and the mem_lvl_num. 
> 
> Is this because the decoding flow has any specific requirement that
> only can set mem_lvl_num and not set mem_lvl?

As one example, perf_mem__lvl_scnprintf() breaks if both are set. 

> 
> > If we're hitting in the unified L3 cache, we aren't actually snooping anything
> > which is why I didn't set mem_snoop here.
> 
> I am a bit suspecious for the clarification.  If the system cache is in
> the cache conhernecy domain, then snooping occurs on it; in other words,
> if system cache is connected with a bus (like CCI or CMN), and the bus
> can help for data consistency, I prefer to set SNOOP_HIT flag.
> 
> Could you confirm for this?

Thinking about this more that seems reasonable. 

Thanks,
Ali


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-28  3:08         ` Ali Saidi
@ 2022-03-28 13:05           ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-28 13:05 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali,

On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:

[...]

> > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > 
> > > > Peer core contains its local L1 cache, so I think we can set the
> > > > memory level L1 to indicate this case.
> > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > caches and we don't know. 
> > 
> > How about set both L1 and L2 cache together for this case?
> > 
> > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > better than use ANY_CACHE, at least this can help us to distinguish
> > from other data source types if we avoid to use ANY_CACHE for all of
> > them.
> 
> This seems much more confusing then the ambiguity of where the line came from
> and is only possible with the deprecated mem_lvl enconding.  It will make
> perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> attribute a line to a single cache will find that the sum of the number of hits
> is greater than the number of accesses which also seems terribly confusing.

Understand.  I considered the potential issue for
perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
cache levels.  We might need to extend a bit for another field
'mem_lvlnum'.

Agreed that there would have inaccurate issue for statistics, it's fine
for me to use ANY_CACHE in this patch set.

I still think we should consider to extend the memory levels to
demonstrate clear momory hierarchy on Arm archs, I personally like the
definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
though these cache levels are not precise like L1/L2/L3 levels, they can
help us to map very well for the cache topology on Arm archs and without
any confusion.  We could take this as an enhancement if you don't want
to bother the current patch set's upstreaming.

> > > > For this data source type and below types, though they indicate
> > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > think this will mislead users when report the result.
> > > 
> > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > found in the lower-level shared cache and the transaction wouldn't require a
> > > cache-to-cache transfer of the modified data, so the most common case when we
> > > source a line out of another cores cache will be if it was "modifiable" in that
> > > cache. 
> > 
> > Let's still use MOSI protocol as example.  I think there have two
> > cases: on case is the peer cache line is in 'Shared' state and another
> > case is the peer cache line is in 'Owned' state.
> > 
> > Quotes the wiki page for these two cases:
> > 
> > "When the cache block is in the Shared (S) state and there is a
> > snooped bus read (BusRd) transaction, then the block stays in the same
> > state and generates no more transactions as all the cache blocks have
> > the same value including the main memory and it is only being read,
> > not written into."
> > 
> > "While in the Owner (O) state and there is a snooped read request
> > (BusRd), the block remains in the same state while flushing (Flush)
> > the data for the other processor to read from it."
> > 
> > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > for the cache line line is in the Modified (M) state.
> > 
> > Again, my comment is based on the literal understanding; so please
> > correct if have any misunderstanding at here.
> 
> Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> where sharing patterns are detected," so if a line is shared among caches it
> will likely also be in the SLC (and thus we'd get an L3 hit). 
> 
> If there is one copy of the cache line and that cache line needs to transition
> from one core to another core, I don't believe it matters if it was truly
> modified or not because the core already had permission to modify it and the
> transaction is just as costly (ping ponging between core caches). This is the
> one thing I really want to be able to uniquely identify as any cacheline doing
> this that isn't a lock is a place where optimization is likely possible. 

I understood that your point that if big amount of transitions within
multiple cores hit the same cache line, it would be very likely caused
by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
for easier reviewing.

Alternatively, I think it's good to pick up the patch series "perf c2c:
Sort cacheline with all loads" [1], rather than relying on HITM tag, the
patch series extends a new option "-d all" for perf c2c, so it displays
the suspecious false sharing cache lines based on load/store ops and
thread infos.  The main reason for holding on th patch set is due to we
cannot verify it with Arm SPE at that time point, as the time being Arm
SPE trace data was absent both store ops and data source packets.

I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
upstream the patch series "perf c2c: Sort cacheline with all loads"
(only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
have been merged in the mainline kernel).

If this is fine for you, I can respin the patch series for "perf c2c".
Or any other thoughts?

[1] https://lore.kernel.org/lkml/20201213133850.10070-1-leo.yan@linaro.org/

> > > > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > > 
> > > > > +		break;
> > > > > +	/*
> > > > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > > > +	 * transfer, so set SNOOP_HITM
> > > > > +	 */
> > > > > +	case ARM_SPE_NV_LCL_CLSTR:
> > > > 
> > > > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > > > the cluster level, it should happen in L2 cache:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > > 
> > > We don't know if this is coming from the cluster cache, or the private L1 or L2
> > > core caches. The description above about why we'll be transferring the line from
> > > cache-to-cache applies here too. 
> > 
> > I think a minor difference between PEER_CORE and LCL_CLSTR is:
> > if data source is PEER_CORE, the snooping happens on the peer core's
> > local cache (Core's L1 or Core's L2 when core contains L2); for the data
> > source LCL_CLSTR, the snooping occurs on the cluster level's cache
> > (cluster's L2 cache or cluster's L3 cache when cluster contains L3).
> > 
> > So can we set both 'L2 | L3' for LCL_CLSTR case?
> 
> Just as above, I believe setting two options will lead to more confusion.
> Additionally, we agreed in the previous discussion that the system cache shall
> be the L3, so i don't see how this helps alleviate any confusion. That said, the
> systems I'm most familiar with never set LCL_CLSTR as a source. 

Okay, let's rollback to PERF_MEM_LVLNUM_ANY_CACHE as cache level for
LCL_CLSTR and PEER_CLSTR.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-28 13:05           ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-28 13:05 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali,

On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:

[...]

> > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > 
> > > > Peer core contains its local L1 cache, so I think we can set the
> > > > memory level L1 to indicate this case.
> > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > caches and we don't know. 
> > 
> > How about set both L1 and L2 cache together for this case?
> > 
> > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > better than use ANY_CACHE, at least this can help us to distinguish
> > from other data source types if we avoid to use ANY_CACHE for all of
> > them.
> 
> This seems much more confusing then the ambiguity of where the line came from
> and is only possible with the deprecated mem_lvl enconding.  It will make
> perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> attribute a line to a single cache will find that the sum of the number of hits
> is greater than the number of accesses which also seems terribly confusing.

Understand.  I considered the potential issue for
perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
cache levels.  We might need to extend a bit for another field
'mem_lvlnum'.

Agreed that there would have inaccurate issue for statistics, it's fine
for me to use ANY_CACHE in this patch set.

I still think we should consider to extend the memory levels to
demonstrate clear momory hierarchy on Arm archs, I personally like the
definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
though these cache levels are not precise like L1/L2/L3 levels, they can
help us to map very well for the cache topology on Arm archs and without
any confusion.  We could take this as an enhancement if you don't want
to bother the current patch set's upstreaming.

> > > > For this data source type and below types, though they indicate
> > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > think this will mislead users when report the result.
> > > 
> > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > found in the lower-level shared cache and the transaction wouldn't require a
> > > cache-to-cache transfer of the modified data, so the most common case when we
> > > source a line out of another cores cache will be if it was "modifiable" in that
> > > cache. 
> > 
> > Let's still use MOSI protocol as example.  I think there have two
> > cases: on case is the peer cache line is in 'Shared' state and another
> > case is the peer cache line is in 'Owned' state.
> > 
> > Quotes the wiki page for these two cases:
> > 
> > "When the cache block is in the Shared (S) state and there is a
> > snooped bus read (BusRd) transaction, then the block stays in the same
> > state and generates no more transactions as all the cache blocks have
> > the same value including the main memory and it is only being read,
> > not written into."
> > 
> > "While in the Owner (O) state and there is a snooped read request
> > (BusRd), the block remains in the same state while flushing (Flush)
> > the data for the other processor to read from it."
> > 
> > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > for the cache line line is in the Modified (M) state.
> > 
> > Again, my comment is based on the literal understanding; so please
> > correct if have any misunderstanding at here.
> 
> Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> where sharing patterns are detected," so if a line is shared among caches it
> will likely also be in the SLC (and thus we'd get an L3 hit). 
> 
> If there is one copy of the cache line and that cache line needs to transition
> from one core to another core, I don't believe it matters if it was truly
> modified or not because the core already had permission to modify it and the
> transaction is just as costly (ping ponging between core caches). This is the
> one thing I really want to be able to uniquely identify as any cacheline doing
> this that isn't a lock is a place where optimization is likely possible. 

I understood that your point that if big amount of transitions within
multiple cores hit the same cache line, it would be very likely caused
by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
for easier reviewing.

Alternatively, I think it's good to pick up the patch series "perf c2c:
Sort cacheline with all loads" [1], rather than relying on HITM tag, the
patch series extends a new option "-d all" for perf c2c, so it displays
the suspecious false sharing cache lines based on load/store ops and
thread infos.  The main reason for holding on th patch set is due to we
cannot verify it with Arm SPE at that time point, as the time being Arm
SPE trace data was absent both store ops and data source packets.

I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
upstream the patch series "perf c2c: Sort cacheline with all loads"
(only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
have been merged in the mainline kernel).

If this is fine for you, I can respin the patch series for "perf c2c".
Or any other thoughts?

[1] https://lore.kernel.org/lkml/20201213133850.10070-1-leo.yan@linaro.org/

> > > > I prefer we set below fields for ARM_SPE_NV_PEER_CORE:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L1;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L1;
> > > > 
> > > > > +		break;
> > > > > +	/*
> > > > > +	 * We don't know if this is L1, L2 but we do know it was a cache-2-cache
> > > > > +	 * transfer, so set SNOOP_HITM
> > > > > +	 */
> > > > > +	case ARM_SPE_NV_LCL_CLSTR:
> > > > 
> > > > For ARM_SPE_NV_LCL_CLSTR, it fetches the data from the shared cache in
> > > > the cluster level, it should happen in L2 cache:
> > > > 
> > > >         data_src->mem_lvl = PERF_MEM_LVL_HIT | PERF_MEM_LVL_L2;
> > > >         data_src->mem_snoop = PERF_MEM_SNOOP_HIT;
> > > >         data_src->mem_lvl_num = PERF_MEM_LVLNUM_L2;
> > > 
> > > We don't know if this is coming from the cluster cache, or the private L1 or L2
> > > core caches. The description above about why we'll be transferring the line from
> > > cache-to-cache applies here too. 
> > 
> > I think a minor difference between PEER_CORE and LCL_CLSTR is:
> > if data source is PEER_CORE, the snooping happens on the peer core's
> > local cache (Core's L1 or Core's L2 when core contains L2); for the data
> > source LCL_CLSTR, the snooping occurs on the cluster level's cache
> > (cluster's L2 cache or cluster's L3 cache when cluster contains L3).
> > 
> > So can we set both 'L2 | L3' for LCL_CLSTR case?
> 
> Just as above, I believe setting two options will lead to more confusion.
> Additionally, we agreed in the previous discussion that the system cache shall
> be the L3, so i don't see how this helps alleviate any confusion. That said, the
> systems I'm most familiar with never set LCL_CLSTR as a source. 

Okay, let's rollback to PERF_MEM_LVLNUM_ANY_CACHE as cache level for
LCL_CLSTR and PEER_CLSTR.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-28 13:05           ` Leo Yan
@ 2022-03-29 13:34             ` Shuai Xue
  -1 siblings, 0 replies; 66+ messages in thread
From: Shuai Xue @ 2022-03-29 13:34 UTC (permalink / raw)
  To: Leo Yan, Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Leo, Ali,

Thank you for your great work and valuable discussion.

在 2022/3/27 AM3:43, Ali Saidi 写道:> Hi Leo,
> On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
>> Hi Ali, German,
>>
>> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
>>
>> [...]
>>
>>> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
>>> +						union perf_mem_data_src *data_src)
>>>  {
>>> -	union perf_mem_data_src	data_src = { 0 };
>>> +	/*
>>> +	 * Even though four levels of cache hierarchy are possible, no known
>>> +	 * production Neoverse systems currently include more than three levels
>>> +	 * so for the time being we assume three exist. If a production system
>>> +	 * is built with four the this function would have to be changed to
>>> +	 * detect the number of levels for reporting.
>>> +	 */
>>>
>>> -	if (record->op == ARM_SPE_LD)
>>> -		data_src.mem_op = PERF_MEM_OP_LOAD;
>>> -	else
>>> -		data_src.mem_op = PERF_MEM_OP_STORE;
>>
>> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
>> v2 and v3.
>>
>> IMHO, we need to consider two kinds of information which can guide us
>> for a reliable implementation.  The first thing is to summarize the data
>> source configuration for x86 PEBS, we can dive in more details for this
>> part; the second thing is we can refer to the AMBA architecture document
>> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
>> its sub section 'Suggested DataSource values', which would help us
>> much for mapping the cache topology to Arm SPE data source.
>>
>> As a result, I summarized the data source configurations for PEBS and
>> Arm SPE Neoverse in the spreadsheet:
>> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
>
> Thanks for putting this together and digging into the details, but you're making
> assumptions in neoverse data sources about the core configurations that aren't
> correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
> line is coming from a peer-core we don't know which level it's actually coming
> from.  Similarly, if it's coming from a local cluster, that could mean a cluster
> l3, but it's not the L2.

As far as I know, Neoverse N2 microarchitecture L3 Cache is non-inclusive, and L1
and L2 are strictly inclusive, like  Intel Skylake SP (SKX), i.e., the L2 may
or may not be in the L3 (no guarantee is made). That is to say, we can not tell
it is from cluster L2 or L3. Could you confirm this?

[...]


> I still think we should consider to extend the memory levels to
> demonstrate clear momory hierarchy on Arm archs, I personally like the
> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> though these cache levels are not precise like L1/L2/L3 levels, they can
> help us to map very well for the cache topology on Arm archs and without
> any confusion.  We could take this as an enhancement if you don't want
> to bother the current patch set's upstreaming.

Agree. In my opinion, imprecise cache levels can lead to wrong conclusions.
"PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE" are more intuitive.

Best Regards,
Shuai



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-29 13:34             ` Shuai Xue
  0 siblings, 0 replies; 66+ messages in thread
From: Shuai Xue @ 2022-03-29 13:34 UTC (permalink / raw)
  To: Leo Yan, Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Leo, Ali,

Thank you for your great work and valuable discussion.

在 2022/3/27 AM3:43, Ali Saidi 写道:> Hi Leo,
> On Sat, 26 Mar 2022 21:47:54 +0800, Leo Yan wrote:
>> Hi Ali, German,
>>
>> On Thu, Mar 24, 2022 at 06:33:21PM +0000, Ali Saidi wrote:
>>
>> [...]
>>
>>> +static void arm_spe__synth_data_source_neoverse(const struct arm_spe_record *record,
>>> +						union perf_mem_data_src *data_src)
>>>  {
>>> -	union perf_mem_data_src	data_src = { 0 };
>>> +	/*
>>> +	 * Even though four levels of cache hierarchy are possible, no known
>>> +	 * production Neoverse systems currently include more than three levels
>>> +	 * so for the time being we assume three exist. If a production system
>>> +	 * is built with four the this function would have to be changed to
>>> +	 * detect the number of levels for reporting.
>>> +	 */
>>>
>>> -	if (record->op == ARM_SPE_LD)
>>> -		data_src.mem_op = PERF_MEM_OP_LOAD;
>>> -	else
>>> -		data_src.mem_op = PERF_MEM_OP_STORE;
>>
>> Firstly, apologize that I didn't give clear idea when Ali sent patch sets
>> v2 and v3.
>>
>> IMHO, we need to consider two kinds of information which can guide us
>> for a reliable implementation.  The first thing is to summarize the data
>> source configuration for x86 PEBS, we can dive in more details for this
>> part; the second thing is we can refer to the AMBA architecture document
>> ARM IHI 0050E.b, section 11.1.2 'Crossing a chip-to-chip interface' and
>> its sub section 'Suggested DataSource values', which would help us
>> much for mapping the cache topology to Arm SPE data source.
>>
>> As a result, I summarized the data source configurations for PEBS and
>> Arm SPE Neoverse in the spreadsheet:
>> https://docs.google.com/spreadsheets/d/11YmjG0TyRjH7IXgvsREFgTg3AVtxh2dvLloRK1EdNjU/edit?usp=sharing
>
> Thanks for putting this together and digging into the details, but you're making
> assumptions in neoverse data sources about the core configurations that aren't
> correct. The Neoverse cores have all have integrated L1 and L2 cache, so if the
> line is coming from a peer-core we don't know which level it's actually coming
> from.  Similarly, if it's coming from a local cluster, that could mean a cluster
> l3, but it's not the L2.

As far as I know, Neoverse N2 microarchitecture L3 Cache is non-inclusive, and L1
and L2 are strictly inclusive, like  Intel Skylake SP (SKX), i.e., the L2 may
or may not be in the L3 (no guarantee is made). That is to say, we can not tell
it is from cluster L2 or L3. Could you confirm this?

[...]


> I still think we should consider to extend the memory levels to
> demonstrate clear momory hierarchy on Arm archs, I personally like the
> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> though these cache levels are not precise like L1/L2/L3 levels, they can
> help us to map very well for the cache topology on Arm archs and without
> any confusion.  We could take this as an enhancement if you don't want
> to bother the current patch set's upstreaming.

Agree. In my opinion, imprecise cache levels can lead to wrong conclusions.
"PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE" are more intuitive.

Best Regards,
Shuai



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-28 13:05           ` Leo Yan
@ 2022-03-29 14:32             ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-29 14:32 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


Hi Leo,

On Mon, 28 Mar 2022 21:05:47 +0800, Leo Yan wrote:
> Hi Ali,
> 
> On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > > 
> > > > > Peer core contains its local L1 cache, so I think we can set the
> > > > > memory level L1 to indicate this case.
> > > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > > caches and we don't know. 
> > > 
> > > How about set both L1 and L2 cache together for this case?
> > > 
> > > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > > better than use ANY_CACHE, at least this can help us to distinguish
> > > from other data source types if we avoid to use ANY_CACHE for all of
> > > them.
> > 
> > This seems much more confusing then the ambiguity of where the line came from
> > and is only possible with the deprecated mem_lvl enconding.  It will make
> > perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> > attribute a line to a single cache will find that the sum of the number of hits
> > is greater than the number of accesses which also seems terribly confusing.
> 
> Understand.  I considered the potential issue for
> perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
> levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
> cache levels.  We might need to extend a bit for another field
> 'mem_lvlnum'.
> 
> Agreed that there would have inaccurate issue for statistics, it's fine
> for me to use ANY_CACHE in this patch set.

Thanks!

> 
> I still think we should consider to extend the memory levels to
> demonstrate clear momory hierarchy on Arm archs, I personally like the
> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> though these cache levels are not precise like L1/L2/L3 levels, they can
> help us to map very well for the cache topology on Arm archs and without
> any confusion.  We could take this as an enhancement if you don't want
> to bother the current patch set's upstreaming.

I'd like to do this in a separate patch, but I have one other proposal. The
Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
it's also in the L2. Given that the Graviton systems and afaik the Ampere
systems don't have any cache between the L2 and the SLC, thus anything from
PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
should just set L2 for these cases? German, are you good with this for now? 

> > > > > For this data source type and below types, though they indicate
> > > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > > think this will mislead users when report the result.
> > > > 
> > > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > > found in the lower-level shared cache and the transaction wouldn't require a
> > > > cache-to-cache transfer of the modified data, so the most common case when we
> > > > source a line out of another cores cache will be if it was "modifiable" in that
> > > > cache. 
> > > 
> > > Let's still use MOSI protocol as example.  I think there have two
> > > cases: on case is the peer cache line is in 'Shared' state and another
> > > case is the peer cache line is in 'Owned' state.
> > > 
> > > Quotes the wiki page for these two cases:
> > > 
> > > "When the cache block is in the Shared (S) state and there is a
> > > snooped bus read (BusRd) transaction, then the block stays in the same
> > > state and generates no more transactions as all the cache blocks have
> > > the same value including the main memory and it is only being read,
> > > not written into."
> > > 
> > > "While in the Owner (O) state and there is a snooped read request
> > > (BusRd), the block remains in the same state while flushing (Flush)
> > > the data for the other processor to read from it."
> > > 
> > > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > > for the cache line line is in the Modified (M) state.
> > > 
> > > Again, my comment is based on the literal understanding; so please
> > > correct if have any misunderstanding at here.
> > 
> > Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> > where sharing patterns are detected," so if a line is shared among caches it
> > will likely also be in the SLC (and thus we'd get an L3 hit). 
> > 
> > If there is one copy of the cache line and that cache line needs to transition
> > from one core to another core, I don't believe it matters if it was truly
> > modified or not because the core already had permission to modify it and the
> > transaction is just as costly (ping ponging between core caches). This is the
> > one thing I really want to be able to uniquely identify as any cacheline doing
> > this that isn't a lock is a place where optimization is likely possible. 
> 
> I understood that your point that if big amount of transitions within
> multiple cores hit the same cache line, it would be very likely caused
> by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
> for easier reviewing.

And that from a dataflow perspective if the line is owned (and could be
modifiable) vs. was actually modified is really the less interesting bit. 

> Alternatively, I think it's good to pick up the patch series "perf c2c:
> Sort cacheline with all loads" [1], rather than relying on HITM tag, the
> patch series extends a new option "-d all" for perf c2c, so it displays
> the suspecious false sharing cache lines based on load/store ops and
> thread infos.  The main reason for holding on th patch set is due to we
> cannot verify it with Arm SPE at that time point, as the time being Arm
> SPE trace data was absent both store ops and data source packets.

Looking at examples I don't, at least from my system, data-source isn't set for
stores, only for loads. 

> I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
> upstream the patch series "perf c2c: Sort cacheline with all loads"
> (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
> have been merged in the mainline kernel).
> 
> If this is fine for you, I can respin the patch series for "perf c2c".
> Or any other thoughts?

I think this is a nice option to have in the tool-box, but from my point of
view, I'd like someone who is familiar with c2c output on x86 to come to an
arm64 system and be able to zero in on a ping-ponging line like they would
otherwise. Highlighting a line that is moving between cores frequently which is
likely in the exclusive state by tagging it an HITM accomplishes this and will
make it easier to find these cases.  Your approach also has innaccurancies and
wouldn't be able to differentiate between core X accessing a line a lot followed
by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
will "overcount" HITM, but I don't think this is particularly bad and it does
specifically highlight the core-2-core transfers that are likely a performance
issue easily and it will result in easier identification of areas of false or
true sharing and improve performance.

Thanks,
Ali

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-29 14:32             ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-03-29 14:32 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will


Hi Leo,

On Mon, 28 Mar 2022 21:05:47 +0800, Leo Yan wrote:
> Hi Ali,
> 
> On Mon, Mar 28, 2022 at 03:08:05AM +0000, Ali Saidi wrote:
> 
> [...]
> 
> > > > > > +	case ARM_SPE_NV_PEER_CORE:
> > > > > > +		data_src->mem_lvl = PERF_MEM_LVL_HIT;
> > > > > > +		data_src->mem_snoop = PERF_MEM_SNOOP_HITM;
> > > > > > +		data_src->mem_lvl_num = PERF_MEM_LVLNUM_ANY_CACHE;
> > > > > 
> > > > > Peer core contains its local L1 cache, so I think we can set the
> > > > > memory level L1 to indicate this case.
> > > > It could be either the L1 or the L2. All the neoverse cores have private L2
> > > > caches and we don't know. 
> > > 
> > > How about set both L1 and L2 cache together for this case?
> > > 
> > > Although 'L1 | L2' cannot tell the exact cache level, I think it's
> > > better than use ANY_CACHE, at least this can help us to distinguish
> > > from other data source types if we avoid to use ANY_CACHE for all of
> > > them.
> > 
> > This seems much more confusing then the ambiguity of where the line came from
> > and is only possible with the deprecated mem_lvl enconding.  It will make
> > perf_mem__lvl_scnprintf() print the wrong thing and anyone who is trying to
> > attribute a line to a single cache will find that the sum of the number of hits
> > is greater than the number of accesses which also seems terribly confusing.
> 
> Understand.  I considered the potential issue for
> perf_mem__lvl_scnprintf(), actually it supports printing multipl cache
> levels for 'mem_lvl', by conjuncting with 'or' it composes the multiple
> cache levels.  We might need to extend a bit for another field
> 'mem_lvlnum'.
> 
> Agreed that there would have inaccurate issue for statistics, it's fine
> for me to use ANY_CACHE in this patch set.

Thanks!

> 
> I still think we should consider to extend the memory levels to
> demonstrate clear momory hierarchy on Arm archs, I personally like the
> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> though these cache levels are not precise like L1/L2/L3 levels, they can
> help us to map very well for the cache topology on Arm archs and without
> any confusion.  We could take this as an enhancement if you don't want
> to bother the current patch set's upstreaming.

I'd like to do this in a separate patch, but I have one other proposal. The
Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
it's also in the L2. Given that the Graviton systems and afaik the Ampere
systems don't have any cache between the L2 and the SLC, thus anything from
PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
should just set L2 for these cases? German, are you good with this for now? 

> > > > > For this data source type and below types, though they indicate
> > > > > the snooping happens, but it doesn't mean the data in the cache line
> > > > > is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
> > > > > think this will mislead users when report the result.
> > > > 
> > > > I'm of the opposite opinion. If the data wasn't modified, it will likely be
> > > > found in the lower-level shared cache and the transaction wouldn't require a
> > > > cache-to-cache transfer of the modified data, so the most common case when we
> > > > source a line out of another cores cache will be if it was "modifiable" in that
> > > > cache. 
> > > 
> > > Let's still use MOSI protocol as example.  I think there have two
> > > cases: on case is the peer cache line is in 'Shared' state and another
> > > case is the peer cache line is in 'Owned' state.
> > > 
> > > Quotes the wiki page for these two cases:
> > > 
> > > "When the cache block is in the Shared (S) state and there is a
> > > snooped bus read (BusRd) transaction, then the block stays in the same
> > > state and generates no more transactions as all the cache blocks have
> > > the same value including the main memory and it is only being read,
> > > not written into."
> > > 
> > > "While in the Owner (O) state and there is a snooped read request
> > > (BusRd), the block remains in the same state while flushing (Flush)
> > > the data for the other processor to read from it."
> > > 
> > > Seems to me, it's reasonable to set HTIM flag when the snooping happens
> > > for the cache line line is in the Modified (M) state.
> > > 
> > > Again, my comment is based on the literal understanding; so please
> > > correct if have any misunderstanding at here.
> > 
> > Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
> > where sharing patterns are detected," so if a line is shared among caches it
> > will likely also be in the SLC (and thus we'd get an L3 hit). 
> > 
> > If there is one copy of the cache line and that cache line needs to transition
> > from one core to another core, I don't believe it matters if it was truly
> > modified or not because the core already had permission to modify it and the
> > transaction is just as costly (ping ponging between core caches). This is the
> > one thing I really want to be able to uniquely identify as any cacheline doing
> > this that isn't a lock is a place where optimization is likely possible. 
> 
> I understood that your point that if big amount of transitions within
> multiple cores hit the same cache line, it would be very likely caused
> by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
> for easier reviewing.

And that from a dataflow perspective if the line is owned (and could be
modifiable) vs. was actually modified is really the less interesting bit. 

> Alternatively, I think it's good to pick up the patch series "perf c2c:
> Sort cacheline with all loads" [1], rather than relying on HITM tag, the
> patch series extends a new option "-d all" for perf c2c, so it displays
> the suspecious false sharing cache lines based on load/store ops and
> thread infos.  The main reason for holding on th patch set is due to we
> cannot verify it with Arm SPE at that time point, as the time being Arm
> SPE trace data was absent both store ops and data source packets.

Looking at examples I don't, at least from my system, data-source isn't set for
stores, only for loads. 

> I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
> upstream the patch series "perf c2c: Sort cacheline with all loads"
> (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
> have been merged in the mainline kernel).
> 
> If this is fine for you, I can respin the patch series for "perf c2c".
> Or any other thoughts?

I think this is a nice option to have in the tool-box, but from my point of
view, I'd like someone who is familiar with c2c output on x86 to come to an
arm64 system and be able to zero in on a ping-ponging line like they would
otherwise. Highlighting a line that is moving between cores frequently which is
likely in the exclusive state by tagging it an HITM accomplishes this and will
make it easier to find these cases.  Your approach also has innaccurancies and
wouldn't be able to differentiate between core X accessing a line a lot followed
by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
will "overcount" HITM, but I don't think this is particularly bad and it does
specifically highlight the core-2-core transfers that are likely a performance
issue easily and it will result in easier identification of areas of false or
true sharing and improve performance.

Thanks,
Ali

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-29 14:32             ` Ali Saidi
@ 2022-03-31 12:19               ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-31 12:19 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali,

On Tue, Mar 29, 2022 at 02:32:14PM +0000, Ali Saidi wrote:

[...]

> > I still think we should consider to extend the memory levels to
> > demonstrate clear momory hierarchy on Arm archs, I personally like the
> > definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> > though these cache levels are not precise like L1/L2/L3 levels, they can
> > help us to map very well for the cache topology on Arm archs and without
> > any confusion.  We could take this as an enhancement if you don't want
> > to bother the current patch set's upstreaming.
> 
> I'd like to do this in a separate patch, but I have one other proposal. The
> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> systems don't have any cache between the L2 and the SLC, thus anything from
> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> should just set L2 for these cases? German, are you good with this for now? 

If we use a single cache level (no matterh it's L2 or ANY_CACHE) for
these data sources, it's hard for users to understand what's the cost
for the memory operations.  So here I suggested for these new cache
levels is not only about cache level, it's more about the information
telling the memory operation's cost.

[...]

> > Alternatively, I think it's good to pick up the patch series "perf c2c:
> > Sort cacheline with all loads" [1], rather than relying on HITM tag, the
> > patch series extends a new option "-d all" for perf c2c, so it displays
> > the suspecious false sharing cache lines based on load/store ops and
> > thread infos.  The main reason for holding on th patch set is due to we
> > cannot verify it with Arm SPE at that time point, as the time being Arm
> > SPE trace data was absent both store ops and data source packets.
> 
> Looking at examples I don't, at least from my system, data-source isn't set for
> stores, only for loads.

Ouch ...  If data source is not set for store operation, then all store
samples will absent cache level info.  Or should we set ANY_CACHE as
cache level for store operations?

> > I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
> > upstream the patch series "perf c2c: Sort cacheline with all loads"
> > (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
> > have been merged in the mainline kernel).
> > 
> > If this is fine for you, I can respin the patch series for "perf c2c".
> > Or any other thoughts?
> 
> I think this is a nice option to have in the tool-box, but from my point of
> view, I'd like someone who is familiar with c2c output on x86 to come to an
> arm64 system and be able to zero in on a ping-ponging line like they would
> otherwise. Highlighting a line that is moving between cores frequently which is
> likely in the exclusive state by tagging it an HITM accomplishes this and will
> make it easier to find these cases.  Your approach also has innaccurancies and
> wouldn't be able to differentiate between core X accessing a line a lot followed
> by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
> will "overcount" HITM, but I don't think this is particularly bad and it does
> specifically highlight the core-2-core transfers that are likely a performance
> issue easily and it will result in easier identification of areas of false or
> true sharing and improve performance.

I don't want to block this patch set by this part, and either I don't
want to introduce any confusion for later users, especially I think
users who in later use this tool but it's hard for them to be aware any
assumptions in this discussion thread.  So two options would be fine
for me:

Option 1: if you and Arm mates can confirm that inaccuracy caused by
setting HITM is low (e.g. 2%-3% inaccuracy that introduced by directly
set HITM), I think this could be acceptable.  Otherwise, please
consider option 2.

Option 2: by default we set PERF_MEM_SNOOP_HIT flag since now actually
we have no info to support HITM.  Then use a new patch to add an extra
option (say '--coarse-hitm') for 'perf c2c' tool, a user can explictly
specify this option for 'perf c2c' command; when a user specifies this
option it means that the user understands and accepts inaccuracy by
forcing to use PERF_MEM_SNOOP_HITM flag.  I think you could refer to
the option '--stitch-lbr' for adding an option for 'perf c2c' tool.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-31 12:19               ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-31 12:19 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi Ali,

On Tue, Mar 29, 2022 at 02:32:14PM +0000, Ali Saidi wrote:

[...]

> > I still think we should consider to extend the memory levels to
> > demonstrate clear momory hierarchy on Arm archs, I personally like the
> > definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> > though these cache levels are not precise like L1/L2/L3 levels, they can
> > help us to map very well for the cache topology on Arm archs and without
> > any confusion.  We could take this as an enhancement if you don't want
> > to bother the current patch set's upstreaming.
> 
> I'd like to do this in a separate patch, but I have one other proposal. The
> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> systems don't have any cache between the L2 and the SLC, thus anything from
> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> should just set L2 for these cases? German, are you good with this for now? 

If we use a single cache level (no matterh it's L2 or ANY_CACHE) for
these data sources, it's hard for users to understand what's the cost
for the memory operations.  So here I suggested for these new cache
levels is not only about cache level, it's more about the information
telling the memory operation's cost.

[...]

> > Alternatively, I think it's good to pick up the patch series "perf c2c:
> > Sort cacheline with all loads" [1], rather than relying on HITM tag, the
> > patch series extends a new option "-d all" for perf c2c, so it displays
> > the suspecious false sharing cache lines based on load/store ops and
> > thread infos.  The main reason for holding on th patch set is due to we
> > cannot verify it with Arm SPE at that time point, as the time being Arm
> > SPE trace data was absent both store ops and data source packets.
> 
> Looking at examples I don't, at least from my system, data-source isn't set for
> stores, only for loads.

Ouch ...  If data source is not set for store operation, then all store
samples will absent cache level info.  Or should we set ANY_CACHE as
cache level for store operations?

> > I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
> > upstream the patch series "perf c2c: Sort cacheline with all loads"
> > (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
> > have been merged in the mainline kernel).
> > 
> > If this is fine for you, I can respin the patch series for "perf c2c".
> > Or any other thoughts?
> 
> I think this is a nice option to have in the tool-box, but from my point of
> view, I'd like someone who is familiar with c2c output on x86 to come to an
> arm64 system and be able to zero in on a ping-ponging line like they would
> otherwise. Highlighting a line that is moving between cores frequently which is
> likely in the exclusive state by tagging it an HITM accomplishes this and will
> make it easier to find these cases.  Your approach also has innaccurancies and
> wouldn't be able to differentiate between core X accessing a line a lot followed
> by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
> will "overcount" HITM, but I don't think this is particularly bad and it does
> specifically highlight the core-2-core transfers that are likely a performance
> issue easily and it will result in easier identification of areas of false or
> true sharing and improve performance.

I don't want to block this patch set by this part, and either I don't
want to introduce any confusion for later users, especially I think
users who in later use this tool but it's hard for them to be aware any
assumptions in this discussion thread.  So two options would be fine
for me:

Option 1: if you and Arm mates can confirm that inaccuracy caused by
setting HITM is low (e.g. 2%-3% inaccuracy that introduced by directly
set HITM), I think this could be acceptable.  Otherwise, please
consider option 2.

Option 2: by default we set PERF_MEM_SNOOP_HIT flag since now actually
we have no info to support HITM.  Then use a new patch to add an extra
option (say '--coarse-hitm') for 'perf c2c' tool, a user can explictly
specify this option for 'perf c2c' command; when a user specifies this
option it means that the user understands and accepts inaccuracy by
forcing to use PERF_MEM_SNOOP_HITM flag.  I think you could refer to
the option '--stitch-lbr' for adding an option for 'perf c2c' tool.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-29 14:32             ` Ali Saidi
@ 2022-03-31 12:28               ` German Gomez
  -1 siblings, 0 replies; 66+ messages in thread
From: German Gomez @ 2022-03-31 12:28 UTC (permalink / raw)
  To: Ali Saidi, leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi all,

It seems I gave the Review tags a bit too early this time. Apologies for
the inconvenience. Indeed there was more interesting discussions to be
had :)

(Probably best to remove by tags for the next re-spin)

On 29/03/2022 15:32, Ali Saidi wrote:
> [...]
>
>> I still think we should consider to extend the memory levels to
>> demonstrate clear momory hierarchy on Arm archs, I personally like the
>> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
>> though these cache levels are not precise like L1/L2/L3 levels, they can
>> help us to map very well for the cache topology on Arm archs and without
>> any confusion.  We could take this as an enhancement if you don't want
>> to bother the current patch set's upstreaming.
> I'd like to do this in a separate patch, but I have one other proposal. The
> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> systems don't have any cache between the L2 and the SLC, thus anything from
> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> should just set L2 for these cases? German, are you good with this for now? 

Sorry for the delay. I'd like to also check this with someone. I'll try
to get back asap. In the meantime, if this approach is also OK with Leo,
I think it would be fine by me.

Thanks,
German

>>>>>> For this data source type and below types, though they indicate
>>>>>> the snooping happens, but it doesn't mean the data in the cache line
>>>>>> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
>>>>>> think this will mislead users when report the result.
>>>>> I'm of the opposite opinion. If the data wasn't modified, it will likely be
>>>>> found in the lower-level shared cache and the transaction wouldn't require a
>>>>> cache-to-cache transfer of the modified data, so the most common case when we
>>>>> source a line out of another cores cache will be if it was "modifiable" in that
>>>>> cache. 
>>>> Let's still use MOSI protocol as example.  I think there have two
>>>> cases: on case is the peer cache line is in 'Shared' state and another
>>>> case is the peer cache line is in 'Owned' state.
>>>>
>>>> Quotes the wiki page for these two cases:
>>>>
>>>> "When the cache block is in the Shared (S) state and there is a
>>>> snooped bus read (BusRd) transaction, then the block stays in the same
>>>> state and generates no more transactions as all the cache blocks have
>>>> the same value including the main memory and it is only being read,
>>>> not written into."
>>>>
>>>> "While in the Owner (O) state and there is a snooped read request
>>>> (BusRd), the block remains in the same state while flushing (Flush)
>>>> the data for the other processor to read from it."
>>>>
>>>> Seems to me, it's reasonable to set HTIM flag when the snooping happens
>>>> for the cache line line is in the Modified (M) state.
>>>>
>>>> Again, my comment is based on the literal understanding; so please
>>>> correct if have any misunderstanding at here.
>>> Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
>>> where sharing patterns are detected," so if a line is shared among caches it
>>> will likely also be in the SLC (and thus we'd get an L3 hit). 
>>>
>>> If there is one copy of the cache line and that cache line needs to transition
>>> from one core to another core, I don't believe it matters if it was truly
>>> modified or not because the core already had permission to modify it and the
>>> transaction is just as costly (ping ponging between core caches). This is the
>>> one thing I really want to be able to uniquely identify as any cacheline doing
>>> this that isn't a lock is a place where optimization is likely possible. 
>> I understood that your point that if big amount of transitions within
>> multiple cores hit the same cache line, it would be very likely caused
>> by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
>> for easier reviewing.
> And that from a dataflow perspective if the line is owned (and could be
> modifiable) vs. was actually modified is really the less interesting bit. 
>
>> Alternatively, I think it's good to pick up the patch series "perf c2c:
>> Sort cacheline with all loads" [1], rather than relying on HITM tag, the
>> patch series extends a new option "-d all" for perf c2c, so it displays
>> the suspecious false sharing cache lines based on load/store ops and
>> thread infos.  The main reason for holding on th patch set is due to we
>> cannot verify it with Arm SPE at that time point, as the time being Arm
>> SPE trace data was absent both store ops and data source packets.
> Looking at examples I don't, at least from my system, data-source isn't set for
> stores, only for loads. 
>
>> I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
>> upstream the patch series "perf c2c: Sort cacheline with all loads"
>> (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
>> have been merged in the mainline kernel).
>>
>> If this is fine for you, I can respin the patch series for "perf c2c".
>> Or any other thoughts?
> I think this is a nice option to have in the tool-box, but from my point of
> view, I'd like someone who is familiar with c2c output on x86 to come to an
> arm64 system and be able to zero in on a ping-ponging line like they would
> otherwise. Highlighting a line that is moving between cores frequently which is
> likely in the exclusive state by tagging it an HITM accomplishes this and will
> make it easier to find these cases.  Your approach also has innaccurancies and
> wouldn't be able to differentiate between core X accessing a line a lot followed
> by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
> will "overcount" HITM, but I don't think this is particularly bad and it does
> specifically highlight the core-2-core transfers that are likely a performance
> issue easily and it will result in easier identification of areas of false or
> true sharing and improve performance.
>
> Thanks,
> Ali

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-31 12:28               ` German Gomez
  0 siblings, 0 replies; 66+ messages in thread
From: German Gomez @ 2022-03-31 12:28 UTC (permalink / raw)
  To: Ali Saidi, leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi all,

It seems I gave the Review tags a bit too early this time. Apologies for
the inconvenience. Indeed there was more interesting discussions to be
had :)

(Probably best to remove by tags for the next re-spin)

On 29/03/2022 15:32, Ali Saidi wrote:
> [...]
>
>> I still think we should consider to extend the memory levels to
>> demonstrate clear momory hierarchy on Arm archs, I personally like the
>> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
>> though these cache levels are not precise like L1/L2/L3 levels, they can
>> help us to map very well for the cache topology on Arm archs and without
>> any confusion.  We could take this as an enhancement if you don't want
>> to bother the current patch set's upstreaming.
> I'd like to do this in a separate patch, but I have one other proposal. The
> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> systems don't have any cache between the L2 and the SLC, thus anything from
> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> should just set L2 for these cases? German, are you good with this for now? 

Sorry for the delay. I'd like to also check this with someone. I'll try
to get back asap. In the meantime, if this approach is also OK with Leo,
I think it would be fine by me.

Thanks,
German

>>>>>> For this data source type and below types, though they indicate
>>>>>> the snooping happens, but it doesn't mean the data in the cache line
>>>>>> is in 'modified' state.  If set flag PERF_MEM_SNOOP_HITM, I personally
>>>>>> think this will mislead users when report the result.
>>>>> I'm of the opposite opinion. If the data wasn't modified, it will likely be
>>>>> found in the lower-level shared cache and the transaction wouldn't require a
>>>>> cache-to-cache transfer of the modified data, so the most common case when we
>>>>> source a line out of another cores cache will be if it was "modifiable" in that
>>>>> cache. 
>>>> Let's still use MOSI protocol as example.  I think there have two
>>>> cases: on case is the peer cache line is in 'Shared' state and another
>>>> case is the peer cache line is in 'Owned' state.
>>>>
>>>> Quotes the wiki page for these two cases:
>>>>
>>>> "When the cache block is in the Shared (S) state and there is a
>>>> snooped bus read (BusRd) transaction, then the block stays in the same
>>>> state and generates no more transactions as all the cache blocks have
>>>> the same value including the main memory and it is only being read,
>>>> not written into."
>>>>
>>>> "While in the Owner (O) state and there is a snooped read request
>>>> (BusRd), the block remains in the same state while flushing (Flush)
>>>> the data for the other processor to read from it."
>>>>
>>>> Seems to me, it's reasonable to set HTIM flag when the snooping happens
>>>> for the cache line line is in the Modified (M) state.
>>>>
>>>> Again, my comment is based on the literal understanding; so please
>>>> correct if have any misunderstanding at here.
>>> Per the CMN TRM, "The SLC allocation policy is exclusive for data lines, except
>>> where sharing patterns are detected," so if a line is shared among caches it
>>> will likely also be in the SLC (and thus we'd get an L3 hit). 
>>>
>>> If there is one copy of the cache line and that cache line needs to transition
>>> from one core to another core, I don't believe it matters if it was truly
>>> modified or not because the core already had permission to modify it and the
>>> transaction is just as costly (ping ponging between core caches). This is the
>>> one thing I really want to be able to uniquely identify as any cacheline doing
>>> this that isn't a lock is a place where optimization is likely possible. 
>> I understood that your point that if big amount of transitions within
>> multiple cores hit the same cache line, it would be very likely caused
>> by the cache line's Modified state so we set PERF_MEM_SNOOP_HITM flag
>> for easier reviewing.
> And that from a dataflow perspective if the line is owned (and could be
> modifiable) vs. was actually modified is really the less interesting bit. 
>
>> Alternatively, I think it's good to pick up the patch series "perf c2c:
>> Sort cacheline with all loads" [1], rather than relying on HITM tag, the
>> patch series extends a new option "-d all" for perf c2c, so it displays
>> the suspecious false sharing cache lines based on load/store ops and
>> thread infos.  The main reason for holding on th patch set is due to we
>> cannot verify it with Arm SPE at that time point, as the time being Arm
>> SPE trace data was absent both store ops and data source packets.
> Looking at examples I don't, at least from my system, data-source isn't set for
> stores, only for loads. 
>
>> I perfer to set PERF_MEM_SNOOP_HIT flag in this patch set and we can
>> upstream the patch series "perf c2c: Sort cacheline with all loads"
>> (only needs upstreaming patches 01, 02, 03, 10, 11, the rest patches
>> have been merged in the mainline kernel).
>>
>> If this is fine for you, I can respin the patch series for "perf c2c".
>> Or any other thoughts?
> I think this is a nice option to have in the tool-box, but from my point of
> view, I'd like someone who is familiar with c2c output on x86 to come to an
> arm64 system and be able to zero in on a ping-ponging line like they would
> otherwise. Highlighting a line that is moving between cores frequently which is
> likely in the exclusive state by tagging it an HITM accomplishes this and will
> make it easier to find these cases.  Your approach also has innaccurancies and
> wouldn't be able to differentiate between core X accessing a line a lot followed
> by core Y acessing a line alot vs the cores ping-ponging.  Yes, I agree that we
> will "overcount" HITM, but I don't think this is particularly bad and it does
> specifically highlight the core-2-core transfers that are likely a performance
> issue easily and it will result in easier identification of areas of false or
> true sharing and improve performance.
>
> Thanks,
> Ali

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-31 12:28               ` German Gomez
@ 2022-03-31 12:44                 ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-31 12:44 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, james.clark, john.garry, jolsa, kjain,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Mar 31, 2022 at 01:28:58PM +0100, German Gomez wrote:
> Hi all,
> 
> It seems I gave the Review tags a bit too early this time. Apologies for
> the inconvenience. Indeed there was more interesting discussions to be
> had :)
> 
> (Probably best to remove by tags for the next re-spin)

Now worries, German.  Your review and testing are very helpful :)

> On 29/03/2022 15:32, Ali Saidi wrote:
> > [...]
> >
> >> I still think we should consider to extend the memory levels to
> >> demonstrate clear momory hierarchy on Arm archs, I personally like the
> >> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> >> though these cache levels are not precise like L1/L2/L3 levels, they can
> >> help us to map very well for the cache topology on Arm archs and without
> >> any confusion.  We could take this as an enhancement if you don't want
> >> to bother the current patch set's upstreaming.
> > I'd like to do this in a separate patch, but I have one other proposal. The
> > Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> > it's also in the L2. Given that the Graviton systems and afaik the Ampere
> > systems don't have any cache between the L2 and the SLC, thus anything from
> > PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> > should just set L2 for these cases? German, are you good with this for now? 
> 
> Sorry for the delay. I'd like to also check this with someone. I'll try
> to get back asap. In the meantime, if this approach is also OK with Leo,
> I think it would be fine by me.

Thanks for the checking internally.  Let me just bring up my another
thinking (sorry that my suggestion is float): another choice is we set
ANY_CACHE as cache level if we are not certain the cache level, and
extend snoop field to indicate the snooping logics, like:

  PERF_MEM_SNOOP_PEER_CORE
  PERF_MEM_SNOOP_LCL_CLSTR
  PERF_MEM_SNOOP_PEER_CLSTR

Seems to me, we doing this is not only for cache level, it's more
important for users to know the variant cost for involving different
snooping logics.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-03-31 12:44                 ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-03-31 12:44 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, james.clark, john.garry, jolsa, kjain,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Mar 31, 2022 at 01:28:58PM +0100, German Gomez wrote:
> Hi all,
> 
> It seems I gave the Review tags a bit too early this time. Apologies for
> the inconvenience. Indeed there was more interesting discussions to be
> had :)
> 
> (Probably best to remove by tags for the next re-spin)

Now worries, German.  Your review and testing are very helpful :)

> On 29/03/2022 15:32, Ali Saidi wrote:
> > [...]
> >
> >> I still think we should consider to extend the memory levels to
> >> demonstrate clear momory hierarchy on Arm archs, I personally like the
> >> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> >> though these cache levels are not precise like L1/L2/L3 levels, they can
> >> help us to map very well for the cache topology on Arm archs and without
> >> any confusion.  We could take this as an enhancement if you don't want
> >> to bother the current patch set's upstreaming.
> > I'd like to do this in a separate patch, but I have one other proposal. The
> > Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> > it's also in the L2. Given that the Graviton systems and afaik the Ampere
> > systems don't have any cache between the L2 and the SLC, thus anything from
> > PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> > should just set L2 for these cases? German, are you good with this for now? 
> 
> Sorry for the delay. I'd like to also check this with someone. I'll try
> to get back asap. In the meantime, if this approach is also OK with Leo,
> I think it would be fine by me.

Thanks for the checking internally.  Let me just bring up my another
thinking (sorry that my suggestion is float): another choice is we set
ANY_CACHE as cache level if we are not certain the cache level, and
extend snoop field to indicate the snooping logics, like:

  PERF_MEM_SNOOP_PEER_CORE
  PERF_MEM_SNOOP_LCL_CLSTR
  PERF_MEM_SNOOP_PEER_CLSTR

Seems to me, we doing this is not only for cache level, it's more
important for users to know the variant cost for involving different
snooping logics.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-31 12:44                 ` Leo Yan
@ 2022-04-03 20:33                   ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-04-03 20:33 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Thu, 31 Mar 2022 12:44:3 +0000, Leo Yan wrote:
> 
> On Thu, Mar 31, 2022 at 01:28:58PM +0100, German Gomez wrote:
> > Hi all,
> > 
> > It seems I gave the Review tags a bit too early this time. Apologies for
> > the inconvenience. Indeed there was more interesting discussions to be
> > had :)
> > 
> > (Probably best to remove by tags for the next re-spin)
> 
> Now worries, German.  Your review and testing are very helpful :)
> 
> > On 29/03/2022 15:32, Ali Saidi wrote:
> > > [...]
> > >
> > >> I still think we should consider to extend the memory levels to
> > >> demonstrate clear momory hierarchy on Arm archs, I personally like the
> > >> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> > >> though these cache levels are not precise like L1/L2/L3 levels, they can
> > >> help us to map very well for the cache topology on Arm archs and without
> > >> any confusion.  We could take this as an enhancement if you don't want
> > >> to bother the current patch set's upstreaming.
> > > I'd like to do this in a separate patch, but I have one other proposal. The
> > > Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> > > it's also in the L2. Given that the Graviton systems and afaik the Ampere
> > > systems don't have any cache between the L2 and the SLC, thus anything from
> > > PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> > > should just set L2 for these cases? German, are you good with this for now? 
> > 
> > Sorry for the delay. I'd like to also check this with someone. I'll try
> > to get back asap. In the meantime, if this approach is also OK with Leo,
> > I think it would be fine by me.
> 
> Thanks for the checking internally.  Let me just bring up my another
> thinking (sorry that my suggestion is float): another choice is we set
> ANY_CACHE as cache level if we are not certain the cache level, and
> extend snoop field to indicate the snooping logics, like:
> 
>   PERF_MEM_SNOOP_PEER_CORE
>   PERF_MEM_SNOOP_LCL_CLSTR
>   PERF_MEM_SNOOP_PEER_CLSTR
> 
> Seems to me, we doing this is not only for cache level, it's more
> important for users to know the variant cost for involving different
> snooping logics.

I think we've come full circle :). Going back to what do we want to indicate to
a user about the source of the cache line, I believe there are three things with
an eye toward helping a user of the data improve the performance of their
application:

1. The level below them in the hierarchy it it (L1, L2, LLC, local DRAM).
Depending on the level this directly indicates the expense of the operation. 

2. If it came from a peer of theirs on the same socket. I'm really of the
opinion still that exactly which peer, doesn't matter much as it's a 2nd or 3rd
order concern compared to, it it couldn't be sourced from a cache level below
the originating core, had to come from a local peer and the request went to
that lower levels and was eventually sourced from a peer.  Why it was sourced
from the peer is still almost irrelevant to me. If it was truly modified or the
core it was sourced from only had permission to modify it the snoop filter
doesn't necessarily need to know the difference and the outcome is the same. 

3. For multi-socket systems that it came from a different socket and there it is
probably most interesting if it came from DRAM on the remote socket or a cache.

I'm putting 3 aside for now since we've really been focusing on 1 and 2 in this
discussion and I think the biggest hangup has been the definition of HIT vs
HITM. If someone has a precise definition, that would be great, but AFAIK it
goes back to the P6 bus where HIT was asserted by another core if it had a line
(in any state) and HITM was additionally asserted if a core needed to inhibit
another device (e.g. DDR controller) from providing that line to the requestor. 

The latter logic is why I think it's perfectly acceptable to use HITM to
indicate a peer cache-to-cache transfer, however since others don't feel that way
let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
indicates some peer of the hierarchy below the originating core sourced the
data.  This clears up the definition that line came from from a peer and may or
may not have been modified, but it doesn't add a lot of implementation dependant
functionality into the SNOOP API. 

We could use the mem-level to indicate the level of the cache hierarchy we had
to get to before the snoop traveled upward, which seems like what x86 is doing
here.

PEER_CORE -> MEM_SNOOP_PEER + L2
PEER_CLSTR -> MEM_SNOOP_PEER + L3
PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
the clusters and the existing commercial implementations don't have them).

Thanks,
Ali

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-03 20:33                   ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-04-03 20:33 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Thu, 31 Mar 2022 12:44:3 +0000, Leo Yan wrote:
> 
> On Thu, Mar 31, 2022 at 01:28:58PM +0100, German Gomez wrote:
> > Hi all,
> > 
> > It seems I gave the Review tags a bit too early this time. Apologies for
> > the inconvenience. Indeed there was more interesting discussions to be
> > had :)
> > 
> > (Probably best to remove by tags for the next re-spin)
> 
> Now worries, German.  Your review and testing are very helpful :)
> 
> > On 29/03/2022 15:32, Ali Saidi wrote:
> > > [...]
> > >
> > >> I still think we should consider to extend the memory levels to
> > >> demonstrate clear momory hierarchy on Arm archs, I personally like the
> > >> definitions for "PEER_CORE", "LCL_CLSTR", "PEER_CLSTR" and "SYS_CACHE",
> > >> though these cache levels are not precise like L1/L2/L3 levels, they can
> > >> help us to map very well for the cache topology on Arm archs and without
> > >> any confusion.  We could take this as an enhancement if you don't want
> > >> to bother the current patch set's upstreaming.
> > > I'd like to do this in a separate patch, but I have one other proposal. The
> > > Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> > > it's also in the L2. Given that the Graviton systems and afaik the Ampere
> > > systems don't have any cache between the L2 and the SLC, thus anything from
> > > PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> > > should just set L2 for these cases? German, are you good with this for now? 
> > 
> > Sorry for the delay. I'd like to also check this with someone. I'll try
> > to get back asap. In the meantime, if this approach is also OK with Leo,
> > I think it would be fine by me.
> 
> Thanks for the checking internally.  Let me just bring up my another
> thinking (sorry that my suggestion is float): another choice is we set
> ANY_CACHE as cache level if we are not certain the cache level, and
> extend snoop field to indicate the snooping logics, like:
> 
>   PERF_MEM_SNOOP_PEER_CORE
>   PERF_MEM_SNOOP_LCL_CLSTR
>   PERF_MEM_SNOOP_PEER_CLSTR
> 
> Seems to me, we doing this is not only for cache level, it's more
> important for users to know the variant cost for involving different
> snooping logics.

I think we've come full circle :). Going back to what do we want to indicate to
a user about the source of the cache line, I believe there are three things with
an eye toward helping a user of the data improve the performance of their
application:

1. The level below them in the hierarchy it it (L1, L2, LLC, local DRAM).
Depending on the level this directly indicates the expense of the operation. 

2. If it came from a peer of theirs on the same socket. I'm really of the
opinion still that exactly which peer, doesn't matter much as it's a 2nd or 3rd
order concern compared to, it it couldn't be sourced from a cache level below
the originating core, had to come from a local peer and the request went to
that lower levels and was eventually sourced from a peer.  Why it was sourced
from the peer is still almost irrelevant to me. If it was truly modified or the
core it was sourced from only had permission to modify it the snoop filter
doesn't necessarily need to know the difference and the outcome is the same. 

3. For multi-socket systems that it came from a different socket and there it is
probably most interesting if it came from DRAM on the remote socket or a cache.

I'm putting 3 aside for now since we've really been focusing on 1 and 2 in this
discussion and I think the biggest hangup has been the definition of HIT vs
HITM. If someone has a precise definition, that would be great, but AFAIK it
goes back to the P6 bus where HIT was asserted by another core if it had a line
(in any state) and HITM was additionally asserted if a core needed to inhibit
another device (e.g. DDR controller) from providing that line to the requestor. 

The latter logic is why I think it's perfectly acceptable to use HITM to
indicate a peer cache-to-cache transfer, however since others don't feel that way
let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
indicates some peer of the hierarchy below the originating core sourced the
data.  This clears up the definition that line came from from a peer and may or
may not have been modified, but it doesn't add a lot of implementation dependant
functionality into the SNOOP API. 

We could use the mem-level to indicate the level of the cache hierarchy we had
to get to before the snoop traveled upward, which seems like what x86 is doing
here.

PEER_CORE -> MEM_SNOOP_PEER + L2
PEER_CLSTR -> MEM_SNOOP_PEER + L3
PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
the clusters and the existing commercial implementations don't have them).

Thanks,
Ali

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-03 20:33                   ` Ali Saidi
@ 2022-04-04 15:12                     ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-04 15:12 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > Let me just bring up my another
> > thinking (sorry that my suggestion is float): another choice is we set
> > ANY_CACHE as cache level if we are not certain the cache level, and
> > extend snoop field to indicate the snooping logics, like:
> > 
> >   PERF_MEM_SNOOP_PEER_CORE
> >   PERF_MEM_SNOOP_LCL_CLSTR
> >   PERF_MEM_SNOOP_PEER_CLSTR
> > 
> > Seems to me, we doing this is not only for cache level, it's more
> > important for users to know the variant cost for involving different
> > snooping logics.
> 
> I think we've come full circle :).

Not too bad, and I learned a lot :)

> Going back to what do we want to indicate to
> a user about the source of the cache line, I believe there are three things with
> an eye toward helping a user of the data improve the performance of their
> application:

Thanks a lot for summary!

> 1. The level below them in the hierarchy it it (L1, L2, LLC, local DRAM).
> Depending on the level this directly indicates the expense of the operation. 
> 
> 2. If it came from a peer of theirs on the same socket. I'm really of the
> opinion still that exactly which peer, doesn't matter much as it's a 2nd or 3rd
> order concern compared to, it it couldn't be sourced from a cache level below
> the originating core, had to come from a local peer and the request went to
> that lower levels and was eventually sourced from a peer.  Why it was sourced
> from the peer is still almost irrelevant to me. If it was truly modified or the
> core it was sourced from only had permission to modify it the snoop filter
> doesn't necessarily need to know the difference and the outcome is the same. 

I think here the key information delivered is:

For the peer snooping, you think there has big cost difference between
L2 cache snooping and L3 cache snooping; for L3 cache snooping, we
don't care about it's an internal cluster snooping or external cluster
snooping, and we have no enough info to reason snooping type (HIT vs
HITM).

> 3. For multi-socket systems that it came from a different socket and there it is
> probably most interesting if it came from DRAM on the remote socket or a cache.
>
> I'm putting 3 aside for now since we've really been focusing on 1 and 2 in this
> discussion and I think the biggest hangup has been the definition of HIT vs
> HITM.

Agree on the item 3.

> If someone has a precise definition, that would be great, but AFAIK it
> goes back to the P6 bus where HIT was asserted by another core if it had a line
> (in any state) and HITM was additionally asserted if a core needed to inhibit
> another device (e.g. DDR controller) from providing that line to the requestor. 

Thanks for sharing the info for how the bus implements HIT/HITM.

> The latter logic is why I think it's perfectly acceptable to use HITM to
> indicate a peer cache-to-cache transfer, however since others don't feel that way
> let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
> indicates some peer of the hierarchy below the originating core sourced the
> data.  This clears up the definition that line came from from a peer and may or
> may not have been modified, but it doesn't add a lot of implementation dependant
> functionality into the SNOOP API. 
> 
> We could use the mem-level to indicate the level of the cache hierarchy we had
> to get to before the snoop traveled upward, which seems like what x86 is doing
> here.

It makes sense to me that to use the highest cache level as mem-level.
Please add comments in the code for this, this would be useful for
understanding the code.

> PEER_CORE -> MEM_SNOOP_PEER + L2
> PEER_CLSTR -> MEM_SNOOP_PEER + L3
> PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> the clusters and the existing commercial implementations don't have them).

Generally, this idea is fine for me.

Following your suggestion, if we connect the concepts PoC and PoU in Arm
reference manual, we can extend the snooping mode with MEM_SNOOP_POU
(for PoU) and MEM_SNOOP_POC (for PoC), so:

PEER_CORE -> MEM_SNOOP_POU + L2
PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
PEER_CLSTR -> MEM_SNOOP_POC + L3

Seems to me, we could consider for this.  If this is over complexity or
even I said any wrong concepts for this, please use your method.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-04 15:12                     ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-04 15:12 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > Let me just bring up my another
> > thinking (sorry that my suggestion is float): another choice is we set
> > ANY_CACHE as cache level if we are not certain the cache level, and
> > extend snoop field to indicate the snooping logics, like:
> > 
> >   PERF_MEM_SNOOP_PEER_CORE
> >   PERF_MEM_SNOOP_LCL_CLSTR
> >   PERF_MEM_SNOOP_PEER_CLSTR
> > 
> > Seems to me, we doing this is not only for cache level, it's more
> > important for users to know the variant cost for involving different
> > snooping logics.
> 
> I think we've come full circle :).

Not too bad, and I learned a lot :)

> Going back to what do we want to indicate to
> a user about the source of the cache line, I believe there are three things with
> an eye toward helping a user of the data improve the performance of their
> application:

Thanks a lot for summary!

> 1. The level below them in the hierarchy it it (L1, L2, LLC, local DRAM).
> Depending on the level this directly indicates the expense of the operation. 
> 
> 2. If it came from a peer of theirs on the same socket. I'm really of the
> opinion still that exactly which peer, doesn't matter much as it's a 2nd or 3rd
> order concern compared to, it it couldn't be sourced from a cache level below
> the originating core, had to come from a local peer and the request went to
> that lower levels and was eventually sourced from a peer.  Why it was sourced
> from the peer is still almost irrelevant to me. If it was truly modified or the
> core it was sourced from only had permission to modify it the snoop filter
> doesn't necessarily need to know the difference and the outcome is the same. 

I think here the key information delivered is:

For the peer snooping, you think there has big cost difference between
L2 cache snooping and L3 cache snooping; for L3 cache snooping, we
don't care about it's an internal cluster snooping or external cluster
snooping, and we have no enough info to reason snooping type (HIT vs
HITM).

> 3. For multi-socket systems that it came from a different socket and there it is
> probably most interesting if it came from DRAM on the remote socket or a cache.
>
> I'm putting 3 aside for now since we've really been focusing on 1 and 2 in this
> discussion and I think the biggest hangup has been the definition of HIT vs
> HITM.

Agree on the item 3.

> If someone has a precise definition, that would be great, but AFAIK it
> goes back to the P6 bus where HIT was asserted by another core if it had a line
> (in any state) and HITM was additionally asserted if a core needed to inhibit
> another device (e.g. DDR controller) from providing that line to the requestor. 

Thanks for sharing the info for how the bus implements HIT/HITM.

> The latter logic is why I think it's perfectly acceptable to use HITM to
> indicate a peer cache-to-cache transfer, however since others don't feel that way
> let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
> indicates some peer of the hierarchy below the originating core sourced the
> data.  This clears up the definition that line came from from a peer and may or
> may not have been modified, but it doesn't add a lot of implementation dependant
> functionality into the SNOOP API. 
> 
> We could use the mem-level to indicate the level of the cache hierarchy we had
> to get to before the snoop traveled upward, which seems like what x86 is doing
> here.

It makes sense to me that to use the highest cache level as mem-level.
Please add comments in the code for this, this would be useful for
understanding the code.

> PEER_CORE -> MEM_SNOOP_PEER + L2
> PEER_CLSTR -> MEM_SNOOP_PEER + L3
> PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> the clusters and the existing commercial implementations don't have them).

Generally, this idea is fine for me.

Following your suggestion, if we connect the concepts PoC and PoU in Arm
reference manual, we can extend the snooping mode with MEM_SNOOP_POU
(for PoU) and MEM_SNOOP_POC (for PoC), so:

PEER_CORE -> MEM_SNOOP_POU + L2
PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
PEER_CLSTR -> MEM_SNOOP_POC + L3

Seems to me, we could consider for this.  If this is over complexity or
even I said any wrong concepts for this, please use your method.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-04 15:12                     ` Leo Yan
@ 2022-04-06 21:00                       ` Ali Saidi
  -1 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-04-06 21:00 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Mon, 4 Apr 2022 15:12:18  +0000, Leo Yan wrote:
> On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > The latter logic is why I think it's perfectly acceptable to use HITM to
> > indicate a peer cache-to-cache transfer, however since others don't feel that way
> > let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
> > indicates some peer of the hierarchy below the originating core sourced the
> > data.  This clears up the definition that line came from from a peer and may or
> > may not have been modified, but it doesn't add a lot of implementation dependant
> > functionality into the SNOOP API. 
> > 
> > We could use the mem-level to indicate the level of the cache hierarchy we had
> > to get to before the snoop traveled upward, which seems like what x86 is doing
> > here.
> 
> It makes sense to me that to use the highest cache level as mem-level.
> Please add comments in the code for this, this would be useful for
> understanding the code.

Ok.

> > PEER_CORE -> MEM_SNOOP_PEER + L2
> > PEER_CLSTR -> MEM_SNOOP_PEER + L3
> > PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> > the clusters and the existing commercial implementations don't have them).
> 
> Generally, this idea is fine for me.

Great.  

Now the next tricky thing. Since we're not using HITM for recording the memory
events, the question becomes for the c2c output should we output the SNOOP_PEER
events as if they are HITM events with a clarification in the perf-c2c man page
or effectively duplicate all the lcl_hitm logic, which is a fair amount,  in
perf c2c to add a column and sort option? 

> Following your suggestion, if we connect the concepts PoC and PoU in Arm
> reference manual, we can extend the snooping mode with MEM_SNOOP_POU
> (for PoU) and MEM_SNOOP_POC (for PoC), so:
> 
> PEER_CORE -> MEM_SNOOP_POU + L2
> PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
> PEER_CLSTR -> MEM_SNOOP_POC + L3
> 
> Seems to me, we could consider for this.  If this is over complexity or
> even I said any wrong concepts for this, please use your method.

I think this adds a lot of complexity and reduces clarity. Some systems
implement coherent icaches and the PoU would be the L1 cache, others don't so
that would be the L2 (or wherever there is a unified cache). Similarly, with the
point of coherency, some systems would consider that dram, but other systems
have transparent LLCs and it would be the LLC. 

Thanks,
Ali


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-06 21:00                       ` Ali Saidi
  0 siblings, 0 replies; 66+ messages in thread
From: Ali Saidi @ 2022-04-06 21:00 UTC (permalink / raw)
  To: leo.yan
  Cc: Nick.Forrington, acme, alexander.shishkin, alisaidi,
	andrew.kilroy, benh, german.gomez, james.clark, john.garry,
	jolsa, kjain, lihuafei1, linux-arm-kernel, linux-kernel,
	linux-perf-users, mark.rutland, mathieu.poirier, mingo, namhyung,
	peterz, will

On Mon, 4 Apr 2022 15:12:18  +0000, Leo Yan wrote:
> On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > The latter logic is why I think it's perfectly acceptable to use HITM to
> > indicate a peer cache-to-cache transfer, however since others don't feel that way
> > let me propose a single additional snooping type PERF_MEM_SNOOP_PEER that
> > indicates some peer of the hierarchy below the originating core sourced the
> > data.  This clears up the definition that line came from from a peer and may or
> > may not have been modified, but it doesn't add a lot of implementation dependant
> > functionality into the SNOOP API. 
> > 
> > We could use the mem-level to indicate the level of the cache hierarchy we had
> > to get to before the snoop traveled upward, which seems like what x86 is doing
> > here.
> 
> It makes sense to me that to use the highest cache level as mem-level.
> Please add comments in the code for this, this would be useful for
> understanding the code.

Ok.

> > PEER_CORE -> MEM_SNOOP_PEER + L2
> > PEER_CLSTR -> MEM_SNOOP_PEER + L3
> > PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> > the clusters and the existing commercial implementations don't have them).
> 
> Generally, this idea is fine for me.

Great.  

Now the next tricky thing. Since we're not using HITM for recording the memory
events, the question becomes for the c2c output should we output the SNOOP_PEER
events as if they are HITM events with a clarification in the perf-c2c man page
or effectively duplicate all the lcl_hitm logic, which is a fair amount,  in
perf c2c to add a column and sort option? 

> Following your suggestion, if we connect the concepts PoC and PoU in Arm
> reference manual, we can extend the snooping mode with MEM_SNOOP_POU
> (for PoU) and MEM_SNOOP_POC (for PoC), so:
> 
> PEER_CORE -> MEM_SNOOP_POU + L2
> PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
> PEER_CLSTR -> MEM_SNOOP_POC + L3
> 
> Seems to me, we could consider for this.  If this is over complexity or
> even I said any wrong concepts for this, please use your method.

I think this adds a lot of complexity and reduces clarity. Some systems
implement coherent icaches and the PoU would be the L1 cache, others don't so
that would be the L2 (or wherever there is a unified cache). Similarly, with the
point of coherency, some systems would consider that dram, but other systems
have transparent LLCs and it would be the LLC. 

Thanks,
Ali


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-03-31 12:44                 ` Leo Yan
@ 2022-04-07 15:24                   ` German Gomez
  -1 siblings, 0 replies; 66+ messages in thread
From: German Gomez @ 2022-04-07 15:24 UTC (permalink / raw)
  To: Ali Saidi, Leo Yan
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi,

On 31/03/2022 13:44, Leo Yan wrote:
> [...]
>>> I'd like to do this in a separate patch, but I have one other proposal. The
>>> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
>>> it's also in the L2. Given that the Graviton systems and afaik the Ampere
>>> systems don't have any cache between the L2 and the SLC, thus anything from
>>> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
>>> should just set L2 for these cases? German, are you good with this for now? 
>> Sorry for the delay. I'd like to also check this with someone. I'll try
>> to get back asap. In the meantime, if this approach is also OK with Leo,
>> I think it would be fine by me.

Sorry for the delay. Yeah setting it to L2 indeed looks reasonable for
now. Somebody brought up the case of running SPE in a heterogeneous 
system, but also we think might be beyond the scope of this change.

One very minor nit though. Would you be ok with renaming LCL to LOCAL 
and CLSTR to CLUSTER? I sometimes mistead the former as "LLC".

> Thanks for the checking internally.  Let me just bring up my another
> thinking (sorry that my suggestion is float): another choice is we set
> ANY_CACHE as cache level if we are not certain the cache level, and
> extend snoop field to indicate the snooping logics, like:
>
>   PERF_MEM_SNOOP_PEER_CORE
>   PERF_MEM_SNOOP_LCL_CLSTR
>   PERF_MEM_SNOOP_PEER_CLSTR
>
> Seems to me, we doing this is not only for cache level, it's more
> important for users to know the variant cost for involving different
> snooping logics.
>
> Thanks,
> Leo

I see there's been some more messages I need to catch up with. Is the 
intention to extend the PERF_MEM_* flags for this cset, or will it be
left for a later change?

In any case, I'd be keen to take another look at it and try to bring
some more eyes into this.

Thanks,
German

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-07 15:24                   ` German Gomez
  0 siblings, 0 replies; 66+ messages in thread
From: German Gomez @ 2022-04-07 15:24 UTC (permalink / raw)
  To: Ali Saidi, Leo Yan
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

Hi,

On 31/03/2022 13:44, Leo Yan wrote:
> [...]
>>> I'd like to do this in a separate patch, but I have one other proposal. The
>>> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
>>> it's also in the L2. Given that the Graviton systems and afaik the Ampere
>>> systems don't have any cache between the L2 and the SLC, thus anything from
>>> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
>>> should just set L2 for these cases? German, are you good with this for now? 
>> Sorry for the delay. I'd like to also check this with someone. I'll try
>> to get back asap. In the meantime, if this approach is also OK with Leo,
>> I think it would be fine by me.

Sorry for the delay. Yeah setting it to L2 indeed looks reasonable for
now. Somebody brought up the case of running SPE in a heterogeneous 
system, but also we think might be beyond the scope of this change.

One very minor nit though. Would you be ok with renaming LCL to LOCAL 
and CLSTR to CLUSTER? I sometimes mistead the former as "LLC".

> Thanks for the checking internally.  Let me just bring up my another
> thinking (sorry that my suggestion is float): another choice is we set
> ANY_CACHE as cache level if we are not certain the cache level, and
> extend snoop field to indicate the snooping logics, like:
>
>   PERF_MEM_SNOOP_PEER_CORE
>   PERF_MEM_SNOOP_LCL_CLSTR
>   PERF_MEM_SNOOP_PEER_CLSTR
>
> Seems to me, we doing this is not only for cache level, it's more
> important for users to know the variant cost for involving different
> snooping logics.
>
> Thanks,
> Leo

I see there's been some more messages I need to catch up with. Is the 
intention to extend the PERF_MEM_* flags for this cset, or will it be
left for a later change?

In any case, I'd be keen to take another look at it and try to bring
some more eyes into this.

Thanks,
German

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-06 21:00                       ` Ali Saidi
@ 2022-04-08  1:06                         ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-08  1:06 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Wed, Apr 06, 2022 at 09:00:17PM +0000, Ali Saidi wrote:
> On Mon, 4 Apr 2022 15:12:18  +0000, Leo Yan wrote:
> > On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > > PEER_CORE -> MEM_SNOOP_PEER + L2
> > > PEER_CLSTR -> MEM_SNOOP_PEER + L3
> > > PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> > > the clusters and the existing commercial implementations don't have them).
> > 
> > Generally, this idea is fine for me.
> 
> Great.  
> 
> Now the next tricky thing. Since we're not using HITM for recording the memory
> events, the question becomes for the c2c output should we output the SNOOP_PEER
> events as if they are HITM events with a clarification in the perf-c2c man page
> or effectively duplicate all the lcl_hitm logic, which is a fair amount,  in
> perf c2c to add a column and sort option? 

I think we need to handle both load and store operations in 'perf c2c'
tool.

For the load operation, in the 'cache line details' view, we need to
support 'snoop_peer' conlumn; and since Arm SPE doesn't give any data
source info for store opeartion, so my plan is to add an extra conlumn
'Other' based on the two existed conlumns 'L1 Hit' and 'L1 Miss'.

Could you leave this part for me?  I will respin my patch set for
extend 'perf c2c' for this (and hope can support the old Arm SPE trace
data).

Please note, when you spin new patch set, you need to take care for
the store operations.  In the current patch set, it will wrongly
always set L1 hit for all store operations due to the data source
field is always zero.  My understanding is for all store operations,
we need to set the cache level as PERF_MEM_LVLNUM_ANY_CACHE and snoop
type as PERF_MEM_SNOOP_NA.

> > Following your suggestion, if we connect the concepts PoC and PoU in Arm
> > reference manual, we can extend the snooping mode with MEM_SNOOP_POU
> > (for PoU) and MEM_SNOOP_POC (for PoC), so:
> > 
> > PEER_CORE -> MEM_SNOOP_POU + L2
> > PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
> > PEER_CLSTR -> MEM_SNOOP_POC + L3
> > 
> > Seems to me, we could consider for this.  If this is over complexity or
> > even I said any wrong concepts for this, please use your method.
> 
> I think this adds a lot of complexity and reduces clarity. Some systems
> implement coherent icaches and the PoU would be the L1 cache, others don't so
> that would be the L2 (or wherever there is a unified cache). Similarly, with the
> point of coherency, some systems would consider that dram, but other systems
> have transparent LLCs and it would be the LLC. 

Okay, it's fine for me to move forward to use MEM_SNOOP_PEER as the
solution.

Since German is looking into this part, @German, if you have any comment
on this part, just let us know.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-08  1:06                         ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-08  1:06 UTC (permalink / raw)
  To: Ali Saidi
  Cc: Nick.Forrington, acme, alexander.shishkin, andrew.kilroy, benh,
	german.gomez, james.clark, john.garry, jolsa, kjain, lihuafei1,
	linux-arm-kernel, linux-kernel, linux-perf-users, mark.rutland,
	mathieu.poirier, mingo, namhyung, peterz, will

On Wed, Apr 06, 2022 at 09:00:17PM +0000, Ali Saidi wrote:
> On Mon, 4 Apr 2022 15:12:18  +0000, Leo Yan wrote:
> > On Sun, Apr 03, 2022 at 08:33:37PM +0000, Ali Saidi wrote:

[...]

> > > PEER_CORE -> MEM_SNOOP_PEER + L2
> > > PEER_CLSTR -> MEM_SNOOP_PEER + L3
> > > PEER_LCL_CLSTR -> MEM_SNOOP_PEER + L3 (since newer neoverse cores don't support
> > > the clusters and the existing commercial implementations don't have them).
> > 
> > Generally, this idea is fine for me.
> 
> Great.  
> 
> Now the next tricky thing. Since we're not using HITM for recording the memory
> events, the question becomes for the c2c output should we output the SNOOP_PEER
> events as if they are HITM events with a clarification in the perf-c2c man page
> or effectively duplicate all the lcl_hitm logic, which is a fair amount,  in
> perf c2c to add a column and sort option? 

I think we need to handle both load and store operations in 'perf c2c'
tool.

For the load operation, in the 'cache line details' view, we need to
support 'snoop_peer' conlumn; and since Arm SPE doesn't give any data
source info for store opeartion, so my plan is to add an extra conlumn
'Other' based on the two existed conlumns 'L1 Hit' and 'L1 Miss'.

Could you leave this part for me?  I will respin my patch set for
extend 'perf c2c' for this (and hope can support the old Arm SPE trace
data).

Please note, when you spin new patch set, you need to take care for
the store operations.  In the current patch set, it will wrongly
always set L1 hit for all store operations due to the data source
field is always zero.  My understanding is for all store operations,
we need to set the cache level as PERF_MEM_LVLNUM_ANY_CACHE and snoop
type as PERF_MEM_SNOOP_NA.

> > Following your suggestion, if we connect the concepts PoC and PoU in Arm
> > reference manual, we can extend the snooping mode with MEM_SNOOP_POU
> > (for PoU) and MEM_SNOOP_POC (for PoC), so:
> > 
> > PEER_CORE -> MEM_SNOOP_POU + L2
> > PEER_LCL_CLSTR -> MEM_SNOOP_POU + L3
> > PEER_CLSTR -> MEM_SNOOP_POC + L3
> > 
> > Seems to me, we could consider for this.  If this is over complexity or
> > even I said any wrong concepts for this, please use your method.
> 
> I think this adds a lot of complexity and reduces clarity. Some systems
> implement coherent icaches and the PoU would be the L1 cache, others don't so
> that would be the L2 (or wherever there is a unified cache). Similarly, with the
> point of coherency, some systems would consider that dram, but other systems
> have transparent LLCs and it would be the LLC. 

Okay, it's fine for me to move forward to use MEM_SNOOP_PEER as the
solution.

Since German is looking into this part, @German, if you have any comment
on this part, just let us know.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
  2022-04-07 15:24                   ` German Gomez
@ 2022-04-08  1:18                     ` Leo Yan
  -1 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-08  1:18 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, james.clark, john.garry, jolsa, kjain,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Apr 07, 2022 at 04:24:35PM +0100, German Gomez wrote:
> Hi,
> 
> On 31/03/2022 13:44, Leo Yan wrote:
> > [...]
> >>> I'd like to do this in a separate patch, but I have one other proposal. The
> >>> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> >>> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> >>> systems don't have any cache between the L2 and the SLC, thus anything from
> >>> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> >>> should just set L2 for these cases? German, are you good with this for now? 
> >> Sorry for the delay. I'd like to also check this with someone. I'll try
> >> to get back asap. In the meantime, if this approach is also OK with Leo,
> >> I think it would be fine by me.
> 
> Sorry for the delay. Yeah setting it to L2 indeed looks reasonable for
> now. Somebody brought up the case of running SPE in a heterogeneous 
> system, but also we think might be beyond the scope of this change.
> 
> One very minor nit though. Would you be ok with renaming LCL to LOCAL 
> and CLSTR to CLUSTER? I sometimes mistead the former as "LLC".

Ali's suggestion is to use the format: highest_cache_level | MEM_SNOOP_PEER.

Simply to say, the highest cache level is where we snoop the cache
data with the highest cache level.  And we use an extra snoop op
MEM_SNOOP_PEER as the flag to indicate a peer snooping from the local
cluster or peer cluster.

Please review the more detailed discussion in another email.

> > Thanks for the checking internally.  Let me just bring up my another
> > thinking (sorry that my suggestion is float): another choice is we set
> > ANY_CACHE as cache level if we are not certain the cache level, and
> > extend snoop field to indicate the snooping logics, like:
> >
> >   PERF_MEM_SNOOP_PEER_CORE
> >   PERF_MEM_SNOOP_LCL_CLSTR
> >   PERF_MEM_SNOOP_PEER_CLSTR
> >
> > Seems to me, we doing this is not only for cache level, it's more
> > important for users to know the variant cost for involving different
> > snooping logics.
> >
> I see there's been some more messages I need to catch up with. Is the 
> intention to extend the PERF_MEM_* flags for this cset, or will it be
> left for a later change?

The plan is to extend the PERF_MEM_* flags in this patch set.

> In any case, I'd be keen to take another look at it and try to bring
> some more eyes into this.

Sure.  Please check at your side and thanks for confirmation.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores
@ 2022-04-08  1:18                     ` Leo Yan
  0 siblings, 0 replies; 66+ messages in thread
From: Leo Yan @ 2022-04-08  1:18 UTC (permalink / raw)
  To: German Gomez
  Cc: Ali Saidi, Nick.Forrington, acme, alexander.shishkin,
	andrew.kilroy, benh, james.clark, john.garry, jolsa, kjain,
	lihuafei1, linux-arm-kernel, linux-kernel, linux-perf-users,
	mark.rutland, mathieu.poirier, mingo, namhyung, peterz, will

On Thu, Apr 07, 2022 at 04:24:35PM +0100, German Gomez wrote:
> Hi,
> 
> On 31/03/2022 13:44, Leo Yan wrote:
> > [...]
> >>> I'd like to do this in a separate patch, but I have one other proposal. The
> >>> Neoverse cores L2 is strictly inclusive of the L1, so even if it's in the L1,
> >>> it's also in the L2. Given that the Graviton systems and afaik the Ampere
> >>> systems don't have any cache between the L2 and the SLC, thus anything from
> >>> PEER_CORE, LCL_CLSTR, or PEER_CLSTR would hit in the L2, perhaps we
> >>> should just set L2 for these cases? German, are you good with this for now? 
> >> Sorry for the delay. I'd like to also check this with someone. I'll try
> >> to get back asap. In the meantime, if this approach is also OK with Leo,
> >> I think it would be fine by me.
> 
> Sorry for the delay. Yeah setting it to L2 indeed looks reasonable for
> now. Somebody brought up the case of running SPE in a heterogeneous 
> system, but also we think might be beyond the scope of this change.
> 
> One very minor nit though. Would you be ok with renaming LCL to LOCAL 
> and CLSTR to CLUSTER? I sometimes mistead the former as "LLC".

Ali's suggestion is to use the format: highest_cache_level | MEM_SNOOP_PEER.

Simply to say, the highest cache level is where we snoop the cache
data with the highest cache level.  And we use an extra snoop op
MEM_SNOOP_PEER as the flag to indicate a peer snooping from the local
cluster or peer cluster.

Please review the more detailed discussion in another email.

> > Thanks for the checking internally.  Let me just bring up my another
> > thinking (sorry that my suggestion is float): another choice is we set
> > ANY_CACHE as cache level if we are not certain the cache level, and
> > extend snoop field to indicate the snooping logics, like:
> >
> >   PERF_MEM_SNOOP_PEER_CORE
> >   PERF_MEM_SNOOP_LCL_CLSTR
> >   PERF_MEM_SNOOP_PEER_CLSTR
> >
> > Seems to me, we doing this is not only for cache level, it's more
> > important for users to know the variant cost for involving different
> > snooping logics.
> >
> I see there's been some more messages I need to catch up with. Is the 
> intention to extend the PERF_MEM_* flags for this cset, or will it be
> left for a later change?

The plan is to extend the PERF_MEM_* flags in this patch set.

> In any case, I'd be keen to take another look at it and try to bring
> some more eyes into this.

Sure.  Please check at your side and thanks for confirmation.

Thanks,
Leo

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2022-04-08  1:20 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 18:33 [PATCH v4 0/4] perf: arm-spe: Decode SPE source and use for perf c2c Ali Saidi
2022-03-24 18:33 ` Ali Saidi
2022-03-24 18:33 ` [PATCH v4 1/4] tools: arm64: Import cputype.h Ali Saidi
2022-03-24 18:33   ` Ali Saidi
2022-03-25 18:39   ` Arnaldo Carvalho de Melo
2022-03-25 18:39     ` Arnaldo Carvalho de Melo
2022-03-25 18:58     ` Ali Saidi
2022-03-25 18:58       ` Ali Saidi
2022-03-25 19:42     ` Arnaldo Carvalho de Melo
2022-03-25 19:42       ` Arnaldo Carvalho de Melo
2022-03-26  5:49       ` Leo Yan
2022-03-26  5:49         ` Leo Yan
2022-03-26 13:59         ` Arnaldo Carvalho de Melo
2022-03-26 13:59           ` Arnaldo Carvalho de Melo
2022-03-24 18:33 ` [PATCH v4 2/4] perf arm-spe: Use SPE data source for neoverse cores Ali Saidi
2022-03-24 18:33   ` Ali Saidi
2022-03-26 13:47   ` Leo Yan
2022-03-26 13:47     ` Leo Yan
2022-03-26 13:52     ` Arnaldo Carvalho de Melo
2022-03-26 13:52       ` Arnaldo Carvalho de Melo
2022-03-26 13:56       ` Leo Yan
2022-03-26 13:56         ` Leo Yan
2022-03-26 14:04         ` Arnaldo Carvalho de Melo
2022-03-26 14:04           ` Arnaldo Carvalho de Melo
2022-03-26 19:43     ` Ali Saidi
2022-03-26 19:43       ` Ali Saidi
2022-03-27  9:09       ` Leo Yan
2022-03-27  9:09         ` Leo Yan
2022-03-28  3:08       ` Ali Saidi
2022-03-28  3:08         ` Ali Saidi
2022-03-28 13:05         ` Leo Yan
2022-03-28 13:05           ` Leo Yan
2022-03-29 13:34           ` Shuai Xue
2022-03-29 13:34             ` Shuai Xue
2022-03-29 14:32           ` Ali Saidi
2022-03-29 14:32             ` Ali Saidi
2022-03-31 12:19             ` Leo Yan
2022-03-31 12:19               ` Leo Yan
2022-03-31 12:28             ` German Gomez
2022-03-31 12:28               ` German Gomez
2022-03-31 12:44               ` Leo Yan
2022-03-31 12:44                 ` Leo Yan
2022-04-03 20:33                 ` Ali Saidi
2022-04-03 20:33                   ` Ali Saidi
2022-04-04 15:12                   ` Leo Yan
2022-04-04 15:12                     ` Leo Yan
2022-04-06 21:00                     ` Ali Saidi
2022-04-06 21:00                       ` Ali Saidi
2022-04-08  1:06                       ` Leo Yan
2022-04-08  1:06                         ` Leo Yan
2022-04-07 15:24                 ` German Gomez
2022-04-07 15:24                   ` German Gomez
2022-04-08  1:18                   ` Leo Yan
2022-04-08  1:18                     ` Leo Yan
2022-03-24 18:33 ` [PATCH v4 3/4] perf mem: Support mem_lvl_num in c2c command Ali Saidi
2022-03-24 18:33   ` Ali Saidi
2022-03-26 13:54   ` Arnaldo Carvalho de Melo
2022-03-26 13:54     ` Arnaldo Carvalho de Melo
2022-03-24 18:33 ` [PATCH v4 4/4] perf mem: Support HITM for when mem_lvl_num is any Ali Saidi
2022-03-24 18:33   ` Ali Saidi
2022-03-26  6:23   ` Leo Yan
2022-03-26  6:23     ` Leo Yan
2022-03-26 13:30     ` Arnaldo Carvalho de Melo
2022-03-26 13:30       ` Arnaldo Carvalho de Melo
2022-03-26 19:14     ` Ali Saidi
2022-03-26 19:14       ` Ali Saidi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.