linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V11 00/10] arm64/perf: Enable branch stack sampling
@ 2023-05-31  4:04 Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback Anshuman Khandual
                   ` (10 more replies)
  0 siblings, 11 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
relevant register definitions could be accessed here.

https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers

This series applies on 6.4-rc4.

Changes in V11:

- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data

Changes in V10:

https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/

- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()

Changes in V9:

https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/

- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1 
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism

Changes in V8:

https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/

- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element 
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()   
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call

Changes in V7:

https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/

- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
  select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()

Changes in V6:

https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/

- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events

Changes in V5:

https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/

- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'

Changes in V4:

https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/

- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning

Changes in V3:

https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/

- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
  only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
  been checked in generic perf via perf_allow_kernel()

Changes in V2:

https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/

- Dropped branch sample filter helpers consolidation patch from this series 
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters

Changes in V1:

https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/

- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI

Changes in RFC V2:

https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/

- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver

Changes in RFC V1:

https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (10):
  drivers: perf: arm_pmu: Add new sched_task() callback
  arm64/perf: Add BRBE registers and fields
  arm64/perf: Add branch stack support in struct arm_pmu
  arm64/perf: Add branch stack support in struct pmu_hw_events
  arm64/perf: Add branch stack support in ARMV8 PMU
  arm64/perf: Enable branch stack events via FEAT_BRBE
  arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
  arm64/perf: Add struct brbe_regset helper functions
  arm64/perf: Implement branch records save on task sched out
  arm64/perf: Implement branch records save on PMU IRQ

 arch/arm64/include/asm/perf_event.h |  46 ++
 arch/arm64/include/asm/sysreg.h     | 103 ++++
 arch/arm64/tools/sysreg             | 159 ++++++
 drivers/perf/Kconfig                |  11 +
 drivers/perf/Makefile               |   1 +
 drivers/perf/arm_brbe.c             | 762 ++++++++++++++++++++++++++++
 drivers/perf/arm_brbe.h             | 270 ++++++++++
 drivers/perf/arm_pmu.c              |  12 +-
 drivers/perf/arm_pmuv3.c            | 105 +++-
 include/linux/perf/arm_pmu.h        |  22 +-
 10 files changed, 1465 insertions(+), 26 deletions(-)
 create mode 100644 drivers/perf/arm_brbe.c
 create mode 100644 drivers/perf/arm_brbe.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-05  7:26   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields Anshuman Khandual
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This adds armpmu_sched_task(), as generic pmu's sched_task() override which
in turn can utilize a new arm_pmu.sched_task() callback when available from
the arm_pmu instance. This new callback will be used while enabling BRBE in
ARMV8 PMU.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/perf/arm_pmu.c       | 9 +++++++++
 include/linux/perf/arm_pmu.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 15bd1e34a88e..aada47e3b126 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -517,6 +517,14 @@ static int armpmu_event_init(struct perf_event *event)
 	return __hw_perf_event_init(event);
 }
 
+static void armpmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+	struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+
+	if (armpmu->sched_task)
+		armpmu->sched_task(pmu_ctx, sched_in);
+}
+
 static void armpmu_enable(struct pmu *pmu)
 {
 	struct arm_pmu *armpmu = to_arm_pmu(pmu);
@@ -858,6 +866,7 @@ struct arm_pmu *armpmu_alloc(void)
 	}
 
 	pmu->pmu = (struct pmu) {
+		.sched_task	= armpmu_sched_task,
 		.pmu_enable	= armpmu_enable,
 		.pmu_disable	= armpmu_disable,
 		.event_init	= armpmu_event_init,
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 525b5d64e394..f7fbd162ca4c 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -100,6 +100,7 @@ struct arm_pmu {
 	void		(*stop)(struct arm_pmu *);
 	void		(*reset)(void *);
 	int		(*map_event)(struct perf_event *event);
+	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 	int		num_events;
 	bool		secure_access; /* 32-bit ARM only */
 #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-05  7:55   ` Mark Rutland
  2023-06-13 16:27   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu Anshuman Khandual
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This adds BRBE related register definitions and various other related field
macros there in. These will be used subsequently in a BRBE driver which is
being added later on.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
 arch/arm64/tools/sysreg         | 159 ++++++++++++++++++++++++++++++++
 2 files changed, 262 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index e72d9aaab6b1..12419c55d3b7 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -165,6 +165,109 @@
 #define SYS_DBGDTRTX_EL0		sys_reg(2, 3, 0, 5, 0)
 #define SYS_DBGVCR32_EL2		sys_reg(2, 4, 0, 7, 0)
 
+#define __SYS_BRBINFO(n)		sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
+#define __SYS_BRBSRC(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
+#define __SYS_BRBTGT(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
+
+#define SYS_BRBINF0_EL1			__SYS_BRBINFO(0)
+#define SYS_BRBINF1_EL1			__SYS_BRBINFO(1)
+#define SYS_BRBINF2_EL1			__SYS_BRBINFO(2)
+#define SYS_BRBINF3_EL1			__SYS_BRBINFO(3)
+#define SYS_BRBINF4_EL1			__SYS_BRBINFO(4)
+#define SYS_BRBINF5_EL1			__SYS_BRBINFO(5)
+#define SYS_BRBINF6_EL1			__SYS_BRBINFO(6)
+#define SYS_BRBINF7_EL1			__SYS_BRBINFO(7)
+#define SYS_BRBINF8_EL1			__SYS_BRBINFO(8)
+#define SYS_BRBINF9_EL1			__SYS_BRBINFO(9)
+#define SYS_BRBINF10_EL1		__SYS_BRBINFO(10)
+#define SYS_BRBINF11_EL1		__SYS_BRBINFO(11)
+#define SYS_BRBINF12_EL1		__SYS_BRBINFO(12)
+#define SYS_BRBINF13_EL1		__SYS_BRBINFO(13)
+#define SYS_BRBINF14_EL1		__SYS_BRBINFO(14)
+#define SYS_BRBINF15_EL1		__SYS_BRBINFO(15)
+#define SYS_BRBINF16_EL1		__SYS_BRBINFO(16)
+#define SYS_BRBINF17_EL1		__SYS_BRBINFO(17)
+#define SYS_BRBINF18_EL1		__SYS_BRBINFO(18)
+#define SYS_BRBINF19_EL1		__SYS_BRBINFO(19)
+#define SYS_BRBINF20_EL1		__SYS_BRBINFO(20)
+#define SYS_BRBINF21_EL1		__SYS_BRBINFO(21)
+#define SYS_BRBINF22_EL1		__SYS_BRBINFO(22)
+#define SYS_BRBINF23_EL1		__SYS_BRBINFO(23)
+#define SYS_BRBINF24_EL1		__SYS_BRBINFO(24)
+#define SYS_BRBINF25_EL1		__SYS_BRBINFO(25)
+#define SYS_BRBINF26_EL1		__SYS_BRBINFO(26)
+#define SYS_BRBINF27_EL1		__SYS_BRBINFO(27)
+#define SYS_BRBINF28_EL1		__SYS_BRBINFO(28)
+#define SYS_BRBINF29_EL1		__SYS_BRBINFO(29)
+#define SYS_BRBINF30_EL1		__SYS_BRBINFO(30)
+#define SYS_BRBINF31_EL1		__SYS_BRBINFO(31)
+
+#define SYS_BRBSRC0_EL1			__SYS_BRBSRC(0)
+#define SYS_BRBSRC1_EL1			__SYS_BRBSRC(1)
+#define SYS_BRBSRC2_EL1			__SYS_BRBSRC(2)
+#define SYS_BRBSRC3_EL1			__SYS_BRBSRC(3)
+#define SYS_BRBSRC4_EL1			__SYS_BRBSRC(4)
+#define SYS_BRBSRC5_EL1			__SYS_BRBSRC(5)
+#define SYS_BRBSRC6_EL1			__SYS_BRBSRC(6)
+#define SYS_BRBSRC7_EL1			__SYS_BRBSRC(7)
+#define SYS_BRBSRC8_EL1			__SYS_BRBSRC(8)
+#define SYS_BRBSRC9_EL1			__SYS_BRBSRC(9)
+#define SYS_BRBSRC10_EL1		__SYS_BRBSRC(10)
+#define SYS_BRBSRC11_EL1		__SYS_BRBSRC(11)
+#define SYS_BRBSRC12_EL1		__SYS_BRBSRC(12)
+#define SYS_BRBSRC13_EL1		__SYS_BRBSRC(13)
+#define SYS_BRBSRC14_EL1		__SYS_BRBSRC(14)
+#define SYS_BRBSRC15_EL1		__SYS_BRBSRC(15)
+#define SYS_BRBSRC16_EL1		__SYS_BRBSRC(16)
+#define SYS_BRBSRC17_EL1		__SYS_BRBSRC(17)
+#define SYS_BRBSRC18_EL1		__SYS_BRBSRC(18)
+#define SYS_BRBSRC19_EL1		__SYS_BRBSRC(19)
+#define SYS_BRBSRC20_EL1		__SYS_BRBSRC(20)
+#define SYS_BRBSRC21_EL1		__SYS_BRBSRC(21)
+#define SYS_BRBSRC22_EL1		__SYS_BRBSRC(22)
+#define SYS_BRBSRC23_EL1		__SYS_BRBSRC(23)
+#define SYS_BRBSRC24_EL1		__SYS_BRBSRC(24)
+#define SYS_BRBSRC25_EL1		__SYS_BRBSRC(25)
+#define SYS_BRBSRC26_EL1		__SYS_BRBSRC(26)
+#define SYS_BRBSRC27_EL1		__SYS_BRBSRC(27)
+#define SYS_BRBSRC28_EL1		__SYS_BRBSRC(28)
+#define SYS_BRBSRC29_EL1		__SYS_BRBSRC(29)
+#define SYS_BRBSRC30_EL1		__SYS_BRBSRC(30)
+#define SYS_BRBSRC31_EL1		__SYS_BRBSRC(31)
+
+#define SYS_BRBTGT0_EL1			__SYS_BRBTGT(0)
+#define SYS_BRBTGT1_EL1			__SYS_BRBTGT(1)
+#define SYS_BRBTGT2_EL1			__SYS_BRBTGT(2)
+#define SYS_BRBTGT3_EL1			__SYS_BRBTGT(3)
+#define SYS_BRBTGT4_EL1			__SYS_BRBTGT(4)
+#define SYS_BRBTGT5_EL1			__SYS_BRBTGT(5)
+#define SYS_BRBTGT6_EL1			__SYS_BRBTGT(6)
+#define SYS_BRBTGT7_EL1			__SYS_BRBTGT(7)
+#define SYS_BRBTGT8_EL1			__SYS_BRBTGT(8)
+#define SYS_BRBTGT9_EL1			__SYS_BRBTGT(9)
+#define SYS_BRBTGT10_EL1		__SYS_BRBTGT(10)
+#define SYS_BRBTGT11_EL1		__SYS_BRBTGT(11)
+#define SYS_BRBTGT12_EL1		__SYS_BRBTGT(12)
+#define SYS_BRBTGT13_EL1		__SYS_BRBTGT(13)
+#define SYS_BRBTGT14_EL1		__SYS_BRBTGT(14)
+#define SYS_BRBTGT15_EL1		__SYS_BRBTGT(15)
+#define SYS_BRBTGT16_EL1		__SYS_BRBTGT(16)
+#define SYS_BRBTGT17_EL1		__SYS_BRBTGT(17)
+#define SYS_BRBTGT18_EL1		__SYS_BRBTGT(18)
+#define SYS_BRBTGT19_EL1		__SYS_BRBTGT(19)
+#define SYS_BRBTGT20_EL1		__SYS_BRBTGT(20)
+#define SYS_BRBTGT21_EL1		__SYS_BRBTGT(21)
+#define SYS_BRBTGT22_EL1		__SYS_BRBTGT(22)
+#define SYS_BRBTGT23_EL1		__SYS_BRBTGT(23)
+#define SYS_BRBTGT24_EL1		__SYS_BRBTGT(24)
+#define SYS_BRBTGT25_EL1		__SYS_BRBTGT(25)
+#define SYS_BRBTGT26_EL1		__SYS_BRBTGT(26)
+#define SYS_BRBTGT27_EL1		__SYS_BRBTGT(27)
+#define SYS_BRBTGT28_EL1		__SYS_BRBTGT(28)
+#define SYS_BRBTGT29_EL1		__SYS_BRBTGT(29)
+#define SYS_BRBTGT30_EL1		__SYS_BRBTGT(30)
+#define SYS_BRBTGT31_EL1		__SYS_BRBTGT(31)
+
 #define SYS_MIDR_EL1			sys_reg(3, 0, 0, 0, 0)
 #define SYS_MPIDR_EL1			sys_reg(3, 0, 0, 0, 5)
 #define SYS_REVIDR_EL1			sys_reg(3, 0, 0, 0, 6)
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index c9a0d1fa3209..44745f42262f 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -947,6 +947,165 @@ UnsignedEnum	3:0	BT
 EndEnum
 EndSysreg
 
+
+SysregFields BRBINFx_EL1
+Res0	63:47
+Field	46	CCU
+Field	45:32	CC
+Res0	31:18
+Field	17	LASTFAILED
+Field	16	T
+Res0	15:14
+Enum	13:8		TYPE
+	0b000000	UNCOND_DIR
+	0b000001	INDIR
+	0b000010	DIR_LINK
+	0b000011	INDIR_LINK
+	0b000101	RET_SUB
+	0b000111	RET_EXCPT
+	0b001000	COND_DIR
+	0b100001	DEBUG_HALT
+	0b100010	CALL
+	0b100011	TRAP
+	0b100100	SERROR
+	0b100110	INST_DEBUG
+	0b100111	DATA_DEBUG
+	0b101010	ALGN_FAULT
+	0b101011	INST_FAULT
+	0b101100	DATA_FAULT
+	0b101110	IRQ
+	0b101111	FIQ
+	0b111001	DEBUG_EXIT
+EndEnum
+Enum	7:6	EL
+	0b00	EL0
+	0b01	EL1
+	0b10	EL2
+	0b11	EL3
+EndEnum
+Field	5	MPRED
+Res0	4:2
+Enum	1:0	VALID
+	0b00	NONE
+	0b01	TARGET
+	0b10	SOURCE
+	0b11	FULL
+EndEnum
+EndSysregFields
+
+Sysreg	BRBCR_EL1	2	1	9	0	0
+Res0	63:24
+Field	23 	EXCEPTION
+Field	22 	ERTN
+Res0	21:9
+Field	8 	FZP
+Res0	7
+Enum	6:5	TS
+	0b01	VIRTUAL
+	0b10	GST_PHYSICAL
+	0b11	PHYSICAL
+EndEnum
+Field	4	MPRED
+Field	3	CC
+Res0	2
+Field	1	E1BRE
+Field	0	E0BRE
+EndSysreg
+
+Sysreg	BRBFCR_EL1	2	1	9	0	1
+Res0	63:30
+Enum	29:28	BANK
+	0b0	FIRST
+	0b1	SECOND
+EndEnum
+Res0	27:23
+Field	22	CONDDIR
+Field	21	DIRCALL
+Field	20	INDCALL
+Field	19	RTN
+Field	18	INDIRECT
+Field	17	DIRECT
+Field	16	EnI
+Res0	15:8
+Field	7	PAUSED
+Field	6	LASTFAILED
+Res0	5:0
+EndSysreg
+
+Sysreg	BRBTS_EL1	2	1	9	0	2
+Field	63:0	TS
+EndSysreg
+
+Sysreg	BRBINFINJ_EL1	2	1	9	1	0
+Res0	63:47
+Field	46	CCU
+Field	45:32	CC
+Res0	31:18
+Field	17	LASTFAILED
+Field	16	T
+Res0	15:14
+Enum	13:8		TYPE
+	0b000000	UNCOND_DIR
+	0b000001	INDIR
+	0b000010	DIR_LINK
+	0b000011	INDIR_LINK
+	0b000100	RET_SUB
+	0b000100	RET_SUB
+	0b000111	RET_EXCPT
+	0b001000	COND_DIR
+	0b100001	DEBUG_HALT
+	0b100010	CALL
+	0b100011	TRAP
+	0b100100	SERROR
+	0b100110	INST_DEBUG
+	0b100111	DATA_DEBUG
+	0b101010	ALGN_FAULT
+	0b101011	INST_FAULT
+	0b101100	DATA_FAULT
+	0b101110	IRQ
+	0b101111	FIQ
+	0b111001	DEBUG_EXIT
+EndEnum
+Enum	7:6	EL
+	0b00	EL0
+	0b01	EL1
+	0b10	EL2
+	0b11	EL3
+EndEnum
+Field	5	MPRED
+Res0	4:2
+Enum	1:0	VALID
+	0b00	NONE
+	0b01	TARGET
+	0b10	SOURCE
+	0b00	FULL
+EndEnum
+EndSysreg
+
+Sysreg	BRBSRCINJ_EL1	2	1	9	1	1
+Field	63:0 ADDRESS
+EndSysreg
+
+Sysreg	BRBTGTINJ_EL1	2	1	9	1	2
+Field	63:0 ADDRESS
+EndSysreg
+
+Sysreg	BRBIDR0_EL1	2	1	9	2	0
+Res0	63:16
+Enum	15:12	CC
+	0b101	20_BIT
+EndEnum
+Enum	11:8	FORMAT
+	0b0	0
+EndEnum
+Enum	7:0		NUMREC
+	0b1000		8
+	0b10000		16
+	0b100000	32
+	0b1000000	64
+EndEnum
+EndSysreg
+
 Sysreg	ID_AA64ZFR0_EL1	3	0	0	4	4
 Res0	63:60
 UnsignedEnum	59:56	F64MM
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-05  7:58   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events Anshuman Khandual
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This updates 'struct arm_pmu' for branch stack sampling support later. This
adds a new 'features' element in the structure to track supported features,
and another 'private' element to encapsulate implementation attributes on a
given 'struct arm_pmu'. These updates here will help in tracking any branch
stack sampling support, which is being added later. This also adds a helper
arm_pmu_branch_stack_supported().

This also enables perf branch stack sampling event on all 'struct arm pmu',
supporting the feature but after removing the current gate that blocks such
events unconditionally in armpmu_event_init(). Instead a quick probe can be
initiated via arm_pmu_branch_stack_supported() to ascertain the support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/perf/arm_pmu.c       |  3 +--
 include/linux/perf/arm_pmu.h | 12 +++++++++++-
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index aada47e3b126..d4a4f2bd89a5 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -510,8 +510,7 @@ static int armpmu_event_init(struct perf_event *event)
 		!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
 		return -ENOENT;
 
-	/* does not support taken branch sampling */
-	if (has_branch_stack(event))
+	if (has_branch_stack(event) && !arm_pmu_branch_stack_supported(armpmu))
 		return -EOPNOTSUPP;
 
 	return __hw_perf_event_init(event);
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index f7fbd162ca4c..0da745eaf426 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -102,7 +102,9 @@ struct arm_pmu {
 	int		(*map_event)(struct perf_event *event);
 	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 	int		num_events;
-	bool		secure_access; /* 32-bit ARM only */
+	unsigned int	secure_access	: 1, /* 32-bit ARM only */
+			has_branch_stack: 1, /* 64-bit ARM only */
+			reserved	: 30;
 #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
 	DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
 #define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE	0x4000
@@ -118,8 +120,16 @@ struct arm_pmu {
 
 	/* Only to be used by ACPI probing code */
 	unsigned long acpi_cpuid;
+
+	/* Implementation specific attributes */
+	void		*private;
 };
 
+static inline bool arm_pmu_branch_stack_supported(struct arm_pmu *armpmu)
+{
+	return armpmu->has_branch_stack;
+}
+
 #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
 
 u64 armpmu_event_update(struct perf_event *event);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (2 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-05  8:00   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU Anshuman Khandual
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This adds branch records buffer pointer in 'struct pmu_hw_events' which can
be used to capture branch records during PMU interrupt. This percpu pointer
here needs to be allocated first before usage.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 include/linux/perf/arm_pmu.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 0da745eaf426..694b241e456c 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -44,6 +44,13 @@ static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_47BIT) == ARMPMU_EVT_47BIT);
 	},								\
 }
 
+#define MAX_BRANCH_RECORDS 64
+
+struct branch_records {
+	struct perf_branch_stack	branch_stack;
+	struct perf_branch_entry	branch_entries[MAX_BRANCH_RECORDS];
+};
+
 /* The events for a given PMU register set. */
 struct pmu_hw_events {
 	/*
@@ -70,6 +77,8 @@ struct pmu_hw_events {
 	struct arm_pmu		*percpu_pmu;
 
 	int irq;
+
+	struct branch_records	*branches;
 };
 
 enum armpmu_attr_groups {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (3 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-02  2:33   ` Namhyung Kim
  2023-06-05 12:05   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE Anshuman Khandual
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This enables support for branch stack sampling event in ARMV8 PMU, checking
has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
for now. While here, this also defines arm_pmu's sched_task() callback with
armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/perf_event.h | 33 +++++++++++++
 drivers/perf/arm_pmuv3.c            | 76 ++++++++++++++++++++---------
 2 files changed, 86 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index eb7071c9eb34..7548813783ba 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -24,4 +24,37 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
 	(regs)->pstate = PSR_MODE_EL1h;	\
 }
 
+struct pmu_hw_events;
+struct arm_pmu;
+struct perf_event;
+
+#ifdef CONFIG_PERF_EVENTS
+static inline bool has_branch_stack(struct perf_event *event);
+
+static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+	WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline bool armv8pmu_branch_valid(struct perf_event *event)
+{
+	WARN_ON_ONCE(!has_branch_stack(event));
+	return false;
+}
+
+static inline void armv8pmu_branch_enable(struct perf_event *event)
+{
+	WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_disable(struct perf_event *event)
+{
+	WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_reset(void) { }
+static inline int armv8pmu_private_alloc(struct arm_pmu *arm_pmu) { return 0; }
+static inline void armv8pmu_private_free(struct arm_pmu *arm_pmu) { }
+#endif
 #endif
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index c98e4039386d..86d803ff1ae3 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -705,38 +705,21 @@ static void armv8pmu_enable_event(struct perf_event *event)
 	 * Enable counter and interrupt, and set the counter to count
 	 * the event that we're interested in.
 	 */
-
-	/*
-	 * Disable counter
-	 */
 	armv8pmu_disable_event_counter(event);
-
-	/*
-	 * Set event.
-	 */
 	armv8pmu_write_event_type(event);
-
-	/*
-	 * Enable interrupt for this counter
-	 */
 	armv8pmu_enable_event_irq(event);
-
-	/*
-	 * Enable counter
-	 */
 	armv8pmu_enable_event_counter(event);
+
+	if (has_branch_stack(event))
+		armv8pmu_branch_enable(event);
 }
 
 static void armv8pmu_disable_event(struct perf_event *event)
 {
-	/*
-	 * Disable counter
-	 */
-	armv8pmu_disable_event_counter(event);
+	if (has_branch_stack(event))
+		armv8pmu_branch_disable(event);
 
-	/*
-	 * Disable interrupt for this counter
-	 */
+	armv8pmu_disable_event_counter(event);
 	armv8pmu_disable_event_irq(event);
 }
 
@@ -814,6 +797,11 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
 		if (!armpmu_event_set_period(event))
 			continue;
 
+		if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
+			armv8pmu_branch_read(cpuc, event);
+			perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
+		}
+
 		/*
 		 * Perf event overflow will queue the processing of the event as
 		 * an irq_work which will be taken care of in the handling of
@@ -912,6 +900,14 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
 	return event->hw.idx;
 }
 
+static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+	struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+
+	if (sched_in && arm_pmu_branch_stack_supported(armpmu))
+		armv8pmu_branch_reset();
+}
+
 /*
  * Add an event filter to a given event.
  */
@@ -982,6 +978,9 @@ static void armv8pmu_reset(void *info)
 		pmcr |= ARMV8_PMU_PMCR_LP;
 
 	armv8pmu_pmcr_write(pmcr);
+
+	if (arm_pmu_branch_stack_supported(cpu_pmu))
+		armv8pmu_branch_reset();
 }
 
 static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
@@ -1019,6 +1018,9 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
 
 	hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
 
+	if (has_branch_stack(event) && !armv8pmu_branch_valid(event))
+		return -EOPNOTSUPP;
+
 	/*
 	 * CHAIN events only work when paired with an adjacent counter, and it
 	 * never makes sense for a user to open one in isolation, as they'll be
@@ -1135,6 +1137,21 @@ static void __armv8pmu_probe_pmu(void *info)
 		cpu_pmu->reg_pmmir = read_pmmir();
 	else
 		cpu_pmu->reg_pmmir = 0;
+	armv8pmu_branch_probe(cpu_pmu);
+}
+
+static int branch_records_alloc(struct arm_pmu *armpmu)
+{
+	struct pmu_hw_events *events;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		events = per_cpu_ptr(armpmu->hw_events, cpu);
+		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
+		if (!events->branches)
+			return -ENOMEM;
+	}
+	return 0;
 }
 
 static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
@@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
 	};
 	int ret;
 
+	ret = armv8pmu_private_alloc(cpu_pmu);
+	if (ret)
+		return ret;
+
 	ret = smp_call_function_any(&cpu_pmu->supported_cpus,
 				    __armv8pmu_probe_pmu,
 				    &probe, 1);
 	if (ret)
 		return ret;
 
+	if (arm_pmu_branch_stack_supported(cpu_pmu)) {
+		ret = branch_records_alloc(cpu_pmu);
+		if (ret)
+			return ret;
+	} else {
+		armv8pmu_private_free(cpu_pmu);
+	}
+
 	return probe.present ? 0 : -ENODEV;
 }
 
@@ -1214,6 +1243,7 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
 	cpu_pmu->set_event_filter	= armv8pmu_set_event_filter;
 
 	cpu_pmu->pmu.event_idx		= armv8pmu_user_event_idx;
+	cpu_pmu->sched_task		= armv8pmu_sched_task;
 
 	cpu_pmu->name			= name;
 	cpu_pmu->map_event		= map_event;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (4 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-02  1:45   ` Namhyung Kim
  2023-06-05 13:43   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 07/10] arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack() Anshuman Khandual
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This enables branch stack sampling events in ARMV8 PMU, via an architecture
feature FEAT_BRBE aka branch record buffer extension. This defines required
branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
is wrapped with a new config option CONFIG_ARM64_BRBE.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/perf_event.h |  11 +
 drivers/perf/Kconfig                |  11 +
 drivers/perf/Makefile               |   1 +
 drivers/perf/arm_brbe.c             | 571 ++++++++++++++++++++++++++++
 drivers/perf/arm_brbe.h             | 257 +++++++++++++
 drivers/perf/arm_pmuv3.c            |  21 +-
 6 files changed, 869 insertions(+), 3 deletions(-)
 create mode 100644 drivers/perf/arm_brbe.c
 create mode 100644 drivers/perf/arm_brbe.h

diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index 7548813783ba..f071d629c0cf 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -31,6 +31,16 @@ struct perf_event;
 #ifdef CONFIG_PERF_EVENTS
 static inline bool has_branch_stack(struct perf_event *event);
 
+#ifdef CONFIG_ARM64_BRBE
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event);
+bool armv8pmu_branch_valid(struct perf_event *event);
+void armv8pmu_branch_enable(struct perf_event *event);
+void armv8pmu_branch_disable(struct perf_event *event);
+void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_reset(void);
+int armv8pmu_private_alloc(struct arm_pmu *arm_pmu);
+void armv8pmu_private_free(struct arm_pmu *arm_pmu);
+#else
 static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
 {
 	WARN_ON_ONCE(!has_branch_stack(event));
@@ -58,3 +68,4 @@ static inline int armv8pmu_private_alloc(struct arm_pmu *arm_pmu) { return 0; }
 static inline void armv8pmu_private_free(struct arm_pmu *arm_pmu) { }
 #endif
 #endif
+#endif
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 711f82400086..7d07aa79e5b0 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -172,6 +172,17 @@ config ARM_SPE_PMU
 	  Extension, which provides periodic sampling of operations in
 	  the CPU pipeline and reports this via the perf AUX interface.
 
+config ARM64_BRBE
+	bool "Enable support for Branch Record Buffer Extension (BRBE)"
+	depends on PERF_EVENTS && ARM64 && ARM_PMU
+	default y
+	help
+	  Enable perf support for Branch Record Buffer Extension (BRBE) which
+	  records all branches taken in an execution path. This supports some
+	  branch types and privilege based filtering. It captured additional
+	  relevant information such as cycle count, misprediction and branch
+	  type, branch privilege level etc.
+
 config ARM_DMC620_PMU
 	tristate "Enable PMU support for the ARM DMC-620 memory controller"
 	depends on (ARM64 && ACPI) || COMPILE_TEST
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index dabc859540ce..29d256f2deaa 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_RISCV_PMU_SBI) += riscv_pmu_sbi.o
 obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
 obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
 obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
+obj-$(CONFIG_ARM64_BRBE) += arm_brbe.o
 obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
 obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
 obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
new file mode 100644
index 000000000000..34547ad750ad
--- /dev/null
+++ b/drivers/perf/arm_brbe.c
@@ -0,0 +1,571 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Branch Record Buffer Extension Driver.
+ *
+ * Copyright (C) 2022 ARM Limited
+ *
+ * Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ */
+#include "arm_brbe.h"
+
+static bool valid_brbe_nr(int brbe_nr)
+{
+	return brbe_nr == BRBIDR0_EL1_NUMREC_8 ||
+	       brbe_nr == BRBIDR0_EL1_NUMREC_16 ||
+	       brbe_nr == BRBIDR0_EL1_NUMREC_32 ||
+	       brbe_nr == BRBIDR0_EL1_NUMREC_64;
+}
+
+static bool valid_brbe_cc(int brbe_cc)
+{
+	return brbe_cc == BRBIDR0_EL1_CC_20_BIT;
+}
+
+static bool valid_brbe_format(int brbe_format)
+{
+	return brbe_format == BRBIDR0_EL1_FORMAT_0;
+}
+
+static bool valid_brbe_version(int brbe_version)
+{
+	return brbe_version == ID_AA64DFR0_EL1_BRBE_IMP ||
+	       brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
+}
+
+static void select_brbe_bank(int bank)
+{
+	u64 brbfcr;
+
+	WARN_ON(bank > BRBE_BANK_IDX_1);
+	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	brbfcr &= ~BRBFCR_EL1_BANK_MASK;
+	brbfcr |= SYS_FIELD_PREP(BRBFCR_EL1, BANK, bank);
+	write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+	isb();
+}
+
+/*
+ * Generic perf branch filters supported on BRBE
+ *
+ * New branch filters need to be evaluated whether they could be supported on
+ * BRBE. This ensures that such branch filters would not just be accepted, to
+ * fail silently. PERF_SAMPLE_BRANCH_HV is a special case that is selectively
+ * supported only on platforms where kernel is in hyp mode.
+ */
+#define BRBE_EXCLUDE_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_ABORT_TX	| \
+				     PERF_SAMPLE_BRANCH_IN_TX		| \
+				     PERF_SAMPLE_BRANCH_NO_TX		| \
+				     PERF_SAMPLE_BRANCH_CALL_STACK)
+
+#define BRBE_ALLOWED_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_USER		| \
+				     PERF_SAMPLE_BRANCH_KERNEL		| \
+				     PERF_SAMPLE_BRANCH_HV		| \
+				     PERF_SAMPLE_BRANCH_ANY		| \
+				     PERF_SAMPLE_BRANCH_ANY_CALL	| \
+				     PERF_SAMPLE_BRANCH_ANY_RETURN	| \
+				     PERF_SAMPLE_BRANCH_IND_CALL	| \
+				     PERF_SAMPLE_BRANCH_COND		| \
+				     PERF_SAMPLE_BRANCH_IND_JUMP	| \
+				     PERF_SAMPLE_BRANCH_CALL		| \
+				     PERF_SAMPLE_BRANCH_NO_FLAGS	| \
+				     PERF_SAMPLE_BRANCH_NO_CYCLES	| \
+				     PERF_SAMPLE_BRANCH_TYPE_SAVE	| \
+				     PERF_SAMPLE_BRANCH_HW_INDEX	| \
+				     PERF_SAMPLE_BRANCH_PRIV_SAVE)
+
+#define BRBE_PERF_BRANCH_FILTERS    (BRBE_ALLOWED_BRANCH_FILTERS	| \
+				     BRBE_EXCLUDE_BRANCH_FILTERS)
+
+bool armv8pmu_branch_valid(struct perf_event *event)
+{
+	u64 branch_type = event->attr.branch_sample_type;
+
+	/*
+	 * Ensure both perf branch filter allowed and exclude
+	 * masks are always in sync with the generic perf ABI.
+	 */
+	BUILD_BUG_ON(BRBE_PERF_BRANCH_FILTERS != (PERF_SAMPLE_BRANCH_MAX - 1));
+
+	if (branch_type & ~BRBE_ALLOWED_BRANCH_FILTERS) {
+		pr_debug_once("requested branch filter not supported 0x%llx\n", branch_type);
+		return false;
+	}
+
+	/*
+	 * If the event does not have at least one of the privilege
+	 * branch filters as in PERF_SAMPLE_BRANCH_PLM_ALL, the core
+	 * perf will adjust its value based on perf event's existing
+	 * privilege level via attr.exclude_[user|kernel|hv].
+	 *
+	 * As event->attr.branch_sample_type might have been changed
+	 * when the event reaches here, it is not possible to figure
+	 * out whether the event originally had HV privilege request
+	 * or got added via the core perf. Just report this situation
+	 * once and continue ignoring if there are other instances.
+	 */
+	if ((branch_type & PERF_SAMPLE_BRANCH_HV) && !is_kernel_in_hyp_mode())
+		pr_debug_once("hypervisor privilege filter not supported 0x%llx\n", branch_type);
+
+	return true;
+}
+
+int armv8pmu_private_alloc(struct arm_pmu *arm_pmu)
+{
+	struct brbe_hw_attr *brbe_attr = kzalloc(sizeof(struct brbe_hw_attr), GFP_KERNEL);
+
+	if (!brbe_attr)
+		return -ENOMEM;
+
+	arm_pmu->private = brbe_attr;
+	return 0;
+}
+
+void armv8pmu_private_free(struct arm_pmu *arm_pmu)
+{
+	kfree(arm_pmu->private);
+}
+
+static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
+{
+	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)armpmu->private;
+	u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
+
+	brbe_attr->brbe_version = brbe;
+	brbe_attr->brbe_format = brbe_get_format(brbidr);
+	brbe_attr->brbe_cc = brbe_get_cc_bits(brbidr);
+	brbe_attr->brbe_nr = brbe_get_numrec(brbidr);
+
+	if (!valid_brbe_version(brbe_attr->brbe_version) ||
+	   !valid_brbe_format(brbe_attr->brbe_format) ||
+	   !valid_brbe_cc(brbe_attr->brbe_cc) ||
+	   !valid_brbe_nr(brbe_attr->brbe_nr))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+void armv8pmu_branch_probe(struct arm_pmu *armpmu)
+{
+	u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+	u32 brbe;
+
+	brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+	if (!brbe)
+		return;
+
+	if (brbe_attributes_probe(armpmu, brbe))
+		return;
+
+	armpmu->has_branch_stack = 1;
+}
+
+static u64 branch_type_to_brbfcr(int branch_type)
+{
+	u64 brbfcr = 0;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+		brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
+		return brbfcr;
+	}
+
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+		brbfcr |= BRBFCR_EL1_INDCALL;
+		brbfcr |= BRBFCR_EL1_DIRCALL;
+	}
+
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+		brbfcr |= BRBFCR_EL1_RTN;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
+		brbfcr |= BRBFCR_EL1_INDCALL;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_COND)
+		brbfcr |= BRBFCR_EL1_CONDDIR;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+		brbfcr |= BRBFCR_EL1_INDIRECT;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_CALL)
+		brbfcr |= BRBFCR_EL1_DIRCALL;
+
+	return brbfcr;
+}
+
+static u64 branch_type_to_brbcr(int branch_type)
+{
+	u64 brbcr = BRBCR_EL1_DEFAULT_TS;
+
+	/*
+	 * BRBE need not be paused on PMU interrupt while tracing only
+	 * the user space, bcause it will automatically be inside the
+	 * prohibited region. But even after PMU overflow occurs, the
+	 * interrupt could still take much more cycles, before it can
+	 * be taken and by that time BRBE will have been overwritten.
+	 * Let's enable pause on PMU interrupt mechanism even for user
+	 * only traces.
+	 */
+	brbcr |= BRBCR_EL1_FZP;
+
+	/*
+	 * When running in the hyp mode, writing into BRBCR_EL1
+	 * actually writes into BRBCR_EL2 instead. Field E2BRE
+	 * is also at the same position as E1BRE.
+	 */
+	if (branch_type & PERF_SAMPLE_BRANCH_USER)
+		brbcr |= BRBCR_EL1_E0BRE;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
+		brbcr |= BRBCR_EL1_E1BRE;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_HV) {
+		if (is_kernel_in_hyp_mode())
+			brbcr |= BRBCR_EL1_E1BRE;
+	}
+
+	if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
+		brbcr |= BRBCR_EL1_CC;
+
+	if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
+		brbcr |= BRBCR_EL1_MPRED;
+
+	/*
+	 * The exception and exception return branches could be
+	 * captured, irrespective of the perf event's privilege.
+	 * If the perf event does not have enough privilege for
+	 * a given exception level, then addresses which falls
+	 * under that exception level will be reported as zero
+	 * for the captured branch record, creating source only
+	 * or target only records.
+	 */
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+		brbcr |= BRBCR_EL1_EXCEPTION;
+		brbcr |= BRBCR_EL1_ERTN;
+	}
+
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+		brbcr |= BRBCR_EL1_EXCEPTION;
+
+	if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+		brbcr |= BRBCR_EL1_ERTN;
+
+	return brbcr & BRBCR_EL1_DEFAULT_CONFIG;
+}
+
+void armv8pmu_branch_enable(struct perf_event *event)
+{
+	u64 branch_type = event->attr.branch_sample_type;
+	u64 brbfcr, brbcr;
+
+	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
+	brbfcr |= branch_type_to_brbfcr(branch_type);
+	write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+	isb();
+
+	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+	brbcr &= ~BRBCR_EL1_DEFAULT_CONFIG;
+	brbcr |= branch_type_to_brbcr(branch_type);
+	write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+	isb();
+	armv8pmu_branch_reset();
+}
+
+void armv8pmu_branch_disable(struct perf_event *event)
+{
+	u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	u64 brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+
+	brbcr &= ~(BRBCR_EL1_E0BRE | BRBCR_EL1_E1BRE);
+	brbfcr |= BRBFCR_EL1_PAUSED;
+	write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+	write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+	isb();
+}
+
+static void brbe_set_perf_entry_type(struct perf_branch_entry *entry, u64 brbinf)
+{
+	int brbe_type = brbe_get_type(brbinf);
+
+	switch (brbe_type) {
+	case BRBINFx_EL1_TYPE_UNCOND_DIR:
+		entry->type = PERF_BR_UNCOND;
+		break;
+	case BRBINFx_EL1_TYPE_INDIR:
+		entry->type = PERF_BR_IND;
+		break;
+	case BRBINFx_EL1_TYPE_DIR_LINK:
+		entry->type = PERF_BR_CALL;
+		break;
+	case BRBINFx_EL1_TYPE_INDIR_LINK:
+		entry->type = PERF_BR_IND_CALL;
+		break;
+	case BRBINFx_EL1_TYPE_RET_SUB:
+		entry->type = PERF_BR_RET;
+		break;
+	case BRBINFx_EL1_TYPE_COND_DIR:
+		entry->type = PERF_BR_COND;
+		break;
+	case BRBINFx_EL1_TYPE_CALL:
+		entry->type = PERF_BR_CALL;
+		break;
+	case BRBINFx_EL1_TYPE_TRAP:
+		entry->type = PERF_BR_SYSCALL;
+		break;
+	case BRBINFx_EL1_TYPE_RET_EXCPT:
+		entry->type = PERF_BR_ERET;
+		break;
+	case BRBINFx_EL1_TYPE_IRQ:
+		entry->type = PERF_BR_IRQ;
+		break;
+	case BRBINFx_EL1_TYPE_DEBUG_HALT:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_ARM64_DEBUG_HALT;
+		break;
+	case BRBINFx_EL1_TYPE_SERROR:
+		entry->type = PERF_BR_SERROR;
+		break;
+	case BRBINFx_EL1_TYPE_INST_DEBUG:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_ARM64_DEBUG_INST;
+		break;
+	case BRBINFx_EL1_TYPE_DATA_DEBUG:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_ARM64_DEBUG_DATA;
+		break;
+	case BRBINFx_EL1_TYPE_ALGN_FAULT:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_NEW_FAULT_ALGN;
+		break;
+	case BRBINFx_EL1_TYPE_INST_FAULT:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_NEW_FAULT_INST;
+		break;
+	case BRBINFx_EL1_TYPE_DATA_FAULT:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_NEW_FAULT_DATA;
+		break;
+	case BRBINFx_EL1_TYPE_FIQ:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_ARM64_FIQ;
+		break;
+	case BRBINFx_EL1_TYPE_DEBUG_EXIT:
+		entry->type = PERF_BR_EXTEND_ABI;
+		entry->new_type = PERF_BR_ARM64_DEBUG_EXIT;
+		break;
+	default:
+		pr_warn_once("%d - unknown branch type captured\n", brbe_type);
+		entry->type = PERF_BR_UNKNOWN;
+		break;
+	}
+}
+
+static int brbe_get_perf_priv(u64 brbinf)
+{
+	int brbe_el = brbe_get_el(brbinf);
+
+	switch (brbe_el) {
+	case BRBINFx_EL1_EL_EL0:
+		return PERF_BR_PRIV_USER;
+	case BRBINFx_EL1_EL_EL1:
+		return PERF_BR_PRIV_KERNEL;
+	case BRBINFx_EL1_EL_EL2:
+		if (is_kernel_in_hyp_mode())
+			return PERF_BR_PRIV_KERNEL;
+		return PERF_BR_PRIV_HV;
+	default:
+		pr_warn_once("%d - unknown branch privilege captured\n", brbe_el);
+		return PERF_BR_PRIV_UNKNOWN;
+	}
+}
+
+static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
+			       u64 brbinf)
+{
+	if (branch_sample_type(event))
+		brbe_set_perf_entry_type(entry, brbinf);
+
+	if (!branch_sample_no_cycles(event))
+		entry->cycles = brbe_get_cycles(brbinf);
+
+	if (!branch_sample_no_flags(event)) {
+		/*
+		 * BRBINFx_EL1.LASTFAILED indicates that a TME transaction failed (or
+		 * was cancelled) prior to this record, and some number of records
+		 * prior to this one, may have been generated during an attempt to
+		 * execute the transaction.
+		 *
+		 * We will remove such entries later in process_branch_aborts().
+		 */
+		entry->abort = brbe_get_lastfailed(brbinf);
+
+		/*
+		 * All these information (i.e transaction state and mispredicts)
+		 * are available for source only and complete branch records.
+		 */
+		if (brbe_record_is_complete(brbinf) ||
+		    brbe_record_is_source_only(brbinf)) {
+			entry->mispred = brbe_get_mispredict(brbinf);
+			entry->predicted = !entry->mispred;
+			entry->in_tx = brbe_get_in_tx(brbinf);
+		}
+	}
+
+	if (branch_sample_priv(event)) {
+		/*
+		 * All these information (i.e branch privilege level) are
+		 * available for target only and complete branch records.
+		 */
+		if (brbe_record_is_complete(brbinf) ||
+		    brbe_record_is_target_only(brbinf))
+			entry->priv = brbe_get_perf_priv(brbinf);
+	}
+}
+
+/*
+ * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
+ * preceding consecutive branch records, that were in a transaction
+ * (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
+ * consecutive branch records up to the last record, which were in a
+ * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * --------------------------------- -------------------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
+ * --------------------------------- -------------------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ *
+ * BRBFCR_EL1.LASTFAILED == 1
+ *
+ * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
+ * branches records near the end of the BRBE buffer.
+ *
+ * Architecture does not guarantee a non transaction (TX = 0) branch
+ * record between two different transactions. So it is possible that
+ * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
+ * mark more than required transactions as aborted.
+ */
+static void process_branch_aborts(struct pmu_hw_events *cpuc)
+{
+	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
+	u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	bool lastfailed = !!(brbfcr & BRBFCR_EL1_LASTFAILED);
+	int idx = brbe_attr->brbe_nr - 1;
+	struct perf_branch_entry *entry;
+
+	do {
+		entry = &cpuc->branches->branch_entries[idx];
+		if (entry->in_tx) {
+			entry->abort = lastfailed;
+		} else {
+			lastfailed = entry->abort;
+			entry->abort = false;
+		}
+	} while (idx--, idx >= 0);
+}
+
+void armv8pmu_branch_reset(void)
+{
+	asm volatile(BRB_IALL);
+	isb();
+}
+
+static bool capture_branch_entry(struct pmu_hw_events *cpuc,
+				 struct perf_event *event, int idx)
+{
+	struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
+	u64 brbinf = get_brbinf_reg(idx);
+
+	/*
+	 * There are no valid entries anymore on the buffer.
+	 * Abort the branch record processing to save some
+	 * cycles and also reduce the capture/process load
+	 * for the user space as well.
+	 */
+	if (brbe_invalid(brbinf))
+		return false;
+
+	perf_clear_branch_entry_bitfields(entry);
+	if (brbe_record_is_complete(brbinf)) {
+		entry->from = get_brbsrc_reg(idx);
+		entry->to = get_brbtgt_reg(idx);
+	} else if (brbe_record_is_source_only(brbinf)) {
+		entry->from = get_brbsrc_reg(idx);
+		entry->to = 0;
+	} else if (brbe_record_is_target_only(brbinf)) {
+		entry->from = 0;
+		entry->to = get_brbtgt_reg(idx);
+	}
+	capture_brbe_flags(entry, event, brbinf);
+	return true;
+}
+
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
+	u64 brbfcr, brbcr;
+	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
+
+	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+
+	/* Ensure pause on PMU interrupt is enabled */
+	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
+
+	/* Pause the buffer */
+	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+	isb();
+
+	/* Determine the indices for each loop */
+	loop1_idx1 = BRBE_BANK0_IDX_MIN;
+	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
+		loop1_idx2 = brbe_attr->brbe_nr - 1;
+		loop2_idx1 = BRBE_BANK1_IDX_MIN;
+		loop2_idx2 = BRBE_BANK0_IDX_MAX;
+	} else {
+		loop1_idx2 = BRBE_BANK0_IDX_MAX;
+		loop2_idx1 = BRBE_BANK1_IDX_MIN;
+		loop2_idx2 = brbe_attr->brbe_nr - 1;
+	}
+
+	/* Loop through bank 0 */
+	select_brbe_bank(BRBE_BANK_IDX_0);
+	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
+		if (!capture_branch_entry(cpuc, event, idx))
+			goto skip_bank_1;
+	}
+
+	/* Loop through bank 1 */
+	select_brbe_bank(BRBE_BANK_IDX_1);
+	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
+		if (!capture_branch_entry(cpuc, event, idx))
+			break;
+	}
+
+skip_bank_1:
+	cpuc->branches->branch_stack.nr = idx;
+	cpuc->branches->branch_stack.hw_idx = -1ULL;
+	process_branch_aborts(cpuc);
+
+	/* Unpause the buffer */
+	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+	isb();
+	armv8pmu_branch_reset();
+}
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
new file mode 100644
index 000000000000..a47480eec070
--- /dev/null
+++ b/drivers/perf/arm_brbe.h
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Branch Record Buffer Extension Helpers.
+ *
+ * Copyright (C) 2022 ARM Limited
+ *
+ * Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ */
+#define pr_fmt(fmt) "brbe: " fmt
+
+#include <linux/perf/arm_pmu.h>
+
+#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT   | \
+				   BRBFCR_EL1_INDIRECT | \
+				   BRBFCR_EL1_RTN      | \
+				   BRBFCR_EL1_INDCALL  | \
+				   BRBFCR_EL1_DIRCALL  | \
+				   BRBFCR_EL1_CONDDIR)
+
+#define BRBFCR_EL1_DEFAULT_CONFIG (BRBFCR_EL1_BANK_MASK | \
+				   BRBFCR_EL1_PAUSED    | \
+				   BRBFCR_EL1_EnI       | \
+				   BRBFCR_EL1_BRANCH_FILTERS)
+
+/*
+ * BRBTS_EL1 is currently not used for branch stack implementation
+ * purpose but BRBCR_EL1.TS needs to have a valid value from all
+ * available options. BRBCR_EL1_TS_VIRTUAL is selected for this.
+ */
+#define BRBCR_EL1_DEFAULT_TS      FIELD_PREP(BRBCR_EL1_TS_MASK, BRBCR_EL1_TS_VIRTUAL)
+
+#define BRBCR_EL1_DEFAULT_CONFIG  (BRBCR_EL1_EXCEPTION | \
+				   BRBCR_EL1_ERTN      | \
+				   BRBCR_EL1_CC        | \
+				   BRBCR_EL1_MPRED     | \
+				   BRBCR_EL1_E1BRE     | \
+				   BRBCR_EL1_E0BRE     | \
+				   BRBCR_EL1_FZP       | \
+				   BRBCR_EL1_DEFAULT_TS)
+/*
+ * BRBE Instructions
+ *
+ * BRB_IALL : Invalidate the entire buffer
+ * BRB_INJ  : Inject latest branch record derived from [BRBSRCINJ, BRBTGTINJ, BRBINFINJ]
+ */
+#define BRB_IALL __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 4) | (0x1f))
+#define BRB_INJ  __emit_inst(0xD5000000 | sys_insn(1, 1, 7, 2, 5) | (0x1f))
+
+/*
+ * BRBE Buffer Organization
+ *
+ * BRBE buffer is arranged as multiple banks of 32 branch record
+ * entries each. An individual branch record in a given bank could
+ * be accessed, after selecting the bank in BRBFCR_EL1.BANK and
+ * accessing the registers i.e [BRBSRC, BRBTGT, BRBINF] set with
+ * indices [0..31].
+ *
+ * Bank 0
+ *
+ *	---------------------------------	------
+ *	| 00 | BRBSRC | BRBTGT | BRBINF |	| 00 |
+ *	---------------------------------	------
+ *	| 01 | BRBSRC | BRBTGT | BRBINF |	| 01 |
+ *	---------------------------------	------
+ *	| .. | BRBSRC | BRBTGT | BRBINF |	| .. |
+ *	---------------------------------	------
+ *	| 31 | BRBSRC | BRBTGT | BRBINF |	| 31 |
+ *	---------------------------------	------
+ *
+ * Bank 1
+ *
+ *	---------------------------------	------
+ *	| 32 | BRBSRC | BRBTGT | BRBINF |	| 00 |
+ *	---------------------------------	------
+ *	| 33 | BRBSRC | BRBTGT | BRBINF |	| 01 |
+ *	---------------------------------	------
+ *	| .. | BRBSRC | BRBTGT | BRBINF |	| .. |
+ *	---------------------------------	------
+ *	| 63 | BRBSRC | BRBTGT | BRBINF |	| 31 |
+ *	---------------------------------	------
+ */
+#define BRBE_BANK_MAX_ENTRIES 32
+
+#define BRBE_BANK0_IDX_MIN 0
+#define BRBE_BANK0_IDX_MAX 31
+#define BRBE_BANK1_IDX_MIN 32
+#define BRBE_BANK1_IDX_MAX 63
+
+struct brbe_hw_attr {
+	int	brbe_version;
+	int	brbe_cc;
+	int	brbe_nr;
+	int	brbe_format;
+};
+
+enum brbe_bank_idx {
+	BRBE_BANK_IDX_INVALID = -1,
+	BRBE_BANK_IDX_0,
+	BRBE_BANK_IDX_1,
+	BRBE_BANK_IDX_MAX
+};
+
+#define RETURN_READ_BRBSRCN(n) \
+	read_sysreg_s(SYS_BRBSRC##n##_EL1)
+
+#define RETURN_READ_BRBTGTN(n) \
+	read_sysreg_s(SYS_BRBTGT##n##_EL1)
+
+#define RETURN_READ_BRBINFN(n) \
+	read_sysreg_s(SYS_BRBINF##n##_EL1)
+
+#define BRBE_REGN_CASE(n, case_macro) \
+	case n: return case_macro(n); break
+
+#define BRBE_REGN_SWITCH(x, case_macro)				\
+	do {							\
+		switch (x) {					\
+		BRBE_REGN_CASE(0, case_macro);			\
+		BRBE_REGN_CASE(1, case_macro);			\
+		BRBE_REGN_CASE(2, case_macro);			\
+		BRBE_REGN_CASE(3, case_macro);			\
+		BRBE_REGN_CASE(4, case_macro);			\
+		BRBE_REGN_CASE(5, case_macro);			\
+		BRBE_REGN_CASE(6, case_macro);			\
+		BRBE_REGN_CASE(7, case_macro);			\
+		BRBE_REGN_CASE(8, case_macro);			\
+		BRBE_REGN_CASE(9, case_macro);			\
+		BRBE_REGN_CASE(10, case_macro);			\
+		BRBE_REGN_CASE(11, case_macro);			\
+		BRBE_REGN_CASE(12, case_macro);			\
+		BRBE_REGN_CASE(13, case_macro);			\
+		BRBE_REGN_CASE(14, case_macro);			\
+		BRBE_REGN_CASE(15, case_macro);			\
+		BRBE_REGN_CASE(16, case_macro);			\
+		BRBE_REGN_CASE(17, case_macro);			\
+		BRBE_REGN_CASE(18, case_macro);			\
+		BRBE_REGN_CASE(19, case_macro);			\
+		BRBE_REGN_CASE(20, case_macro);			\
+		BRBE_REGN_CASE(21, case_macro);			\
+		BRBE_REGN_CASE(22, case_macro);			\
+		BRBE_REGN_CASE(23, case_macro);			\
+		BRBE_REGN_CASE(24, case_macro);			\
+		BRBE_REGN_CASE(25, case_macro);			\
+		BRBE_REGN_CASE(26, case_macro);			\
+		BRBE_REGN_CASE(27, case_macro);			\
+		BRBE_REGN_CASE(28, case_macro);			\
+		BRBE_REGN_CASE(29, case_macro);			\
+		BRBE_REGN_CASE(30, case_macro);			\
+		BRBE_REGN_CASE(31, case_macro);			\
+		default:					\
+			pr_warn("unknown register index\n");	\
+			return -1;				\
+		}						\
+	} while (0)
+
+static inline int buffer_to_brbe_idx(int buffer_idx)
+{
+	return buffer_idx % BRBE_BANK_MAX_ENTRIES;
+}
+
+static inline u64 get_brbsrc_reg(int buffer_idx)
+{
+	int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+	BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBSRCN);
+}
+
+static inline u64 get_brbtgt_reg(int buffer_idx)
+{
+	int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+	BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBTGTN);
+}
+
+static inline u64 get_brbinf_reg(int buffer_idx)
+{
+	int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+	BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBINFN);
+}
+
+static inline u64 brbe_record_valid(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_VALID_MASK, brbinf);
+}
+
+static inline bool brbe_invalid(u64 brbinf)
+{
+	return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_NONE;
+}
+
+static inline bool brbe_record_is_complete(u64 brbinf)
+{
+	return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_FULL;
+}
+
+static inline bool brbe_record_is_source_only(u64 brbinf)
+{
+	return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_SOURCE;
+}
+
+static inline bool brbe_record_is_target_only(u64 brbinf)
+{
+	return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_TARGET;
+}
+
+static inline int brbe_get_in_tx(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_T_MASK, brbinf);
+}
+
+static inline int brbe_get_mispredict(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_MPRED_MASK, brbinf);
+}
+
+static inline int brbe_get_lastfailed(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_LASTFAILED_MASK, brbinf);
+}
+
+static inline int brbe_get_cycles(u64 brbinf)
+{
+	/*
+	 * Captured cycle count is unknown and hence
+	 * should not be passed on to the user space.
+	 */
+	if (brbinf & BRBINFx_EL1_CCU)
+		return 0;
+
+	return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
+}
+
+static inline int brbe_get_type(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_TYPE_MASK, brbinf);
+}
+
+static inline int brbe_get_el(u64 brbinf)
+{
+	return FIELD_GET(BRBINFx_EL1_EL_MASK, brbinf);
+}
+
+static inline int brbe_get_numrec(u64 brbidr)
+{
+	return FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
+}
+
+static inline int brbe_get_format(u64 brbidr)
+{
+	return FIELD_GET(BRBIDR0_EL1_FORMAT_MASK, brbidr);
+}
+
+static inline int brbe_get_cc_bits(u64 brbidr)
+{
+	return FIELD_GET(BRBIDR0_EL1_CC_MASK, brbidr);
+}
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 86d803ff1ae3..fef1bc6067cc 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -797,6 +797,10 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
 		if (!armpmu_event_set_period(event))
 			continue;
 
+		/*
+		 * PMU IRQ should remain asserted until all branch records
+		 * are captured and processed into struct perf_sample_data.
+		 */
 		if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
 			armv8pmu_branch_read(cpuc, event);
 			perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
@@ -1142,14 +1146,25 @@ static void __armv8pmu_probe_pmu(void *info)
 
 static int branch_records_alloc(struct arm_pmu *armpmu)
 {
+	struct branch_records __percpu *tmp_alloc_ptr;
+	struct branch_records *records;
 	struct pmu_hw_events *events;
 	int cpu;
 
+	tmp_alloc_ptr = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
+	if (!tmp_alloc_ptr)
+		return -ENOMEM;
+
+	/*
+	 * FIXME: Memory allocated via tmp_alloc_ptr gets completely
+	 * consumed here, never required to be freed up later. Hence
+	 * losing access to on stack 'tmp_alloc_ptr' is acceptible.
+	 * Otherwise this alloc handle has to be saved some where.
+	 */
 	for_each_possible_cpu(cpu) {
 		events = per_cpu_ptr(armpmu->hw_events, cpu);
-		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
-		if (!events->branches)
-			return -ENOMEM;
+		records = per_cpu_ptr(tmp_alloc_ptr, cpu);
+		events->branches = records;
 	}
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 07/10] arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (5 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions Anshuman Khandual
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

Short running processes i.e those getting very small cpu run time each time
when they get scheduled on, might not accumulate much branch records before
a PMU IRQ really happens. This increases possibility, for such processes to
loose much of its branch records, while being scheduled in-out of various
cpus on the system.

There is a need to save all occurred branch records during the cpu run time
while the process gets scheduled out. It requires an event context specific
buffer for such storage.

This adds PERF_ATTACH_TASK_DATA flag unconditionally, for all branch stack
sampling events, which would allocate task_ctx_data during its event init.
This also creates a platform specific task_ctx_data kmem cache which will
serve such allocation requests.

This adds a new structure 'arm64_perf_task_context' which encapsulates brbe
register set for maximum possible BRBE entries on the HW along with a valid
records tracking element.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/perf/arm_brbe.c  | 13 +++++++++++++
 drivers/perf/arm_brbe.h  | 13 +++++++++++++
 drivers/perf/arm_pmuv3.c |  8 ++++++--
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 34547ad750ad..484842d8cf3e 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -109,20 +109,33 @@ bool armv8pmu_branch_valid(struct perf_event *event)
 	return true;
 }
 
+static inline struct kmem_cache *
+arm64_create_brbe_task_ctx_kmem_cache(size_t size)
+{
+	return kmem_cache_create("arm64_brbe_task_ctx", size, 0, 0, NULL);
+}
+
 int armv8pmu_private_alloc(struct arm_pmu *arm_pmu)
 {
 	struct brbe_hw_attr *brbe_attr = kzalloc(sizeof(struct brbe_hw_attr), GFP_KERNEL);
+	size_t size = sizeof(struct arm64_perf_task_context);
 
 	if (!brbe_attr)
 		return -ENOMEM;
 
 	arm_pmu->private = brbe_attr;
+	arm_pmu->pmu.task_ctx_cache = arm64_create_brbe_task_ctx_kmem_cache(size);
+	if (!arm_pmu->pmu.task_ctx_cache) {
+		kfree(arm_pmu->private);
+		return -ENOMEM;
+	}
 	return 0;
 }
 
 void armv8pmu_private_free(struct arm_pmu *arm_pmu)
 {
 	kfree(arm_pmu->private);
+	kmem_cache_destroy(arm_pmu->pmu.task_ctx_cache);
 }
 
 static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
index a47480eec070..4a72c2ba7140 100644
--- a/drivers/perf/arm_brbe.h
+++ b/drivers/perf/arm_brbe.h
@@ -80,12 +80,25 @@
  *	---------------------------------	------
  */
 #define BRBE_BANK_MAX_ENTRIES 32
+#define BRBE_MAX_BANK 2
+#define BRBE_MAX_ENTRIES (BRBE_BANK_MAX_ENTRIES * BRBE_MAX_BANK)
 
 #define BRBE_BANK0_IDX_MIN 0
 #define BRBE_BANK0_IDX_MAX 31
 #define BRBE_BANK1_IDX_MIN 32
 #define BRBE_BANK1_IDX_MAX 63
 
+struct brbe_regset {
+	unsigned long brbsrc;
+	unsigned long brbtgt;
+	unsigned long brbinf;
+};
+
+struct arm64_perf_task_context {
+	struct brbe_regset store[BRBE_MAX_ENTRIES];
+	int nr_brbe_records;
+};
+
 struct brbe_hw_attr {
 	int	brbe_version;
 	int	brbe_cc;
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index fef1bc6067cc..29672ff20026 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1022,8 +1022,12 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
 
 	hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
 
-	if (has_branch_stack(event) && !armv8pmu_branch_valid(event))
-		return -EOPNOTSUPP;
+	if (has_branch_stack(event)) {
+		if (!armv8pmu_branch_valid(event))
+			return -EOPNOTSUPP;
+
+		event->attach_state |= PERF_ATTACH_TASK_DATA;
+	}
 
 	/*
 	 * CHAIN events only work when paired with an adjacent counter, and it
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (6 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 07/10] arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack() Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-02  2:40   ` Namhyung Kim
  2023-06-13 17:17   ` Mark Rutland
  2023-05-31  4:04 ` [PATCH V11 09/10] arm64/perf: Implement branch records save on task sched out Anshuman Khandual
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

The primary abstraction level for fetching branch records from BRBE HW has
been changed as 'struct brbe_regset', which contains storage for all three
BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
happens in the task sched out path, or in the PMU IRQ handling path, these
registers need to be extracted from the HW. Afterwards both live and stored
sets need to be stitched together to create final branch records set. This
adds required helper functions for such operations.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/perf/arm_brbe.c | 163 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 163 insertions(+)

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 484842d8cf3e..759db681d673 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -44,6 +44,169 @@ static void select_brbe_bank(int bank)
 	isb();
 }
 
+/*
+ * This scans over BRBE register banks and captures individual branch reocrds
+ * [BRBSRC, BRBTGT, BRBINF] into a pre-allocated 'struct brbe_regset' buffer,
+ * until an invalid one gets encountered. The caller for this function needs
+ * to ensure BRBE is an appropriate state before the records can be captured.
+ */
+static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
+{
+	int loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2;
+	int idx, count;
+
+	loop1_idx1 = BRBE_BANK0_IDX_MIN;
+	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
+		loop1_idx2 = brbe_attr->brbe_nr - 1;
+		loop2_idx1 = BRBE_BANK1_IDX_MIN;
+		loop2_idx2 = BRBE_BANK0_IDX_MAX;
+	} else {
+		loop1_idx2 = BRBE_BANK0_IDX_MAX;
+		loop2_idx1 = BRBE_BANK1_IDX_MIN;
+		loop2_idx2 = brbe_attr->brbe_nr - 1;
+	}
+
+	select_brbe_bank(BRBE_BANK_IDX_0);
+	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
+		buf[idx].brbinf = get_brbinf_reg(idx);
+		/*
+		 * There are no valid entries anymore on the buffer.
+		 * Abort the branch record processing to save some
+		 * cycles and also reduce the capture/process load
+		 * for the user space as well.
+		 */
+		if (brbe_invalid(buf[idx].brbinf))
+			return idx;
+
+		buf[idx].brbsrc = get_brbsrc_reg(idx);
+		buf[idx].brbtgt = get_brbtgt_reg(idx);
+	}
+
+	select_brbe_bank(BRBE_BANK_IDX_1);
+	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
+		buf[idx].brbinf = get_brbinf_reg(idx);
+		/*
+		 * There are no valid entries anymore on the buffer.
+		 * Abort the branch record processing to save some
+		 * cycles and also reduce the capture/process load
+		 * for the user space as well.
+		 */
+		if (brbe_invalid(buf[idx].brbinf))
+			return idx;
+
+		buf[idx].brbsrc = get_brbsrc_reg(idx);
+		buf[idx].brbtgt = get_brbtgt_reg(idx);
+	}
+	return idx;
+}
+
+static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
+				    struct brbe_regset *dst, int dst_idx)
+{
+	dst[dst_idx].brbinf = src[src_idx].brbinf;
+	dst[dst_idx].brbsrc = src[src_idx].brbsrc;
+	dst[dst_idx].brbtgt = src[src_idx].brbtgt;
+}
+
+/*
+ * This function concatenates branch records from stored and live buffer
+ * up to maximum nr_max records and the stored buffer holds the resultant
+ * buffer. The concatenated buffer contains all the branch records from
+ * the live buffer but might contain some from stored buffer considering
+ * the maximum combined length does not exceed 'nr_max'.
+ *
+ *	Stored records	Live records
+ *	------------------------------------------------^
+ *	|	S0	|	L0	|	Newest	|
+ *	---------------------------------		|
+ *	|	S1	|	L1	|		|
+ *	---------------------------------		|
+ *	|	S2	|	L2	|		|
+ *	---------------------------------		|
+ *	|	S3	|	L3	|		|
+ *	---------------------------------		|
+ *	|	S4	|	L4	|		nr_max
+ *	---------------------------------		|
+ *	|		|	L5	|		|
+ *	---------------------------------		|
+ *	|		|	L6	|		|
+ *	---------------------------------		|
+ *	|		|	L7	|		|
+ *	---------------------------------		|
+ *	|		|		|		|
+ *	---------------------------------		|
+ *	|		|		|	Oldest	|
+ *	------------------------------------------------V
+ *
+ *
+ * S0 is the newest in the stored records, where as L7 is the oldest in
+ * the live reocords. Unless the live buffer is detetcted as being full
+ * thus potentially dropping off some older records, L7 and S0 records
+ * are contiguous in time for a user task context. The stitched buffer
+ * here represents maximum possible branch records, contiguous in time.
+ *
+ *	Stored records  Live records
+ *	------------------------------------------------^
+ *	|	L0	|	L0	|	Newest	|
+ *	---------------------------------		|
+ *	|	L0	|	L1	|		|
+ *	---------------------------------		|
+ *	|	L2	|	L2	|		|
+ *	---------------------------------		|
+ *	|	L3	|	L3	|		|
+ *	---------------------------------		|
+ *	|	L4	|	L4	|	      nr_max
+ *	---------------------------------		|
+ *	|	L5	|	L5	|		|
+ *	---------------------------------		|
+ *	|	L6	|	L6	|		|
+ *	---------------------------------		|
+ *	|	L7	|	L7	|		|
+ *	---------------------------------		|
+ *	|	S0	|		|		|
+ *	---------------------------------		|
+ *	|	S1	|		|    Oldest	|
+ *	------------------------------------------------V
+ *	|	S2	| <----|
+ *	-----------------      |
+ *	|	S3	| <----| Dropped off after nr_max
+ *	-----------------      |
+ *	|	S4	| <----|
+ *	-----------------
+ */
+static int stitch_stored_live_entries(struct brbe_regset *stored,
+				      struct brbe_regset *live,
+				      int nr_stored, int nr_live,
+				      int nr_max)
+{
+	int nr_total, nr_excess, nr_last, i;
+
+	nr_total = nr_stored + nr_live;
+	nr_excess = nr_total - nr_max;
+
+	/* Stored branch records in stitched buffer */
+	if (nr_live == nr_max)
+		nr_stored = 0;
+	else if (nr_excess > 0)
+		nr_stored -= nr_excess;
+
+	/* Stitched buffer branch records length */
+	if (nr_total > nr_max)
+		nr_last = nr_max;
+	else
+		nr_last = nr_total;
+
+	/* Move stored branch records */
+	for (i = 0; i < nr_stored; i++)
+		copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
+
+	/* Copy live branch records */
+	for (i = 0; i < nr_live; i++)
+		copy_brbe_regset(live, i, stored, i);
+
+	return nr_last;
+}
+
 /*
  * Generic perf branch filters supported on BRBE
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 09/10] arm64/perf: Implement branch records save on task sched out
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (7 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-05-31  4:04 ` [PATCH V11 10/10] arm64/perf: Implement branch records save on PMU IRQ Anshuman Khandual
  2023-06-09 11:13 ` [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
  10 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This modifies current armv8pmu_sched_task(), to implement a branch records
save mechanism via armv8pmu_branch_save() when a task scheds out of a cpu.
BRBE is paused and disabled for all exception levels before branch records
get captured, which then get concatenated with all existing stored records
present in the task context maintaining the contiguity. Although the final
length of the concatenated buffer does not exceed implemented BRBE length.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/perf_event.h |  2 ++
 drivers/perf/arm_brbe.c             | 30 +++++++++++++++++++++++++++++
 drivers/perf/arm_pmuv3.c            | 14 ++++++++++++--
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
index f071d629c0cf..c81b768cd172 100644
--- a/arch/arm64/include/asm/perf_event.h
+++ b/arch/arm64/include/asm/perf_event.h
@@ -40,6 +40,7 @@ void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
 void armv8pmu_branch_reset(void);
 int armv8pmu_private_alloc(struct arm_pmu *arm_pmu);
 void armv8pmu_private_free(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx);
 #else
 static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
 {
@@ -66,6 +67,7 @@ static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
 static inline void armv8pmu_branch_reset(void) { }
 static inline int armv8pmu_private_alloc(struct arm_pmu *arm_pmu) { return 0; }
 static inline void armv8pmu_private_free(struct arm_pmu *arm_pmu) { }
+static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx) { }
 #endif
 #endif
 #endif
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 759db681d673..0678ebf0a896 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -207,6 +207,36 @@ static int stitch_stored_live_entries(struct brbe_regset *stored,
 	return nr_last;
 }
 
+static int brbe_branch_save(struct brbe_hw_attr *brbe_attr, struct brbe_regset *live)
+{
+	u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	int nr_live;
+
+	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+	isb();
+
+	nr_live = capture_brbe_regset(brbe_attr, live);
+
+	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+	isb();
+
+	return nr_live;
+}
+
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)arm_pmu->private;
+	struct arm64_perf_task_context *task_ctx = ctx;
+	struct brbe_regset live[BRBE_MAX_ENTRIES];
+	int nr_live, nr_store;
+
+	nr_live = brbe_branch_save(brbe_attr, live);
+	nr_store = task_ctx->nr_brbe_records;
+	nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+					      nr_live, brbe_attr->brbe_nr);
+	task_ctx->nr_brbe_records = nr_store;
+}
+
 /*
  * Generic perf branch filters supported on BRBE
  *
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 29672ff20026..9725a53d6799 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -907,9 +907,19 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
 static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+	void *task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
 
-	if (sched_in && arm_pmu_branch_stack_supported(armpmu))
-		armv8pmu_branch_reset();
+	if (arm_pmu_branch_stack_supported(armpmu)) {
+		/* Save branch records in task_ctx on sched out */
+		if (task_ctx && !sched_in) {
+			armv8pmu_branch_save(armpmu, task_ctx);
+			return;
+		}
+
+		/* Reset branch records on sched in */
+		if (sched_in)
+			armv8pmu_branch_reset();
+	}
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH V11 10/10] arm64/perf: Implement branch records save on PMU IRQ
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (8 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 09/10] arm64/perf: Implement branch records save on task sched out Anshuman Khandual
@ 2023-05-31  4:04 ` Anshuman Khandual
  2023-06-09 11:13 ` [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
  10 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-05-31  4:04 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
	Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

This modifies armv8pmu_branch_read() to concatenate live entries along with
task context stored entries and then process the resultant buffer to create
perf branch entry array for perf_sample_data. It follows the same principle
like task sched out.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Tested-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 drivers/perf/arm_brbe.c | 75 +++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 45 deletions(-)

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index 0678ebf0a896..e3efc1563111 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -693,41 +693,45 @@ void armv8pmu_branch_reset(void)
 	isb();
 }
 
-static bool capture_branch_entry(struct pmu_hw_events *cpuc,
-				 struct perf_event *event, int idx)
+static void brbe_regset_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+				       struct brbe_regset *regset, int idx)
 {
 	struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
-	u64 brbinf = get_brbinf_reg(idx);
-
-	/*
-	 * There are no valid entries anymore on the buffer.
-	 * Abort the branch record processing to save some
-	 * cycles and also reduce the capture/process load
-	 * for the user space as well.
-	 */
-	if (brbe_invalid(brbinf))
-		return false;
+	u64 brbinf = regset[idx].brbinf;
 
 	perf_clear_branch_entry_bitfields(entry);
 	if (brbe_record_is_complete(brbinf)) {
-		entry->from = get_brbsrc_reg(idx);
-		entry->to = get_brbtgt_reg(idx);
+		entry->from = regset[idx].brbsrc;
+		entry->to = regset[idx].brbtgt;
 	} else if (brbe_record_is_source_only(brbinf)) {
-		entry->from = get_brbsrc_reg(idx);
+		entry->from = regset[idx].brbsrc;
 		entry->to = 0;
 	} else if (brbe_record_is_target_only(brbinf)) {
 		entry->from = 0;
-		entry->to = get_brbtgt_reg(idx);
+		entry->to = regset[idx].brbtgt;
 	}
 	capture_brbe_flags(entry, event, brbinf);
-	return true;
+}
+
+static void process_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+				   struct brbe_regset *regset, int nr_regset)
+{
+	int idx;
+
+	for (idx = 0; idx < nr_regset; idx++)
+		brbe_regset_branch_entries(cpuc, event, regset, idx);
+
+	cpuc->branches->branch_stack.nr = nr_regset;
+	cpuc->branches->branch_stack.hw_idx = -1ULL;
 }
 
 void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
 {
 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
+	struct arm64_perf_task_context *task_ctx = event->pmu_ctx->task_ctx_data;
+	struct brbe_regset live[BRBE_MAX_ENTRIES];
+	int nr_live, nr_store;
 	u64 brbfcr, brbcr;
-	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
 
 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
@@ -739,35 +743,16 @@ void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
 	isb();
 
-	/* Determine the indices for each loop */
-	loop1_idx1 = BRBE_BANK0_IDX_MIN;
-	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
-		loop1_idx2 = brbe_attr->brbe_nr - 1;
-		loop2_idx1 = BRBE_BANK1_IDX_MIN;
-		loop2_idx2 = BRBE_BANK0_IDX_MAX;
+	nr_live = capture_brbe_regset(brbe_attr, live);
+	if (event->ctx->task) {
+		nr_store = task_ctx->nr_brbe_records;
+		nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+						      nr_live, brbe_attr->brbe_nr);
+		process_branch_entries(cpuc, event, task_ctx->store, nr_store);
+		task_ctx->nr_brbe_records = 0;
 	} else {
-		loop1_idx2 = BRBE_BANK0_IDX_MAX;
-		loop2_idx1 = BRBE_BANK1_IDX_MIN;
-		loop2_idx2 = brbe_attr->brbe_nr - 1;
-	}
-
-	/* Loop through bank 0 */
-	select_brbe_bank(BRBE_BANK_IDX_0);
-	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
-		if (!capture_branch_entry(cpuc, event, idx))
-			goto skip_bank_1;
-	}
-
-	/* Loop through bank 1 */
-	select_brbe_bank(BRBE_BANK_IDX_1);
-	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
-		if (!capture_branch_entry(cpuc, event, idx))
-			break;
+		process_branch_entries(cpuc, event, live, nr_live);
 	}
-
-skip_bank_1:
-	cpuc->branches->branch_stack.nr = idx;
-	cpuc->branches->branch_stack.hw_idx = -1ULL;
 	process_branch_aborts(cpuc);
 
 	/* Unpause the buffer */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-05-31  4:04 ` [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE Anshuman Khandual
@ 2023-06-02  1:45   ` Namhyung Kim
  2023-06-05  3:00     ` Anshuman Khandual
  2023-06-05 13:43   ` Mark Rutland
  1 sibling, 1 reply; 48+ messages in thread
From: Namhyung Kim @ 2023-06-02  1:45 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

Hello,

On Tue, May 30, 2023 at 9:21 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> This enables branch stack sampling events in ARMV8 PMU, via an architecture
> feature FEAT_BRBE aka branch record buffer extension. This defines required
> branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
> is wrapped with a new config option CONFIG_ARM64_BRBE.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---

[SNIP]
> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> +{
> +       struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
> +       u64 brbfcr, brbcr;
> +       int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
> +
> +       brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +       brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> +
> +       /* Ensure pause on PMU interrupt is enabled */
> +       WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> +
> +       /* Pause the buffer */
> +       write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> +       isb();
> +
> +       /* Determine the indices for each loop */
> +       loop1_idx1 = BRBE_BANK0_IDX_MIN;
> +       if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
> +               loop1_idx2 = brbe_attr->brbe_nr - 1;
> +               loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +               loop2_idx2 = BRBE_BANK0_IDX_MAX;

Is this to disable the bank1?  Maybe need a comment.


> +       } else {
> +               loop1_idx2 = BRBE_BANK0_IDX_MAX;
> +               loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +               loop2_idx2 = brbe_attr->brbe_nr - 1;
> +       }

The loop2_idx1 is the same for both cases.  Maybe better
to move it out of the if statement.

Thanks,
Namhyung


> +
> +       /* Loop through bank 0 */
> +       select_brbe_bank(BRBE_BANK_IDX_0);
> +       for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
> +               if (!capture_branch_entry(cpuc, event, idx))
> +                       goto skip_bank_1;
> +       }
> +
> +       /* Loop through bank 1 */
> +       select_brbe_bank(BRBE_BANK_IDX_1);
> +       for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
> +               if (!capture_branch_entry(cpuc, event, idx))
> +                       break;
> +       }
> +
> +skip_bank_1:
> +       cpuc->branches->branch_stack.nr = idx;
> +       cpuc->branches->branch_stack.hw_idx = -1ULL;
> +       process_branch_aborts(cpuc);
> +
> +       /* Unpause the buffer */
> +       write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> +       isb();
> +       armv8pmu_branch_reset();
> +}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-05-31  4:04 ` [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU Anshuman Khandual
@ 2023-06-02  2:33   ` Namhyung Kim
  2023-06-05  2:43     ` Anshuman Khandual
  2023-06-05 12:05   ` Mark Rutland
  1 sibling, 1 reply; 48+ messages in thread
From: Namhyung Kim @ 2023-06-02  2:33 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Tue, May 30, 2023 at 9:27 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> This enables support for branch stack sampling event in ARMV8 PMU, checking
> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
> for now. While here, this also defines arm_pmu's sched_task() callback with
> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/include/asm/perf_event.h | 33 +++++++++++++
>  drivers/perf/arm_pmuv3.c            | 76 ++++++++++++++++++++---------
>  2 files changed, 86 insertions(+), 23 deletions(-)
>
> diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
> index eb7071c9eb34..7548813783ba 100644
> --- a/arch/arm64/include/asm/perf_event.h
> +++ b/arch/arm64/include/asm/perf_event.h
> @@ -24,4 +24,37 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
>         (regs)->pstate = PSR_MODE_EL1h; \
>  }
>
> +struct pmu_hw_events;
> +struct arm_pmu;
> +struct perf_event;
> +
> +#ifdef CONFIG_PERF_EVENTS
> +static inline bool has_branch_stack(struct perf_event *event);
> +
> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> +{
> +       WARN_ON_ONCE(!has_branch_stack(event));
> +}
> +
> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
> +{
> +       WARN_ON_ONCE(!has_branch_stack(event));
> +       return false;
> +}
> +
> +static inline void armv8pmu_branch_enable(struct perf_event *event)
> +{
> +       WARN_ON_ONCE(!has_branch_stack(event));
> +}
> +
> +static inline void armv8pmu_branch_disable(struct perf_event *event)
> +{
> +       WARN_ON_ONCE(!has_branch_stack(event));
> +}
> +
> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
> +static inline void armv8pmu_branch_reset(void) { }
> +static inline int armv8pmu_private_alloc(struct arm_pmu *arm_pmu) { return 0; }
> +static inline void armv8pmu_private_free(struct arm_pmu *arm_pmu) { }
> +#endif
>  #endif
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index c98e4039386d..86d803ff1ae3 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -705,38 +705,21 @@ static void armv8pmu_enable_event(struct perf_event *event)
>          * Enable counter and interrupt, and set the counter to count
>          * the event that we're interested in.
>          */
> -
> -       /*
> -        * Disable counter
> -        */
>         armv8pmu_disable_event_counter(event);
> -
> -       /*
> -        * Set event.
> -        */
>         armv8pmu_write_event_type(event);
> -
> -       /*
> -        * Enable interrupt for this counter
> -        */
>         armv8pmu_enable_event_irq(event);
> -
> -       /*
> -        * Enable counter
> -        */
>         armv8pmu_enable_event_counter(event);
> +
> +       if (has_branch_stack(event))
> +               armv8pmu_branch_enable(event);
>  }
>
>  static void armv8pmu_disable_event(struct perf_event *event)
>  {
> -       /*
> -        * Disable counter
> -        */
> -       armv8pmu_disable_event_counter(event);
> +       if (has_branch_stack(event))
> +               armv8pmu_branch_disable(event);
>
> -       /*
> -        * Disable interrupt for this counter
> -        */
> +       armv8pmu_disable_event_counter(event);
>         armv8pmu_disable_event_irq(event);
>  }
>
> @@ -814,6 +797,11 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>                 if (!armpmu_event_set_period(event))
>                         continue;
>
> +               if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
> +                       armv8pmu_branch_read(cpuc, event);
> +                       perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
> +               }
> +
>                 /*
>                  * Perf event overflow will queue the processing of the event as
>                  * an irq_work which will be taken care of in the handling of
> @@ -912,6 +900,14 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>         return event->hw.idx;
>  }
>
> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
> +{
> +       struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
> +
> +       if (sched_in && arm_pmu_branch_stack_supported(armpmu))
> +               armv8pmu_branch_reset();
> +}
> +
>  /*
>   * Add an event filter to a given event.
>   */
> @@ -982,6 +978,9 @@ static void armv8pmu_reset(void *info)
>                 pmcr |= ARMV8_PMU_PMCR_LP;
>
>         armv8pmu_pmcr_write(pmcr);
> +
> +       if (arm_pmu_branch_stack_supported(cpu_pmu))
> +               armv8pmu_branch_reset();
>  }
>
>  static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
> @@ -1019,6 +1018,9 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>
>         hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>
> +       if (has_branch_stack(event) && !armv8pmu_branch_valid(event))
> +               return -EOPNOTSUPP;
> +
>         /*
>          * CHAIN events only work when paired with an adjacent counter, and it
>          * never makes sense for a user to open one in isolation, as they'll be
> @@ -1135,6 +1137,21 @@ static void __armv8pmu_probe_pmu(void *info)
>                 cpu_pmu->reg_pmmir = read_pmmir();
>         else
>                 cpu_pmu->reg_pmmir = 0;
> +       armv8pmu_branch_probe(cpu_pmu);
> +}
> +
> +static int branch_records_alloc(struct arm_pmu *armpmu)
> +{
> +       struct pmu_hw_events *events;
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               events = per_cpu_ptr(armpmu->hw_events, cpu);
> +               events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
> +               if (!events->branches)
> +                       return -ENOMEM;
> +       }
> +       return 0;
>  }
>
>  static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>         };
>         int ret;
>
> +       ret = armv8pmu_private_alloc(cpu_pmu);
> +       if (ret)
> +               return ret;

Wouldn't it be better to move it under the if statement below
if it's only needed for branch stack?

> +
>         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>                                     __armv8pmu_probe_pmu,
>                                     &probe, 1);
>         if (ret)
>                 return ret;

Otherwise you might need to free it here.

>
> +       if (arm_pmu_branch_stack_supported(cpu_pmu)) {
> +               ret = branch_records_alloc(cpu_pmu);
> +               if (ret)
> +                       return ret;

And here too.

Thanks,
Namhyung


> +       } else {
> +               armv8pmu_private_free(cpu_pmu);
> +       }
> +
>         return probe.present ? 0 : -ENODEV;
>  }
>
> @@ -1214,6 +1243,7 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
>         cpu_pmu->set_event_filter       = armv8pmu_set_event_filter;
>
>         cpu_pmu->pmu.event_idx          = armv8pmu_user_event_idx;
> +       cpu_pmu->sched_task             = armv8pmu_sched_task;
>
>         cpu_pmu->name                   = name;
>         cpu_pmu->map_event              = map_event;
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-05-31  4:04 ` [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions Anshuman Khandual
@ 2023-06-02  2:40   ` Namhyung Kim
  2023-06-05  3:14     ` Anshuman Khandual
  2023-06-13 17:17   ` Mark Rutland
  1 sibling, 1 reply; 48+ messages in thread
From: Namhyung Kim @ 2023-06-02  2:40 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Tue, May 30, 2023 at 9:15 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> The primary abstraction level for fetching branch records from BRBE HW has
> been changed as 'struct brbe_regset', which contains storage for all three
> BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
> happens in the task sched out path, or in the PMU IRQ handling path, these
> registers need to be extracted from the HW. Afterwards both live and stored
> sets need to be stitched together to create final branch records set. This
> adds required helper functions for such operations.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---

[SNIP]
> +
> +static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
> +                                   struct brbe_regset *dst, int dst_idx)
> +{
> +       dst[dst_idx].brbinf = src[src_idx].brbinf;
> +       dst[dst_idx].brbsrc = src[src_idx].brbsrc;
> +       dst[dst_idx].brbtgt = src[src_idx].brbtgt;
> +}
> +
> +/*
> + * This function concatenates branch records from stored and live buffer
> + * up to maximum nr_max records and the stored buffer holds the resultant
> + * buffer. The concatenated buffer contains all the branch records from
> + * the live buffer but might contain some from stored buffer considering
> + * the maximum combined length does not exceed 'nr_max'.
> + *
> + *     Stored records  Live records
> + *     ------------------------------------------------^
> + *     |       S0      |       L0      |       Newest  |
> + *     ---------------------------------               |
> + *     |       S1      |       L1      |               |
> + *     ---------------------------------               |
> + *     |       S2      |       L2      |               |
> + *     ---------------------------------               |
> + *     |       S3      |       L3      |               |
> + *     ---------------------------------               |
> + *     |       S4      |       L4      |               nr_max
> + *     ---------------------------------               |
> + *     |               |       L5      |               |
> + *     ---------------------------------               |
> + *     |               |       L6      |               |
> + *     ---------------------------------               |
> + *     |               |       L7      |               |
> + *     ---------------------------------               |
> + *     |               |               |               |
> + *     ---------------------------------               |
> + *     |               |               |       Oldest  |
> + *     ------------------------------------------------V
> + *
> + *
> + * S0 is the newest in the stored records, where as L7 is the oldest in
> + * the live reocords. Unless the live buffer is detetcted as being full
> + * thus potentially dropping off some older records, L7 and S0 records
> + * are contiguous in time for a user task context. The stitched buffer
> + * here represents maximum possible branch records, contiguous in time.
> + *
> + *     Stored records  Live records
> + *     ------------------------------------------------^
> + *     |       L0      |       L0      |       Newest  |
> + *     ---------------------------------               |
> + *     |       L0      |       L1      |               |
> + *     ---------------------------------               |
> + *     |       L2      |       L2      |               |
> + *     ---------------------------------               |
> + *     |       L3      |       L3      |               |
> + *     ---------------------------------               |
> + *     |       L4      |       L4      |             nr_max
> + *     ---------------------------------               |
> + *     |       L5      |       L5      |               |
> + *     ---------------------------------               |
> + *     |       L6      |       L6      |               |
> + *     ---------------------------------               |
> + *     |       L7      |       L7      |               |
> + *     ---------------------------------               |
> + *     |       S0      |               |               |
> + *     ---------------------------------               |
> + *     |       S1      |               |    Oldest     |
> + *     ------------------------------------------------V
> + *     |       S2      | <----|
> + *     -----------------      |
> + *     |       S3      | <----| Dropped off after nr_max
> + *     -----------------      |
> + *     |       S4      | <----|
> + *     -----------------
> + */
> +static int stitch_stored_live_entries(struct brbe_regset *stored,
> +                                     struct brbe_regset *live,
> +                                     int nr_stored, int nr_live,
> +                                     int nr_max)
> +{
> +       int nr_total, nr_excess, nr_last, i;
> +
> +       nr_total = nr_stored + nr_live;
> +       nr_excess = nr_total - nr_max;
> +
> +       /* Stored branch records in stitched buffer */
> +       if (nr_live == nr_max)
> +               nr_stored = 0;
> +       else if (nr_excess > 0)
> +               nr_stored -= nr_excess;
> +
> +       /* Stitched buffer branch records length */
> +       if (nr_total > nr_max)
> +               nr_last = nr_max;
> +       else
> +               nr_last = nr_total;
> +
> +       /* Move stored branch records */
> +       for (i = 0; i < nr_stored; i++)
> +               copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);

I'm afraid it can overwrite some entries if nr_live is small
and nr_stored is big.  Why not use memmove()?

Also I think it'd be simpler if you copy store to live.
It'll save copying live in the IRQ but it will copy the
whole content to store again for the sched switch.

Thanks,
Namhyung


> +
> +       /* Copy live branch records */
> +       for (i = 0; i < nr_live; i++)
> +               copy_brbe_regset(live, i, stored, i);
> +
> +       return nr_last;
> +}
> +
>  /*
>   * Generic perf branch filters supported on BRBE
>   *
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-02  2:33   ` Namhyung Kim
@ 2023-06-05  2:43     ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-05  2:43 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

Hello Namhyung,

On 6/2/23 08:03, Namhyung Kim wrote:
> On Tue, May 30, 2023 at 9:27 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>> This enables support for branch stack sampling event in ARMV8 PMU, checking
>> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
>> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
>> for now. While here, this also defines arm_pmu's sched_task() callback with
>> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  arch/arm64/include/asm/perf_event.h | 33 +++++++++++++
>>  drivers/perf/arm_pmuv3.c            | 76 ++++++++++++++++++++---------
>>  2 files changed, 86 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/perf_event.h b/arch/arm64/include/asm/perf_event.h
>> index eb7071c9eb34..7548813783ba 100644
>> --- a/arch/arm64/include/asm/perf_event.h
>> +++ b/arch/arm64/include/asm/perf_event.h
>> @@ -24,4 +24,37 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
>>         (regs)->pstate = PSR_MODE_EL1h; \
>>  }
>>
>> +struct pmu_hw_events;
>> +struct arm_pmu;
>> +struct perf_event;
>> +
>> +#ifdef CONFIG_PERF_EVENTS
>> +static inline bool has_branch_stack(struct perf_event *event);
>> +
>> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>> +{
>> +       WARN_ON_ONCE(!has_branch_stack(event));
>> +}
>> +
>> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
>> +{
>> +       WARN_ON_ONCE(!has_branch_stack(event));
>> +       return false;
>> +}
>> +
>> +static inline void armv8pmu_branch_enable(struct perf_event *event)
>> +{
>> +       WARN_ON_ONCE(!has_branch_stack(event));
>> +}
>> +
>> +static inline void armv8pmu_branch_disable(struct perf_event *event)
>> +{
>> +       WARN_ON_ONCE(!has_branch_stack(event));
>> +}
>> +
>> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu) { }
>> +static inline void armv8pmu_branch_reset(void) { }
>> +static inline int armv8pmu_private_alloc(struct arm_pmu *arm_pmu) { return 0; }
>> +static inline void armv8pmu_private_free(struct arm_pmu *arm_pmu) { }
>> +#endif
>>  #endif
>> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
>> index c98e4039386d..86d803ff1ae3 100644
>> --- a/drivers/perf/arm_pmuv3.c
>> +++ b/drivers/perf/arm_pmuv3.c
>> @@ -705,38 +705,21 @@ static void armv8pmu_enable_event(struct perf_event *event)
>>          * Enable counter and interrupt, and set the counter to count
>>          * the event that we're interested in.
>>          */
>> -
>> -       /*
>> -        * Disable counter
>> -        */
>>         armv8pmu_disable_event_counter(event);
>> -
>> -       /*
>> -        * Set event.
>> -        */
>>         armv8pmu_write_event_type(event);
>> -
>> -       /*
>> -        * Enable interrupt for this counter
>> -        */
>>         armv8pmu_enable_event_irq(event);
>> -
>> -       /*
>> -        * Enable counter
>> -        */
>>         armv8pmu_enable_event_counter(event);
>> +
>> +       if (has_branch_stack(event))
>> +               armv8pmu_branch_enable(event);
>>  }
>>
>>  static void armv8pmu_disable_event(struct perf_event *event)
>>  {
>> -       /*
>> -        * Disable counter
>> -        */
>> -       armv8pmu_disable_event_counter(event);
>> +       if (has_branch_stack(event))
>> +               armv8pmu_branch_disable(event);
>>
>> -       /*
>> -        * Disable interrupt for this counter
>> -        */
>> +       armv8pmu_disable_event_counter(event);
>>         armv8pmu_disable_event_irq(event);
>>  }
>>
>> @@ -814,6 +797,11 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>                 if (!armpmu_event_set_period(event))
>>                         continue;
>>
>> +               if (has_branch_stack(event) && !WARN_ON(!cpuc->branches)) {
>> +                       armv8pmu_branch_read(cpuc, event);
>> +                       perf_sample_save_brstack(&data, event, &cpuc->branches->branch_stack);
>> +               }
>> +
>>                 /*
>>                  * Perf event overflow will queue the processing of the event as
>>                  * an irq_work which will be taken care of in the handling of
>> @@ -912,6 +900,14 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>>         return event->hw.idx;
>>  }
>>
>> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
>> +{
>> +       struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>> +
>> +       if (sched_in && arm_pmu_branch_stack_supported(armpmu))
>> +               armv8pmu_branch_reset();
>> +}
>> +
>>  /*
>>   * Add an event filter to a given event.
>>   */
>> @@ -982,6 +978,9 @@ static void armv8pmu_reset(void *info)
>>                 pmcr |= ARMV8_PMU_PMCR_LP;
>>
>>         armv8pmu_pmcr_write(pmcr);
>> +
>> +       if (arm_pmu_branch_stack_supported(cpu_pmu))
>> +               armv8pmu_branch_reset();
>>  }
>>
>>  static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
>> @@ -1019,6 +1018,9 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>>
>>         hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>>
>> +       if (has_branch_stack(event) && !armv8pmu_branch_valid(event))
>> +               return -EOPNOTSUPP;
>> +
>>         /*
>>          * CHAIN events only work when paired with an adjacent counter, and it
>>          * never makes sense for a user to open one in isolation, as they'll be
>> @@ -1135,6 +1137,21 @@ static void __armv8pmu_probe_pmu(void *info)
>>                 cpu_pmu->reg_pmmir = read_pmmir();
>>         else
>>                 cpu_pmu->reg_pmmir = 0;
>> +       armv8pmu_branch_probe(cpu_pmu);
>> +}
>> +
>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>> +{
>> +       struct pmu_hw_events *events;
>> +       int cpu;
>> +
>> +       for_each_possible_cpu(cpu) {
>> +               events = per_cpu_ptr(armpmu->hw_events, cpu);
>> +               events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
>> +               if (!events->branches)
>> +                       return -ENOMEM;
>> +       }
>> +       return 0;
>>  }
>>
>>  static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>         };
>>         int ret;
>>
>> +       ret = armv8pmu_private_alloc(cpu_pmu);
>> +       if (ret)
>> +               return ret;
> Wouldn't it be better to move it under the if statement below
> if it's only needed for branch stack?

armv8pmu_private_alloc() allocates arm_pmu's private structure which stores
the BRBE HW attributes during armv8pmu_branch_probe(), called from this SMP
callback __armv8pmu_probe_pmu(). Hence without the structure being allocated
and assigned, following smp_call_function_any() cannot execute successfully.

armv8pmu_private_alloc()
	{
		......
		Allocates arm_pmu->private as single 'struct brbe_hw_attr'
		Allocates arm_pmu->pmu.task_ctx_cache
		......
	}

__armv8pmu_probe_pmu()
	armv8pmu_branch_probe()
		brbe_attributes_probe()
		{
			......
			brbe_attr->brbe_version = brbe;
			brbe_attr->brbe_format = brbe_get_format(brbidr);
        		brbe_attr->brbe_cc = brbe_get_cc_bits(brbidr);
        		brbe_attr->brbe_nr = brbe_get_numrec(brbidr);
			......
		}

armv8pmu_private_alloc() cannot be moved inside armv8pmu_branch_probe(),
because there cannot be any allocation while being in a SMP call context.

> 
>> +
>>         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>                                     __armv8pmu_probe_pmu,
>>                                     &probe, 1);
>>         if (ret)
>>                 return ret;
> Otherwise you might need to free it here.
> 
>> +       if (arm_pmu_branch_stack_supported(cpu_pmu)) {
>> +               ret = branch_records_alloc(cpu_pmu);
>> +               if (ret)
>> +                       return ret;
> And here too.

Not freeing the arm_pmu's private data, might not be a problem in cases
where either pmu does not support BRBE or pmu probe itself fails. But for
completeness, will change as following.

diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 9725a53d6799..fdbe52913cc7 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1198,13 +1198,17 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
        ret = smp_call_function_any(&cpu_pmu->supported_cpus,
                                    __armv8pmu_probe_pmu,
                                    &probe, 1);
-       if (ret)
+       if (ret) {
+               armv8pmu_private_free(cpu_pmu);
                return ret;
+       }
 
        if (arm_pmu_branch_stack_supported(cpu_pmu)) {
                ret = branch_records_alloc(cpu_pmu);
-               if (ret)
+               if (ret) {
+                       armv8pmu_private_free(cpu_pmu);
                        return ret;
+               }
        } else {
                armv8pmu_private_free(cpu_pmu);
        }

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-02  1:45   ` Namhyung Kim
@ 2023-06-05  3:00     ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-05  3:00 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/2/23 07:15, Namhyung Kim wrote:
> Hello,
> 
> On Tue, May 30, 2023 at 9:21 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> This enables branch stack sampling events in ARMV8 PMU, via an architecture
>> feature FEAT_BRBE aka branch record buffer extension. This defines required
>> branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
>> is wrapped with a new config option CONFIG_ARM64_BRBE.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
> 
> [SNIP]
>> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>> +{
>> +       struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>> +       u64 brbfcr, brbcr;
>> +       int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
>> +
>> +       brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>> +       brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>> +
>> +       /* Ensure pause on PMU interrupt is enabled */
>> +       WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>> +
>> +       /* Pause the buffer */
>> +       write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>> +       isb();
>> +
>> +       /* Determine the indices for each loop */
>> +       loop1_idx1 = BRBE_BANK0_IDX_MIN;
>> +       if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
>> +               loop1_idx2 = brbe_attr->brbe_nr - 1;
>> +               loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +               loop2_idx2 = BRBE_BANK0_IDX_MAX;
> 
> Is this to disable the bank1?  Maybe need a comment.

Sure, will add a comment.

> 
> 
>> +       } else {
>> +               loop1_idx2 = BRBE_BANK0_IDX_MAX;
>> +               loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +               loop2_idx2 = brbe_attr->brbe_nr - 1;
>> +       }
> 
> The loop2_idx1 is the same for both cases.  Maybe better
> to move it out of the if statement.

Sure, will do the following change as suggested but wondering whether should
the change be implemented from this patch onwards or in the later patch that
adds capture_brbe_regset().
 
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -56,13 +56,14 @@ static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regse
        int idx, count;
 
        loop1_idx1 = BRBE_BANK0_IDX_MIN;
+       loop2_idx1 = BRBE_BANK1_IDX_MIN;
        if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
                loop1_idx2 = brbe_attr->brbe_nr - 1;
-               loop2_idx1 = BRBE_BANK1_IDX_MIN;
+
+               /* Disable capturing the bank 1 */
                loop2_idx2 = BRBE_BANK0_IDX_MAX;
        } else {
                loop1_idx2 = BRBE_BANK0_IDX_MAX;
-               loop2_idx1 = BRBE_BANK1_IDX_MIN;
                loop2_idx2 = brbe_attr->brbe_nr - 1;
        }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-06-02  2:40   ` Namhyung Kim
@ 2023-06-05  3:14     ` Anshuman Khandual
  2023-06-05 23:49       ` Namhyung Kim
  0 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-05  3:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/2/23 08:10, Namhyung Kim wrote:
> On Tue, May 30, 2023 at 9:15 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> The primary abstraction level for fetching branch records from BRBE HW has
>> been changed as 'struct brbe_regset', which contains storage for all three
>> BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
>> happens in the task sched out path, or in the PMU IRQ handling path, these
>> registers need to be extracted from the HW. Afterwards both live and stored
>> sets need to be stitched together to create final branch records set. This
>> adds required helper functions for such operations.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
> 
> [SNIP]
>> +
>> +static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
>> +                                   struct brbe_regset *dst, int dst_idx)
>> +{
>> +       dst[dst_idx].brbinf = src[src_idx].brbinf;
>> +       dst[dst_idx].brbsrc = src[src_idx].brbsrc;
>> +       dst[dst_idx].brbtgt = src[src_idx].brbtgt;
>> +}
>> +
>> +/*
>> + * This function concatenates branch records from stored and live buffer
>> + * up to maximum nr_max records and the stored buffer holds the resultant
>> + * buffer. The concatenated buffer contains all the branch records from
>> + * the live buffer but might contain some from stored buffer considering
>> + * the maximum combined length does not exceed 'nr_max'.
>> + *
>> + *     Stored records  Live records
>> + *     ------------------------------------------------^
>> + *     |       S0      |       L0      |       Newest  |
>> + *     ---------------------------------               |
>> + *     |       S1      |       L1      |               |
>> + *     ---------------------------------               |
>> + *     |       S2      |       L2      |               |
>> + *     ---------------------------------               |
>> + *     |       S3      |       L3      |               |
>> + *     ---------------------------------               |
>> + *     |       S4      |       L4      |               nr_max
>> + *     ---------------------------------               |
>> + *     |               |       L5      |               |
>> + *     ---------------------------------               |
>> + *     |               |       L6      |               |
>> + *     ---------------------------------               |
>> + *     |               |       L7      |               |
>> + *     ---------------------------------               |
>> + *     |               |               |               |
>> + *     ---------------------------------               |
>> + *     |               |               |       Oldest  |
>> + *     ------------------------------------------------V
>> + *
>> + *
>> + * S0 is the newest in the stored records, where as L7 is the oldest in
>> + * the live reocords. Unless the live buffer is detetcted as being full
>> + * thus potentially dropping off some older records, L7 and S0 records
>> + * are contiguous in time for a user task context. The stitched buffer
>> + * here represents maximum possible branch records, contiguous in time.
>> + *
>> + *     Stored records  Live records
>> + *     ------------------------------------------------^
>> + *     |       L0      |       L0      |       Newest  |
>> + *     ---------------------------------               |
>> + *     |       L0      |       L1      |               |
>> + *     ---------------------------------               |
>> + *     |       L2      |       L2      |               |
>> + *     ---------------------------------               |
>> + *     |       L3      |       L3      |               |
>> + *     ---------------------------------               |
>> + *     |       L4      |       L4      |             nr_max
>> + *     ---------------------------------               |
>> + *     |       L5      |       L5      |               |
>> + *     ---------------------------------               |
>> + *     |       L6      |       L6      |               |
>> + *     ---------------------------------               |
>> + *     |       L7      |       L7      |               |
>> + *     ---------------------------------               |
>> + *     |       S0      |               |               |
>> + *     ---------------------------------               |
>> + *     |       S1      |               |    Oldest     |
>> + *     ------------------------------------------------V
>> + *     |       S2      | <----|
>> + *     -----------------      |
>> + *     |       S3      | <----| Dropped off after nr_max
>> + *     -----------------      |
>> + *     |       S4      | <----|
>> + *     -----------------
>> + */
>> +static int stitch_stored_live_entries(struct brbe_regset *stored,
>> +                                     struct brbe_regset *live,
>> +                                     int nr_stored, int nr_live,
>> +                                     int nr_max)
>> +{
>> +       int nr_total, nr_excess, nr_last, i;
>> +
>> +       nr_total = nr_stored + nr_live;
>> +       nr_excess = nr_total - nr_max;
>> +
>> +       /* Stored branch records in stitched buffer */
>> +       if (nr_live == nr_max)
>> +               nr_stored = 0;
>> +       else if (nr_excess > 0)
>> +               nr_stored -= nr_excess;
>> +
>> +       /* Stitched buffer branch records length */
>> +       if (nr_total > nr_max)
>> +               nr_last = nr_max;
>> +       else
>> +               nr_last = nr_total;
>> +
>> +       /* Move stored branch records */
>> +       for (i = 0; i < nr_stored; i++)
>> +               copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
> 
> I'm afraid it can overwrite some entries if nr_live is small
> and nr_stored is big.  Why not use memmove()?

nr_stored is first adjusted with nr_excess if both live and stored entries combined
exceed the maximum branch records in the HW. I am wondering how it can override ?

> 
> Also I think it'd be simpler if you copy store to live.
> It'll save copying live in the IRQ but it will copy the
> whole content to store again for the sched switch.

But how that is better than the current scheme ?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback
  2023-05-31  4:04 ` [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback Anshuman Khandual
@ 2023-06-05  7:26   ` Mark Rutland
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-05  7:26 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:19AM +0530, Anshuman Khandual wrote:
> This adds armpmu_sched_task(), as generic pmu's sched_task() override which
> in turn can utilize a new arm_pmu.sched_task() callback when available from
> the arm_pmu instance. This new callback will be used while enabling BRBE in
> ARMV8 PMU.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

> ---
>  drivers/perf/arm_pmu.c       | 9 +++++++++
>  include/linux/perf/arm_pmu.h | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 15bd1e34a88e..aada47e3b126 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -517,6 +517,14 @@ static int armpmu_event_init(struct perf_event *event)
>  	return __hw_perf_event_init(event);
>  }
>  
> +static void armpmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
> +{
> +	struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
> +
> +	if (armpmu->sched_task)
> +		armpmu->sched_task(pmu_ctx, sched_in);
> +}
> +
>  static void armpmu_enable(struct pmu *pmu)
>  {
>  	struct arm_pmu *armpmu = to_arm_pmu(pmu);
> @@ -858,6 +866,7 @@ struct arm_pmu *armpmu_alloc(void)
>  	}
>  
>  	pmu->pmu = (struct pmu) {
> +		.sched_task	= armpmu_sched_task,
>  		.pmu_enable	= armpmu_enable,
>  		.pmu_disable	= armpmu_disable,
>  		.event_init	= armpmu_event_init,
> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
> index 525b5d64e394..f7fbd162ca4c 100644
> --- a/include/linux/perf/arm_pmu.h
> +++ b/include/linux/perf/arm_pmu.h
> @@ -100,6 +100,7 @@ struct arm_pmu {
>  	void		(*stop)(struct arm_pmu *);
>  	void		(*reset)(void *);
>  	int		(*map_event)(struct perf_event *event);
> +	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
>  	int		num_events;
>  	bool		secure_access; /* 32-bit ARM only */
>  #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields
  2023-05-31  4:04 ` [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields Anshuman Khandual
@ 2023-06-05  7:55   ` Mark Rutland
  2023-06-06  4:27     ` Anshuman Khandual
  2023-06-13 16:27   ` Mark Rutland
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-05  7:55 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

Hi ANshuman,

This looks good to me, with some minor nits on enum value naming and field
formatting.

On Wed, May 31, 2023 at 09:34:20AM +0530, Anshuman Khandual wrote:
> This adds BRBE related register definitions and various other related field
> macros there in. These will be used subsequently in a BRBE driver which is
> being added later on.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Reviewed-by: Mark Brown <broonie@kernel.org>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>  arch/arm64/tools/sysreg         | 159 ++++++++++++++++++++++++++++++++
>  2 files changed, 262 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index e72d9aaab6b1..12419c55d3b7 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -165,6 +165,109 @@
>  #define SYS_DBGDTRTX_EL0		sys_reg(2, 3, 0, 5, 0)
>  #define SYS_DBGVCR32_EL2		sys_reg(2, 4, 0, 7, 0)
>  
> +#define __SYS_BRBINFO(n)		sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
> +#define __SYS_BRBSRC(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
> +#define __SYS_BRBTGT(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))

These look correct to me per ARM DDI 0487J.a

> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
> index c9a0d1fa3209..44745f42262f 100644
> --- a/arch/arm64/tools/sysreg
> +++ b/arch/arm64/tools/sysreg
> @@ -947,6 +947,165 @@ UnsignedEnum	3:0	BT
>  EndEnum
>  EndSysreg
>  
> +
> +SysregFields BRBINFx_EL1
> +Res0	63:47
> +Field	46	CCU
> +Field	45:32	CC
> +Res0	31:18
> +Field	17	LASTFAILED
> +Field	16	T
> +Res0	15:14
> +Enum	13:8		TYPE
> +	0b000000	UNCOND_DIR
> +	0b000001	INDIR
> +	0b000010	DIR_LINK
> +	0b000011	INDIR_LINK

For clarity, I'd prefer that we use "DIRECT" and "INDIRECT" in full for each of
these, i.e.

	0b000000        UNCOND_DIRECT
	0b000001	INDIRECT
	0b000010	DIRECT_LINK
	0b000011	INDIRECT_LINK

> +	0b000101	RET_SUB
> +	0b000111	RET_EXCPT

Similarly, I'm not keen on the suffixes here.

I think these would be clearer as "RET" and "ERET", as those are short and
unambiguous, and I think the alternative of spelling out "RET_SUBROUTINE" and
"RET_EXCEPTION" is overly verbose.

> +	0b001000	COND_DIR

As with above, I'd prefer "COND_DIRECT" here.

> +	0b100001	DEBUG_HALT
> +	0b100010	CALL
> +	0b100011	TRAP
> +	0b100100	SERROR
> +	0b100110	INST_DEBUG

We generally use 'insn' rather than 'inst', so I'd prefer s/INST/INSN/ here.

> +	0b100111	DATA_DEBUG
> +	0b101010	ALGN_FAULT

s/ALGN/ALIGN/

> +	0b101011	INST_FAULT

As above, I'd prefer "INSN_FAULT" here, though I'm confused that the
architecture doesn't use "abort" naming for this.

> +	0b101100	DATA_FAULT
> +	0b101110	IRQ
> +	0b101111	FIQ
> +	0b111001	DEBUG_EXIT
> +EndEnum

[...]

+Sysreg	BRBCR_EL1	2	1	9	0	0
> +Res0	63:24
> +Field	23 	EXCEPTION
> +Field	22 	ERTN
> +Res0	21:9
> +Field	8 	FZP
> +Res0	7
> +Enum	6:5	TS
> +	0b01	VIRTUAL
> +	0b10	GST_PHYSICAL

s/GST/GUEST/

> +	0b11	PHYSICAL
> +EndEnum
> +Field	4	MPRED
> +Field	3	CC
> +Res0	2
> +Field	1	E1BRE
> +Field	0	E0BRE
> +EndSysreg

[...]

> +Sysreg	BRBINFINJ_EL1	2	1	9	1	0
> +Res0	63:47
> +Field	46	CCU
> +Field	45:32	CC
> +Res0	31:18
> +Field	17	LASTFAILED
> +Field	16	T
> +Res0	15:14
> +Enum	13:8		TYPE
> +	0b000000	UNCOND_DIR
> +	0b000001	INDIR
> +	0b000010	DIR_LINK
> +	0b000011	INDIR_LINK
> +	0b000100	RET_SUB
> +	0b000100	RET_SUB
> +	0b000111	RET_EXCPT
> +	0b001000	COND_DIR
> +	0b100001	DEBUG_HALT
> +	0b100010	CALL
> +	0b100011	TRAP
> +	0b100100	SERROR
> +	0b100110	INST_DEBUG
> +	0b100111	DATA_DEBUG
> +	0b101010	ALGN_FAULT
> +	0b101011	INST_FAULT
> +	0b101100	DATA_FAULT
> +	0b101110	IRQ
> +	0b101111	FIQ
> +	0b111001	DEBUG_EXIT
> +EndEnum

Same comments as for BRBINFx_EL1.TYPE

> +Enum	7:0		NUMREC
> +	0b1000		8
> +	0b10000		16
> +	0b100000	32
> +	0b1000000	64

Could we please pad these to the same width, i.e. have

	0b0001000	8
	0b0010000	16
	0b0100000	32
	0b1000000	64

That way it's much easier to see how these compare to one another, and it
matches the usual style.

Otherwise, I see the ARM ARM lists these in hex, and using that would also be
fine, e.g.

	0x08		8
	0x10		16
	0x20		32
	0x40		64

> +EndEnum
> +EndSysreg

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu
  2023-05-31  4:04 ` [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu Anshuman Khandual
@ 2023-06-05  7:58   ` Mark Rutland
  2023-06-06  4:47     ` Anshuman Khandual
  0 siblings, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-05  7:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:21AM +0530, Anshuman Khandual wrote:
> This updates 'struct arm_pmu' for branch stack sampling support later. This
> adds a new 'features' element in the structure to track supported features,
> and another 'private' element to encapsulate implementation attributes on a
> given 'struct arm_pmu'. These updates here will help in tracking any branch
> stack sampling support, which is being added later. This also adds a helper
> arm_pmu_branch_stack_supported().
> 
> This also enables perf branch stack sampling event on all 'struct arm pmu',
> supporting the feature but after removing the current gate that blocks such
> events unconditionally in armpmu_event_init(). Instead a quick probe can be
> initiated via arm_pmu_branch_stack_supported() to ascertain the support.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  drivers/perf/arm_pmu.c       |  3 +--
>  include/linux/perf/arm_pmu.h | 12 +++++++++++-
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index aada47e3b126..d4a4f2bd89a5 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -510,8 +510,7 @@ static int armpmu_event_init(struct perf_event *event)
>  		!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
>  		return -ENOENT;
>  
> -	/* does not support taken branch sampling */
> -	if (has_branch_stack(event))
> +	if (has_branch_stack(event) && !arm_pmu_branch_stack_supported(armpmu))
>  		return -EOPNOTSUPP;
>  
>  	return __hw_perf_event_init(event);
> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
> index f7fbd162ca4c..0da745eaf426 100644
> --- a/include/linux/perf/arm_pmu.h
> +++ b/include/linux/perf/arm_pmu.h
> @@ -102,7 +102,9 @@ struct arm_pmu {
>  	int		(*map_event)(struct perf_event *event);
>  	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
>  	int		num_events;
> -	bool		secure_access; /* 32-bit ARM only */
> +	unsigned int	secure_access	: 1, /* 32-bit ARM only */
> +			has_branch_stack: 1, /* 64-bit ARM only */
> +			reserved	: 30;
>  #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
>  	DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
>  #define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE	0x4000
> @@ -118,8 +120,16 @@ struct arm_pmu {
>  
>  	/* Only to be used by ACPI probing code */
>  	unsigned long acpi_cpuid;
> +
> +	/* Implementation specific attributes */
> +	void		*private;
>  };
>  
> +static inline bool arm_pmu_branch_stack_supported(struct arm_pmu *armpmu)
> +{
> +	return armpmu->has_branch_stack;
> +}

Since this is a trivial test, and we already access the 'secure_access' field
directly, I'd prefer we removed this helper and directly accessesed
arm_pmu::has_branch_stack, e.g. with the logic in armpmu_event_init() being:

	if (has_branch_stack(event) && !armpmu->has_branch_stack)
		return -EOPNOTSUPP;

With that:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events
  2023-05-31  4:04 ` [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events Anshuman Khandual
@ 2023-06-05  8:00   ` Mark Rutland
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-05  8:00 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:22AM +0530, Anshuman Khandual wrote:
> This adds branch records buffer pointer in 'struct pmu_hw_events' which can
> be used to capture branch records during PMU interrupt. This percpu pointer
> here needs to be allocated first before usage.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Acked-by: Mark Rutland <mark.rutland@arm.com>

> ---
>  include/linux/perf/arm_pmu.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
> index 0da745eaf426..694b241e456c 100644
> --- a/include/linux/perf/arm_pmu.h
> +++ b/include/linux/perf/arm_pmu.h
> @@ -44,6 +44,13 @@ static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_47BIT) == ARMPMU_EVT_47BIT);
>  	},								\
>  }
>  
> +#define MAX_BRANCH_RECORDS 64
> +
> +struct branch_records {
> +	struct perf_branch_stack	branch_stack;
> +	struct perf_branch_entry	branch_entries[MAX_BRANCH_RECORDS];
> +};
> +
>  /* The events for a given PMU register set. */
>  struct pmu_hw_events {
>  	/*
> @@ -70,6 +77,8 @@ struct pmu_hw_events {
>  	struct arm_pmu		*percpu_pmu;
>  
>  	int irq;
> +
> +	struct branch_records	*branches;
>  };
>  
>  enum armpmu_attr_groups {
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-05-31  4:04 ` [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU Anshuman Khandual
  2023-06-02  2:33   ` Namhyung Kim
@ 2023-06-05 12:05   ` Mark Rutland
  2023-06-06 10:34     ` Anshuman Khandual
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-05 12:05 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
> This enables support for branch stack sampling event in ARMV8 PMU, checking
> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
> for now. While here, this also defines arm_pmu's sched_task() callback with
> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.

This generally looks good, but I have a few comments below.

[...]

> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
> +{
> +	WARN_ON_ONCE(!has_branch_stack(event));
> +	return false;
> +}

IIUC this is for validating the attr, so could we please name this
armv8pmu_branch_attr_valid() ?

[...]

> +static int branch_records_alloc(struct arm_pmu *armpmu)
> +{
> +	struct pmu_hw_events *events;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		events = per_cpu_ptr(armpmu->hw_events, cpu);
> +		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
> +		if (!events->branches)
> +			return -ENOMEM;
> +	}
> +	return 0;

This leaks memory if any allocation fails, and the next patch replaces this
code entirely.

Please add this once in a working state. Either use the percpu allocation
trick in the next patch from the start, or have this kzalloc() with a
corresponding kfree() in an error path.

>  }
>  
>  static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>  	};
>  	int ret;
>  
> +	ret = armv8pmu_private_alloc(cpu_pmu);
> +	if (ret)
> +		return ret;
> +
>  	ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>  				    __armv8pmu_probe_pmu,
>  				    &probe, 1);
>  	if (ret)
>  		return ret;
>  
> +	if (arm_pmu_branch_stack_supported(cpu_pmu)) {
> +		ret = branch_records_alloc(cpu_pmu);
> +		if (ret)
> +			return ret;
> +	} else {
> +		armv8pmu_private_free(cpu_pmu);
> +	}

I see from the next patch that "private" is four ints, so please just add that
to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
if we end up needing more space in future we can consider factoring it out.

> +
>  	return probe.present ? 0 : -ENODEV;
>  }

It also seems odd to ceck probe.present *after* checking
arm_pmu_branch_stack_supported().

With the allocation removed I think this can be written more clearly as:

| static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
| {
|         struct armv8pmu_probe_info probe = {
|                 .pmu = cpu_pmu,
|                 .present = false,
|         };   
|         int ret; 
| 
|         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
|                                     __armv8pmu_probe_pmu,
|                                     &probe, 1);
|         if (ret)
|                 return ret; 
| 
|         if (!probe.present)
|                 return -ENODEV;
| 
|         if (arm_pmu_branch_stack_supported(cpu_pmu))
|                 ret = branch_records_alloc(cpu_pmu);
|              
|         return ret; 
| }

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-05-31  4:04 ` [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE Anshuman Khandual
  2023-06-02  1:45   ` Namhyung Kim
@ 2023-06-05 13:43   ` Mark Rutland
  2023-06-09  4:30     ` Anshuman Khandual
                       ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-05 13:43 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:24AM +0530, Anshuman Khandual wrote:
> This enables branch stack sampling events in ARMV8 PMU, via an architecture
> feature FEAT_BRBE aka branch record buffer extension. This defines required
> branch helper functions pmuv8pmu_branch_XXXXX() and the implementation here
> is wrapped with a new config option CONFIG_ARM64_BRBE.

[...]

> +int armv8pmu_private_alloc(struct arm_pmu *arm_pmu)
> +{
> +	struct brbe_hw_attr *brbe_attr = kzalloc(sizeof(struct brbe_hw_attr), GFP_KERNEL);
> +
> +	if (!brbe_attr)
> +		return -ENOMEM;
> +
> +	arm_pmu->private = brbe_attr;
> +	return 0;
> +}
> +
> +void armv8pmu_private_free(struct arm_pmu *arm_pmu)
> +{
> +	kfree(arm_pmu->private);
> +}

As on the previous patch, I think these should go for now.

[...]

> +static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
> +{
> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)armpmu->private;
> +	u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> +
> +	brbe_attr->brbe_version = brbe;
> +	brbe_attr->brbe_format = brbe_get_format(brbidr);
> +	brbe_attr->brbe_cc = brbe_get_cc_bits(brbidr);
> +	brbe_attr->brbe_nr = brbe_get_numrec(brbidr);

I think we can store the BRBIDR0_EL1 value directly in arm_pmu as a single
value, and extract the fields as required, like we do for PMMIDR.

[...]

> +static u64 branch_type_to_brbcr(int branch_type)
> +{
> +	u64 brbcr = BRBCR_EL1_DEFAULT_TS;
> +
> +	/*
> +	 * BRBE need not be paused on PMU interrupt while tracing only
> +	 * the user space, bcause it will automatically be inside the
> +	 * prohibited region. But even after PMU overflow occurs, the
> +	 * interrupt could still take much more cycles, before it can
> +	 * be taken and by that time BRBE will have been overwritten.
> +	 * Let's enable pause on PMU interrupt mechanism even for user
> +	 * only traces.
> +	 */
> +	brbcr |= BRBCR_EL1_FZP;

I think this is trying to say that we *should* use FZP when sampling the
kernel (due to IRQ latency), and *can* safely use it when sampling userspace,
so it would be good to explain it that way around.

It's a bit unfortunate, because where this matters we'll always be losing some
branches either way, but I guess we don't have much say in the matter.

[...]

> +/*
> + * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
> + * preceding consecutive branch records, that were in a transaction
> + * (i.e their BRBINFx_EL1.TX set) have been aborted.
> + *
> + * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
> + * consecutive branch records up to the last record, which were in a
> + * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
> + *
> + * --------------------------------- -------------------
> + * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> + * --------------------------------- -------------------
> + * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> + * --------------------------------- -------------------
> + * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> + * --------------------------------- -------------------
> + * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
> + * --------------------------------- -------------------
> + * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> + * --------------------------------- -------------------
> + * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> + * --------------------------------- -------------------
> + *
> + * BRBFCR_EL1.LASTFAILED == 1
> + *
> + * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
> + * branches records near the end of the BRBE buffer.
> + *
> + * Architecture does not guarantee a non transaction (TX = 0) branch
> + * record between two different transactions. So it is possible that
> + * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
> + * mark more than required transactions as aborted.
> + */

Linux doesn't currently support TME (and IIUC no-one has built it), so can't we
delete the transaction handling for now? We can add a comment with somehing like:

/*
 * TODO: add transaction handling for TME.
 */

Assuming no-one has built TME, we might also be able to get an architectural
fix to disambiguate the boundary between two transactions, and avoid the
problem described above.

[...]

> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> +{
> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
> +	u64 brbfcr, brbcr;
> +	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
> +
> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> +
> +	/* Ensure pause on PMU interrupt is enabled */
> +	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> +
> +	/* Pause the buffer */
> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> +	isb();
> +
> +	/* Determine the indices for each loop */
> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
> +	} else {
> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
> +	}
> +
> +	/* Loop through bank 0 */
> +	select_brbe_bank(BRBE_BANK_IDX_0);
> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
> +		if (!capture_branch_entry(cpuc, event, idx))
> +			goto skip_bank_1;
> +	}
> +
> +	/* Loop through bank 1 */
> +	select_brbe_bank(BRBE_BANK_IDX_1);
> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
> +		if (!capture_branch_entry(cpuc, event, idx))
> +			break;
> +	}
> +
> +skip_bank_1:
> +	cpuc->branches->branch_stack.nr = idx;
> +	cpuc->branches->branch_stack.hw_idx = -1ULL;
> +	process_branch_aborts(cpuc);
> +
> +	/* Unpause the buffer */
> +	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> +	isb();
> +	armv8pmu_branch_reset();
> +}

The loop indicies are rather difficult to follow, and I think those can be made
quite a lot simpler if split out, e.g.

| int __armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
| {
| 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
| 	int nr_hw_entries = brbe_attr->brbe_nr;
| 	int idx;
| 
| 	select_brbe_bank(BRBE_BANK_IDX_0);
| 	while (idx < nr_hw_entries && idx < BRBE_BANK0_IDX_MAX) {
| 		if (!capture_branch_entry(cpuc, event, idx))
| 			return idx;
| 		idx++;
| 	}
| 
| 	select_brbe_bank(BRBE_BANK_IDX_1);
| 	while (idx < nr_hw_entries && idx < BRBE_BANK1_IDX_MAX) {
| 		if (!capture_branch_entry(cpuc, event, idx))
| 			return idx;
| 		idx++;
| 	}
| 
| 	return idx;
| }
| 
| void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
| {
| 	u64 brbfcr, brbcr;
| 	int nr;
| 
| 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
| 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
| 
| 	/* Ensure pause on PMU interrupt is enabled */
| 	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
| 
| 	/* Pause the buffer */
| 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
| 	isb();
| 
| 	nr = __armv8pmu_branch_read(cpus, event);
| 
| 	cpuc->branches->branch_stack.nr = nr;
| 	cpuc->branches->branch_stack.hw_idx = -1ULL;
| 	process_branch_aborts(cpuc);
| 
| 	/* Unpause the buffer */
| 	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
| 	isb();
| 	armv8pmu_branch_reset();
| }

Looking at <linux/perf_event.h> I see:

| /*
|  * branch stack layout:
|  *  nr: number of taken branches stored in entries[]
|  *  hw_idx: The low level index of raw branch records
|  *          for the most recent branch.
|  *          -1ULL means invalid/unknown.
|  *
|  * Note that nr can vary from sample to sample
|  * branches (to, from) are stored from most recent
|  * to least recent, i.e., entries[0] contains the most
|  * recent branch.
|  * The entries[] is an abstraction of raw branch records,
|  * which may not be stored in age order in HW, e.g. Intel LBR.
|  * The hw_idx is to expose the low level index of raw
|  * branch record for the most recent branch aka entries[0].
|  * The hw_idx index is between -1 (unknown) and max depth,
|  * which can be retrieved in /sys/devices/cpu/caps/branches.
|  * For the architectures whose raw branch records are
|  * already stored in age order, the hw_idx should be 0.
|  */
| struct perf_branch_stack {
|         __u64                           nr;  
|         __u64                           hw_idx;
|         struct perf_branch_entry        entries[];
| };

... which seems to indicate we should be setting hw_idx to 0, since IIUC our
records are in age order.

[...]

> @@ -1142,14 +1146,25 @@ static void __armv8pmu_probe_pmu(void *info)
>  
>  static int branch_records_alloc(struct arm_pmu *armpmu)
>  {
> +	struct branch_records __percpu *tmp_alloc_ptr;
> +	struct branch_records *records;
>  	struct pmu_hw_events *events;
>  	int cpu;
>  
> +	tmp_alloc_ptr = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
> +	if (!tmp_alloc_ptr)
> +		return -ENOMEM;
> +
> +	/*
> +	 * FIXME: Memory allocated via tmp_alloc_ptr gets completely
> +	 * consumed here, never required to be freed up later. Hence
> +	 * losing access to on stack 'tmp_alloc_ptr' is acceptible.
> +	 * Otherwise this alloc handle has to be saved some where.
> +	 */
>  	for_each_possible_cpu(cpu) {
>  		events = per_cpu_ptr(armpmu->hw_events, cpu);
> -		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
> -		if (!events->branches)
> -			return -ENOMEM;
> +		records = per_cpu_ptr(tmp_alloc_ptr, cpu);
> +		events->branches = records;
>  	}
>  	return 0;
>  }

As on a prior patch, I think either this should be the approach from the start,
or we should have cleanup for the kzalloc, and either way this should not be a
part of this patch.

If you use the approach in this patch, please rename "tmp_alloc_pointer" for
clarity, and move the temporaries into the loop, e.g.

| static int branch_records_alloc(struct arm_pmu *armpmu)
| {
| 	struct branch_records __percpu *records;
| 	int cpu;
| 
| 	records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
| 	if (!records)
| 		return -ENOMEM;
| 
| 	for_each_possible_cpu(cpu) {
| 		struct pmu_hw_events *events_cpu;
| 		struct branch_records *records_cpu;
| 
| 		events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
| 		records_cpu = per_cpu_ptr(records, cpu);
| 		events_cpu->branches = records_cpu;
| 	}
|	
| 	return 0;
| }

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-06-05  3:14     ` Anshuman Khandual
@ 2023-06-05 23:49       ` Namhyung Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Namhyung Kim @ 2023-06-05 23:49 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	mark.rutland, Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Sun, Jun 4, 2023 at 8:15 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 6/2/23 08:10, Namhyung Kim wrote:
> > On Tue, May 30, 2023 at 9:15 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> The primary abstraction level for fetching branch records from BRBE HW has
> >> been changed as 'struct brbe_regset', which contains storage for all three
> >> BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
> >> happens in the task sched out path, or in the PMU IRQ handling path, these
> >> registers need to be extracted from the HW. Afterwards both live and stored
> >> sets need to be stitched together to create final branch records set. This
> >> adds required helper functions for such operations.
> >>
> >> Cc: Catalin Marinas <catalin.marinas@arm.com>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Mark Rutland <mark.rutland@arm.com>
> >> Cc: linux-arm-kernel@lists.infradead.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Tested-by: James Clark <james.clark@arm.com>
> >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> >> ---
> >
> > [SNIP]
> >> +
> >> +static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
> >> +                                   struct brbe_regset *dst, int dst_idx)
> >> +{
> >> +       dst[dst_idx].brbinf = src[src_idx].brbinf;
> >> +       dst[dst_idx].brbsrc = src[src_idx].brbsrc;
> >> +       dst[dst_idx].brbtgt = src[src_idx].brbtgt;
> >> +}
> >> +
> >> +/*
> >> + * This function concatenates branch records from stored and live buffer
> >> + * up to maximum nr_max records and the stored buffer holds the resultant
> >> + * buffer. The concatenated buffer contains all the branch records from
> >> + * the live buffer but might contain some from stored buffer considering
> >> + * the maximum combined length does not exceed 'nr_max'.
> >> + *
> >> + *     Stored records  Live records
> >> + *     ------------------------------------------------^
> >> + *     |       S0      |       L0      |       Newest  |
> >> + *     ---------------------------------               |
> >> + *     |       S1      |       L1      |               |
> >> + *     ---------------------------------               |
> >> + *     |       S2      |       L2      |               |
> >> + *     ---------------------------------               |
> >> + *     |       S3      |       L3      |               |
> >> + *     ---------------------------------               |
> >> + *     |       S4      |       L4      |               nr_max
> >> + *     ---------------------------------               |
> >> + *     |               |       L5      |               |
> >> + *     ---------------------------------               |
> >> + *     |               |       L6      |               |
> >> + *     ---------------------------------               |
> >> + *     |               |       L7      |               |
> >> + *     ---------------------------------               |
> >> + *     |               |               |               |
> >> + *     ---------------------------------               |
> >> + *     |               |               |       Oldest  |
> >> + *     ------------------------------------------------V
> >> + *
> >> + *
> >> + * S0 is the newest in the stored records, where as L7 is the oldest in
> >> + * the live reocords. Unless the live buffer is detetcted as being full
> >> + * thus potentially dropping off some older records, L7 and S0 records
> >> + * are contiguous in time for a user task context. The stitched buffer
> >> + * here represents maximum possible branch records, contiguous in time.
> >> + *
> >> + *     Stored records  Live records
> >> + *     ------------------------------------------------^
> >> + *     |       L0      |       L0      |       Newest  |
> >> + *     ---------------------------------               |
> >> + *     |       L0      |       L1      |               |
> >> + *     ---------------------------------               |
> >> + *     |       L2      |       L2      |               |
> >> + *     ---------------------------------               |
> >> + *     |       L3      |       L3      |               |
> >> + *     ---------------------------------               |
> >> + *     |       L4      |       L4      |             nr_max
> >> + *     ---------------------------------               |
> >> + *     |       L5      |       L5      |               |
> >> + *     ---------------------------------               |
> >> + *     |       L6      |       L6      |               |
> >> + *     ---------------------------------               |
> >> + *     |       L7      |       L7      |               |
> >> + *     ---------------------------------               |
> >> + *     |       S0      |               |               |
> >> + *     ---------------------------------               |
> >> + *     |       S1      |               |    Oldest     |
> >> + *     ------------------------------------------------V
> >> + *     |       S2      | <----|
> >> + *     -----------------      |
> >> + *     |       S3      | <----| Dropped off after nr_max
> >> + *     -----------------      |
> >> + *     |       S4      | <----|
> >> + *     -----------------
> >> + */
> >> +static int stitch_stored_live_entries(struct brbe_regset *stored,
> >> +                                     struct brbe_regset *live,
> >> +                                     int nr_stored, int nr_live,
> >> +                                     int nr_max)
> >> +{
> >> +       int nr_total, nr_excess, nr_last, i;
> >> +
> >> +       nr_total = nr_stored + nr_live;
> >> +       nr_excess = nr_total - nr_max;
> >> +
> >> +       /* Stored branch records in stitched buffer */
> >> +       if (nr_live == nr_max)
> >> +               nr_stored = 0;
> >> +       else if (nr_excess > 0)
> >> +               nr_stored -= nr_excess;
> >> +
> >> +       /* Stitched buffer branch records length */
> >> +       if (nr_total > nr_max)
> >> +               nr_last = nr_max;
> >> +       else
> >> +               nr_last = nr_total;
> >> +
> >> +       /* Move stored branch records */
> >> +       for (i = 0; i < nr_stored; i++)
> >> +               copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
> >
> > I'm afraid it can overwrite some entries if nr_live is small
> > and nr_stored is big.  Why not use memmove()?
>
> nr_stored is first adjusted with nr_excess if both live and stored entries combined
> exceed the maximum branch records in the HW. I am wondering how it can override ?

Say nr_stored = 40 and nr_live = 20, wouldn't it copy stored[0] to stored[20]?
Then stored[20:39] will be lost.  Also I'm not sure "-1" is correct.

>
> >
> > Also I think it'd be simpler if you copy store to live.
> > It'll save copying live in the IRQ but it will copy the
> > whole content to store again for the sched switch.
>
> But how that is better than the current scheme ?

I guess normally the live buffer is full, then it can skip
the copy and use the buffer directly for IRQ, right?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields
  2023-06-05  7:55   ` Mark Rutland
@ 2023-06-06  4:27     ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-06  4:27 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/5/23 13:25, Mark Rutland wrote:
> Hi ANshuman,

Hello Mark,

> 
> This looks good to me, with some minor nits on enum value naming and field
> formatting.

Okay

> 
> On Wed, May 31, 2023 at 09:34:20AM +0530, Anshuman Khandual wrote:
>> This adds BRBE related register definitions and various other related field
>> macros there in. These will be used subsequently in a BRBE driver which is
>> being added later on.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Reviewed-by: Mark Brown <broonie@kernel.org>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>  arch/arm64/tools/sysreg         | 159 ++++++++++++++++++++++++++++++++
>>  2 files changed, 262 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
>> index e72d9aaab6b1..12419c55d3b7 100644
>> --- a/arch/arm64/include/asm/sysreg.h
>> +++ b/arch/arm64/include/asm/sysreg.h
>> @@ -165,6 +165,109 @@
>>  #define SYS_DBGDTRTX_EL0		sys_reg(2, 3, 0, 5, 0)
>>  #define SYS_DBGVCR32_EL2		sys_reg(2, 4, 0, 7, 0)
>>  
>> +#define __SYS_BRBINFO(n)		sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 0))
>> +#define __SYS_BRBSRC(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 1))
>> +#define __SYS_BRBTGT(n)			sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10)) >> 2 + 2))
> 
> These look correct to me per ARM DDI 0487J.a
> 
>> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
>> index c9a0d1fa3209..44745f42262f 100644
>> --- a/arch/arm64/tools/sysreg
>> +++ b/arch/arm64/tools/sysreg
>> @@ -947,6 +947,165 @@ UnsignedEnum	3:0	BT
>>  EndEnum
>>  EndSysreg
>>  
>> +
>> +SysregFields BRBINFx_EL1
>> +Res0	63:47
>> +Field	46	CCU
>> +Field	45:32	CC
>> +Res0	31:18
>> +Field	17	LASTFAILED
>> +Field	16	T
>> +Res0	15:14
>> +Enum	13:8		TYPE
>> +	0b000000	UNCOND_DIR
>> +	0b000001	INDIR
>> +	0b000010	DIR_LINK
>> +	0b000011	INDIR_LINK
> 
> For clarity, I'd prefer that we use "DIRECT" and "INDIRECT" in full for each of
> these, i.e.
> 
> 	0b000000        UNCOND_DIRECT
> 	0b000001	INDIRECT
> 	0b000010	DIRECT_LINK
> 	0b000011	INDIRECT_LINK

Okay, will change these as required.

> 
>> +	0b000101	RET_SUB
>> +	0b000111	RET_EXCPT
> 
> Similarly, I'm not keen on the suffixes here.
> 
> I think these would be clearer as "RET" and "ERET", as those are short and
> unambiguous, and I think the alternative of spelling out "RET_SUBROUTINE" and
> "RET_EXCEPTION" is overly verbose.

Sure, will change as RET and ERET.
 
> 
>> +	0b001000	COND_DIR
> 
> As with above, I'd prefer "COND_DIRECT" here.

Okay, will change this as required.

> 
>> +	0b100001	DEBUG_HALT
>> +	0b100010	CALL
>> +	0b100011	TRAP
>> +	0b100100	SERROR
>> +	0b100110	INST_DEBUG
> 
> We generally use 'insn' rather than 'inst', so I'd prefer s/INST/INSN/ here.
> 
>> +	0b100111	DATA_DEBUG
>> +	0b101010	ALGN_FAULT
> 
> s/ALGN/ALIGN/
> 
>> +	0b101011	INST_FAULT
> 
> As above, I'd prefer "INSN_FAULT" here, though I'm confused that the
> architecture doesn't use "abort" naming for this.

Sure, will change as required but please note that INST/ALGN suffixes
have also been used to define the generic ABI. Although it should not
be a problem as such.

include/uapi/linux/perf_event.h

enum {
        PERF_BR_NEW_FAULT_ALGN          = 0,    /* Alignment fault */
        PERF_BR_NEW_FAULT_DATA          = 1,    /* Data fault */
        PERF_BR_NEW_FAULT_INST          = 2,    /* Inst fault */
        PERF_BR_NEW_ARCH_1              = 3,    /* Architecture specific */
        PERF_BR_NEW_ARCH_2              = 4,    /* Architecture specific */
        PERF_BR_NEW_ARCH_3              = 5,    /* Architecture specific */
        PERF_BR_NEW_ARCH_4              = 6,    /* Architecture specific */
        PERF_BR_NEW_ARCH_5              = 7,    /* Architecture specific */
        PERF_BR_NEW_MAX,
};

> 
>> +	0b101100	DATA_FAULT
>> +	0b101110	IRQ
>> +	0b101111	FIQ
>> +	0b111001	DEBUG_EXIT
>> +EndEnum
> 
> [...]
> 
> +Sysreg	BRBCR_EL1	2	1	9	0	0
>> +Res0	63:24
>> +Field	23 	EXCEPTION
>> +Field	22 	ERTN
>> +Res0	21:9
>> +Field	8 	FZP
>> +Res0	7
>> +Enum	6:5	TS
>> +	0b01	VIRTUAL
>> +	0b10	GST_PHYSICAL
> 
> s/GST/GUEST/

Okay

> 
>> +	0b11	PHYSICAL
>> +EndEnum
>> +Field	4	MPRED
>> +Field	3	CC
>> +Res0	2
>> +Field	1	E1BRE
>> +Field	0	E0BRE
>> +EndSysreg
> 
> [...]
> 
>> +Sysreg	BRBINFINJ_EL1	2	1	9	1	0
>> +Res0	63:47
>> +Field	46	CCU
>> +Field	45:32	CC
>> +Res0	31:18
>> +Field	17	LASTFAILED
>> +Field	16	T
>> +Res0	15:14
>> +Enum	13:8		TYPE
>> +	0b000000	UNCOND_DIR
>> +	0b000001	INDIR
>> +	0b000010	DIR_LINK
>> +	0b000011	INDIR_LINK
>> +	0b000100	RET_SUB
>> +	0b000100	RET_SUB
>> +	0b000111	RET_EXCPT
>> +	0b001000	COND_DIR
>> +	0b100001	DEBUG_HALT
>> +	0b100010	CALL
>> +	0b100011	TRAP
>> +	0b100100	SERROR
>> +	0b100110	INST_DEBUG
>> +	0b100111	DATA_DEBUG
>> +	0b101010	ALGN_FAULT
>> +	0b101011	INST_FAULT
>> +	0b101100	DATA_FAULT
>> +	0b101110	IRQ
>> +	0b101111	FIQ
>> +	0b111001	DEBUG_EXIT
>> +EndEnum
> 
> Same comments as for BRBINFx_EL1.TYPE

Done.

> 
>> +Enum	7:0		NUMREC
>> +	0b1000		8
>> +	0b10000		16
>> +	0b100000	32
>> +	0b1000000	64
> 
> Could we please pad these to the same width, i.e. have
> 
> 	0b0001000	8
> 	0b0010000	16
> 	0b0100000	32
> 	0b1000000	64
> 
> That way it's much easier to see how these compare to one another, and it
> matches the usual style.

Sure, will add the padding.

> 
> Otherwise, I see the ARM ARM lists these in hex, and using that would also be
> fine, e.g.
> 
> 	0x08		8
> 	0x10		16
> 	0x20		32
> 	0x40		64
> 
>> +EndEnum
>> +EndSysreg
> 
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu
  2023-06-05  7:58   ` Mark Rutland
@ 2023-06-06  4:47     ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-06  4:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/5/23 13:28, Mark Rutland wrote:
> On Wed, May 31, 2023 at 09:34:21AM +0530, Anshuman Khandual wrote:
>> This updates 'struct arm_pmu' for branch stack sampling support later. This
>> adds a new 'features' element in the structure to track supported features,
>> and another 'private' element to encapsulate implementation attributes on a
>> given 'struct arm_pmu'. These updates here will help in tracking any branch
>> stack sampling support, which is being added later. This also adds a helper
>> arm_pmu_branch_stack_supported().
>>
>> This also enables perf branch stack sampling event on all 'struct arm pmu',
>> supporting the feature but after removing the current gate that blocks such
>> events unconditionally in armpmu_event_init(). Instead a quick probe can be
>> initiated via arm_pmu_branch_stack_supported() to ascertain the support.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  drivers/perf/arm_pmu.c       |  3 +--
>>  include/linux/perf/arm_pmu.h | 12 +++++++++++-
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
>> index aada47e3b126..d4a4f2bd89a5 100644
>> --- a/drivers/perf/arm_pmu.c
>> +++ b/drivers/perf/arm_pmu.c
>> @@ -510,8 +510,7 @@ static int armpmu_event_init(struct perf_event *event)
>>  		!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
>>  		return -ENOENT;
>>  
>> -	/* does not support taken branch sampling */
>> -	if (has_branch_stack(event))
>> +	if (has_branch_stack(event) && !arm_pmu_branch_stack_supported(armpmu))
>>  		return -EOPNOTSUPP;
>>  
>>  	return __hw_perf_event_init(event);
>> diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
>> index f7fbd162ca4c..0da745eaf426 100644
>> --- a/include/linux/perf/arm_pmu.h
>> +++ b/include/linux/perf/arm_pmu.h
>> @@ -102,7 +102,9 @@ struct arm_pmu {
>>  	int		(*map_event)(struct perf_event *event);
>>  	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
>>  	int		num_events;
>> -	bool		secure_access; /* 32-bit ARM only */
>> +	unsigned int	secure_access	: 1, /* 32-bit ARM only */
>> +			has_branch_stack: 1, /* 64-bit ARM only */
>> +			reserved	: 30;
>>  #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
>>  	DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
>>  #define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE	0x4000
>> @@ -118,8 +120,16 @@ struct arm_pmu {
>>  
>>  	/* Only to be used by ACPI probing code */
>>  	unsigned long acpi_cpuid;
>> +
>> +	/* Implementation specific attributes */
>> +	void		*private;
>>  };
>>  
>> +static inline bool arm_pmu_branch_stack_supported(struct arm_pmu *armpmu)
>> +{
>> +	return armpmu->has_branch_stack;
>> +}
> 
> Since this is a trivial test, and we already access the 'secure_access' field
> directly, I'd prefer we removed this helper and directly accessesed
> arm_pmu::has_branch_stack, e.g. with the logic in armpmu_event_init() being:
> 
> 	if (has_branch_stack(event) && !armpmu->has_branch_stack)
> 		return -EOPNOTSUPP;

Sure, will drop the helper and change as suggested in all the call sites.

> 
> With that:
> 
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-05 12:05   ` Mark Rutland
@ 2023-06-06 10:34     ` Anshuman Khandual
  2023-06-06 10:41       ` Mark Rutland
  2023-06-08 10:13       ` Suzuki K Poulose
  0 siblings, 2 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-06 10:34 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/5/23 17:35, Mark Rutland wrote:
> On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
>> This enables support for branch stack sampling event in ARMV8 PMU, checking
>> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
>> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
>> for now. While here, this also defines arm_pmu's sched_task() callback with
>> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
> 
> This generally looks good, but I have a few comments below.
> 
> [...]
> 
>> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
>> +{
>> +	WARN_ON_ONCE(!has_branch_stack(event));
>> +	return false;
>> +}
> 
> IIUC this is for validating the attr, so could we please name this
> armv8pmu_branch_attr_valid() ?

Sure, will change the name and updated call sites.

> 
> [...]
> 
>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>> +{
>> +	struct pmu_hw_events *events;
>> +	int cpu;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		events = per_cpu_ptr(armpmu->hw_events, cpu);
>> +		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
>> +		if (!events->branches)
>> +			return -ENOMEM;
>> +	}
>> +	return 0;
> 
> This leaks memory if any allocation fails, and the next patch replaces this
> code entirely.

Okay.

> 
> Please add this once in a working state. Either use the percpu allocation
> trick in the next patch from the start, or have this kzalloc() with a
> corresponding kfree() in an error path.

I will change branch_records_alloc() as suggested in the next patch's thread
and fold those changes here in this patch.

> 
>>  }
>>  
>>  static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>  	};
>>  	int ret;
>>  
>> +	ret = armv8pmu_private_alloc(cpu_pmu);
>> +	if (ret)
>> +		return ret;
>> +
>>  	ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>  				    __armv8pmu_probe_pmu,
>>  				    &probe, 1);
>>  	if (ret)
>>  		return ret;
>>  
>> +	if (arm_pmu_branch_stack_supported(cpu_pmu)) {
>> +		ret = branch_records_alloc(cpu_pmu);
>> +		if (ret)
>> +			return ret;
>> +	} else {
>> +		armv8pmu_private_free(cpu_pmu);
>> +	}
> 
> I see from the next patch that "private" is four ints, so please just add that
> to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
> if we end up needing more space in future we can consider factoring it out.

struct arm_pmu {
	........................................
        /* Implementation specific attributes */
        void            *private;
}

private pointer here creates an abstraction for given pmu implementation
to hide attribute details without making it known to core arm pmu layer.
Although adding ifdef CONFIG_ARM64_BRBE solves the problem as mentioned
above, it does break that abstraction. Currently arm_pmu layer is aware
about 'branch records' but not about BRBE in particular which the driver
adds later on. I suggest we should not break that abstraction.

Instead a global 'static struct brbe_hw_attr' in drivers/perf/arm_brbe.c
can be initialized into arm_pmu->private during armv8pmu_branch_probe(),
which will also solve the allocation-free problem. Also similar helpers
armv8pmu_task_ctx_alloc()/free() could be defined to manage task context
cache i.e arm_pmu->pmu.task_ctx_cache independently.

But now armv8pmu_task_ctx_alloc() can be called after pmu probe confirms
to have arm_pmu->has_branch_stack.

> 
>> +
>>  	return probe.present ? 0 : -ENODEV;
>>  }
> 
> It also seems odd to ceck probe.present *after* checking
> arm_pmu_branch_stack_supported().

I will reorganize as suggested below.

> 
> With the allocation removed I think this can be written more clearly as:
> 
> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> | {
> |         struct armv8pmu_probe_info probe = {
> |                 .pmu = cpu_pmu,
> |                 .present = false,
> |         };   
> |         int ret; 
> | 
> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
> |                                     __armv8pmu_probe_pmu,
> |                                     &probe, 1);
> |         if (ret)
> |                 return ret; > | 
> |         if (!probe.present)
> |                 return -ENODEV;
> | 
> |         if (arm_pmu_branch_stack_supported(cpu_pmu))
> |                 ret = branch_records_alloc(cpu_pmu);
> |              
> |         return ret; 
> | }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-06 10:34     ` Anshuman Khandual
@ 2023-06-06 10:41       ` Mark Rutland
  2023-06-08 10:13       ` Suzuki K Poulose
  1 sibling, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-06 10:41 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Tue, Jun 06, 2023 at 04:04:25PM +0530, Anshuman Khandual wrote:
> On 6/5/23 17:35, Mark Rutland wrote:
> > On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
> >>  static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> >> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> >>  	};
> >>  	int ret;
> >>  
> >> +	ret = armv8pmu_private_alloc(cpu_pmu);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >>  	ret = smp_call_function_any(&cpu_pmu->supported_cpus,
> >>  				    __armv8pmu_probe_pmu,
> >>  				    &probe, 1);
> >>  	if (ret)
> >>  		return ret;
> >>  
> >> +	if (arm_pmu_branch_stack_supported(cpu_pmu)) {
> >> +		ret = branch_records_alloc(cpu_pmu);
> >> +		if (ret)
> >> +			return ret;
> >> +	} else {
> >> +		armv8pmu_private_free(cpu_pmu);
> >> +	}
> > 
> > I see from the next patch that "private" is four ints, so please just add that
> > to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
> > if we end up needing more space in future we can consider factoring it out.
> 
> struct arm_pmu {
> 	........................................
>         /* Implementation specific attributes */
>         void            *private;
> }
> 
> private pointer here creates an abstraction for given pmu implementation
> to hide attribute details without making it known to core arm pmu layer.
> Although adding ifdef CONFIG_ARM64_BRBE solves the problem as mentioned
> above, it does break that abstraction. Currently arm_pmu layer is aware
> about 'branch records' but not about BRBE in particular which the driver
> adds later on. I suggest we should not break that abstraction.

I understand the rationale, but I think it's simpler for now to break that
abstraction. We can always refactor it later.

> Instead a global 'static struct brbe_hw_attr' in drivers/perf/arm_brbe.c
> can be initialized into arm_pmu->private during armv8pmu_branch_probe(),
> which will also solve the allocation-free problem. 

IIUC that's not going to work for big.LITTLE systems where the BRBE support
varies, as we need this data per arm_pmu.

> Also similar helpers armv8pmu_task_ctx_alloc()/free() could be defined to
> manage task context cache i.e arm_pmu->pmu.task_ctx_cache independently.
> 
> But now armv8pmu_task_ctx_alloc() can be called after pmu probe confirms
> to have arm_pmu->has_branch_stack.

I think those are different, and should be kept.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-06 10:34     ` Anshuman Khandual
  2023-06-06 10:41       ` Mark Rutland
@ 2023-06-08 10:13       ` Suzuki K Poulose
  2023-06-09  4:00         ` Anshuman Khandual
  2023-06-09  7:14         ` Anshuman Khandual
  1 sibling, 2 replies; 48+ messages in thread
From: Suzuki K Poulose @ 2023-06-08 10:13 UTC (permalink / raw)
  To: Anshuman Khandual, Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users

On 06/06/2023 11:34, Anshuman Khandual wrote:
> 
> 
> On 6/5/23 17:35, Mark Rutland wrote:
>> On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
>>> This enables support for branch stack sampling event in ARMV8 PMU, checking
>>> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
>>> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
>>> for now. While here, this also defines arm_pmu's sched_task() callback with
>>> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
>>
>> This generally looks good, but I have a few comments below.
>>
>> [...]
>>
>>> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
>>> +{
>>> +	WARN_ON_ONCE(!has_branch_stack(event));
>>> +	return false;
>>> +}
>>
>> IIUC this is for validating the attr, so could we please name this
>> armv8pmu_branch_attr_valid() ?
> 
> Sure, will change the name and updated call sites.
> 
>>
>> [...]
>>
>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>> +{
>>> +	struct pmu_hw_events *events;
>>> +	int cpu;
>>> +
>>> +	for_each_possible_cpu(cpu) {

Shouldn't this be supported_pmus ? i.e.
	for_each_cpu(cpu, &armpmu->supported_cpus) {


>>> +		events = per_cpu_ptr(armpmu->hw_events, cpu);
>>> +		events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
>>> +		if (!events->branches)
>>> +			return -ENOMEM;

Do we need to free the allocated branches already ?

>>> +	}


May be:
	int ret = 0;

	for_each_cpu(cpu, &armpmu->supported_cpus) {
		events = per_cpu_ptr(armpmu->hw_events, cpu);
		events->branches = kzalloc(sizeof(struct 		branch_records), GFP_KERNEL);
		
		if (!events->branches) {
			ret = -ENOMEM;
			break;
		}
	}

	if (!ret)
		return 0;

	for_each_cpu(cpu, &armpmu->supported_cpus) {
		events = per_cpu_ptr(armpmu->hw_events, cpu);
		if (!events->branches)
			break;
		kfree(events->branches);
	}
	return ret;
	
>>> +	return 0;
>>
>> This leaks memory if any allocation fails, and the next patch replaces this
>> code entirely.
> 
> Okay.
> 
>>
>> Please add this once in a working state. Either use the percpu allocation
>> trick in the next patch from the start, or have this kzalloc() with a
>> corresponding kfree() in an error path.
> 
> I will change branch_records_alloc() as suggested in the next patch's thread
> and fold those changes here in this patch.
> 
>>
>>>   }
>>>   
>>>   static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>   	};
>>>   	int ret;
>>>   
>>> +	ret = armv8pmu_private_alloc(cpu_pmu);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>>   	ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>>   				    __armv8pmu_probe_pmu,
>>>   				    &probe, 1);
>>>   	if (ret)
>>>   		return ret;
>>>   
>>> +	if (arm_pmu_branch_stack_supported(cpu_pmu)) {
>>> +		ret = branch_records_alloc(cpu_pmu);
>>> +		if (ret)
>>> +			return ret;
>>> +	} else {
>>> +		armv8pmu_private_free(cpu_pmu);
>>> +	}
>>
>> I see from the next patch that "private" is four ints, so please just add that
>> to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
>> if we end up needing more space in future we can consider factoring it out.
> 
> struct arm_pmu {
> 	........................................
>          /* Implementation specific attributes */
>          void            *private;
> }
> 
> private pointer here creates an abstraction for given pmu implementation
> to hide attribute details without making it known to core arm pmu layer.
> Although adding ifdef CONFIG_ARM64_BRBE solves the problem as mentioned
> above, it does break that abstraction. Currently arm_pmu layer is aware
> about 'branch records' but not about BRBE in particular which the driver
> adds later on. I suggest we should not break that abstraction.
> 
> Instead a global 'static struct brbe_hw_attr' in drivers/perf/arm_brbe.c
> can be initialized into arm_pmu->private during armv8pmu_branch_probe(),
> which will also solve the allocation-free problem. Also similar helpers
> armv8pmu_task_ctx_alloc()/free() could be defined to manage task context
> cache i.e arm_pmu->pmu.task_ctx_cache independently.
> 
> But now armv8pmu_task_ctx_alloc() can be called after pmu probe confirms
> to have arm_pmu->has_branch_stack.
> 
>>
>>> +
>>>   	return probe.present ? 0 : -ENODEV;
>>>   }
>>
>> It also seems odd to ceck probe.present *after* checking
>> arm_pmu_branch_stack_supported().
> 
> I will reorganize as suggested below.
> 
>>
>> With the allocation removed I think this can be written more clearly as:
>>
>> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>> | {
>> |         struct armv8pmu_probe_info probe = {
>> |                 .pmu = cpu_pmu,
>> |                 .present = false,
>> |         };
>> |         int ret;
>> |
>> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>> |                                     __armv8pmu_probe_pmu,
>> |                                     &probe, 1);
>> |         if (ret)
>> |                 return ret; > |
>> |         if (!probe.present)
>> |                 return -ENODEV;
>> |
>> |         if (arm_pmu_branch_stack_supported(cpu_pmu))
>> |                 ret = branch_records_alloc(cpu_pmu);
>> |
>> |         return ret;
>> | }

Could we not simplify this as below and keep the abstraction, since we
already have it ?

 >> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
 >> | {
 >> |         struct armv8pmu_probe_info probe = {
 >> |                 .pmu = cpu_pmu,
 >> |                 .present = false,
 >> |         };
 >> |         int ret;
 >> |
 >> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
 >> |                                     __armv8pmu_probe_pmu,
 >> |                                     &probe, 1);
 >> |         if (ret)
 >> |                 return ret;
 >> |         if (!probe.present)
 >> |                 return -ENODEV;
 >> |
 >> |  	     if (!arm_pmu_branch_stack_supported(cpu_pmu))
 >> |		     return 0;
 >> |
 >> |	     ret = armv8pmu_private_alloc(cpu_pmu);
 >> |	     if (ret)
 >> |		 return ret;
 >> |		
 >> |	      ret = branch_records_alloc(cpu_pmu);
 >> |	      if (ret)
 >> |		  armv8pmu_private_free(cpu_pmu);
 >> |		
 >> |        return ret;
 >> | }


Suzuki


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-08 10:13       ` Suzuki K Poulose
@ 2023-06-09  4:00         ` Anshuman Khandual
  2023-06-09  9:54           ` Suzuki K Poulose
  2023-06-09  7:14         ` Anshuman Khandual
  1 sibling, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09  4:00 UTC (permalink / raw)
  To: Suzuki K Poulose, Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users



On 6/8/23 15:43, Suzuki K Poulose wrote:
> On 06/06/2023 11:34, Anshuman Khandual wrote:
>>
>>
>> On 6/5/23 17:35, Mark Rutland wrote:
>>> On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
>>>> This enables support for branch stack sampling event in ARMV8 PMU, checking
>>>> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
>>>> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
>>>> for now. While here, this also defines arm_pmu's sched_task() callback with
>>>> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
>>>
>>> This generally looks good, but I have a few comments below.
>>>
>>> [...]
>>>
>>>> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
>>>> +{
>>>> +    WARN_ON_ONCE(!has_branch_stack(event));
>>>> +    return false;
>>>> +}
>>>
>>> IIUC this is for validating the attr, so could we please name this
>>> armv8pmu_branch_attr_valid() ?
>>
>> Sure, will change the name and updated call sites.
>>
>>>
>>> [...]
>>>
>>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>>> +{
>>>> +    struct pmu_hw_events *events;
>>>> +    int cpu;
>>>> +
>>>> +    for_each_possible_cpu(cpu) {
> 
> Shouldn't this be supported_pmus ? i.e.
>     for_each_cpu(cpu, &armpmu->supported_cpus) {
> 
> 
>>>> +        events = per_cpu_ptr(armpmu->hw_events, cpu);
>>>> +        events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
>>>> +        if (!events->branches)
>>>> +            return -ENOMEM;
> 
> Do we need to free the allocated branches already ?

This gets fixed in the next patch via per-cpu allocation. I will
move and fold the code block in here. Updated function will look
like the following.

static int branch_records_alloc(struct arm_pmu *armpmu)
{
        struct branch_records __percpu *records;
        int cpu;

        records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
        if (!records)
                return -ENOMEM;

        /*
         * FIXME: Memory allocated via records gets completely
         * consumed here, never required to be freed up later. Hence
         * losing access to on stack 'records' is acceptable.
         * Otherwise this alloc handle has to be saved some where.
         */
        for_each_possible_cpu(cpu) {
                struct pmu_hw_events *events_cpu;
                struct branch_records *records_cpu;

                events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
                records_cpu = per_cpu_ptr(records, cpu);
                events_cpu->branches = records_cpu;
        }
        return 0;
}

Regarding the cpumask argument in for_each_cpu().

- hw_events is a __percpu pointer in struct arm_pmu

	- pmu->hw_events = alloc_percpu_gfp(struct pmu_hw_events, GFP_KERNEL)


- 'records' above is being allocated via alloc_percpu_gfp()

	- records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL)

If 'armpmu->supported_cpus' mask gets used instead of possible cpu mask,
would not there be some dangling per-cpu branch_record allocated areas,
that remain unsigned ? Assigning all of them back into hw_events should
be harmless.

> 
>>>> +    }
> 
> 
> May be:
>     int ret = 0;
> 
>     for_each_cpu(cpu, &armpmu->supported_cpus) {
>         events = per_cpu_ptr(armpmu->hw_events, cpu);
>         events->branches = kzalloc(sizeof(struct         branch_records), GFP_KERNEL);
>        
>         if (!events->branches) {
>             ret = -ENOMEM;
>             break;
>         }
>     }
> 
>     if (!ret)
>         return 0;
> 
>     for_each_cpu(cpu, &armpmu->supported_cpus) {
>         events = per_cpu_ptr(armpmu->hw_events, cpu);
>         if (!events->branches)
>             break;
>         kfree(events->branches);
>     }
>     return ret;
>     
>>>> +    return 0;
>>>
>>> This leaks memory if any allocation fails, and the next patch replaces this
>>> code entirely.
>>
>> Okay.
>>
>>>
>>> Please add this once in a working state. Either use the percpu allocation
>>> trick in the next patch from the start, or have this kzalloc() with a
>>> corresponding kfree() in an error path.
>>
>> I will change branch_records_alloc() as suggested in the next patch's thread
>> and fold those changes here in this patch.
>>
>>>
>>>>   }
>>>>     static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>>       };
>>>>       int ret;
>>>>   +    ret = armv8pmu_private_alloc(cpu_pmu);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>>       ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>>>                       __armv8pmu_probe_pmu,
>>>>                       &probe, 1);
>>>>       if (ret)
>>>>           return ret;
>>>>   +    if (arm_pmu_branch_stack_supported(cpu_pmu)) {
>>>> +        ret = branch_records_alloc(cpu_pmu);
>>>> +        if (ret)
>>>> +            return ret;
>>>> +    } else {
>>>> +        armv8pmu_private_free(cpu_pmu);
>>>> +    }
>>>
>>> I see from the next patch that "private" is four ints, so please just add that
>>> to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
>>> if we end up needing more space in future we can consider factoring it out.
>>
>> struct arm_pmu {
>>     ........................................
>>          /* Implementation specific attributes */
>>          void            *private;
>> }
>>
>> private pointer here creates an abstraction for given pmu implementation
>> to hide attribute details without making it known to core arm pmu layer.
>> Although adding ifdef CONFIG_ARM64_BRBE solves the problem as mentioned
>> above, it does break that abstraction. Currently arm_pmu layer is aware
>> about 'branch records' but not about BRBE in particular which the driver
>> adds later on. I suggest we should not break that abstraction.
>>
>> Instead a global 'static struct brbe_hw_attr' in drivers/perf/arm_brbe.c
>> can be initialized into arm_pmu->private during armv8pmu_branch_probe(),
>> which will also solve the allocation-free problem. Also similar helpers
>> armv8pmu_task_ctx_alloc()/free() could be defined to manage task context
>> cache i.e arm_pmu->pmu.task_ctx_cache independently.
>>
>> But now armv8pmu_task_ctx_alloc() can be called after pmu probe confirms
>> to have arm_pmu->has_branch_stack.
>>
>>>
>>>> +
>>>>       return probe.present ? 0 : -ENODEV;
>>>>   }
>>>
>>> It also seems odd to ceck probe.present *after* checking
>>> arm_pmu_branch_stack_supported().
>>
>> I will reorganize as suggested below.
>>
>>>
>>> With the allocation removed I think this can be written more clearly as:
>>>
>>> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> | {
>>> |         struct armv8pmu_probe_info probe = {
>>> |                 .pmu = cpu_pmu,
>>> |                 .present = false,
>>> |         };
>>> |         int ret;
>>> |
>>> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>> |                                     __armv8pmu_probe_pmu,
>>> |                                     &probe, 1);
>>> |         if (ret)
>>> |                 return ret; > |
>>> |         if (!probe.present)
>>> |                 return -ENODEV;
>>> |
>>> |         if (arm_pmu_branch_stack_supported(cpu_pmu))
>>> |                 ret = branch_records_alloc(cpu_pmu);
>>> |
>>> |         return ret;
>>> | }
> 
> Could we not simplify this as below and keep the abstraction, since we
> already have it ?

No, there is an allocation dependency before the smp call context.
 
> 
>>> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> | {
>>> |         struct armv8pmu_probe_info probe = {
>>> |                 .pmu = cpu_pmu,
>>> |                 .present = false,
>>> |         };
>>> |         int ret;
>>> |
>>> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>> |                                     __armv8pmu_probe_pmu,
>>> |                                     &probe, 1);
>>> |         if (ret)
>>> |                 return ret;
>>> |         if (!probe.present)
>>> |                 return -ENODEV;
>>> |
>>> |           if (!arm_pmu_branch_stack_supported(cpu_pmu))
>>> |             return 0;
>>> |
>>> |         ret = armv8pmu_private_alloc(cpu_pmu);

This needs to be allocated before each supported PMU gets probed via
__armv8pmu_probe_pmu() inside smp_call_function_any() callbacks that
unfortunately cannot do memory allocation.

>>> |         if (ret)
>>> |         return ret;
>>> |       
>>> |          ret = branch_records_alloc(cpu_pmu);
>>> |          if (ret)
>>> |          armv8pmu_private_free(cpu_pmu);
>>> |       
>>> |        return ret;
>>> | }

Changing the abstraction will cause too much code churn, this late in
the development phase, which should be avoided IMHO.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-05 13:43   ` Mark Rutland
@ 2023-06-09  4:30     ` Anshuman Khandual
  2023-06-09 12:37       ` Mark Rutland
  2023-06-09  4:47     ` Anshuman Khandual
  2023-06-09  5:22     ` Anshuman Khandual
  2 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09  4:30 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/5/23 19:13, Mark Rutland wrote:
>> +static u64 branch_type_to_brbcr(int branch_type)
>> +{
>> +	u64 brbcr = BRBCR_EL1_DEFAULT_TS;
>> +
>> +	/*
>> +	 * BRBE need not be paused on PMU interrupt while tracing only
>> +	 * the user space, bcause it will automatically be inside the
>> +	 * prohibited region. But even after PMU overflow occurs, the
>> +	 * interrupt could still take much more cycles, before it can
>> +	 * be taken and by that time BRBE will have been overwritten.
>> +	 * Let's enable pause on PMU interrupt mechanism even for user
>> +	 * only traces.
>> +	 */
>> +	brbcr |= BRBCR_EL1_FZP;
> I think this is trying to say that we *should* use FZP when sampling the
> kernel (due to IRQ latency), and *can* safely use it when sampling userspace,
> so it would be good to explain it that way around.

Agreed, following updated comment explains why we should enable FZP
when sampling kernel, otherwise BRBE will capture unwanted records.
It also explains why we should enable FZP even when sampling user
space due to IRQ latency.

        /*
         * BRBE should be paused on PMU interrupt while tracing kernel
         * space to stop capturing further branch records. Otherwise
         * interrupt handler branch records might get into the samples
         * which is not desired.
         *
         * BRBE need not be paused on PMU interrupt while tracing only
         * the user space, because it will automatically be inside the
         * prohibited region. But even after PMU overflow occurs, the
         * interrupt could still take much more cycles, before it can
         * be taken and by that time BRBE will have been overwritten.
         * Hence enable pause on PMU interrupt mechanism even for user
         * only traces as well.
         */
        brbcr |= BRBCR_EL1_FZP;

> 
> It's a bit unfortunate, because where this matters we'll always be losing some
> branches either way, but I guess we don't have much say in the matter.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-05 13:43   ` Mark Rutland
  2023-06-09  4:30     ` Anshuman Khandual
@ 2023-06-09  4:47     ` Anshuman Khandual
  2023-06-09 12:42       ` Mark Rutland
  2023-06-09  5:22     ` Anshuman Khandual
  2 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09  4:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On 6/5/23 19:13, Mark Rutland wrote:
>> +/*
>> + * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
>> + * preceding consecutive branch records, that were in a transaction
>> + * (i.e their BRBINFx_EL1.TX set) have been aborted.
>> + *
>> + * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
>> + * consecutive branch records up to the last record, which were in a
>> + * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
>> + *
>> + * --------------------------------- -------------------
>> + * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
>> + * --------------------------------- -------------------
>> + * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
>> + * --------------------------------- -------------------
>> + * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
>> + * --------------------------------- -------------------
>> + * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
>> + * --------------------------------- -------------------
>> + * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
>> + * --------------------------------- -------------------
>> + * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
>> + * --------------------------------- -------------------
>> + * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
>> + * --------------------------------- -------------------
>> + * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
>> + * --------------------------------- -------------------
>> + * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
>> + * --------------------------------- -------------------
>> + * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
>> + * --------------------------------- -------------------
>> + *
>> + * BRBFCR_EL1.LASTFAILED == 1
>> + *
>> + * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
>> + * branches records near the end of the BRBE buffer.
>> + *
>> + * Architecture does not guarantee a non transaction (TX = 0) branch
>> + * record between two different transactions. So it is possible that
>> + * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
>> + * mark more than required transactions as aborted.
>> + */
> Linux doesn't currently support TME (and IIUC no-one has built it), so can't we
> delete the transaction handling for now? We can add a comment with somehing like:
> 
> /*
>  * TODO: add transaction handling for TME.
>  */
> 
> Assuming no-one has built TME, we might also be able to get an architectural
> fix to disambiguate the boundary between two transactions, and avoid the
> problem described above.
> 
> [...]
> 

OR can leave this unchanged for now. Then update it if and when the relevant
architectural fix comes in. The current TME branch records handling here, is
as per the current architectural specification.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-05 13:43   ` Mark Rutland
  2023-06-09  4:30     ` Anshuman Khandual
  2023-06-09  4:47     ` Anshuman Khandual
@ 2023-06-09  5:22     ` Anshuman Khandual
  2023-06-09 12:47       ` Mark Rutland
  2 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09  5:22 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

[...]

On 6/5/23 19:13, Mark Rutland wrote:
>> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>> +{
>> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>> +	u64 brbfcr, brbcr;
>> +	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
>> +
>> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>> +
>> +	/* Ensure pause on PMU interrupt is enabled */
>> +	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>> +
>> +	/* Pause the buffer */
>> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>> +	isb();
>> +
>> +	/* Determine the indices for each loop */
>> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
>> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
>> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
>> +	} else {
>> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
>> +	}
>> +
>> +	/* Loop through bank 0 */
>> +	select_brbe_bank(BRBE_BANK_IDX_0);
>> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
>> +		if (!capture_branch_entry(cpuc, event, idx))
>> +			goto skip_bank_1;
>> +	}
>> +
>> +	/* Loop through bank 1 */
>> +	select_brbe_bank(BRBE_BANK_IDX_1);
>> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
>> +		if (!capture_branch_entry(cpuc, event, idx))
>> +			break;
>> +	}
>> +
>> +skip_bank_1:
>> +	cpuc->branches->branch_stack.nr = idx;
>> +	cpuc->branches->branch_stack.hw_idx = -1ULL;
>> +	process_branch_aborts(cpuc);
>> +
>> +	/* Unpause the buffer */
>> +	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>> +	isb();
>> +	armv8pmu_branch_reset();
>> +}
> The loop indicies are rather difficult to follow, and I think those can be made
> quite a lot simpler if split out, e.g.
> 
> | int __armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> | {
> | 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
> | 	int nr_hw_entries = brbe_attr->brbe_nr;
> | 	int idx;

I guess idx needs an init to 0.

> | 
> | 	select_brbe_bank(BRBE_BANK_IDX_0);
> | 	while (idx < nr_hw_entries && idx < BRBE_BANK0_IDX_MAX) {
> | 		if (!capture_branch_entry(cpuc, event, idx))
> | 			return idx;
> | 		idx++;
> | 	}
> | 
> | 	select_brbe_bank(BRBE_BANK_IDX_1);
> | 	while (idx < nr_hw_entries && idx < BRBE_BANK1_IDX_MAX) {
> | 		if (!capture_branch_entry(cpuc, event, idx))
> | 			return idx;
> | 		idx++;
> | 	}
> | 
> | 	return idx;
> | }

These loops are better than the proposed one with indices, will update.

> | 
> | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> | {
> | 	u64 brbfcr, brbcr;
> | 	int nr;
> | 
> | 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> | 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> | 
> | 	/* Ensure pause on PMU interrupt is enabled */
> | 	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> | 
> | 	/* Pause the buffer */
> | 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> | 	isb();
> | 
> | 	nr = __armv8pmu_branch_read(cpus, event);
> | 
> | 	cpuc->branches->branch_stack.nr = nr;
> | 	cpuc->branches->branch_stack.hw_idx = -1ULL;
> | 	process_branch_aborts(cpuc);
> | 
> | 	/* Unpause the buffer */
> | 	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> | 	isb();
> | 	armv8pmu_branch_reset();
> | }
> 
> Looking at <linux/perf_event.h> I see:
> 
> | /*
> |  * branch stack layout:
> |  *  nr: number of taken branches stored in entries[]
> |  *  hw_idx: The low level index of raw branch records
> |  *          for the most recent branch.
> |  *          -1ULL means invalid/unknown.
> |  *
> |  * Note that nr can vary from sample to sample
> |  * branches (to, from) are stored from most recent
> |  * to least recent, i.e., entries[0] contains the most
> |  * recent branch.
> |  * The entries[] is an abstraction of raw branch records,
> |  * which may not be stored in age order in HW, e.g. Intel LBR.
> |  * The hw_idx is to expose the low level index of raw
> |  * branch record for the most recent branch aka entries[0].
> |  * The hw_idx index is between -1 (unknown) and max depth,
> |  * which can be retrieved in /sys/devices/cpu/caps/branches.
> |  * For the architectures whose raw branch records are
> |  * already stored in age order, the hw_idx should be 0.
> |  */
> | struct perf_branch_stack {
> |         __u64                           nr;  
> |         __u64                           hw_idx;
> |         struct perf_branch_entry        entries[];
> | };
> 
> ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
> records are in age order.
Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
figured that there was no need for hw_idx and hence marked it as -1UL similar
to other platforms like powerpc.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-08 10:13       ` Suzuki K Poulose
  2023-06-09  4:00         ` Anshuman Khandual
@ 2023-06-09  7:14         ` Anshuman Khandual
  1 sibling, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09  7:14 UTC (permalink / raw)
  To: Suzuki K Poulose, Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users

[..]

On 6/8/23 15:43, Suzuki K Poulose wrote:
>>> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> | {
>>> |         struct armv8pmu_probe_info probe = {
>>> |                 .pmu = cpu_pmu,
>>> |                 .present = false,
>>> |         };
>>> |         int ret;
>>> |
>>> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>> |                                     __armv8pmu_probe_pmu,
>>> |                                     &probe, 1);
>>> |         if (ret)
>>> |                 return ret;
>>> |         if (!probe.present)
>>> |                 return -ENODEV;
>>> |
>>> |           if (!arm_pmu_branch_stack_supported(cpu_pmu))
>>> |             return 0;
>>> |
>>> |         ret = armv8pmu_private_alloc(cpu_pmu);
>>> |         if (ret)
>>> |         return ret;
>>> |       
>>> |          ret = branch_records_alloc(cpu_pmu);
>>> |          if (ret)
>>> |          armv8pmu_private_free(cpu_pmu);
>>> |       
>>> |        return ret;
>>> | }


After splitting the task ctx cache management from pmu private data
management, the above function will look something like this taking
care of all error path freeing as well.

static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
{
        struct armv8pmu_probe_info probe = {
                .pmu = cpu_pmu,
                .present = false,
        };
        int ret;

        ret = armv8pmu_private_alloc(cpu_pmu);
        if (ret)
                return ret;

        ret = smp_call_function_any(&cpu_pmu->supported_cpus,
                                    __armv8pmu_probe_pmu,
                                    &probe, 1);
        if (ret)
                goto probe_err;

        if (!probe.present) {
                ret = -ENODEV;
                goto probe_err;
        }

        if (cpu_pmu->has_branch_stack) {
                ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
                if (ret)
                        goto probe_err;

                ret = branch_records_alloc(cpu_pmu);
                if (ret) {
                        armv8pmu_task_ctx_cache_free(cpu_pmu);
                        goto probe_err;
                }
                return 0;
        }
        armv8pmu_private_free(cpu_pmu);
        return 0;

probe_err:
        armv8pmu_private_free(cpu_pmu);
        return ret;
}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU
  2023-06-09  4:00         ` Anshuman Khandual
@ 2023-06-09  9:54           ` Suzuki K Poulose
  0 siblings, 0 replies; 48+ messages in thread
From: Suzuki K Poulose @ 2023-06-09  9:54 UTC (permalink / raw)
  To: Anshuman Khandual, Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users

On 09/06/2023 05:00, Anshuman Khandual wrote:
> 
> 
> On 6/8/23 15:43, Suzuki K Poulose wrote:
>> On 06/06/2023 11:34, Anshuman Khandual wrote:
>>>
>>>
>>> On 6/5/23 17:35, Mark Rutland wrote:
>>>> On Wed, May 31, 2023 at 09:34:23AM +0530, Anshuman Khandual wrote:
>>>>> This enables support for branch stack sampling event in ARMV8 PMU, checking
>>>>> has_branch_stack() on the event inside 'struct arm_pmu' callbacks. Although
>>>>> these branch stack helpers armv8pmu_branch_XXXXX() are just dummy functions
>>>>> for now. While here, this also defines arm_pmu's sched_task() callback with
>>>>> armv8pmu_sched_task(), which resets the branch record buffer on a sched_in.
>>>>
>>>> This generally looks good, but I have a few comments below.
>>>>
>>>> [...]
>>>>
>>>>> +static inline bool armv8pmu_branch_valid(struct perf_event *event)
>>>>> +{
>>>>> +    WARN_ON_ONCE(!has_branch_stack(event));
>>>>> +    return false;
>>>>> +}
>>>>
>>>> IIUC this is for validating the attr, so could we please name this
>>>> armv8pmu_branch_attr_valid() ?
>>>
>>> Sure, will change the name and updated call sites.
>>>
>>>>
>>>> [...]
>>>>
>>>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>>>> +{
>>>>> +    struct pmu_hw_events *events;
>>>>> +    int cpu;
>>>>> +
>>>>> +    for_each_possible_cpu(cpu) {
>>
>> Shouldn't this be supported_pmus ? i.e.
>>      for_each_cpu(cpu, &armpmu->supported_cpus) {
>>
>>
>>>>> +        events = per_cpu_ptr(armpmu->hw_events, cpu);
>>>>> +        events->branches = kzalloc(sizeof(struct branch_records), GFP_KERNEL);
>>>>> +        if (!events->branches)
>>>>> +            return -ENOMEM;
>>
>> Do we need to free the allocated branches already ?
> 
> This gets fixed in the next patch via per-cpu allocation. I will
> move and fold the code block in here. Updated function will look
> like the following.
> 
> static int branch_records_alloc(struct arm_pmu *armpmu)
> {
>          struct branch_records __percpu *records;
>          int cpu;
> 
>          records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
>          if (!records)
>                  return -ENOMEM;
> 
>          /*
>           * FIXME: Memory allocated via records gets completely
>           * consumed here, never required to be freed up later. Hence
>           * losing access to on stack 'records' is acceptable.
>           * Otherwise this alloc handle has to be saved some where.
>           */
>          for_each_possible_cpu(cpu) {
>                  struct pmu_hw_events *events_cpu;
>                  struct branch_records *records_cpu;
> 
>                  events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
>                  records_cpu = per_cpu_ptr(records, cpu);
>                  events_cpu->branches = records_cpu;
>          }
>          return 0;
> }
> 
> Regarding the cpumask argument in for_each_cpu().
> 
> - hw_events is a __percpu pointer in struct arm_pmu
> 
> 	- pmu->hw_events = alloc_percpu_gfp(struct pmu_hw_events, GFP_KERNEL)
> 
> 
> - 'records' above is being allocated via alloc_percpu_gfp()
> 
> 	- records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL)


> 
> If 'armpmu->supported_cpus' mask gets used instead of possible cpu mask,
> would not there be some dangling per-cpu branch_record allocated areas,
> that remain unsigned ? Assigning all of them back into hw_events should
> be harmless.

Thats because you are using alloc_percpu for records ? With the current
proposed code, if there are two different arm_pmus on the system, you
would end up wasting 1xper_cpu branch_records ? And if there are 3,
2xper_cpu gets wasted ?

> 
>>
>>>>> +    }
>>
>>
>> May be:
>>      int ret = 0;
>>
>>      for_each_cpu(cpu, &armpmu->supported_cpus) {
>>          events = per_cpu_ptr(armpmu->hw_events, cpu);
>>          events->branches = kzalloc(sizeof(struct         branch_records), GFP_KERNEL);
>>         
>>          if (!events->branches) {
>>              ret = -ENOMEM;
>>              break;
>>          }
>>      }
>>
>>      if (!ret)
>>          return 0;
>>
>>      for_each_cpu(cpu, &armpmu->supported_cpus) {
>>          events = per_cpu_ptr(armpmu->hw_events, cpu);
>>          if (!events->branches)
>>              break;
>>          kfree(events->branches);
>>      }
>>      return ret;
>>      
>>>>> +    return 0;
>>>>
>>>> This leaks memory if any allocation fails, and the next patch replaces this
>>>> code entirely.
>>>
>>> Okay.
>>>
>>>>
>>>> Please add this once in a working state. Either use the percpu allocation
>>>> trick in the next patch from the start, or have this kzalloc() with a
>>>> corresponding kfree() in an error path.
>>>
>>> I will change branch_records_alloc() as suggested in the next patch's thread
>>> and fold those changes here in this patch.
>>>
>>>>
>>>>>    }
>>>>>      static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>>> @@ -1145,12 +1162,24 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>>>        };
>>>>>        int ret;
>>>>>    +    ret = armv8pmu_private_alloc(cpu_pmu);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>>        ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>>>>                        __armv8pmu_probe_pmu,
>>>>>                        &probe, 1);
>>>>>        if (ret)
>>>>>            return ret;
>>>>>    +    if (arm_pmu_branch_stack_supported(cpu_pmu)) {
>>>>> +        ret = branch_records_alloc(cpu_pmu);
>>>>> +        if (ret)
>>>>> +            return ret;
>>>>> +    } else {
>>>>> +        armv8pmu_private_free(cpu_pmu);
>>>>> +    }
>>>>
>>>> I see from the next patch that "private" is four ints, so please just add that
>>>> to struct arm_pmu under an ifdef CONFIG_ARM64_BRBE. That'll simplify this, and
>>>> if we end up needing more space in future we can consider factoring it out.
>>>
>>> struct arm_pmu {
>>>      ........................................
>>>           /* Implementation specific attributes */
>>>           void            *private;
>>> }
>>>
>>> private pointer here creates an abstraction for given pmu implementation
>>> to hide attribute details without making it known to core arm pmu layer.
>>> Although adding ifdef CONFIG_ARM64_BRBE solves the problem as mentioned
>>> above, it does break that abstraction. Currently arm_pmu layer is aware
>>> about 'branch records' but not about BRBE in particular which the driver
>>> adds later on. I suggest we should not break that abstraction.
>>>
>>> Instead a global 'static struct brbe_hw_attr' in drivers/perf/arm_brbe.c
>>> can be initialized into arm_pmu->private during armv8pmu_branch_probe(),
>>> which will also solve the allocation-free problem. Also similar helpers
>>> armv8pmu_task_ctx_alloc()/free() could be defined to manage task context
>>> cache i.e arm_pmu->pmu.task_ctx_cache independently.
>>>
>>> But now armv8pmu_task_ctx_alloc() can be called after pmu probe confirms
>>> to have arm_pmu->has_branch_stack.
>>>
>>>>
>>>>> +
>>>>>        return probe.present ? 0 : -ENODEV;
>>>>>    }
>>>>
>>>> It also seems odd to ceck probe.present *after* checking
>>>> arm_pmu_branch_stack_supported().
>>>
>>> I will reorganize as suggested below.
>>>
>>>>
>>>> With the allocation removed I think this can be written more clearly as:
>>>>
>>>> | static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>> | {
>>>> |         struct armv8pmu_probe_info probe = {
>>>> |                 .pmu = cpu_pmu,
>>>> |                 .present = false,
>>>> |         };
>>>> |         int ret;
>>>> |
>>>> |         ret = smp_call_function_any(&cpu_pmu->supported_cpus,
>>>> |                                     __armv8pmu_probe_pmu,
>>>> |                                     &probe, 1);
>>>> |         if (ret)
>>>> |                 return ret; > |
>>>> |         if (!probe.present)
>>>> |                 return -ENODEV;
>>>> |
>>>> |         if (arm_pmu_branch_stack_supported(cpu_pmu))
>>>> |                 ret = branch_records_alloc(cpu_pmu);
>>>> |
>>>> |         return ret;
>>>> | }
>>
>> Could we not simplify this as below and keep the abstraction, since we
>> already have it ?
> 
> No, there is an allocation dependency before the smp call context.

Ok, I wasn't aware of that. Could we not read whatever we need to know
about the brbe in armv8pmu_probe_info and process it at the caller here?
And then do the the private_alloc etc as we need ?

Suzuki


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 00/10] arm64/perf: Enable branch stack sampling
  2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
                   ` (9 preceding siblings ...)
  2023-05-31  4:04 ` [PATCH V11 10/10] arm64/perf: Implement branch records save on PMU IRQ Anshuman Khandual
@ 2023-06-09 11:13 ` Anshuman Khandual
  10 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-09 11:13 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, will, catalin.marinas, mark.rutland
  Cc: Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 5/31/23 09:34, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> relevant register definitions could be accessed here.
> 
> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
> 
> This series applies on 6.4-rc4.
> 
> Changes in V11:
> 
> - Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data
> 
> Changes in V10:
> 
> https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/
> 
> - Rebased the series on v6.4-rc2
> - Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
> - Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
> - Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()
> 
> Changes in V9:
> 
> https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/
> 
> - Fixed build problem with has_branch_stack() in arm64 header
> - BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
> - Renamed all BRBINF_EL1 call sites as BRBINFx_EL1 
> - Dropped static const char branch_filter_error_msg[]
> - Implemented a positive list check for BRBE supported perf branch filters
> - Added a comment in armv8pmu_handle_irq()
> - Implemented per-cpu allocation for struct branch_record records
> - Skipped looping through bank 1 if an invalid record is detected in bank 0
> - Added comment in armv8pmu_branch_read() explaining prohibited region etc
> - Added comment warning about erroneously marking transactions as aborted
> - Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
> - Dropped the last argument (idx) in capture_brbe_flags()
> - Dropped the brbcr argument from capture_brbe_flags()
> - Used perf_sample_save_brstack() to capture branch records for perf_sample_data
> - Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
> - Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
> - Implemented event task context based branch records save mechanism
> 
> Changes in V8:
> 
> https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/
> 
> - Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
> - Added a comment and line break before arm_pmu->private element 
> - Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
> - Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
> - Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
> - Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
> - Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
> - Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
> - Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
> - Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()   
> - Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
> - Renamed brbe_valid() as brbe_record_is_complete()
> - Renamed brbe_source() as brbe_record_is_source_only()
> - Renamed brbe_target() as brbe_record_is_target_only()
> - Inverted checks for !brbe_record_is_[target|source]_only() for info capture
> - Replaced 'fetch' with 'get' in all helpers that extract field value
> - Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
> - Dropped select_brbe_bank_index() completely, added capture_branch_entry()
> - Process captured branch entries in two separate loops one for each BRBE bank
> - Moved branch_records_alloc() inside armv8pmu_probe_pmu()
> - Added a forward declaration for the helper has_branch_stack()
> - Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
> - Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call
> 
> Changes in V7:
> 
> https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/
> 
> - Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
> - Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
> - Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
> - Defined BRBCR_EL1_DEFAULT_TS in the header
> - Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
> - Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
> - Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
> - Also set BRBE in paused state in armv8pmu_branch_disable()
> - Dropped brbe_paused(), set_brbe_paused() helpers
> - Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
> - Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
> - Added valid_brbe_[cc, format, version]() helpers
> - Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
> - Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
> - Defined enum brbe_bank_idx with possible values for BRBE bank indices
> - Changed armpmu->hw_attr into armpmu->private
> - Added missing space in stub definition for armv8pmu_branch_valid()
> - Replaced both kmalloc() with kzalloc()
> - Added BRBE_BANK_MAX_ENTRIES
> - Updated comment for capture_brbe_flags()
> - Updated comment for struct brbe_hw_attr
> - Dropped space after type cast in couple of places
> - Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
> - Captured cpuc->branches->branch_entries[idx] in a local variable
> - Dropped saved_priv from armv8pmu_branch_read()
> - Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
> - Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
> - Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
> - Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
>   select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
> - Reorganized brbe_valid_nr() and dropped the pr_warn() message
> - Changed probe sequence in brbe_attributes_probe()
> - Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
> - Disable BRBE before disabling the PMU event counter
> - Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
> - Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()
> 
> Changes in V6:
> 
> https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/
> 
> - Restore the exception level privilege after reading the branch records
> - Unpause the buffer after reading the branch records
> - Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
> - Reworked BRBE implementation and branch stack sampling support on arm pmu
> - BRBE implementation is now part of overall ARMV8 PMU implementation
> - BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
> - CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
> - File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
> - File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
> - BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
> - BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
> - BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
> - Added sched_task() callback into struct arm_pmu
> - Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
> - Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
> - Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
> - Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
> - Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
> - armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
> - Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
> - Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
> - Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
> - Dropped set_brbe_disabled() as well
> - Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events
> 
> Changes in V5:
> 
> https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/
> 
> - Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
> - Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
> - Changed config ARM_BRBE_PMU from 'tristate' to 'bool'
> 
> Changes in V4:
> 
> https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/
> 
> - Changed ../tools/sysreg declarations as suggested
> - Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
> - Dropped perfmon_capable() check in armpmu_event_init()
> - s/pr_warn_once/pr_info in armpmu_event_init()
> - Added brbe_format element into struct pmu_hw_events
> - Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
> - Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning
> 
> Changes in V3:
> 
> https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/
> 
> - Moved brbe_stack from the stack and now dynamically allocated
> - Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
> - Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
> - Created dummy BRBINF_EL1 field definitions in tools/sysreg
> - Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
> - Both exception and exception return branche records are now captured
>   only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
>   been checked in generic perf via perf_allow_kernel()
> 
> Changes in V2:
> 
> https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/
> 
> - Dropped branch sample filter helpers consolidation patch from this series 
> - Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
> - Use cached perfmon_capable() while configuring BRBE branch record filters
> 
> Changes in V1:
> 
> https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/
> 
> - Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
> - Process new perf branch types via PERF_BR_EXTEND_ABI
> 
> Changes in RFC V2:
> 
> https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/
> 
> - Added branch_sample_priv() while consolidating other branch sample filter helpers
> - Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
> - Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
> - Added documentation for struct arm_pmu changes, updated commit message
> - Updated commit message for BRBE detection infrastructure patch
> - PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
> - Branch privilege state capture mechanism has now moved inside the driver
> 
> Changes in RFC V1:
> 
> https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: James Clark <james.clark@arm.com>
> Cc: Rob Herring <robh@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Suzuki Poulose <suzuki.poulose@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-perf-users@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
> Anshuman Khandual (10):
>   drivers: perf: arm_pmu: Add new sched_task() callback
>   arm64/perf: Add BRBE registers and fields
>   arm64/perf: Add branch stack support in struct arm_pmu
>   arm64/perf: Add branch stack support in struct pmu_hw_events
>   arm64/perf: Add branch stack support in ARMV8 PMU
>   arm64/perf: Enable branch stack events via FEAT_BRBE
>   arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
>   arm64/perf: Add struct brbe_regset helper functions
>   arm64/perf: Implement branch records save on task sched out
>   arm64/perf: Implement branch records save on PMU IRQ

Hello Mark,

I am working on your review comments for the first six patches in this
series, and planning to respin next week. But it would be great if you
could also review rest of the series [PATCH 7 - 10] which adds branch
records save-restore mechanism and let me know your thoughts. It will
help in collecting more changes (if required) for the next spin. Thank
you.

- Anshuman

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09  4:30     ` Anshuman Khandual
@ 2023-06-09 12:37       ` Mark Rutland
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-09 12:37 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Fri, Jun 09, 2023 at 10:00:09AM +0530, Anshuman Khandual wrote:
> >> +static u64 branch_type_to_brbcr(int branch_type)
> >> +{
> >> +	u64 brbcr = BRBCR_EL1_DEFAULT_TS;
> >> +
> >> +	/*
> >> +	 * BRBE need not be paused on PMU interrupt while tracing only
> >> +	 * the user space, bcause it will automatically be inside the
> >> +	 * prohibited region. But even after PMU overflow occurs, the
> >> +	 * interrupt could still take much more cycles, before it can
> >> +	 * be taken and by that time BRBE will have been overwritten.
> >> +	 * Let's enable pause on PMU interrupt mechanism even for user
> >> +	 * only traces.
> >> +	 */
> >> +	brbcr |= BRBCR_EL1_FZP;
> > I think this is trying to say that we *should* use FZP when sampling the
> > kernel (due to IRQ latency), and *can* safely use it when sampling userspace,
> > so it would be good to explain it that way around.
> 
> Agreed, following updated comment explains why we should enable FZP
> when sampling kernel, otherwise BRBE will capture unwanted records.
> It also explains why we should enable FZP even when sampling user
> space due to IRQ latency.
> 
>         /*
>          * BRBE should be paused on PMU interrupt while tracing kernel
>          * space to stop capturing further branch records. Otherwise
>          * interrupt handler branch records might get into the samples
>          * which is not desired.
>          *
>          * BRBE need not be paused on PMU interrupt while tracing only
>          * the user space, because it will automatically be inside the
>          * prohibited region. But even after PMU overflow occurs, the
>          * interrupt could still take much more cycles, before it can
>          * be taken and by that time BRBE will have been overwritten.
>          * Hence enable pause on PMU interrupt mechanism even for user
>          * only traces as well.
>          */
>         brbcr |= BRBCR_EL1_FZP;

Thanks; I think that's a lot clearer!

Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09  4:47     ` Anshuman Khandual
@ 2023-06-09 12:42       ` Mark Rutland
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-09 12:42 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Fri, Jun 09, 2023 at 10:17:19AM +0530, Anshuman Khandual wrote:
> On 6/5/23 19:13, Mark Rutland wrote:
> >> +/*
> >> + * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
> >> + * preceding consecutive branch records, that were in a transaction
> >> + * (i.e their BRBINFx_EL1.TX set) have been aborted.
> >> + *
> >> + * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
> >> + * consecutive branch records up to the last record, which were in a
> >> + * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
> >> + *
> >> + * --------------------------------- -------------------
> >> + * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> >> + * --------------------------------- -------------------
> >> + * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
> >> + * --------------------------------- -------------------
> >> + * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> >> + * --------------------------------- -------------------
> >> + * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> >> + * --------------------------------- -------------------
> >> + * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> >> + * --------------------------------- -------------------
> >> + * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
> >> + * --------------------------------- -------------------
> >> + * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
> >> + * --------------------------------- -------------------
> >> + * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> >> + * --------------------------------- -------------------
> >> + * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> >> + * --------------------------------- -------------------
> >> + * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
> >> + * --------------------------------- -------------------
> >> + *
> >> + * BRBFCR_EL1.LASTFAILED == 1
> >> + *
> >> + * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
> >> + * branches records near the end of the BRBE buffer.
> >> + *
> >> + * Architecture does not guarantee a non transaction (TX = 0) branch
> >> + * record between two different transactions. So it is possible that
> >> + * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
> >> + * mark more than required transactions as aborted.
> >> + */
> > Linux doesn't currently support TME (and IIUC no-one has built it), so can't we
> > delete the transaction handling for now? We can add a comment with somehing like:
> > 
> > /*
> >  * TODO: add transaction handling for TME.
> >  */
> > 
> > Assuming no-one has built TME, we might also be able to get an architectural
> > fix to disambiguate the boundary between two transactions, and avoid the
> > problem described above.
> > 
> > [...]
> > 
> 
> OR can leave this unchanged for now. Then update it if and when the relevant
> architectural fix comes in. The current TME branch records handling here, is
> as per the current architectural specification.

My rationale for deleting it is that it cannot be used and cannot be tested,
since the kernel doesn't support TME, and there are no TME implementations out
there. If and when we support TME in the kernel, this is very likely to have
bit-rotted.

I'd be happy to link to the current version, e.g.

/*
 * TODO: add transaction handling for FEAT_TME. See:
 *
 * https://lore.kernel.org/linux-arm-kernel/20230531040428.501523-7-anshuman.khandual@arm.com/
 */

I do appreciate that time and effort has gone into writing this, but IMO it's
more distracting than helpful at present, and deleting it makes this easier to
review and maintain.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09  5:22     ` Anshuman Khandual
@ 2023-06-09 12:47       ` Mark Rutland
  2023-06-09 13:15         ` Mark Rutland
  2023-06-09 13:34         ` James Clark
  0 siblings, 2 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-09 12:47 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Fri, Jun 09, 2023 at 10:52:37AM +0530, Anshuman Khandual wrote:
> [...]
> 
> On 6/5/23 19:13, Mark Rutland wrote:
> >> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> >> +{
> >> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
> >> +	u64 brbfcr, brbcr;
> >> +	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
> >> +
> >> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> >> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> >> +
> >> +	/* Ensure pause on PMU interrupt is enabled */
> >> +	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> >> +
> >> +	/* Pause the buffer */
> >> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> >> +	isb();
> >> +
> >> +	/* Determine the indices for each loop */
> >> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
> >> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
> >> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
> >> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> >> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
> >> +	} else {
> >> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
> >> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> >> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
> >> +	}
> >> +
> >> +	/* Loop through bank 0 */
> >> +	select_brbe_bank(BRBE_BANK_IDX_0);
> >> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
> >> +		if (!capture_branch_entry(cpuc, event, idx))
> >> +			goto skip_bank_1;
> >> +	}
> >> +
> >> +	/* Loop through bank 1 */
> >> +	select_brbe_bank(BRBE_BANK_IDX_1);
> >> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
> >> +		if (!capture_branch_entry(cpuc, event, idx))
> >> +			break;
> >> +	}
> >> +
> >> +skip_bank_1:
> >> +	cpuc->branches->branch_stack.nr = idx;
> >> +	cpuc->branches->branch_stack.hw_idx = -1ULL;
> >> +	process_branch_aborts(cpuc);
> >> +
> >> +	/* Unpause the buffer */
> >> +	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> >> +	isb();
> >> +	armv8pmu_branch_reset();
> >> +}
> > The loop indicies are rather difficult to follow, and I think those can be made
> > quite a lot simpler if split out, e.g.
> > 
> > | int __armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> > | {
> > | 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
> > | 	int nr_hw_entries = brbe_attr->brbe_nr;
> > | 	int idx;
> 
> I guess idx needs an init to 0.

Yes, sorry, that should have been:

	int idx = 0;

> > | 
> > | 	select_brbe_bank(BRBE_BANK_IDX_0);
> > | 	while (idx < nr_hw_entries && idx < BRBE_BANK0_IDX_MAX) {
> > | 		if (!capture_branch_entry(cpuc, event, idx))
> > | 			return idx;
> > | 		idx++;
> > | 	}
> > | 
> > | 	select_brbe_bank(BRBE_BANK_IDX_1);
> > | 	while (idx < nr_hw_entries && idx < BRBE_BANK1_IDX_MAX) {
> > | 		if (!capture_branch_entry(cpuc, event, idx))
> > | 			return idx;
> > | 		idx++;
> > | 	}
> > | 
> > | 	return idx;
> > | }
> 
> These loops are better than the proposed one with indices, will update.

Great!

> > | 
> > | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
> > | {
> > | 	u64 brbfcr, brbcr;
> > | 	int nr;
> > | 
> > | 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> > | 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> > | 
> > | 	/* Ensure pause on PMU interrupt is enabled */
> > | 	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
> > | 
> > | 	/* Pause the buffer */
> > | 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> > | 	isb();
> > | 
> > | 	nr = __armv8pmu_branch_read(cpus, event);
> > | 
> > | 	cpuc->branches->branch_stack.nr = nr;
> > | 	cpuc->branches->branch_stack.hw_idx = -1ULL;
> > | 	process_branch_aborts(cpuc);
> > | 
> > | 	/* Unpause the buffer */
> > | 	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> > | 	isb();
> > | 	armv8pmu_branch_reset();
> > | }
> > 
> > Looking at <linux/perf_event.h> I see:
> > 
> > | /*
> > |  * branch stack layout:
> > |  *  nr: number of taken branches stored in entries[]
> > |  *  hw_idx: The low level index of raw branch records
> > |  *          for the most recent branch.
> > |  *          -1ULL means invalid/unknown.
> > |  *
> > |  * Note that nr can vary from sample to sample
> > |  * branches (to, from) are stored from most recent
> > |  * to least recent, i.e., entries[0] contains the most
> > |  * recent branch.
> > |  * The entries[] is an abstraction of raw branch records,
> > |  * which may not be stored in age order in HW, e.g. Intel LBR.
> > |  * The hw_idx is to expose the low level index of raw
> > |  * branch record for the most recent branch aka entries[0].
> > |  * The hw_idx index is between -1 (unknown) and max depth,
> > |  * which can be retrieved in /sys/devices/cpu/caps/branches.
> > |  * For the architectures whose raw branch records are
> > |  * already stored in age order, the hw_idx should be 0.
> > |  */
> > | struct perf_branch_stack {
> > |         __u64                           nr;  
> > |         __u64                           hw_idx;
> > |         struct perf_branch_entry        entries[];
> > | };
> > 
> > ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
> > records are in age order.
> Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
> figured that there was no need for hw_idx and hence marked it as -1UL similar
> to other platforms like powerpc.

That's fair enough; looking at power_pmu_bhrb_read() in
arch/powerpc/perf/core-book3s.c, I see a comment:

	Branches are read most recent first (ie. mfbhrb 0 is
	the most recent branch).

... which suggests that should be 0 also, or that the documentation is wrong.

Do you know how the perf tool consumes this?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09 12:47       ` Mark Rutland
@ 2023-06-09 13:15         ` Mark Rutland
  2023-06-12  8:35           ` Anshuman Khandual
  2023-06-09 13:34         ` James Clark
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-09 13:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Fri, Jun 09, 2023 at 01:47:18PM +0100, Mark Rutland wrote:
> On Fri, Jun 09, 2023 at 10:52:37AM +0530, Anshuman Khandual wrote:
> > On 6/5/23 19:13, Mark Rutland wrote:
> > > Looking at <linux/perf_event.h> I see:
> > > 
> > > | /*
> > > |  * branch stack layout:
> > > |  *  nr: number of taken branches stored in entries[]
> > > |  *  hw_idx: The low level index of raw branch records
> > > |  *          for the most recent branch.
> > > |  *          -1ULL means invalid/unknown.
> > > |  *
> > > |  * Note that nr can vary from sample to sample
> > > |  * branches (to, from) are stored from most recent
> > > |  * to least recent, i.e., entries[0] contains the most
> > > |  * recent branch.
> > > |  * The entries[] is an abstraction of raw branch records,
> > > |  * which may not be stored in age order in HW, e.g. Intel LBR.
> > > |  * The hw_idx is to expose the low level index of raw
> > > |  * branch record for the most recent branch aka entries[0].
> > > |  * The hw_idx index is between -1 (unknown) and max depth,
> > > |  * which can be retrieved in /sys/devices/cpu/caps/branches.
> > > |  * For the architectures whose raw branch records are
> > > |  * already stored in age order, the hw_idx should be 0.
> > > |  */
> > > | struct perf_branch_stack {
> > > |         __u64                           nr;  
> > > |         __u64                           hw_idx;
> > > |         struct perf_branch_entry        entries[];
> > > | };
> > > 
> > > ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
> > > records are in age order.
> > Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
> > figured that there was no need for hw_idx and hence marked it as -1UL similar
> > to other platforms like powerpc.
> 
> That's fair enough; looking at power_pmu_bhrb_read() in
> arch/powerpc/perf/core-book3s.c, I see a comment:
> 
> 	Branches are read most recent first (ie. mfbhrb 0 is
> 	the most recent branch).
> 
> ... which suggests that should be 0 also, or that the documentation is wrong.
> 
> Do you know how the perf tool consumes this?


Thinking about this some more, if what this is saying is that if entries[0]
must be strictly the last branch, and we've lost branches due to interrupt
latency, then we clearly don't meet that requirement and must report -1ULL
here.

So while it'd be nice to figure this out, I'm happy using -1ULL, and would be a
bit concerned using 0.

Sorry for flip-flopping on this.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09 12:47       ` Mark Rutland
  2023-06-09 13:15         ` Mark Rutland
@ 2023-06-09 13:34         ` James Clark
  2023-06-12 10:12           ` Anshuman Khandual
  1 sibling, 1 reply; 48+ messages in thread
From: James Clark @ 2023-06-09 13:34 UTC (permalink / raw)
  To: Mark Rutland, Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users



On 09/06/2023 13:47, Mark Rutland wrote:
> On Fri, Jun 09, 2023 at 10:52:37AM +0530, Anshuman Khandual wrote:
>> [...]
>>
>> On 6/5/23 19:13, Mark Rutland wrote:
>>>> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>>> +{
>>>> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>>>> +	u64 brbfcr, brbcr;
>>>> +	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
>>>> +
>>>> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>>>> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>>>> +
>>>> +	/* Ensure pause on PMU interrupt is enabled */
>>>> +	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>>>> +
>>>> +	/* Pause the buffer */
>>>> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>> +	isb();
>>>> +
>>>> +	/* Determine the indices for each loop */
>>>> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
>>>> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
>>>> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
>>>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>>>> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
>>>> +	} else {
>>>> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
>>>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>>>> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
>>>> +	}
>>>> +
>>>> +	/* Loop through bank 0 */
>>>> +	select_brbe_bank(BRBE_BANK_IDX_0);
>>>> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
>>>> +		if (!capture_branch_entry(cpuc, event, idx))
>>>> +			goto skip_bank_1;
>>>> +	}
>>>> +
>>>> +	/* Loop through bank 1 */
>>>> +	select_brbe_bank(BRBE_BANK_IDX_1);
>>>> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
>>>> +		if (!capture_branch_entry(cpuc, event, idx))
>>>> +			break;
>>>> +	}
>>>> +
>>>> +skip_bank_1:
>>>> +	cpuc->branches->branch_stack.nr = idx;
>>>> +	cpuc->branches->branch_stack.hw_idx = -1ULL;
>>>> +	process_branch_aborts(cpuc);
>>>> +
>>>> +	/* Unpause the buffer */
>>>> +	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>> +	isb();
>>>> +	armv8pmu_branch_reset();
>>>> +}
>>> The loop indicies are rather difficult to follow, and I think those can be made
>>> quite a lot simpler if split out, e.g.
>>>
>>> | int __armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>> | {
>>> | 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>>> | 	int nr_hw_entries = brbe_attr->brbe_nr;
>>> | 	int idx;
>>
>> I guess idx needs an init to 0.
> 
> Yes, sorry, that should have been:
> 
> 	int idx = 0;
> 
>>> | 
>>> | 	select_brbe_bank(BRBE_BANK_IDX_0);
>>> | 	while (idx < nr_hw_entries && idx < BRBE_BANK0_IDX_MAX) {
>>> | 		if (!capture_branch_entry(cpuc, event, idx))
>>> | 			return idx;
>>> | 		idx++;
>>> | 	}
>>> | 
>>> | 	select_brbe_bank(BRBE_BANK_IDX_1);
>>> | 	while (idx < nr_hw_entries && idx < BRBE_BANK1_IDX_MAX) {
>>> | 		if (!capture_branch_entry(cpuc, event, idx))
>>> | 			return idx;
>>> | 		idx++;
>>> | 	}
>>> | 
>>> | 	return idx;
>>> | }
>>
>> These loops are better than the proposed one with indices, will update.
> 
> Great!
> 
>>> | 
>>> | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>> | {
>>> | 	u64 brbfcr, brbcr;
>>> | 	int nr;
>>> | 
>>> | 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>>> | 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>>> | 
>>> | 	/* Ensure pause on PMU interrupt is enabled */
>>> | 	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>>> | 
>>> | 	/* Pause the buffer */
>>> | 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>> | 	isb();
>>> | 
>>> | 	nr = __armv8pmu_branch_read(cpus, event);
>>> | 
>>> | 	cpuc->branches->branch_stack.nr = nr;
>>> | 	cpuc->branches->branch_stack.hw_idx = -1ULL;
>>> | 	process_branch_aborts(cpuc);
>>> | 
>>> | 	/* Unpause the buffer */
>>> | 	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>> | 	isb();
>>> | 	armv8pmu_branch_reset();
>>> | }
>>>
>>> Looking at <linux/perf_event.h> I see:
>>>
>>> | /*
>>> |  * branch stack layout:
>>> |  *  nr: number of taken branches stored in entries[]
>>> |  *  hw_idx: The low level index of raw branch records
>>> |  *          for the most recent branch.
>>> |  *          -1ULL means invalid/unknown.
>>> |  *
>>> |  * Note that nr can vary from sample to sample
>>> |  * branches (to, from) are stored from most recent
>>> |  * to least recent, i.e., entries[0] contains the most
>>> |  * recent branch.
>>> |  * The entries[] is an abstraction of raw branch records,
>>> |  * which may not be stored in age order in HW, e.g. Intel LBR.
>>> |  * The hw_idx is to expose the low level index of raw
>>> |  * branch record for the most recent branch aka entries[0].
>>> |  * The hw_idx index is between -1 (unknown) and max depth,
>>> |  * which can be retrieved in /sys/devices/cpu/caps/branches.
>>> |  * For the architectures whose raw branch records are
>>> |  * already stored in age order, the hw_idx should be 0.
>>> |  */
>>> | struct perf_branch_stack {
>>> |         __u64                           nr;  
>>> |         __u64                           hw_idx;
>>> |         struct perf_branch_entry        entries[];
>>> | };
>>>
>>> ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
>>> records are in age order.
>> Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
>> figured that there was no need for hw_idx and hence marked it as -1UL similar
>> to other platforms like powerpc.
> 
> That's fair enough; looking at power_pmu_bhrb_read() in
> arch/powerpc/perf/core-book3s.c, I see a comment:
> 
> 	Branches are read most recent first (ie. mfbhrb 0 is
> 	the most recent branch).
> 
> ... which suggests that should be 0 also, or that the documentation is wrong.
> 
> Do you know how the perf tool consumes this?
> 
> Thanks,
> Mark.

It looks like it's a unique ID/last position updated in the LBR FIFO and
it's used to stitch callchains together when the stack depth exceeds the
buffer size. Perf takes the previous one that got filled to the limit
and and the new one and stitches them together if the hw_idx matches.

There are some options in perf you need to provide to make it happen, so
I think for BRBE it doesn't matter what value is assigned to it. -1
seems to be a 'not used' value which we should probably set in case the
event is opened with PERF_SAMPLE_BRANCH_HW_INDEX

You could also fail to open the event if PERF_SAMPLE_BRANCH_HW_INDEX is
set, and that would save writing out -1 every time for every branch
stack. Although it's not enabled by default, so maybe that's not necessary.

James

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09 13:15         ` Mark Rutland
@ 2023-06-12  8:35           ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-12  8:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/9/23 18:45, Mark Rutland wrote:
> On Fri, Jun 09, 2023 at 01:47:18PM +0100, Mark Rutland wrote:
>> On Fri, Jun 09, 2023 at 10:52:37AM +0530, Anshuman Khandual wrote:
>>> On 6/5/23 19:13, Mark Rutland wrote:
>>>> Looking at <linux/perf_event.h> I see:
>>>>
>>>> | /*
>>>> |  * branch stack layout:
>>>> |  *  nr: number of taken branches stored in entries[]
>>>> |  *  hw_idx: The low level index of raw branch records
>>>> |  *          for the most recent branch.
>>>> |  *          -1ULL means invalid/unknown.
>>>> |  *
>>>> |  * Note that nr can vary from sample to sample
>>>> |  * branches (to, from) are stored from most recent
>>>> |  * to least recent, i.e., entries[0] contains the most
>>>> |  * recent branch.
>>>> |  * The entries[] is an abstraction of raw branch records,
>>>> |  * which may not be stored in age order in HW, e.g. Intel LBR.
>>>> |  * The hw_idx is to expose the low level index of raw
>>>> |  * branch record for the most recent branch aka entries[0].
>>>> |  * The hw_idx index is between -1 (unknown) and max depth,
>>>> |  * which can be retrieved in /sys/devices/cpu/caps/branches.
>>>> |  * For the architectures whose raw branch records are
>>>> |  * already stored in age order, the hw_idx should be 0.
>>>> |  */
>>>> | struct perf_branch_stack {
>>>> |         __u64                           nr;  
>>>> |         __u64                           hw_idx;
>>>> |         struct perf_branch_entry        entries[];
>>>> | };
>>>>
>>>> ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
>>>> records are in age order.
>>> Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
>>> figured that there was no need for hw_idx and hence marked it as -1UL similar
>>> to other platforms like powerpc.
>>
>> That's fair enough; looking at power_pmu_bhrb_read() in
>> arch/powerpc/perf/core-book3s.c, I see a comment:
>>
>> 	Branches are read most recent first (ie. mfbhrb 0 is
>> 	the most recent branch).
>>
>> ... which suggests that should be 0 also, or that the documentation is wrong.
>>
>> Do you know how the perf tool consumes this?
> 
> 
> Thinking about this some more, if what this is saying is that if entries[0]
> must be strictly the last branch, and we've lost branches due to interrupt
> latency, then we clearly don't meet that requirement and must report -1ULL
> here.

'last branch' means relative to the captured records not in absolute terms.
Loosing records for interrupt latency too does not change the relative age
order for the set. Hence '0' might suggest valid relative not absolute age
order for the branch records set.

> 
> So while it'd be nice to figure this out, I'm happy using -1ULL, and would be a
> bit concerned using 0.

Sounds reasonable. If tools are not going to use this anyway, I guess there
is no point in suggesting that each record set has got valid age order with
subtle conditions involved.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE
  2023-06-09 13:34         ` James Clark
@ 2023-06-12 10:12           ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-12 10:12 UTC (permalink / raw)
  To: James Clark, Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-perf-users



On 6/9/23 19:04, James Clark wrote:
> 
> 
> On 09/06/2023 13:47, Mark Rutland wrote:
>> On Fri, Jun 09, 2023 at 10:52:37AM +0530, Anshuman Khandual wrote:
>>> [...]
>>>
>>> On 6/5/23 19:13, Mark Rutland wrote:
>>>>> +void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>>>> +{
>>>>> +	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>>>>> +	u64 brbfcr, brbcr;
>>>>> +	int idx, loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2, count;
>>>>> +
>>>>> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>>>>> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>>>>> +
>>>>> +	/* Ensure pause on PMU interrupt is enabled */
>>>>> +	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>>>>> +
>>>>> +	/* Pause the buffer */
>>>>> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>>> +	isb();
>>>>> +
>>>>> +	/* Determine the indices for each loop */
>>>>> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
>>>>> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
>>>>> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
>>>>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>>>>> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
>>>>> +	} else {
>>>>> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
>>>>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>>>>> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
>>>>> +	}
>>>>> +
>>>>> +	/* Loop through bank 0 */
>>>>> +	select_brbe_bank(BRBE_BANK_IDX_0);
>>>>> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
>>>>> +		if (!capture_branch_entry(cpuc, event, idx))
>>>>> +			goto skip_bank_1;
>>>>> +	}
>>>>> +
>>>>> +	/* Loop through bank 1 */
>>>>> +	select_brbe_bank(BRBE_BANK_IDX_1);
>>>>> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
>>>>> +		if (!capture_branch_entry(cpuc, event, idx))
>>>>> +			break;
>>>>> +	}
>>>>> +
>>>>> +skip_bank_1:
>>>>> +	cpuc->branches->branch_stack.nr = idx;
>>>>> +	cpuc->branches->branch_stack.hw_idx = -1ULL;
>>>>> +	process_branch_aborts(cpuc);
>>>>> +
>>>>> +	/* Unpause the buffer */
>>>>> +	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>>> +	isb();
>>>>> +	armv8pmu_branch_reset();
>>>>> +}
>>>> The loop indicies are rather difficult to follow, and I think those can be made
>>>> quite a lot simpler if split out, e.g.
>>>>
>>>> | int __armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>>> | {
>>>> | 	struct brbe_hw_attr *brbe_attr = (struct brbe_hw_attr *)cpuc->percpu_pmu->private;
>>>> | 	int nr_hw_entries = brbe_attr->brbe_nr;
>>>> | 	int idx;
>>>
>>> I guess idx needs an init to 0.
>>
>> Yes, sorry, that should have been:
>>
>> 	int idx = 0;
>>
>>>> | 
>>>> | 	select_brbe_bank(BRBE_BANK_IDX_0);
>>>> | 	while (idx < nr_hw_entries && idx < BRBE_BANK0_IDX_MAX) {
>>>> | 		if (!capture_branch_entry(cpuc, event, idx))
>>>> | 			return idx;
>>>> | 		idx++;
>>>> | 	}
>>>> | 
>>>> | 	select_brbe_bank(BRBE_BANK_IDX_1);
>>>> | 	while (idx < nr_hw_entries && idx < BRBE_BANK1_IDX_MAX) {
>>>> | 		if (!capture_branch_entry(cpuc, event, idx))
>>>> | 			return idx;
>>>> | 		idx++;
>>>> | 	}
>>>> | 
>>>> | 	return idx;
>>>> | }
>>>
>>> These loops are better than the proposed one with indices, will update.
>>
>> Great!
>>
>>>> | 
>>>> | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>>> | {
>>>> | 	u64 brbfcr, brbcr;
>>>> | 	int nr;
>>>> | 
>>>> | 	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>>>> | 	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>>>> | 
>>>> | 	/* Ensure pause on PMU interrupt is enabled */
>>>> | 	WARN_ON_ONCE(!(brbcr & BRBCR_EL1_FZP));
>>>> | 
>>>> | 	/* Pause the buffer */
>>>> | 	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>> | 	isb();
>>>> | 
>>>> | 	nr = __armv8pmu_branch_read(cpus, event);
>>>> | 
>>>> | 	cpuc->branches->branch_stack.nr = nr;
>>>> | 	cpuc->branches->branch_stack.hw_idx = -1ULL;
>>>> | 	process_branch_aborts(cpuc);
>>>> | 
>>>> | 	/* Unpause the buffer */
>>>> | 	write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
>>>> | 	isb();
>>>> | 	armv8pmu_branch_reset();
>>>> | }
>>>>
>>>> Looking at <linux/perf_event.h> I see:
>>>>
>>>> | /*
>>>> |  * branch stack layout:
>>>> |  *  nr: number of taken branches stored in entries[]
>>>> |  *  hw_idx: The low level index of raw branch records
>>>> |  *          for the most recent branch.
>>>> |  *          -1ULL means invalid/unknown.
>>>> |  *
>>>> |  * Note that nr can vary from sample to sample
>>>> |  * branches (to, from) are stored from most recent
>>>> |  * to least recent, i.e., entries[0] contains the most
>>>> |  * recent branch.
>>>> |  * The entries[] is an abstraction of raw branch records,
>>>> |  * which may not be stored in age order in HW, e.g. Intel LBR.
>>>> |  * The hw_idx is to expose the low level index of raw
>>>> |  * branch record for the most recent branch aka entries[0].
>>>> |  * The hw_idx index is between -1 (unknown) and max depth,
>>>> |  * which can be retrieved in /sys/devices/cpu/caps/branches.
>>>> |  * For the architectures whose raw branch records are
>>>> |  * already stored in age order, the hw_idx should be 0.
>>>> |  */
>>>> | struct perf_branch_stack {
>>>> |         __u64                           nr;  
>>>> |         __u64                           hw_idx;
>>>> |         struct perf_branch_entry        entries[];
>>>> | };
>>>>
>>>> ... which seems to indicate we should be setting hw_idx to 0, since IIUC our
>>>> records are in age order.
>>> Branch records are indeed in age order, sure will change hw_idx as 0. Earlier
>>> figured that there was no need for hw_idx and hence marked it as -1UL similar
>>> to other platforms like powerpc.
>>
>> That's fair enough; looking at power_pmu_bhrb_read() in
>> arch/powerpc/perf/core-book3s.c, I see a comment:
>>
>> 	Branches are read most recent first (ie. mfbhrb 0 is
>> 	the most recent branch).
>>
>> ... which suggests that should be 0 also, or that the documentation is wrong.
>>
>> Do you know how the perf tool consumes this?
>>
>> Thanks,
>> Mark.
> 
> It looks like it's a unique ID/last position updated in the LBR FIFO and
> it's used to stitch callchains together when the stack depth exceeds the
> buffer size. Perf takes the previous one that got filled to the limit
> and and the new one and stitches them together if the hw_idx matches.

Right.

> 
> There are some options in perf you need to provide to make it happen, so
> I think for BRBE it doesn't matter what value is assigned to it. -1
> seems to be a 'not used' value which we should probably set in case the
> event is opened with PERF_SAMPLE_BRANCH_HW_INDEX

-1 indeed did seem like a "not used" option rather than an "unkwown" option.
 
> 
> You could also fail to open the event if PERF_SAMPLE_BRANCH_HW_INDEX is
> set, and that would save writing out -1 every time for every branch
> stack. Although it's not enabled by default, so maybe that's not necessary.

Yeah blocking events with PERF_SAMPLE_BRANCH_HW_INDEX is not necessary IMHO.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields
  2023-05-31  4:04 ` [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields Anshuman Khandual
  2023-06-05  7:55   ` Mark Rutland
@ 2023-06-13 16:27   ` Mark Rutland
  2023-06-14  2:59     ` Anshuman Khandual
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-13 16:27 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:20AM +0530, Anshuman Khandual wrote:
> This adds BRBE related register definitions and various other related field
> macros there in. These will be used subsequently in a BRBE driver which is
> being added later on.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Reviewed-by: Mark Brown <broonie@kernel.org>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>  arch/arm64/tools/sysreg         | 159 ++++++++++++++++++++++++++++++++
>  2 files changed, 262 insertions(+)

> +SysregFields BRBINFx_EL1

> +Enum	1:0	VALID
> +	0b00	NONE
> +	0b01	TARGET
> +	0b10	SOURCE
> +	0b11	FULL
> +EndEnum

This looks correct...

[...]

> +Sysreg	BRBINFINJ_EL1	2	1	9	1	0

> +Enum	1:0	VALID
> +	0b00	NONE
> +	0b01	TARGET
> +	0b10	SOURCE
> +	0b00	FULL
> +EndEnum

... but this clearly isn't.

Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-05-31  4:04 ` [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions Anshuman Khandual
  2023-06-02  2:40   ` Namhyung Kim
@ 2023-06-13 17:17   ` Mark Rutland
  2023-06-14  5:14     ` Anshuman Khandual
  1 sibling, 1 reply; 48+ messages in thread
From: Mark Rutland @ 2023-06-13 17:17 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, May 31, 2023 at 09:34:26AM +0530, Anshuman Khandual wrote:
> The primary abstraction level for fetching branch records from BRBE HW has
> been changed as 'struct brbe_regset', which contains storage for all three
> BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
> happens in the task sched out path, or in the PMU IRQ handling path, these
> registers need to be extracted from the HW. Afterwards both live and stored
> sets need to be stitched together to create final branch records set. This
> adds required helper functions for such operations.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Tested-by: James Clark <james.clark@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  drivers/perf/arm_brbe.c | 163 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 163 insertions(+)
> 
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index 484842d8cf3e..759db681d673 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -44,6 +44,169 @@ static void select_brbe_bank(int bank)
>  	isb();
>  }
>  
> +/*
> + * This scans over BRBE register banks and captures individual branch reocrds
> + * [BRBSRC, BRBTGT, BRBINF] into a pre-allocated 'struct brbe_regset' buffer,
> + * until an invalid one gets encountered. The caller for this function needs
> + * to ensure BRBE is an appropriate state before the records can be captured.
> + */
> +static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
> +{
> +	int loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2;
> +	int idx, count;
> +
> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
> +	} else {
> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
> +	}
> +
> +	select_brbe_bank(BRBE_BANK_IDX_0);
> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
> +		buf[idx].brbinf = get_brbinf_reg(idx);
> +		/*
> +		 * There are no valid entries anymore on the buffer.
> +		 * Abort the branch record processing to save some
> +		 * cycles and also reduce the capture/process load
> +		 * for the user space as well.
> +		 */
> +		if (brbe_invalid(buf[idx].brbinf))
> +			return idx;
> +
> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
> +	}
> +
> +	select_brbe_bank(BRBE_BANK_IDX_1);
> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
> +		buf[idx].brbinf = get_brbinf_reg(idx);
> +		/*
> +		 * There are no valid entries anymore on the buffer.
> +		 * Abort the branch record processing to save some
> +		 * cycles and also reduce the capture/process load
> +		 * for the user space as well.
> +		 */
> +		if (brbe_invalid(buf[idx].brbinf))
> +			return idx;
> +
> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
> +	}
> +	return idx;
> +}

As with __armv8pmu_branch_read(), the loop conditions are a bit hard to follow,
and I believe that can be rewritten along the lines of the suggestion there.

Looking at this, we now have a couple of places that will try to read the
registers for an individual record, so it probably makes sense to facotr that
into a helper, e.g.

| static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
| {
| 	u64 brbinf = get_brbinf_reg(idx);
| 
| 	if (brbe_invalid(brbinf))
| 		return false;
| 	
| 	entry->brbinf = brbinf;
| 	entry->brbsrc = get_brbsrc_reg(idx);
| 	entry->brbtgt = get_brbtgt_reg(idx);
| 
| 	return true;
| }

... which can be used here, e.g.

| /*
|  * Capture all records before the first invalid record, and return the number
|  * of records captured.
|  */
| static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
| {
| 
| 	int nr_entries = brbe_attr->brbe_nr;
| 	int idx = 0;
| 	
| 	select_brbe_bank(BRBE_BANK_IDX_0);
| 	while (idx < nr_entries && IDX < BRBE_BANK0_IDX_MAX) {
| 		if (__read_brbe_regset(&buf[idx], idx))
| 			return idx;
| 	}
| 
| 	select_brbe_bank(BRBE_BANK_IDX_1);
| 	while (idx < nr_entries && IDX < BRBE_BANK1_IDX_MAX) {
| 		if (__read_brbe_regset(&buf[idx], idx))
| 			return idx;
| 	}
| 
| 	return idx;
| }

... and could be used to implement capture_branch_entry() in the patch before
this.

> +static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
> +				    struct brbe_regset *dst, int dst_idx)
> +{
> +	dst[dst_idx].brbinf = src[src_idx].brbinf;
> +	dst[dst_idx].brbsrc = src[src_idx].brbsrc;
> +	dst[dst_idx].brbtgt = src[src_idx].brbtgt;
> +}

C can do struct assignment, so this is the same as:

| static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
| 				    struct brbe_regset *dst, int dst_idx)
| {
| 	dst[dst_idx] = src[src_idx];
| }

... and given that, it would be simpler and clearer to have that directly in
the caller, so I don't think we need this helper function.

> +/*
> + * This function concatenates branch records from stored and live buffer
> + * up to maximum nr_max records and the stored buffer holds the resultant
> + * buffer. The concatenated buffer contains all the branch records from
> + * the live buffer but might contain some from stored buffer considering
> + * the maximum combined length does not exceed 'nr_max'.
> + *
> + *	Stored records	Live records
> + *	------------------------------------------------^
> + *	|	S0	|	L0	|	Newest	|
> + *	---------------------------------		|
> + *	|	S1	|	L1	|		|
> + *	---------------------------------		|
> + *	|	S2	|	L2	|		|
> + *	---------------------------------		|
> + *	|	S3	|	L3	|		|
> + *	---------------------------------		|
> + *	|	S4	|	L4	|		nr_max
> + *	---------------------------------		|
> + *	|		|	L5	|		|
> + *	---------------------------------		|
> + *	|		|	L6	|		|
> + *	---------------------------------		|
> + *	|		|	L7	|		|
> + *	---------------------------------		|
> + *	|		|		|		|
> + *	---------------------------------		|
> + *	|		|		|	Oldest	|
> + *	------------------------------------------------V
> + *
> + *
> + * S0 is the newest in the stored records, where as L7 is the oldest in
> + * the live reocords. Unless the live buffer is detetcted as being full
> + * thus potentially dropping off some older records, L7 and S0 records
> + * are contiguous in time for a user task context. The stitched buffer
> + * here represents maximum possible branch records, contiguous in time.
> + *
> + *	Stored records  Live records
> + *	------------------------------------------------^
> + *	|	L0	|	L0	|	Newest	|
> + *	---------------------------------		|
> + *	|	L0	|	L1	|		|
> + *	---------------------------------		|
> + *	|	L2	|	L2	|		|
> + *	---------------------------------		|
> + *	|	L3	|	L3	|		|
> + *	---------------------------------		|
> + *	|	L4	|	L4	|	      nr_max
> + *	---------------------------------		|
> + *	|	L5	|	L5	|		|
> + *	---------------------------------		|
> + *	|	L6	|	L6	|		|
> + *	---------------------------------		|
> + *	|	L7	|	L7	|		|
> + *	---------------------------------		|
> + *	|	S0	|		|		|
> + *	---------------------------------		|
> + *	|	S1	|		|    Oldest	|
> + *	------------------------------------------------V
> + *	|	S2	| <----|
> + *	-----------------      |
> + *	|	S3	| <----| Dropped off after nr_max
> + *	-----------------      |
> + *	|	S4	| <----|
> + *	-----------------
> + */
> +static int stitch_stored_live_entries(struct brbe_regset *stored,
> +				      struct brbe_regset *live,
> +				      int nr_stored, int nr_live,
> +				      int nr_max)
> +{
> +	int nr_total, nr_excess, nr_last, i;
> +
> +	nr_total = nr_stored + nr_live;
> +	nr_excess = nr_total - nr_max;
> +
> +	/* Stored branch records in stitched buffer */
> +	if (nr_live == nr_max)
> +		nr_stored = 0;
> +	else if (nr_excess > 0)
> +		nr_stored -= nr_excess;
> +
> +	/* Stitched buffer branch records length */
> +	if (nr_total > nr_max)
> +		nr_last = nr_max;
> +	else
> +		nr_last = nr_total;
> +
> +	/* Move stored branch records */
> +	for (i = 0; i < nr_stored; i++)
> +		copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
> +
> +	/* Copy live branch records */
> +	for (i = 0; i < nr_live; i++)
> +		copy_brbe_regset(live, i, stored, i);
> +
> +	return nr_last;
> +}

I think this can be written more simply as something like:

static int stitch_stored_live_entries(struct brbe_regset *stored,
				      struct brbe_regset *live,
				      int nr_stored, int nr_live,
				      int nr_max)
{	
	int nr_move = max(nr_stored, nr_max - nr_live);

	/* Move the tail of the buffer to make room for the new entries */
	memmove(&stored[nr_live], &stored[0], nr_move * sizeof(*stored));

	/* Copy the new entries into the head of the buffer */
	memcpy(stored[0], &live[0], nr_live * sizeof(*stored));

	/* Return the number of entries in the stitched buffer */
	return min(nr_live + nr_stored, nr_max);
}

... or if we could save this oldest-first, we could make it a circular buffer
and avoid moving older entries.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields
  2023-06-13 16:27   ` Mark Rutland
@ 2023-06-14  2:59     ` Anshuman Khandual
  0 siblings, 0 replies; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-14  2:59 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/13/23 21:57, Mark Rutland wrote:
> On Wed, May 31, 2023 at 09:34:20AM +0530, Anshuman Khandual wrote:
>> This adds BRBE related register definitions and various other related field
>> macros there in. These will be used subsequently in a BRBE driver which is
>> being added later on.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Reviewed-by: Mark Brown <broonie@kernel.org>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  arch/arm64/include/asm/sysreg.h | 103 +++++++++++++++++++++
>>  arch/arm64/tools/sysreg         | 159 ++++++++++++++++++++++++++++++++
>>  2 files changed, 262 insertions(+)
> 
>> +SysregFields BRBINFx_EL1
> 
>> +Enum	1:0	VALID
>> +	0b00	NONE
>> +	0b01	TARGET
>> +	0b10	SOURCE
>> +	0b11	FULL
>> +EndEnum
> 
> This looks correct...
> 
> [...]
> 
>> +Sysreg	BRBINFINJ_EL1	2	1	9	1	0
> 
>> +Enum	1:0	VALID
>> +	0b00	NONE
>> +	0b01	TARGET
>> +	0b10	SOURCE
>> +	0b00	FULL
>> +EndEnum
> 
> ... but this clearly isn't.

Fixed VALID_FULL as 0b11.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-06-13 17:17   ` Mark Rutland
@ 2023-06-14  5:14     ` Anshuman Khandual
  2023-06-14 10:59       ` Mark Rutland
  0 siblings, 1 reply; 48+ messages in thread
From: Anshuman Khandual @ 2023-06-14  5:14 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users



On 6/13/23 22:47, Mark Rutland wrote:
> On Wed, May 31, 2023 at 09:34:26AM +0530, Anshuman Khandual wrote:
>> The primary abstraction level for fetching branch records from BRBE HW has
>> been changed as 'struct brbe_regset', which contains storage for all three
>> BRBE registers i.e BRBSRC, BRBTGT, BRBINF. Whether branch record processing
>> happens in the task sched out path, or in the PMU IRQ handling path, these
>> registers need to be extracted from the HW. Afterwards both live and stored
>> sets need to be stitched together to create final branch records set. This
>> adds required helper functions for such operations.
>>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: James Clark <james.clark@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  drivers/perf/arm_brbe.c | 163 ++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 163 insertions(+)
>>
>> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
>> index 484842d8cf3e..759db681d673 100644
>> --- a/drivers/perf/arm_brbe.c
>> +++ b/drivers/perf/arm_brbe.c
>> @@ -44,6 +44,169 @@ static void select_brbe_bank(int bank)
>>  	isb();
>>  }
>>  
>> +/*
>> + * This scans over BRBE register banks and captures individual branch reocrds
>> + * [BRBSRC, BRBTGT, BRBINF] into a pre-allocated 'struct brbe_regset' buffer,
>> + * until an invalid one gets encountered. The caller for this function needs
>> + * to ensure BRBE is an appropriate state before the records can be captured.
>> + */
>> +static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
>> +{
>> +	int loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2;
>> +	int idx, count;
>> +
>> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
>> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
>> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
>> +	} else {
>> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
>> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
>> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
>> +	}
>> +
>> +	select_brbe_bank(BRBE_BANK_IDX_0);
>> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
>> +		buf[idx].brbinf = get_brbinf_reg(idx);
>> +		/*
>> +		 * There are no valid entries anymore on the buffer.
>> +		 * Abort the branch record processing to save some
>> +		 * cycles and also reduce the capture/process load
>> +		 * for the user space as well.
>> +		 */
>> +		if (brbe_invalid(buf[idx].brbinf))
>> +			return idx;
>> +
>> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
>> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
>> +	}
>> +
>> +	select_brbe_bank(BRBE_BANK_IDX_1);
>> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
>> +		buf[idx].brbinf = get_brbinf_reg(idx);
>> +		/*
>> +		 * There are no valid entries anymore on the buffer.
>> +		 * Abort the branch record processing to save some
>> +		 * cycles and also reduce the capture/process load
>> +		 * for the user space as well.
>> +		 */
>> +		if (brbe_invalid(buf[idx].brbinf))
>> +			return idx;
>> +
>> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
>> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
>> +	}
>> +	return idx;
>> +}
> 
> As with __armv8pmu_branch_read(), the loop conditions are a bit hard to follow,
> and I believe that can be rewritten along the lines of the suggestion there.

I have changed both the places (in separate patches) with suggested loop structure.

> 
> Looking at this, we now have a couple of places that will try to read the
> registers for an individual record, so it probably makes sense to facotr that
> into a helper, e.g.

There are indeed two places inside capture_brbe_regset() - one for each bank.

> 
> | static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
> | {
> | 	u64 brbinf = get_brbinf_reg(idx);
> | 
> | 	if (brbe_invalid(brbinf))
> | 		return false;
> | 	
> | 	entry->brbinf = brbinf;
> | 	entry->brbsrc = get_brbsrc_reg(idx);
> | 	entry->brbtgt = get_brbtgt_reg(idx);
> | 
> | 	return true;
> | }
> 
> ... which can be used here, e.g.
> 
> | /*
> |  * Capture all records before the first invalid record, and return the number
> |  * of records captured.
> |  */
> | static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
> | {
> | 
> | 	int nr_entries = brbe_attr->brbe_nr;
> | 	int idx = 0;
> | 	
> | 	select_brbe_bank(BRBE_BANK_IDX_0);
> | 	while (idx < nr_entries && IDX < BRBE_BANK0_IDX_MAX) {
> | 		if (__read_brbe_regset(&buf[idx], idx))

It should test !_read_brbe_regset(&buf[idx], idx)) instead as the error
case returns false.

> | 			return idx;
> | 	}
> | 
> | 	select_brbe_bank(BRBE_BANK_IDX_1);
> | 	while (idx < nr_entries && IDX < BRBE_BANK1_IDX_MAX) {
> | 		if (__read_brbe_regset(&buf[idx], idx))
> | 			return idx;

Ditto.

> | 	}
> | 
> | 	return idx;
> | }

Will factor out a new helper __read_brbe_regset() from capture_brbe_regset().

> 
> ... and could be used to implement capture_branch_entry() in the patch before
> this.
> 
>> +static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
>> +				    struct brbe_regset *dst, int dst_idx)
>> +{
>> +	dst[dst_idx].brbinf = src[src_idx].brbinf;
>> +	dst[dst_idx].brbsrc = src[src_idx].brbsrc;
>> +	dst[dst_idx].brbtgt = src[src_idx].brbtgt;
>> +}
> 
> C can do struct assignment, so this is the same as:
> 
> | static inline void copy_brbe_regset(struct brbe_regset *src, int src_idx,
> | 				    struct brbe_regset *dst, int dst_idx)
> | {
> | 	dst[dst_idx] = src[src_idx];
> | }

Agreed.

> 
> ... and given that, it would be simpler and clearer to have that directly in
> the caller, so I don't think we need this helper function.

Agreed, will drop copy_brbe_regset().

> 
>> +/*
>> + * This function concatenates branch records from stored and live buffer
>> + * up to maximum nr_max records and the stored buffer holds the resultant
>> + * buffer. The concatenated buffer contains all the branch records from
>> + * the live buffer but might contain some from stored buffer considering
>> + * the maximum combined length does not exceed 'nr_max'.
>> + *
>> + *	Stored records	Live records
>> + *	------------------------------------------------^
>> + *	|	S0	|	L0	|	Newest	|
>> + *	---------------------------------		|
>> + *	|	S1	|	L1	|		|
>> + *	---------------------------------		|
>> + *	|	S2	|	L2	|		|
>> + *	---------------------------------		|
>> + *	|	S3	|	L3	|		|
>> + *	---------------------------------		|
>> + *	|	S4	|	L4	|		nr_max
>> + *	---------------------------------		|
>> + *	|		|	L5	|		|
>> + *	---------------------------------		|
>> + *	|		|	L6	|		|
>> + *	---------------------------------		|
>> + *	|		|	L7	|		|
>> + *	---------------------------------		|
>> + *	|		|		|		|
>> + *	---------------------------------		|
>> + *	|		|		|	Oldest	|
>> + *	------------------------------------------------V
>> + *
>> + *
>> + * S0 is the newest in the stored records, where as L7 is the oldest in
>> + * the live reocords. Unless the live buffer is detetcted as being full

Fixed these typos ^^^					^^^

>> + * thus potentially dropping off some older records, L7 and S0 records
>> + * are contiguous in time for a user task context. The stitched buffer
>> + * here represents maximum possible branch records, contiguous in time.
>> + *
>> + *	Stored records  Live records
>> + *	------------------------------------------------^
>> + *	|	L0	|	L0	|	Newest	|
>> + *	---------------------------------		|
>> + *	|	L0	|	L1	|		|
>> + *	---------------------------------		|
>> + *	|	L2	|	L2	|		|
>> + *	---------------------------------		|
>> + *	|	L3	|	L3	|		|
>> + *	---------------------------------		|
>> + *	|	L4	|	L4	|	      nr_max
>> + *	---------------------------------		|
>> + *	|	L5	|	L5	|		|
>> + *	---------------------------------		|
>> + *	|	L6	|	L6	|		|
>> + *	---------------------------------		|
>> + *	|	L7	|	L7	|		|
>> + *	---------------------------------		|
>> + *	|	S0	|		|		|
>> + *	---------------------------------		|
>> + *	|	S1	|		|    Oldest	|
>> + *	------------------------------------------------V
>> + *	|	S2	| <----|
>> + *	-----------------      |
>> + *	|	S3	| <----| Dropped off after nr_max
>> + *	-----------------      |
>> + *	|	S4	| <----|
>> + *	-----------------
>> + */
>> +static int stitch_stored_live_entries(struct brbe_regset *stored,
>> +				      struct brbe_regset *live,
>> +				      int nr_stored, int nr_live,
>> +				      int nr_max)
>> +{
>> +	int nr_total, nr_excess, nr_last, i;
>> +
>> +	nr_total = nr_stored + nr_live;
>> +	nr_excess = nr_total - nr_max;
>> +
>> +	/* Stored branch records in stitched buffer */
>> +	if (nr_live == nr_max)
>> +		nr_stored = 0;
>> +	else if (nr_excess > 0)
>> +		nr_stored -= nr_excess;
>> +
>> +	/* Stitched buffer branch records length */
>> +	if (nr_total > nr_max)
>> +		nr_last = nr_max;
>> +	else
>> +		nr_last = nr_total;
>> +
>> +	/* Move stored branch records */
>> +	for (i = 0; i < nr_stored; i++)
>> +		copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
>> +
>> +	/* Copy live branch records */
>> +	for (i = 0; i < nr_live; i++)
>> +		copy_brbe_regset(live, i, stored, i);
>> +
>> +	return nr_last;
>> +}
> 
> I think this can be written more simply as something like:
> 
> static int stitch_stored_live_entries(struct brbe_regset *stored,
> 				      struct brbe_regset *live,
> 				      int nr_stored, int nr_live,
> 				      int nr_max)
> {	
> 	int nr_move = max(nr_stored, nr_max - nr_live);

Should this compare be min() instead ? As all nr_live entries need to
be moved starting store[0], there will be (nr_max - nr_live) entries
left for initial stored entries movement, irrespective of how many of
stored entries are actually present. Hence (nr_max - nr_live) acts as
a cap on nr_stored value for this initial movement. But if nr_stored
is smaller than  nr_max - nr_live, it gets picked up.

> 
> 	/* Move the tail of the buffer to make room for the new entries */
> 	memmove(&stored[nr_live], &stored[0], nr_move * sizeof(*stored));
> 
> 	/* Copy the new entries into the head of the buffer */
> 	memcpy(stored[0], &live[0], nr_live * sizeof(*stored));
> 
> 	/* Return the number of entries in the stitched buffer */
> 	return min(nr_live + nr_stored, nr_max);
> }

Otherwise this makes sense and simpler, will rework.

> 
> ... or if we could save this oldest-first, we could make it a circular buffer
> and avoid moving older entries.

Storing the youngest entries first is aligned with how perf branch
stack sampling stores the entries in struct perf_sample_data which
gets copied 'as is' from cpuc->branches->branch_stack. Hence, just
keeping all these buffer in the same age order (youngest first in
index 0) really makes sense. Although the above rework seems fine.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions
  2023-06-14  5:14     ` Anshuman Khandual
@ 2023-06-14 10:59       ` Mark Rutland
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Rutland @ 2023-06-14 10:59 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-arm-kernel, linux-kernel, will, catalin.marinas,
	Mark Brown, James Clark, Rob Herring, Marc Zyngier,
	Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-perf-users

On Wed, Jun 14, 2023 at 10:44:38AM +0530, Anshuman Khandual wrote:
> On 6/13/23 22:47, Mark Rutland wrote:
> >> +/*
> >> + * This scans over BRBE register banks and captures individual branch reocrds
> >> + * [BRBSRC, BRBTGT, BRBINF] into a pre-allocated 'struct brbe_regset' buffer,
> >> + * until an invalid one gets encountered. The caller for this function needs
> >> + * to ensure BRBE is an appropriate state before the records can be captured.
> >> + */
> >> +static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
> >> +{
> >> +	int loop1_idx1, loop1_idx2, loop2_idx1, loop2_idx2;
> >> +	int idx, count;
> >> +
> >> +	loop1_idx1 = BRBE_BANK0_IDX_MIN;
> >> +	if (brbe_attr->brbe_nr <= BRBE_BANK_MAX_ENTRIES) {
> >> +		loop1_idx2 = brbe_attr->brbe_nr - 1;
> >> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> >> +		loop2_idx2 = BRBE_BANK0_IDX_MAX;
> >> +	} else {
> >> +		loop1_idx2 = BRBE_BANK0_IDX_MAX;
> >> +		loop2_idx1 = BRBE_BANK1_IDX_MIN;
> >> +		loop2_idx2 = brbe_attr->brbe_nr - 1;
> >> +	}
> >> +
> >> +	select_brbe_bank(BRBE_BANK_IDX_0);
> >> +	for (idx = 0, count = loop1_idx1; count <= loop1_idx2; idx++, count++) {
> >> +		buf[idx].brbinf = get_brbinf_reg(idx);
> >> +		/*
> >> +		 * There are no valid entries anymore on the buffer.
> >> +		 * Abort the branch record processing to save some
> >> +		 * cycles and also reduce the capture/process load
> >> +		 * for the user space as well.
> >> +		 */
> >> +		if (brbe_invalid(buf[idx].brbinf))
> >> +			return idx;
> >> +
> >> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
> >> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
> >> +	}
> >> +
> >> +	select_brbe_bank(BRBE_BANK_IDX_1);
> >> +	for (count = loop2_idx1; count <= loop2_idx2; idx++, count++) {
> >> +		buf[idx].brbinf = get_brbinf_reg(idx);
> >> +		/*
> >> +		 * There are no valid entries anymore on the buffer.
> >> +		 * Abort the branch record processing to save some
> >> +		 * cycles and also reduce the capture/process load
> >> +		 * for the user space as well.
> >> +		 */
> >> +		if (brbe_invalid(buf[idx].brbinf))
> >> +			return idx;
> >> +
> >> +		buf[idx].brbsrc = get_brbsrc_reg(idx);
> >> +		buf[idx].brbtgt = get_brbtgt_reg(idx);
> >> +	}
> >> +	return idx;
> >> +}
> > 
> > As with __armv8pmu_branch_read(), the loop conditions are a bit hard to follow,
> > and I believe that can be rewritten along the lines of the suggestion there.
> 
> I have changed both the places (in separate patches) with suggested loop structure.
> 
> > 
> > Looking at this, we now have a couple of places that will try to read the
> > registers for an individual record, so it probably makes sense to facotr that
> > into a helper, e.g.
> 
> There are indeed two places inside capture_brbe_regset() - one for each bank.
> 
> > 
> > | static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
> > | {
> > | 	u64 brbinf = get_brbinf_reg(idx);
> > | 
> > | 	if (brbe_invalid(brbinf))
> > | 		return false;
> > | 	
> > | 	entry->brbinf = brbinf;
> > | 	entry->brbsrc = get_brbsrc_reg(idx);
> > | 	entry->brbtgt = get_brbtgt_reg(idx);
> > | 
> > | 	return true;
> > | }
> > 
> > ... which can be used here, e.g.
> > 
> > | /*
> > |  * Capture all records before the first invalid record, and return the number
> > |  * of records captured.
> > |  */
> > | static int capture_brbe_regset(struct brbe_hw_attr *brbe_attr, struct brbe_regset *buf)
> > | {
> > | 
> > | 	int nr_entries = brbe_attr->brbe_nr;
> > | 	int idx = 0;
> > | 	
> > | 	select_brbe_bank(BRBE_BANK_IDX_0);
> > | 	while (idx < nr_entries && IDX < BRBE_BANK0_IDX_MAX) {
> > | 		if (__read_brbe_regset(&buf[idx], idx))
> 
> It should test !_read_brbe_regset(&buf[idx], idx)) instead as the error
> case returns false.

Yes, my bad.

> >> +static int stitch_stored_live_entries(struct brbe_regset *stored,
> >> +				      struct brbe_regset *live,
> >> +				      int nr_stored, int nr_live,
> >> +				      int nr_max)
> >> +{
> >> +	int nr_total, nr_excess, nr_last, i;
> >> +
> >> +	nr_total = nr_stored + nr_live;
> >> +	nr_excess = nr_total - nr_max;
> >> +
> >> +	/* Stored branch records in stitched buffer */
> >> +	if (nr_live == nr_max)
> >> +		nr_stored = 0;
> >> +	else if (nr_excess > 0)
> >> +		nr_stored -= nr_excess;
> >> +
> >> +	/* Stitched buffer branch records length */
> >> +	if (nr_total > nr_max)
> >> +		nr_last = nr_max;
> >> +	else
> >> +		nr_last = nr_total;
> >> +
> >> +	/* Move stored branch records */
> >> +	for (i = 0; i < nr_stored; i++)
> >> +		copy_brbe_regset(stored, i, stored, nr_last - nr_stored - 1 + i);
> >> +
> >> +	/* Copy live branch records */
> >> +	for (i = 0; i < nr_live; i++)
> >> +		copy_brbe_regset(live, i, stored, i);
> >> +
> >> +	return nr_last;
> >> +}
> > 
> > I think this can be written more simply as something like:
> > 
> > static int stitch_stored_live_entries(struct brbe_regset *stored,
> > 				      struct brbe_regset *live,
> > 				      int nr_stored, int nr_live,
> > 				      int nr_max)
> > {	
> > 	int nr_move = max(nr_stored, nr_max - nr_live);
> 
> Should this compare be min() instead ?

Yup, my bad again. That should be min().

> > 	/* Move the tail of the buffer to make room for the new entries */
> > 	memmove(&stored[nr_live], &stored[0], nr_move * sizeof(*stored));
> > 
> > 	/* Copy the new entries into the head of the buffer */
> > 	memcpy(stored[0], &live[0], nr_live * sizeof(*stored));
> > 
> > 	/* Return the number of entries in the stitched buffer */
> > 	return min(nr_live + nr_stored, nr_max);
> > }
> 
> Otherwise this makes sense and simpler, will rework.

Great!

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2023-06-14 10:59 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31  4:04 [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 01/10] drivers: perf: arm_pmu: Add new sched_task() callback Anshuman Khandual
2023-06-05  7:26   ` Mark Rutland
2023-05-31  4:04 ` [PATCH V11 02/10] arm64/perf: Add BRBE registers and fields Anshuman Khandual
2023-06-05  7:55   ` Mark Rutland
2023-06-06  4:27     ` Anshuman Khandual
2023-06-13 16:27   ` Mark Rutland
2023-06-14  2:59     ` Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 03/10] arm64/perf: Add branch stack support in struct arm_pmu Anshuman Khandual
2023-06-05  7:58   ` Mark Rutland
2023-06-06  4:47     ` Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 04/10] arm64/perf: Add branch stack support in struct pmu_hw_events Anshuman Khandual
2023-06-05  8:00   ` Mark Rutland
2023-05-31  4:04 ` [PATCH V11 05/10] arm64/perf: Add branch stack support in ARMV8 PMU Anshuman Khandual
2023-06-02  2:33   ` Namhyung Kim
2023-06-05  2:43     ` Anshuman Khandual
2023-06-05 12:05   ` Mark Rutland
2023-06-06 10:34     ` Anshuman Khandual
2023-06-06 10:41       ` Mark Rutland
2023-06-08 10:13       ` Suzuki K Poulose
2023-06-09  4:00         ` Anshuman Khandual
2023-06-09  9:54           ` Suzuki K Poulose
2023-06-09  7:14         ` Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 06/10] arm64/perf: Enable branch stack events via FEAT_BRBE Anshuman Khandual
2023-06-02  1:45   ` Namhyung Kim
2023-06-05  3:00     ` Anshuman Khandual
2023-06-05 13:43   ` Mark Rutland
2023-06-09  4:30     ` Anshuman Khandual
2023-06-09 12:37       ` Mark Rutland
2023-06-09  4:47     ` Anshuman Khandual
2023-06-09 12:42       ` Mark Rutland
2023-06-09  5:22     ` Anshuman Khandual
2023-06-09 12:47       ` Mark Rutland
2023-06-09 13:15         ` Mark Rutland
2023-06-12  8:35           ` Anshuman Khandual
2023-06-09 13:34         ` James Clark
2023-06-12 10:12           ` Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 07/10] arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack() Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 08/10] arm64/perf: Add struct brbe_regset helper functions Anshuman Khandual
2023-06-02  2:40   ` Namhyung Kim
2023-06-05  3:14     ` Anshuman Khandual
2023-06-05 23:49       ` Namhyung Kim
2023-06-13 17:17   ` Mark Rutland
2023-06-14  5:14     ` Anshuman Khandual
2023-06-14 10:59       ` Mark Rutland
2023-05-31  4:04 ` [PATCH V11 09/10] arm64/perf: Implement branch records save on task sched out Anshuman Khandual
2023-05-31  4:04 ` [PATCH V11 10/10] arm64/perf: Implement branch records save on PMU IRQ Anshuman Khandual
2023-06-09 11:13 ` [PATCH V11 00/10] arm64/perf: Enable branch stack sampling Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).