All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] perf: Rework event forwarding logic
@ 2023-05-04 10:59 Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
                   ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-04 10:59 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, namhyung, eranian, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla

Usually, events are opened on the same pmu as requested by user via
perf_event_attr->type argument. But certain special events are
internally redirected to different pmu. Currently such pmus needs to
be treated specially and thus requires some gruesome hacks.

An approach, suggested by Peter Zijlstra, to get rid of these hacks
was to overwrite event attributes with new pmu's attribute values
within the kernel and let perf_event_init() retry opening the event
with overwritten values. This patch series implements it.

v3: https://lore.kernel.org/r/20230425142205.762-1-ravi.bangoria@amd.com
v3->v4:
 - Split pmu linear searching changes into a separate patch with few
   additional changes.
 - Use special pointer value (void *)(~0) instead of introducing new
   variable to skip creating sysfs/dev files.
 - Move AMD IBS unit test under tools/perf/arch/x86/tests/

Patches are prepared on v6.3.

Ravi Bangoria (4):
  perf/core: Rework forwarding of {task|cpu}-clock events
  perf/ibs: Fix interface via core pmu events
  perf/core: Remove pmu linear searching code
  perf test: Add selftest to test IBS invocation via core pmu events

 arch/x86/events/amd/core.c                    |   2 +-
 arch/x86/events/amd/ibs.c                     |  53 ++++----
 arch/x86/include/asm/perf_event.h             |   2 +
 include/linux/perf_event.h                    |  10 ++
 kernel/events/core.c                          | 114 +++++++++---------
 tools/perf/arch/x86/include/arch-tests.h      |   1 +
 tools/perf/arch/x86/tests/Build               |   1 +
 .../arch/x86/tests/amd-ibs-via-core-pmu.c     |  71 +++++++++++
 tools/perf/arch/x86/tests/arch-tests.c        |   2 +
 9 files changed, 168 insertions(+), 88 deletions(-)
 create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c

-- 
2.40.0


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events
  2023-05-04 10:59 [PATCH v4 0/4] perf: Rework event forwarding logic Ravi Bangoria
@ 2023-05-04 11:00 ` Ravi Bangoria
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
  2024-02-20  8:41   ` [PATCH v4 1/4] " Pengfei Xu
  2023-05-04 11:00 ` [PATCH v4 2/4] perf/ibs: Fix interface via core pmu events Ravi Bangoria
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-04 11:00 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, namhyung, eranian, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla

Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
cpu-clock events are interfaced through it but internally gets forwarded
to their own pmus.

Rework this by overwriting event->attr.type in perf_swevent_init() which
will cause perf_init_event() to retry with updated type and event will
automatically get forwarded to right pmu. With the change, SW pmu no
longer needs to be treated specially and can be included in 'pmu_idr'
list.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 include/linux/perf_event.h | 10 +++++
 kernel/events/core.c       | 77 ++++++++++++++++++++------------------
 2 files changed, 51 insertions(+), 36 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..5c8a748f51ac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -295,6 +295,8 @@ struct perf_event_pmu_context;
 
 struct perf_output_handle;
 
+#define PMU_NULL_DEV	((void *)(~0))
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -827,6 +829,14 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+
+	/*
+	 * Certain events gets forwarded to another pmu internally by over-
+	 * writing kernel copy of event->attr.type without user being aware
+	 * of it. event->orig_type contains original 'type' requested by
+	 * user.
+	 */
+	__u32				orig_type;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 435815d3be3f..0695bb9fbbb6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6647,7 +6647,7 @@ static void perf_sigtrap(struct perf_event *event)
 		return;
 
 	send_sig_perf((void __user *)event->pending_addr,
-		      event->attr.type, event->attr.sig_data);
+		      event->orig_type, event->attr.sig_data);
 }
 
 /*
@@ -9951,6 +9951,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
 	swevent_hlist_put();
 }
 
+static struct pmu perf_cpu_clock; /* fwd declaration */
+static struct pmu perf_task_clock;
+
 static int perf_swevent_init(struct perf_event *event)
 {
 	u64 event_id = event->attr.config;
@@ -9966,7 +9969,10 @@ static int perf_swevent_init(struct perf_event *event)
 
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
+		event->attr.type = perf_cpu_clock.type;
+		return -ENOENT;
 	case PERF_COUNT_SW_TASK_CLOCK:
+		event->attr.type = perf_task_clock.type;
 		return -ENOENT;
 
 	default:
@@ -11086,7 +11092,7 @@ static void cpu_clock_event_read(struct perf_event *event)
 
 static int cpu_clock_event_init(struct perf_event *event)
 {
-	if (event->attr.type != PERF_TYPE_SOFTWARE)
+	if (event->attr.type != perf_cpu_clock.type)
 		return -ENOENT;
 
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
@@ -11107,6 +11113,7 @@ static struct pmu perf_cpu_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
 	.capabilities	= PERF_PMU_CAP_NO_NMI,
+	.dev		= PMU_NULL_DEV,
 
 	.event_init	= cpu_clock_event_init,
 	.add		= cpu_clock_event_add,
@@ -11167,7 +11174,7 @@ static void task_clock_event_read(struct perf_event *event)
 
 static int task_clock_event_init(struct perf_event *event)
 {
-	if (event->attr.type != PERF_TYPE_SOFTWARE)
+	if (event->attr.type != perf_task_clock.type)
 		return -ENOENT;
 
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
@@ -11188,6 +11195,7 @@ static struct pmu perf_task_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
 	.capabilities	= PERF_PMU_CAP_NO_NMI,
+	.dev		= PMU_NULL_DEV,
 
 	.event_init	= task_clock_event_init,
 	.add		= task_clock_event_add,
@@ -11415,31 +11423,31 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		goto unlock;
 
 	pmu->type = -1;
-	if (!name)
-		goto skip_type;
+	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
+		ret = -EINVAL;
+		goto free_pdc;
+	}
+
 	pmu->name = name;
 
-	if (type != PERF_TYPE_SOFTWARE) {
-		if (type >= 0)
-			max = type;
+	if (type >= 0)
+		max = type;
 
-		ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
-		if (ret < 0)
-			goto free_pdc;
+	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto free_pdc;
 
-		WARN_ON(type >= 0 && ret != type);
+	WARN_ON(type >= 0 && ret != type);
 
-		type = ret;
-	}
+	type = ret;
 	pmu->type = type;
 
-	if (pmu_bus_running) {
+	if (pmu_bus_running && !pmu->dev) {
 		ret = pmu_dev_alloc(pmu);
 		if (ret)
 			goto free_idr;
 	}
 
-skip_type:
 	ret = -ENOMEM;
 	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
 	if (!pmu->cpu_pmu_context)
@@ -11481,16 +11489,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
-	/*
-	 * Ensure the TYPE_SOFTWARE PMUs are at the head of the list,
-	 * since these cannot be in the IDR. This way the linear search
-	 * is fast, provided a valid software event is provided.
-	 */
-	if (type == PERF_TYPE_SOFTWARE || !name)
-		list_add_rcu(&pmu->entry, &pmus);
-	else
-		list_add_tail_rcu(&pmu->entry, &pmus);
-
+	list_add_rcu(&pmu->entry, &pmus);
 	atomic_set(&pmu->exclusive_cnt, 0);
 	ret = 0;
 unlock:
@@ -11499,12 +11498,13 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	return ret;
 
 free_dev:
-	device_del(pmu->dev);
-	put_device(pmu->dev);
+	if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
+		device_del(pmu->dev);
+		put_device(pmu->dev);
+	}
 
 free_idr:
-	if (pmu->type != PERF_TYPE_SOFTWARE)
-		idr_remove(&pmu_idr, pmu->type);
+	idr_remove(&pmu_idr, pmu->type);
 
 free_pdc:
 	free_percpu(pmu->pmu_disable_count);
@@ -11525,9 +11525,8 @@ void perf_pmu_unregister(struct pmu *pmu)
 	synchronize_rcu();
 
 	free_percpu(pmu->pmu_disable_count);
-	if (pmu->type != PERF_TYPE_SOFTWARE)
-		idr_remove(&pmu_idr, pmu->type);
-	if (pmu_bus_running) {
+	idr_remove(&pmu_idr, pmu->type);
+	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
 		device_del(pmu->dev);
@@ -11601,6 +11600,12 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
+	/*
+	 * Save original type before calling pmu->event_init() since certain
+	 * pmus overwrites event->attr.type to forward event to another pmu.
+	 */
+	event->orig_type = event->attr.type;
+
 	/* Try parent's PMU first: */
 	if (event->parent && event->parent->pmu) {
 		pmu = event->parent->pmu;
@@ -13640,8 +13645,8 @@ void __init perf_event_init(void)
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
-	perf_pmu_register(&perf_cpu_clock, NULL, -1);
-	perf_pmu_register(&perf_task_clock, NULL, -1);
+	perf_pmu_register(&perf_cpu_clock, "cpu_clock", -1);
+	perf_pmu_register(&perf_task_clock, "task_clock", -1);
 	perf_tp_register();
 	perf_event_init_cpu(smp_processor_id());
 	register_reboot_notifier(&perf_reboot_notifier);
@@ -13684,7 +13689,7 @@ static int __init perf_event_sysfs_init(void)
 		goto unlock;
 
 	list_for_each_entry(pmu, &pmus, entry) {
-		if (!pmu->name || pmu->type < 0)
+		if (pmu->dev)
 			continue;
 
 		ret = pmu_dev_alloc(pmu);
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 2/4] perf/ibs: Fix interface via core pmu events
  2023-05-04 10:59 [PATCH v4 0/4] perf: Rework event forwarding logic Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
@ 2023-05-04 11:00 ` Ravi Bangoria
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events Ravi Bangoria
  3 siblings, 1 reply; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-04 11:00 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, namhyung, eranian, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla

Although, IBS pmus can be invoked via their own interface, indirect
IBS invocation via core pmu events is also supported with fixed set
of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p
(micro-ops) for user convenience.

This indirect IBS invocation is broken since commit 66d258c5b048
("perf/core: Optimize perf_init_event()"), which added RAW pmu under
'pmu_idr' list and thus if event_init() fails with RAW pmu, it started
returning error instead of trying other pmus.

Forward precise events from core pmu to IBS by overwriting 'type' and
'config' in the kernel copy of perf_event_attr. Overwriting will cause
perf_init_event() to retry with updated 'type' and 'config', which will
automatically forward event to IBS pmu.

Without patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  Error:
  The r076:p event is not supported.

With patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ]

Fixes: 66d258c5b048 ("perf/core: Optimize perf_init_event()")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 arch/x86/events/amd/core.c        |  2 +-
 arch/x86/events/amd/ibs.c         | 53 +++++++++++++++----------------
 arch/x86/include/asm/perf_event.h |  2 ++
 3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index bccea57dee81..abadd5f23425 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -374,7 +374,7 @@ static int amd_pmu_hw_config(struct perf_event *event)
 
 	/* pass precise event sampling to ibs: */
 	if (event->attr.precise_ip && get_ibs_caps())
-		return -ENOENT;
+		return forward_event_to_ibs(event);
 
 	if (has_branch_stack(event) && !x86_pmu.lbr_nr)
 		return -EOPNOTSUPP;
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 64582954b5f6..371014802191 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -190,7 +190,7 @@ static struct perf_ibs *get_ibs_pmu(int type)
 }
 
 /*
- * Use IBS for precise event sampling:
+ * core pmu config -> IBS config
  *
  *  perf record -a -e cpu-cycles:p ...    # use ibs op counting cycle count
  *  perf record -a -e r076:p ...          # same as -e cpu-cycles:p
@@ -199,25 +199,9 @@ static struct perf_ibs *get_ibs_pmu(int type)
  * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl,
  * MSRC001_1033) is used to select either cycle or micro-ops counting
  * mode.
- *
- * The rip of IBS samples has skid 0. Thus, IBS supports precise
- * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
- * rip is invalid when IBS was not able to record the rip correctly.
- * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
- *
  */
-static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
+static int core_pmu_ibs_config(struct perf_event *event, u64 *config)
 {
-	switch (event->attr.precise_ip) {
-	case 0:
-		return -ENOENT;
-	case 1:
-	case 2:
-		break;
-	default:
-		return -EOPNOTSUPP;
-	}
-
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		switch (event->attr.config) {
@@ -243,22 +227,37 @@ static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
 	return -EOPNOTSUPP;
 }
 
+/*
+ * The rip of IBS samples has skid 0. Thus, IBS supports precise
+ * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
+ * rip is invalid when IBS was not able to record the rip correctly.
+ * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
+ */
+int forward_event_to_ibs(struct perf_event *event)
+{
+	u64 config = 0;
+
+	if (!event->attr.precise_ip || event->attr.precise_ip > 2)
+		return -EOPNOTSUPP;
+
+	if (!core_pmu_ibs_config(event, &config)) {
+		event->attr.type = perf_ibs_op.pmu.type;
+		event->attr.config = config;
+	}
+	return -ENOENT;
+}
+
 static int perf_ibs_init(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	struct perf_ibs *perf_ibs;
 	u64 max_cnt, config;
-	int ret;
 
 	perf_ibs = get_ibs_pmu(event->attr.type);
-	if (perf_ibs) {
-		config = event->attr.config;
-	} else {
-		perf_ibs = &perf_ibs_op;
-		ret = perf_ibs_precise_event(event, &config);
-		if (ret)
-			return ret;
-	}
+	if (!perf_ibs)
+		return -ENOENT;
+
+	config = event->attr.config;
 
 	if (event->pmu != &perf_ibs->pmu)
 		return -ENOENT;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8fc15ed5e60b..fc86248215e2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -475,8 +475,10 @@ struct pebs_xmm {
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern u32 get_ibs_caps(void);
+extern int forward_event_to_ibs(struct perf_event *event);
 #else
 static inline u32 get_ibs_caps(void) { return 0; }
+static inline int forward_event_to_ibs(struct perf_event *event) { return -ENOENT; }
 #endif
 
 #ifdef CONFIG_PERF_EVENTS
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-04 10:59 [PATCH v4 0/4] perf: Rework event forwarding logic Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
  2023-05-04 11:00 ` [PATCH v4 2/4] perf/ibs: Fix interface via core pmu events Ravi Bangoria
@ 2023-05-04 11:00 ` Ravi Bangoria
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
  2023-05-24 21:41   ` [PATCH v4 3/4] " Nathan Chancellor
  2023-05-04 11:00 ` [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events Ravi Bangoria
  3 siblings, 2 replies; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-04 11:00 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, namhyung, eranian, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla

Searching for the right pmu by iterating over all pmus is no longer
required since all pmus now *must* be present in the 'pmu_idr' list.
So, remove linear searching code.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 kernel/events/core.c | 37 +++++++++++++------------------------
 1 file changed, 13 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0695bb9fbbb6..eba2b8595115 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event)
 	}
 
 again:
+	ret = -ENOENT;
 	rcu_read_lock();
 	pmu = idr_find(&pmu_idr, type);
 	rcu_read_unlock();
-	if (pmu) {
-		if (event->attr.type != type && type != PERF_TYPE_RAW &&
-		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
-			goto fail;
-
-		ret = perf_try_init_event(pmu, event);
-		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
-			type = event->attr.type;
-			goto again;
-		}
+	if (!pmu)
+		goto fail;
 
-		if (ret)
-			pmu = ERR_PTR(ret);
+	if (event->attr.type != type && type != PERF_TYPE_RAW &&
+	    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
+		goto fail;
 
-		goto unlock;
+	ret = perf_try_init_event(pmu, event);
+	if (ret == -ENOENT && event->attr.type != type && !extended_type) {
+		type = event->attr.type;
+		goto again;
 	}
 
-	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
-		ret = perf_try_init_event(pmu, event);
-		if (!ret)
-			goto unlock;
-
-		if (ret != -ENOENT) {
-			pmu = ERR_PTR(ret);
-			goto unlock;
-		}
-	}
 fail:
-	pmu = ERR_PTR(-ENOENT);
+	if (ret)
+		pmu = ERR_PTR(ret);
+
 unlock:
 	srcu_read_unlock(&pmus_srcu, idx);
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events
  2023-05-04 10:59 [PATCH v4 0/4] perf: Rework event forwarding logic Ravi Bangoria
                   ` (2 preceding siblings ...)
  2023-05-04 11:00 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Ravi Bangoria
@ 2023-05-04 11:00 ` Ravi Bangoria
  2023-05-05  9:16   ` Peter Zijlstra
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
  3 siblings, 2 replies; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-04 11:00 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, namhyung, eranian, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla

IBS pmu can be invoked via fixed set of core pmu events with 'precise_ip'
set to 1. Add a simple event open test for all these events.

Without kernel fix:
  $ sudo ./perf test -vv 76
   76: AMD IBS via core pmu                                      :
  --- start ---
  test child forked, pid 6553
  Using CPUID AuthenticAMD-25-1-1
  type: 0x0, config: 0x0, fd: 3  -  Pass
  type: 0x0, config: 0x1, fd: -1  -  Pass
  type: 0x4, config: 0x76, fd: -1  -  Fail
  type: 0x4, config: 0xc1, fd: -1  -  Fail
  type: 0x4, config: 0x12, fd: -1  -  Pass
  test child finished with -1
  ---- end ----
  AMD IBS via core pmu: FAILED!

With kernel fix:
  $ sudo ./perf test -vv 76
   76: AMD IBS via core pmu                                      :
  --- start ---
  test child forked, pid 7526
  Using CPUID AuthenticAMD-25-1-1
  type: 0x0, config: 0x0, fd: 3  -  Pass
  type: 0x0, config: 0x1, fd: -1  -  Pass
  type: 0x4, config: 0x76, fd: 3  -  Pass
  type: 0x4, config: 0xc1, fd: 3  -  Pass
  type: 0x4, config: 0x12, fd: -1  -  Pass
  test child finished with 0
  ---- end ----
  AMD IBS via core pmu: Ok

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 tools/perf/arch/x86/include/arch-tests.h      |  1 +
 tools/perf/arch/x86/tests/Build               |  1 +
 .../arch/x86/tests/amd-ibs-via-core-pmu.c     | 71 +++++++++++++++++++
 tools/perf/arch/x86/tests/arch-tests.c        |  2 +
 4 files changed, 75 insertions(+)
 create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c

diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h
index 902e9ea9b99e..93d3b8877baa 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -11,6 +11,7 @@ int test__intel_pt_pkt_decoder(struct test_suite *test, int subtest);
 int test__intel_pt_hybrid_compat(struct test_suite *test, int subtest);
 int test__bp_modify(struct test_suite *test, int subtest);
 int test__x86_sample_parsing(struct test_suite *test, int subtest);
+int test__amd_ibs_via_core_pmu(struct test_suite *test, int subtest);
 
 extern struct test_suite *arch_tests[];
 
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index 6f4e8636c3bf..fd02d8181718 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -5,3 +5,4 @@ perf-y += arch-tests.o
 perf-y += sample-parsing.o
 perf-$(CONFIG_AUXTRACE) += insn-x86.o intel-pt-test.o
 perf-$(CONFIG_X86_64) += bp-modify.o
+perf-y += amd-ibs-via-core-pmu.o
diff --git a/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c b/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
new file mode 100644
index 000000000000..2902798ca5c1
--- /dev/null
+++ b/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "arch-tests.h"
+#include "linux/perf_event.h"
+#include "tests/tests.h"
+#include "pmu.h"
+#include "pmus.h"
+#include "../perf-sys.h"
+#include "debug.h"
+
+#define NR_SUB_TESTS 5
+
+static struct sub_tests {
+	int type;
+	unsigned long config;
+	bool valid;
+} sub_tests[NR_SUB_TESTS] = {
+	{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, true },
+	{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, false },
+	{ PERF_TYPE_RAW, 0x076, true },
+	{ PERF_TYPE_RAW, 0x0C1, true },
+	{ PERF_TYPE_RAW, 0x012, false },
+};
+
+static int event_open(int type, unsigned long config)
+{
+	struct perf_event_attr attr;
+
+	memset(&attr, 0, sizeof(struct perf_event_attr));
+	attr.type = type;
+	attr.size = sizeof(struct perf_event_attr);
+	attr.config = config;
+	attr.disabled = 1;
+	attr.precise_ip = 1;
+	attr.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	attr.sample_period = 100000;
+
+	return sys_perf_event_open(&attr, -1, 0, -1, 0);
+}
+
+int test__amd_ibs_via_core_pmu(struct test_suite *test __maybe_unused,
+			       int subtest __maybe_unused)
+{
+	struct perf_pmu *ibs_pmu;
+	int ret = TEST_OK;
+	int fd, i;
+
+	if (list_empty(&pmus))
+		perf_pmu__scan(NULL);
+
+	ibs_pmu = perf_pmu__find("ibs_op");
+	if (!ibs_pmu)
+		return TEST_SKIP;
+
+	for (i = 0; i < NR_SUB_TESTS; i++) {
+		fd = event_open(sub_tests[i].type, sub_tests[i].config);
+		pr_debug("type: 0x%x, config: 0x%lx, fd: %d  -  ", sub_tests[i].type,
+			 sub_tests[i].config, fd);
+		if ((sub_tests[i].valid && fd == -1) ||
+		    (!sub_tests[i].valid && fd > 0)) {
+			pr_debug("Fail\n");
+			ret = TEST_FAIL;
+		} else {
+			pr_debug("Pass\n");
+		}
+
+		if (fd > 0)
+			close(fd);
+	}
+
+	return ret;
+}
diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c
index aae6ea0fe52b..b5c85ab8d92e 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -22,6 +22,7 @@ struct test_suite suite__intel_pt = {
 DEFINE_SUITE("x86 bp modify", bp_modify);
 #endif
 DEFINE_SUITE("x86 Sample parsing", x86_sample_parsing);
+DEFINE_SUITE("AMD IBS via core pmu", amd_ibs_via_core_pmu);
 
 struct test_suite *arch_tests[] = {
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
@@ -35,5 +36,6 @@ struct test_suite *arch_tests[] = {
 	&suite__bp_modify,
 #endif
 	&suite__x86_sample_parsing,
+	&suite__amd_ibs_via_core_pmu,
 	NULL,
 };
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events
  2023-05-04 11:00 ` [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events Ravi Bangoria
@ 2023-05-05  9:16   ` Peter Zijlstra
  2023-05-15 21:31     ` Arnaldo Carvalho de Melo
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-05-05  9:16 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: namhyung, eranian, acme, mark.rutland, jolsa, irogers, bp,
	kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla

On Thu, May 04, 2023 at 04:30:03PM +0530, Ravi Bangoria wrote:
> IBS pmu can be invoked via fixed set of core pmu events with 'precise_ip'
> set to 1. Add a simple event open test for all these events.
> 
> Without kernel fix:
>   $ sudo ./perf test -vv 76
>    76: AMD IBS via core pmu                                      :
>   --- start ---
>   test child forked, pid 6553
>   Using CPUID AuthenticAMD-25-1-1
>   type: 0x0, config: 0x0, fd: 3  -  Pass
>   type: 0x0, config: 0x1, fd: -1  -  Pass
>   type: 0x4, config: 0x76, fd: -1  -  Fail
>   type: 0x4, config: 0xc1, fd: -1  -  Fail
>   type: 0x4, config: 0x12, fd: -1  -  Pass
>   test child finished with -1
>   ---- end ----
>   AMD IBS via core pmu: FAILED!
> 
> With kernel fix:
>   $ sudo ./perf test -vv 76
>    76: AMD IBS via core pmu                                      :
>   --- start ---
>   test child forked, pid 7526
>   Using CPUID AuthenticAMD-25-1-1
>   type: 0x0, config: 0x0, fd: 3  -  Pass
>   type: 0x0, config: 0x1, fd: -1  -  Pass
>   type: 0x4, config: 0x76, fd: 3  -  Pass
>   type: 0x4, config: 0xc1, fd: 3  -  Pass
>   type: 0x4, config: 0x12, fd: -1  -  Pass
>   test child finished with 0
>   ---- end ----
>   AMD IBS via core pmu: Ok
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  tools/perf/arch/x86/include/arch-tests.h      |  1 +
>  tools/perf/arch/x86/tests/Build               |  1 +
>  .../arch/x86/tests/amd-ibs-via-core-pmu.c     | 71 +++++++++++++++++++
>  tools/perf/arch/x86/tests/arch-tests.c        |  2 +
>  4 files changed, 75 insertions(+)
>  create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c

Arnaldo, ok if I merge this along with the other patches or would you
rather take it?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip: perf/core] perf test: Add selftest to test IBS invocation via core pmu events
  2023-05-04 11:00 ` [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events Ravi Bangoria
  2023-05-05  9:16   ` Peter Zijlstra
@ 2023-05-10 13:33   ` tip-bot2 for Ravi Bangoria
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Ravi Bangoria @ 2023-05-10 13:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ravi Bangoria, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     78075d947534013b4575687d19ebcbbb6d3addcd
Gitweb:        https://git.kernel.org/tip/78075d947534013b4575687d19ebcbbb6d3addcd
Author:        Ravi Bangoria <ravi.bangoria@amd.com>
AuthorDate:    Thu, 04 May 2023 16:30:03 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:31 +02:00

perf test: Add selftest to test IBS invocation via core pmu events

IBS pmu can be invoked via fixed set of core pmu events with 'precise_ip'
set to 1. Add a simple event open test for all these events.

Without kernel fix:
  $ sudo ./perf test -vv 76
   76: AMD IBS via core pmu                                      :
  --- start ---
  test child forked, pid 6553
  Using CPUID AuthenticAMD-25-1-1
  type: 0x0, config: 0x0, fd: 3  -  Pass
  type: 0x0, config: 0x1, fd: -1  -  Pass
  type: 0x4, config: 0x76, fd: -1  -  Fail
  type: 0x4, config: 0xc1, fd: -1  -  Fail
  type: 0x4, config: 0x12, fd: -1  -  Pass
  test child finished with -1
  ---- end ----
  AMD IBS via core pmu: FAILED!

With kernel fix:
  $ sudo ./perf test -vv 76
   76: AMD IBS via core pmu                                      :
  --- start ---
  test child forked, pid 7526
  Using CPUID AuthenticAMD-25-1-1
  type: 0x0, config: 0x0, fd: 3  -  Pass
  type: 0x0, config: 0x1, fd: -1  -  Pass
  type: 0x4, config: 0x76, fd: 3  -  Pass
  type: 0x4, config: 0xc1, fd: 3  -  Pass
  type: 0x4, config: 0x12, fd: -1  -  Pass
  test child finished with 0
  ---- end ----
  AMD IBS via core pmu: Ok

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-5-ravi.bangoria@amd.com
---
 tools/perf/arch/x86/include/arch-tests.h         |  1 +-
 tools/perf/arch/x86/tests/Build                  |  1 +-
 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c | 71 +++++++++++++++-
 tools/perf/arch/x86/tests/arch-tests.c           |  2 +-
 4 files changed, 75 insertions(+)
 create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c

diff --git a/tools/perf/arch/x86/include/arch-tests.h b/tools/perf/arch/x86/include/arch-tests.h
index 902e9ea..93d3b88 100644
--- a/tools/perf/arch/x86/include/arch-tests.h
+++ b/tools/perf/arch/x86/include/arch-tests.h
@@ -11,6 +11,7 @@ int test__intel_pt_pkt_decoder(struct test_suite *test, int subtest);
 int test__intel_pt_hybrid_compat(struct test_suite *test, int subtest);
 int test__bp_modify(struct test_suite *test, int subtest);
 int test__x86_sample_parsing(struct test_suite *test, int subtest);
+int test__amd_ibs_via_core_pmu(struct test_suite *test, int subtest);
 
 extern struct test_suite *arch_tests[];
 
diff --git a/tools/perf/arch/x86/tests/Build b/tools/perf/arch/x86/tests/Build
index 6f4e863..fd02d81 100644
--- a/tools/perf/arch/x86/tests/Build
+++ b/tools/perf/arch/x86/tests/Build
@@ -5,3 +5,4 @@ perf-y += arch-tests.o
 perf-y += sample-parsing.o
 perf-$(CONFIG_AUXTRACE) += insn-x86.o intel-pt-test.o
 perf-$(CONFIG_X86_64) += bp-modify.o
+perf-y += amd-ibs-via-core-pmu.o
diff --git a/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c b/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
new file mode 100644
index 0000000..2902798
--- /dev/null
+++ b/tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "arch-tests.h"
+#include "linux/perf_event.h"
+#include "tests/tests.h"
+#include "pmu.h"
+#include "pmus.h"
+#include "../perf-sys.h"
+#include "debug.h"
+
+#define NR_SUB_TESTS 5
+
+static struct sub_tests {
+	int type;
+	unsigned long config;
+	bool valid;
+} sub_tests[NR_SUB_TESTS] = {
+	{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_CPU_CYCLES, true },
+	{ PERF_TYPE_HARDWARE, PERF_COUNT_HW_INSTRUCTIONS, false },
+	{ PERF_TYPE_RAW, 0x076, true },
+	{ PERF_TYPE_RAW, 0x0C1, true },
+	{ PERF_TYPE_RAW, 0x012, false },
+};
+
+static int event_open(int type, unsigned long config)
+{
+	struct perf_event_attr attr;
+
+	memset(&attr, 0, sizeof(struct perf_event_attr));
+	attr.type = type;
+	attr.size = sizeof(struct perf_event_attr);
+	attr.config = config;
+	attr.disabled = 1;
+	attr.precise_ip = 1;
+	attr.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	attr.sample_period = 100000;
+
+	return sys_perf_event_open(&attr, -1, 0, -1, 0);
+}
+
+int test__amd_ibs_via_core_pmu(struct test_suite *test __maybe_unused,
+			       int subtest __maybe_unused)
+{
+	struct perf_pmu *ibs_pmu;
+	int ret = TEST_OK;
+	int fd, i;
+
+	if (list_empty(&pmus))
+		perf_pmu__scan(NULL);
+
+	ibs_pmu = perf_pmu__find("ibs_op");
+	if (!ibs_pmu)
+		return TEST_SKIP;
+
+	for (i = 0; i < NR_SUB_TESTS; i++) {
+		fd = event_open(sub_tests[i].type, sub_tests[i].config);
+		pr_debug("type: 0x%x, config: 0x%lx, fd: %d  -  ", sub_tests[i].type,
+			 sub_tests[i].config, fd);
+		if ((sub_tests[i].valid && fd == -1) ||
+		    (!sub_tests[i].valid && fd > 0)) {
+			pr_debug("Fail\n");
+			ret = TEST_FAIL;
+		} else {
+			pr_debug("Pass\n");
+		}
+
+		if (fd > 0)
+			close(fd);
+	}
+
+	return ret;
+}
diff --git a/tools/perf/arch/x86/tests/arch-tests.c b/tools/perf/arch/x86/tests/arch-tests.c
index aae6ea0..b5c85ab 100644
--- a/tools/perf/arch/x86/tests/arch-tests.c
+++ b/tools/perf/arch/x86/tests/arch-tests.c
@@ -22,6 +22,7 @@ struct test_suite suite__intel_pt = {
 DEFINE_SUITE("x86 bp modify", bp_modify);
 #endif
 DEFINE_SUITE("x86 Sample parsing", x86_sample_parsing);
+DEFINE_SUITE("AMD IBS via core pmu", amd_ibs_via_core_pmu);
 
 struct test_suite *arch_tests[] = {
 #ifdef HAVE_DWARF_UNWIND_SUPPORT
@@ -35,5 +36,6 @@ struct test_suite *arch_tests[] = {
 	&suite__bp_modify,
 #endif
 	&suite__x86_sample_parsing,
+	&suite__amd_ibs_via_core_pmu,
 	NULL,
 };

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: perf/core] perf/core: Remove pmu linear searching code
  2023-05-04 11:00 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Ravi Bangoria
@ 2023-05-10 13:33   ` tip-bot2 for Ravi Bangoria
  2023-05-24 21:41   ` [PATCH v4 3/4] " Nathan Chancellor
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Ravi Bangoria @ 2023-05-10 13:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Ravi Bangoria, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     9551fbb64d094cc105964716224adeb7765df8fd
Gitweb:        https://git.kernel.org/tip/9551fbb64d094cc105964716224adeb7765df8fd
Author:        Ravi Bangoria <ravi.bangoria@amd.com>
AuthorDate:    Thu, 04 May 2023 16:30:02 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:31 +02:00

perf/core: Remove pmu linear searching code

Searching for the right pmu by iterating over all pmus is no longer
required since all pmus now *must* be present in the 'pmu_idr' list.
So, remove linear searching code.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-4-ravi.bangoria@amd.com
---
 kernel/events/core.c | 37 +++++++++++++------------------------
 1 file changed, 13 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c01bbe9..231b187 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event)
 	}
 
 again:
+	ret = -ENOENT;
 	rcu_read_lock();
 	pmu = idr_find(&pmu_idr, type);
 	rcu_read_unlock();
-	if (pmu) {
-		if (event->attr.type != type && type != PERF_TYPE_RAW &&
-		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
-			goto fail;
-
-		ret = perf_try_init_event(pmu, event);
-		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
-			type = event->attr.type;
-			goto again;
-		}
+	if (!pmu)
+		goto fail;
 
-		if (ret)
-			pmu = ERR_PTR(ret);
+	if (event->attr.type != type && type != PERF_TYPE_RAW &&
+	    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
+		goto fail;
 
-		goto unlock;
+	ret = perf_try_init_event(pmu, event);
+	if (ret == -ENOENT && event->attr.type != type && !extended_type) {
+		type = event->attr.type;
+		goto again;
 	}
 
-	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
-		ret = perf_try_init_event(pmu, event);
-		if (!ret)
-			goto unlock;
-
-		if (ret != -ENOENT) {
-			pmu = ERR_PTR(ret);
-			goto unlock;
-		}
-	}
 fail:
-	pmu = ERR_PTR(-ENOENT);
+	if (ret)
+		pmu = ERR_PTR(ret);
+
 unlock:
 	srcu_read_unlock(&pmus_srcu, idx);
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: perf/core] perf/ibs: Fix interface via core pmu events
  2023-05-04 11:00 ` [PATCH v4 2/4] perf/ibs: Fix interface via core pmu events Ravi Bangoria
@ 2023-05-10 13:33   ` tip-bot2 for Ravi Bangoria
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Ravi Bangoria @ 2023-05-10 13:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Stephane Eranian, Ravi Bangoria, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     2fad201fe38ff9a692acedb1990ece2c52a29f95
Gitweb:        https://git.kernel.org/tip/2fad201fe38ff9a692acedb1990ece2c52a29f95
Author:        Ravi Bangoria <ravi.bangoria@amd.com>
AuthorDate:    Thu, 04 May 2023 16:30:01 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:30 +02:00

perf/ibs: Fix interface via core pmu events

Although, IBS pmus can be invoked via their own interface, indirect
IBS invocation via core pmu events is also supported with fixed set
of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p
(micro-ops) for user convenience.

This indirect IBS invocation is broken since commit 66d258c5b048
("perf/core: Optimize perf_init_event()"), which added RAW pmu under
'pmu_idr' list and thus if event_init() fails with RAW pmu, it started
returning error instead of trying other pmus.

Forward precise events from core pmu to IBS by overwriting 'type' and
'config' in the kernel copy of perf_event_attr. Overwriting will cause
perf_init_event() to retry with updated 'type' and 'config', which will
automatically forward event to IBS pmu.

Without patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  Error:
  The r076:p event is not supported.

With patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ]

Fixes: 66d258c5b048 ("perf/core: Optimize perf_init_event()")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-3-ravi.bangoria@amd.com
---
 arch/x86/events/amd/core.c        |  2 +-
 arch/x86/events/amd/ibs.c         | 53 ++++++++++++++----------------
 arch/x86/include/asm/perf_event.h |  2 +-
 3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index bccea57..abadd5f 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -374,7 +374,7 @@ static int amd_pmu_hw_config(struct perf_event *event)
 
 	/* pass precise event sampling to ibs: */
 	if (event->attr.precise_ip && get_ibs_caps())
-		return -ENOENT;
+		return forward_event_to_ibs(event);
 
 	if (has_branch_stack(event) && !x86_pmu.lbr_nr)
 		return -EOPNOTSUPP;
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 6458295..3710148 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -190,7 +190,7 @@ static struct perf_ibs *get_ibs_pmu(int type)
 }
 
 /*
- * Use IBS for precise event sampling:
+ * core pmu config -> IBS config
  *
  *  perf record -a -e cpu-cycles:p ...    # use ibs op counting cycle count
  *  perf record -a -e r076:p ...          # same as -e cpu-cycles:p
@@ -199,25 +199,9 @@ static struct perf_ibs *get_ibs_pmu(int type)
  * IbsOpCntCtl (bit 19) of IBS Execution Control Register (IbsOpCtl,
  * MSRC001_1033) is used to select either cycle or micro-ops counting
  * mode.
- *
- * The rip of IBS samples has skid 0. Thus, IBS supports precise
- * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
- * rip is invalid when IBS was not able to record the rip correctly.
- * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
- *
  */
-static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
+static int core_pmu_ibs_config(struct perf_event *event, u64 *config)
 {
-	switch (event->attr.precise_ip) {
-	case 0:
-		return -ENOENT;
-	case 1:
-	case 2:
-		break;
-	default:
-		return -EOPNOTSUPP;
-	}
-
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		switch (event->attr.config) {
@@ -243,22 +227,37 @@ static int perf_ibs_precise_event(struct perf_event *event, u64 *config)
 	return -EOPNOTSUPP;
 }
 
+/*
+ * The rip of IBS samples has skid 0. Thus, IBS supports precise
+ * levels 1 and 2 and the PERF_EFLAGS_EXACT is set. In rare cases the
+ * rip is invalid when IBS was not able to record the rip correctly.
+ * We clear PERF_EFLAGS_EXACT and take the rip from pt_regs then.
+ */
+int forward_event_to_ibs(struct perf_event *event)
+{
+	u64 config = 0;
+
+	if (!event->attr.precise_ip || event->attr.precise_ip > 2)
+		return -EOPNOTSUPP;
+
+	if (!core_pmu_ibs_config(event, &config)) {
+		event->attr.type = perf_ibs_op.pmu.type;
+		event->attr.config = config;
+	}
+	return -ENOENT;
+}
+
 static int perf_ibs_init(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	struct perf_ibs *perf_ibs;
 	u64 max_cnt, config;
-	int ret;
 
 	perf_ibs = get_ibs_pmu(event->attr.type);
-	if (perf_ibs) {
-		config = event->attr.config;
-	} else {
-		perf_ibs = &perf_ibs_op;
-		ret = perf_ibs_precise_event(event, &config);
-		if (ret)
-			return ret;
-	}
+	if (!perf_ibs)
+		return -ENOENT;
+
+	config = event->attr.config;
 
 	if (event->pmu != &perf_ibs->pmu)
 		return -ENOENT;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8fc15ed..fc86248 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -475,8 +475,10 @@ struct pebs_xmm {
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern u32 get_ibs_caps(void);
+extern int forward_event_to_ibs(struct perf_event *event);
 #else
 static inline u32 get_ibs_caps(void) { return 0; }
+static inline int forward_event_to_ibs(struct perf_event *event) { return -ENOENT; }
 #endif
 
 #ifdef CONFIG_PERF_EVENTS

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: perf/core] perf/core: Rework forwarding of {task|cpu}-clock events
  2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
@ 2023-05-10 13:33   ` tip-bot2 for Ravi Bangoria
  2024-02-20  8:41   ` [PATCH v4 1/4] " Pengfei Xu
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Ravi Bangoria @ 2023-05-10 13:33 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra, Ravi Bangoria, x86, linux-kernel

The following commit has been merged into the perf/core branch of tip:

Commit-ID:     0d6d062ca27ec7ef547712d34dcfcfb952bcef53
Gitweb:        https://git.kernel.org/tip/0d6d062ca27ec7ef547712d34dcfcfb952bcef53
Author:        Ravi Bangoria <ravi.bangoria@amd.com>
AuthorDate:    Thu, 04 May 2023 16:30:00 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 08 May 2023 10:58:30 +02:00

perf/core: Rework forwarding of {task|cpu}-clock events

Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
cpu-clock events are interfaced through it but internally gets forwarded
to their own pmus.

Rework this by overwriting event->attr.type in perf_swevent_init() which
will cause perf_init_event() to retry with updated type and event will
automatically get forwarded to right pmu. With the change, SW pmu no
longer needs to be treated specially and can be included in 'pmu_idr'
list.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-2-ravi.bangoria@amd.com
---
 include/linux/perf_event.h | 10 +++++-
 kernel/events/core.c       | 77 +++++++++++++++++++------------------
 2 files changed, 51 insertions(+), 36 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7..bf4f346 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -295,6 +295,8 @@ struct perf_event_pmu_context;
 
 struct perf_output_handle;
 
+#define PMU_NULL_DEV	((void *)(~0UL))
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -827,6 +829,14 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+
+	/*
+	 * Certain events gets forwarded to another pmu internally by over-
+	 * writing kernel copy of event->attr.type without user being aware
+	 * of it. event->orig_type contains original 'type' requested by
+	 * user.
+	 */
+	__u32				orig_type;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 68baa81..c01bbe9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6647,7 +6647,7 @@ static void perf_sigtrap(struct perf_event *event)
 		return;
 
 	send_sig_perf((void __user *)event->pending_addr,
-		      event->attr.type, event->attr.sig_data);
+		      event->orig_type, event->attr.sig_data);
 }
 
 /*
@@ -9951,6 +9951,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
 	swevent_hlist_put();
 }
 
+static struct pmu perf_cpu_clock; /* fwd declaration */
+static struct pmu perf_task_clock;
+
 static int perf_swevent_init(struct perf_event *event)
 {
 	u64 event_id = event->attr.config;
@@ -9966,7 +9969,10 @@ static int perf_swevent_init(struct perf_event *event)
 
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
+		event->attr.type = perf_cpu_clock.type;
+		return -ENOENT;
 	case PERF_COUNT_SW_TASK_CLOCK:
+		event->attr.type = perf_task_clock.type;
 		return -ENOENT;
 
 	default:
@@ -11086,7 +11092,7 @@ static void cpu_clock_event_read(struct perf_event *event)
 
 static int cpu_clock_event_init(struct perf_event *event)
 {
-	if (event->attr.type != PERF_TYPE_SOFTWARE)
+	if (event->attr.type != perf_cpu_clock.type)
 		return -ENOENT;
 
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
@@ -11107,6 +11113,7 @@ static struct pmu perf_cpu_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
 	.capabilities	= PERF_PMU_CAP_NO_NMI,
+	.dev		= PMU_NULL_DEV,
 
 	.event_init	= cpu_clock_event_init,
 	.add		= cpu_clock_event_add,
@@ -11167,7 +11174,7 @@ static void task_clock_event_read(struct perf_event *event)
 
 static int task_clock_event_init(struct perf_event *event)
 {
-	if (event->attr.type != PERF_TYPE_SOFTWARE)
+	if (event->attr.type != perf_task_clock.type)
 		return -ENOENT;
 
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
@@ -11188,6 +11195,7 @@ static struct pmu perf_task_clock = {
 	.task_ctx_nr	= perf_sw_context,
 
 	.capabilities	= PERF_PMU_CAP_NO_NMI,
+	.dev		= PMU_NULL_DEV,
 
 	.event_init	= task_clock_event_init,
 	.add		= task_clock_event_add,
@@ -11415,31 +11423,31 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		goto unlock;
 
 	pmu->type = -1;
-	if (!name)
-		goto skip_type;
+	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
+		ret = -EINVAL;
+		goto free_pdc;
+	}
+
 	pmu->name = name;
 
-	if (type != PERF_TYPE_SOFTWARE) {
-		if (type >= 0)
-			max = type;
+	if (type >= 0)
+		max = type;
 
-		ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
-		if (ret < 0)
-			goto free_pdc;
+	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto free_pdc;
 
-		WARN_ON(type >= 0 && ret != type);
+	WARN_ON(type >= 0 && ret != type);
 
-		type = ret;
-	}
+	type = ret;
 	pmu->type = type;
 
-	if (pmu_bus_running) {
+	if (pmu_bus_running && !pmu->dev) {
 		ret = pmu_dev_alloc(pmu);
 		if (ret)
 			goto free_idr;
 	}
 
-skip_type:
 	ret = -ENOMEM;
 	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
 	if (!pmu->cpu_pmu_context)
@@ -11481,16 +11489,7 @@ skip_type:
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
-	/*
-	 * Ensure the TYPE_SOFTWARE PMUs are at the head of the list,
-	 * since these cannot be in the IDR. This way the linear search
-	 * is fast, provided a valid software event is provided.
-	 */
-	if (type == PERF_TYPE_SOFTWARE || !name)
-		list_add_rcu(&pmu->entry, &pmus);
-	else
-		list_add_tail_rcu(&pmu->entry, &pmus);
-
+	list_add_rcu(&pmu->entry, &pmus);
 	atomic_set(&pmu->exclusive_cnt, 0);
 	ret = 0;
 unlock:
@@ -11499,12 +11498,13 @@ unlock:
 	return ret;
 
 free_dev:
-	device_del(pmu->dev);
-	put_device(pmu->dev);
+	if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
+		device_del(pmu->dev);
+		put_device(pmu->dev);
+	}
 
 free_idr:
-	if (pmu->type != PERF_TYPE_SOFTWARE)
-		idr_remove(&pmu_idr, pmu->type);
+	idr_remove(&pmu_idr, pmu->type);
 
 free_pdc:
 	free_percpu(pmu->pmu_disable_count);
@@ -11525,9 +11525,8 @@ void perf_pmu_unregister(struct pmu *pmu)
 	synchronize_rcu();
 
 	free_percpu(pmu->pmu_disable_count);
-	if (pmu->type != PERF_TYPE_SOFTWARE)
-		idr_remove(&pmu_idr, pmu->type);
-	if (pmu_bus_running) {
+	idr_remove(&pmu_idr, pmu->type);
+	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
 		device_del(pmu->dev);
@@ -11601,6 +11600,12 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
+	/*
+	 * Save original type before calling pmu->event_init() since certain
+	 * pmus overwrites event->attr.type to forward event to another pmu.
+	 */
+	event->orig_type = event->attr.type;
+
 	/* Try parent's PMU first: */
 	if (event->parent && event->parent->pmu) {
 		pmu = event->parent->pmu;
@@ -13640,8 +13645,8 @@ void __init perf_event_init(void)
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
-	perf_pmu_register(&perf_cpu_clock, NULL, -1);
-	perf_pmu_register(&perf_task_clock, NULL, -1);
+	perf_pmu_register(&perf_cpu_clock, "cpu_clock", -1);
+	perf_pmu_register(&perf_task_clock, "task_clock", -1);
 	perf_tp_register();
 	perf_event_init_cpu(smp_processor_id());
 	register_reboot_notifier(&perf_reboot_notifier);
@@ -13684,7 +13689,7 @@ static int __init perf_event_sysfs_init(void)
 		goto unlock;
 
 	list_for_each_entry(pmu, &pmus, entry) {
-		if (!pmu->name || pmu->type < 0)
+		if (pmu->dev)
 			continue;
 
 		ret = pmu_dev_alloc(pmu);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events
  2023-05-05  9:16   ` Peter Zijlstra
@ 2023-05-15 21:31     ` Arnaldo Carvalho de Melo
  2023-05-15 21:33       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-05-15 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ravi Bangoria, namhyung, eranian, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla

Em Fri, May 05, 2023 at 11:16:48AM +0200, Peter Zijlstra escreveu:
> On Thu, May 04, 2023 at 04:30:03PM +0530, Ravi Bangoria wrote:
> > IBS pmu can be invoked via fixed set of core pmu events with 'precise_ip'
> > set to 1. Add a simple event open test for all these events.
> > 
> > Without kernel fix:
> >   $ sudo ./perf test -vv 76
> >    76: AMD IBS via core pmu                                      :
> >   --- start ---
> >   test child forked, pid 6553
> >   Using CPUID AuthenticAMD-25-1-1
> >   type: 0x0, config: 0x0, fd: 3  -  Pass
> >   type: 0x0, config: 0x1, fd: -1  -  Pass
> >   type: 0x4, config: 0x76, fd: -1  -  Fail
> >   type: 0x4, config: 0xc1, fd: -1  -  Fail
> >   type: 0x4, config: 0x12, fd: -1  -  Pass
> >   test child finished with -1
> >   ---- end ----
> >   AMD IBS via core pmu: FAILED!
> > 
> > With kernel fix:
> >   $ sudo ./perf test -vv 76
> >    76: AMD IBS via core pmu                                      :
> >   --- start ---
> >   test child forked, pid 7526
> >   Using CPUID AuthenticAMD-25-1-1
> >   type: 0x0, config: 0x0, fd: 3  -  Pass
> >   type: 0x0, config: 0x1, fd: -1  -  Pass
> >   type: 0x4, config: 0x76, fd: 3  -  Pass
> >   type: 0x4, config: 0xc1, fd: 3  -  Pass
> >   type: 0x4, config: 0x12, fd: -1  -  Pass
> >   test child finished with 0
> >   ---- end ----
> >   AMD IBS via core pmu: Ok
> > 
> > Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> > ---
> >  tools/perf/arch/x86/include/arch-tests.h      |  1 +
> >  tools/perf/arch/x86/tests/Build               |  1 +
> >  .../arch/x86/tests/amd-ibs-via-core-pmu.c     | 71 +++++++++++++++++++
> >  tools/perf/arch/x86/tests/arch-tests.c        |  2 +
> >  4 files changed, 75 insertions(+)
> >  create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
> 
> Arnaldo, ok if I merge this along with the other patches or would you
> rather take it?

That would be ok, checking if you merged already...

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events
  2023-05-15 21:31     ` Arnaldo Carvalho de Melo
@ 2023-05-15 21:33       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-05-15 21:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ravi Bangoria, namhyung, eranian, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla

Em Mon, May 15, 2023 at 06:31:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Fri, May 05, 2023 at 11:16:48AM +0200, Peter Zijlstra escreveu:
> > On Thu, May 04, 2023 at 04:30:03PM +0530, Ravi Bangoria wrote:
> > > IBS pmu can be invoked via fixed set of core pmu events with 'precise_ip'
> > > set to 1. Add a simple event open test for all these events.
> > >   AMD IBS via core pmu: Ok
> > > 
> > > Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> > > ---
> > >  tools/perf/arch/x86/include/arch-tests.h      |  1 +
> > >  tools/perf/arch/x86/tests/Build               |  1 +
> > >  .../arch/x86/tests/amd-ibs-via-core-pmu.c     | 71 +++++++++++++++++++
> > >  tools/perf/arch/x86/tests/arch-tests.c        |  2 +
> > >  4 files changed, 75 insertions(+)
> > >  create mode 100644 tools/perf/arch/x86/tests/amd-ibs-via-core-pmu.c
> > 
> > Arnaldo, ok if I merge this along with the other patches or would you
> > rather take it?
> 
> That would be ok, checking if you merged already...

Yeah, its in tip/master, great.

- Arnaldo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-04 11:00 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Ravi Bangoria
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
@ 2023-05-24 21:41   ` Nathan Chancellor
  2023-05-25  5:16     ` Ravi Bangoria
  1 sibling, 1 reply; 35+ messages in thread
From: Nathan Chancellor @ 2023-05-24 21:41 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz,
	oliver.upton, kvmarm

Hi Ravi,

+ arm64 KVM folks

On Thu, May 04, 2023 at 04:30:02PM +0530, Ravi Bangoria wrote:
> Searching for the right pmu by iterating over all pmus is no longer
> required since all pmus now *must* be present in the 'pmu_idr' list.
> So, remove linear searching code.
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  kernel/events/core.c | 37 +++++++++++++------------------------
>  1 file changed, 13 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 0695bb9fbbb6..eba2b8595115 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event)
>  	}
>  
>  again:
> +	ret = -ENOENT;
>  	rcu_read_lock();
>  	pmu = idr_find(&pmu_idr, type);
>  	rcu_read_unlock();
> -	if (pmu) {
> -		if (event->attr.type != type && type != PERF_TYPE_RAW &&
> -		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
> -			goto fail;
> -
> -		ret = perf_try_init_event(pmu, event);
> -		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
> -			type = event->attr.type;
> -			goto again;
> -		}
> +	if (!pmu)
> +		goto fail;
>  
> -		if (ret)
> -			pmu = ERR_PTR(ret);
> +	if (event->attr.type != type && type != PERF_TYPE_RAW &&
> +	    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
> +		goto fail;
>  
> -		goto unlock;
> +	ret = perf_try_init_event(pmu, event);
> +	if (ret == -ENOENT && event->attr.type != type && !extended_type) {
> +		type = event->attr.type;
> +		goto again;
>  	}
>  
> -	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
> -		ret = perf_try_init_event(pmu, event);
> -		if (!ret)
> -			goto unlock;
> -
> -		if (ret != -ENOENT) {
> -			pmu = ERR_PTR(ret);
> -			goto unlock;
> -		}
> -	}
>  fail:
> -	pmu = ERR_PTR(-ENOENT);
> +	if (ret)
> +		pmu = ERR_PTR(ret);
> +
>  unlock:
>  	srcu_read_unlock(&pmus_srcu, idx);
>  
> -- 
> 2.40.0
> 

My apologies if this has already been reported or fixed already, I did a
search of lore.kernel.org and did not find anything. This patch as
commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in
-next breaks starting QEMU with KVM enabled on two of my arm64 machines:

$ qemu-system-aarch64 \
    -display none \
    -nodefaults \
    -machine virt,gic-version=max \
    -append 'console=ttyAMA0 earlycon' \
    -kernel arch/arm64/boot/Image.gz \
    -initrd rootfs.cpio \
    -cpu host \
    -enable-kvm \
    -m 512m \
    -smp 8 \
    -serial mon:stdio
qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device
qemu-system-aarch64: failed to set irq for PMU

In the kernel log, I see

[   42.944952] kvm: pmu event creation failed -2

I am not sure if this issue is unexpected as a result of this change or
if there is something that needs to change on the arm64 KVM side (it
appears the kernel message comes from arch/arm64/kvm/pmu-emul.c).

If there is any further information I can provide or patches I can test,
I am more than happy to do so.

Cheers,
Nathan

# bad: [cf09e328589a2ed7f6c8d90f2edb697fb4f8a96b] Add linux-next specific files for 20230524
# good: [27e462c8fad4bf04ec4f81f8539ce6fa947ead3a] Merge tag 'xtensa-20230523' of https://github.com/jcmvbkbc/linux-xtensa
git bisect start 'cf09e328589a2ed7f6c8d90f2edb697fb4f8a96b' '27e462c8fad4bf04ec4f81f8539ce6fa947ead3a'
# good: [a20d8ab9e26daaeeaf971139b736981cf164ab0a] Merge branch 'for-linux-next' of git://anongit.freedesktop.org/drm/drm-misc
git bisect good a20d8ab9e26daaeeaf971139b736981cf164ab0a
# good: [2714032dfd641b22695e14efd5f9dff08a5e3245] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git
git bisect good 2714032dfd641b22695e14efd5f9dff08a5e3245
# bad: [b2bc2854ec87557033538aa9290f70b9141a6653] Merge branch 'for-leds-next' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds.git
git bisect bad b2bc2854ec87557033538aa9290f70b9141a6653
# good: [20d4044f23c7724020b6c7d34ccee9bb929d1078] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good 20d4044f23c7724020b6c7d34ccee9bb929d1078
# bad: [c3cab2fce7b318ee2edf148b1436f3a3864ae773] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect bad c3cab2fce7b318ee2edf148b1436f3a3864ae773
# bad: [af75e6871092fc1f9fa039132d5667f1e0a47a0a] Merge branch into tip/master: 'sched/core'
git bisect bad af75e6871092fc1f9fa039132d5667f1e0a47a0a
# good: [c50a7b40a2f50403c3f58f1c27a85e4c5d2e0865] Merge branch into tip/master: 'objtool/core'
git bisect good c50a7b40a2f50403c3f58f1c27a85e4c5d2e0865
# good: [519fabc7aaba3f0847cf37d5f9a5740c370eb777] psi: remove 500ms min window size limitation for triggers
git bisect good 519fabc7aaba3f0847cf37d5f9a5740c370eb777
# bad: [b85c6694924e9f09a40a2e0a3798f3945eaa6fda] Merge branch into tip/master: 'perf/core'
git bisect bad b85c6694924e9f09a40a2e0a3798f3945eaa6fda
# bad: [9551fbb64d094cc105964716224adeb7765df8fd] perf/core: Remove pmu linear searching code
git bisect bad 9551fbb64d094cc105964716224adeb7765df8fd
# good: [2fad201fe38ff9a692acedb1990ece2c52a29f95] perf/ibs: Fix interface via core pmu events
git bisect good 2fad201fe38ff9a692acedb1990ece2c52a29f95
# first bad commit: [9551fbb64d094cc105964716224adeb7765df8fd] perf/core: Remove pmu linear searching code

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-24 21:41   ` [PATCH v4 3/4] " Nathan Chancellor
@ 2023-05-25  5:16     ` Ravi Bangoria
  2023-05-25  7:11       ` Oliver Upton
  0 siblings, 1 reply; 35+ messages in thread
From: Ravi Bangoria @ 2023-05-25  5:16 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, maz,
	oliver.upton, kvmarm, Ravi Bangoria

Hi Nathan,

On 25-May-23 3:11 AM, Nathan Chancellor wrote:
> Hi Ravi,
> 
> + arm64 KVM folks
> 
> On Thu, May 04, 2023 at 04:30:02PM +0530, Ravi Bangoria wrote:
>> Searching for the right pmu by iterating over all pmus is no longer
>> required since all pmus now *must* be present in the 'pmu_idr' list.
>> So, remove linear searching code.
>>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> ---
>>  kernel/events/core.c | 37 +++++++++++++------------------------
>>  1 file changed, 13 insertions(+), 24 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 0695bb9fbbb6..eba2b8595115 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -11630,38 +11630,27 @@ static struct pmu *perf_init_event(struct perf_event *event)
>>  	}
>>  
>>  again:
>> +	ret = -ENOENT;
>>  	rcu_read_lock();
>>  	pmu = idr_find(&pmu_idr, type);
>>  	rcu_read_unlock();
>> -	if (pmu) {
>> -		if (event->attr.type != type && type != PERF_TYPE_RAW &&
>> -		    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
>> -			goto fail;
>> -
>> -		ret = perf_try_init_event(pmu, event);
>> -		if (ret == -ENOENT && event->attr.type != type && !extended_type) {
>> -			type = event->attr.type;
>> -			goto again;
>> -		}
>> +	if (!pmu)
>> +		goto fail;
>>  
>> -		if (ret)
>> -			pmu = ERR_PTR(ret);
>> +	if (event->attr.type != type && type != PERF_TYPE_RAW &&
>> +	    !(pmu->capabilities & PERF_PMU_CAP_EXTENDED_HW_TYPE))
>> +		goto fail;
>>  
>> -		goto unlock;
>> +	ret = perf_try_init_event(pmu, event);
>> +	if (ret == -ENOENT && event->attr.type != type && !extended_type) {
>> +		type = event->attr.type;
>> +		goto again;
>>  	}
>>  
>> -	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
>> -		ret = perf_try_init_event(pmu, event);
>> -		if (!ret)
>> -			goto unlock;
>> -
>> -		if (ret != -ENOENT) {
>> -			pmu = ERR_PTR(ret);
>> -			goto unlock;
>> -		}
>> -	}
>>  fail:
>> -	pmu = ERR_PTR(-ENOENT);
>> +	if (ret)
>> +		pmu = ERR_PTR(ret);
>> +
>>  unlock:
>>  	srcu_read_unlock(&pmus_srcu, idx);
>>  
>> -- 
>> 2.40.0
>>
> 
> My apologies if this has already been reported or fixed already, I did a
> search of lore.kernel.org and did not find anything. This patch as
> commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in
> -next breaks starting QEMU with KVM enabled on two of my arm64 machines:
> 
> $ qemu-system-aarch64 \
>     -display none \
>     -nodefaults \
>     -machine virt,gic-version=max \
>     -append 'console=ttyAMA0 earlycon' \
>     -kernel arch/arm64/boot/Image.gz \
>     -initrd rootfs.cpio \
>     -cpu host \
>     -enable-kvm \
>     -m 512m \
>     -smp 8 \
>     -serial mon:stdio
> qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device
> qemu-system-aarch64: failed to set irq for PMU
> 
> In the kernel log, I see
> 
> [   42.944952] kvm: pmu event creation failed -2
> 
> I am not sure if this issue is unexpected as a result of this change or
> if there is something that needs to change on the arm64 KVM side (it
> appears the kernel message comes from arch/arm64/kvm/pmu-emul.c).

Thanks for reporting it.

Based on these detail, I feel the pmu registration failed in the host,
most probably because pmu driver did not pass pmu name while calling
perf_pmu_register(). Consequently kvm also failed while trying to use
it for guest. Can you please check host kernel logs.

I'm sorry but I neither have Arm board to try myself not I'm familiar
with the Arm architecture. So I'll need your help to diagnose it.

Ravi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-25  5:16     ` Ravi Bangoria
@ 2023-05-25  7:11       ` Oliver Upton
  2023-05-25 14:20         ` Peter Zijlstra
  2023-05-25 15:55         ` Nathan Chancellor
  0 siblings, 2 replies; 35+ messages in thread
From: Oliver Upton @ 2023-05-25  7:11 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: Nathan Chancellor, peterz, namhyung, eranian, acme, mark.rutland,
	jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla, maz, kvmarm

On Thu, May 25, 2023 at 10:46:01AM +0530, Ravi Bangoria wrote:
> On 25-May-23 3:11 AM, Nathan Chancellor wrote:
> > My apologies if this has already been reported or fixed already, I did a
> > search of lore.kernel.org and did not find anything. This patch as
> > commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in
> > -next breaks starting QEMU with KVM enabled on two of my arm64 machines:
> > 
> > $ qemu-system-aarch64 \
> >     -display none \
> >     -nodefaults \
> >     -machine virt,gic-version=max \
> >     -append 'console=ttyAMA0 earlycon' \
> >     -kernel arch/arm64/boot/Image.gz \
> >     -initrd rootfs.cpio \
> >     -cpu host \
> >     -enable-kvm \
> >     -m 512m \
> >     -smp 8 \
> >     -serial mon:stdio
> > qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device
> > qemu-system-aarch64: failed to set irq for PMU
> > 
> > In the kernel log, I see
> > 
> > [   42.944952] kvm: pmu event creation failed -2
> > 
> > I am not sure if this issue is unexpected as a result of this change or
> > if there is something that needs to change on the arm64 KVM side (it
> > appears the kernel message comes from arch/arm64/kvm/pmu-emul.c).
> 
> Thanks for reporting it.
> 
> Based on these detail, I feel the pmu registration failed in the host,
> most probably because pmu driver did not pass pmu name while calling
> perf_pmu_register(). Consequently kvm also failed while trying to use
> it for guest. Can you please check host kernel logs.

The PMUv3 driver does pass a name, but it relies on getting back an
allocated pmu id as @type is -1 in the call to perf_pmu_register().

What actually broke is how KVM probes for a default core PMU to use for
a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
reads the pmu from the returned perf_event. The linear search had the
effect of eventually stumbling on the correct core PMU and succeeding.

Perf folks: is this WAI for heterogenous systems?

Either way, the whole KVM end of this scheme is a bit clunky, and I
believe it to be unneccessary at this point as we maintain a list of
core PMU instances that KVM is able to virtualize. We can just walk
that to find a default PMU to use.

Not seeing any issues on -next with the below diff. If this works for
folks I can actually wrap it up in a patch and send it out.

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 45727d50d18d..cbc0b662b7f8 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
 
 static struct arm_pmu *kvm_pmu_probe_armpmu(void)
 {
-	struct perf_event_attr attr = { };
-	struct perf_event *event;
-	struct arm_pmu *pmu = NULL;
-
-	/*
-	 * Create a dummy event that only counts user cycles. As we'll never
-	 * leave this function with the event being live, it will never
-	 * count anything. But it allows us to probe some of the PMU
-	 * details. Yes, this is terrible.
-	 */
-	attr.type = PERF_TYPE_RAW;
-	attr.size = sizeof(attr);
-	attr.pinned = 1;
-	attr.disabled = 0;
-	attr.exclude_user = 0;
-	attr.exclude_kernel = 1;
-	attr.exclude_hv = 1;
-	attr.exclude_host = 1;
-	attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES;
-	attr.sample_period = GENMASK(63, 0);
+	struct arm_pmu *arm_pmu = NULL, *tmp;
+	struct arm_pmu_entry *entry;
+	int cpu;
 
-	event = perf_event_create_kernel_counter(&attr, -1, current,
-						 kvm_pmu_perf_overflow, &attr);
+	mutex_lock(&arm_pmus_lock);
+	cpu = get_cpu();
 
-	if (IS_ERR(event)) {
-		pr_err_once("kvm: pmu event creation failed %ld\n",
-			    PTR_ERR(event));
-		return NULL;
-	}
+	list_for_each_entry(entry, &arm_pmus, entry) {
+		tmp = entry->arm_pmu;
 
-	if (event->pmu) {
-		pmu = to_arm_pmu(event->pmu);
-		if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI ||
-		    pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF)
-			pmu = NULL;
+		if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) {
+			arm_pmu = tmp;
+			break;
+		}
 	}
 
-	perf_event_disable(event);
-	perf_event_release_kernel(event);
+	put_cpu();
+	mutex_unlock(&arm_pmus_lock);
 
-	return pmu;
+	return arm_pmu;
 }
 
 u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)

-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-25  7:11       ` Oliver Upton
@ 2023-05-25 14:20         ` Peter Zijlstra
  2023-05-25 15:56           ` Oliver Upton
  2023-05-25 15:55         ` Nathan Chancellor
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-05-25 14:20 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme,
	mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter,
	maddy, x86, linux-perf-users, linux-kernel, sandipan.das,
	ananth.narayan, santosh.shukla, maz, kvmarm

On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:

> The PMUv3 driver does pass a name, but it relies on getting back an
> allocated pmu id as @type is -1 in the call to perf_pmu_register().
> 
> What actually broke is how KVM probes for a default core PMU to use for
> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> reads the pmu from the returned perf_event. The linear search had the
> effect of eventually stumbling on the correct core PMU and succeeding.
> 
> Perf folks: is this WAI for heterogenous systems?

TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
sure what ARM64 does here.

IIRC the only way is to hard affine things; that is, force vCPU of
'type' to the pCPU mask of 'type' CPUs.

If you don't do that; or let userspace 'override' that, things go
sideways *real* fast.

Mark gonna have to look at this.

> Either way, the whole KVM end of this scheme is a bit clunky, and I
> believe it to be unneccessary at this point as we maintain a list of
> core PMU instances that KVM is able to virtualize. We can just walk
> that to find a default PMU to use.
> 
> Not seeing any issues on -next with the below diff. If this works for
> folks I can actually wrap it up in a patch and send it out.
> 
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 45727d50d18d..cbc0b662b7f8 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
>  
>  static struct arm_pmu *kvm_pmu_probe_armpmu(void)
>  {
> -	struct perf_event_attr attr = { };
> -	struct perf_event *event;
> -	struct arm_pmu *pmu = NULL;
> -
> -	/*
> -	 * Create a dummy event that only counts user cycles. As we'll never
> -	 * leave this function with the event being live, it will never
> -	 * count anything. But it allows us to probe some of the PMU
> -	 * details. Yes, this is terrible.
> -	 */
> -	attr.type = PERF_TYPE_RAW;
> -	attr.size = sizeof(attr);
> -	attr.pinned = 1;
> -	attr.disabled = 0;
> -	attr.exclude_user = 0;
> -	attr.exclude_kernel = 1;
> -	attr.exclude_hv = 1;
> -	attr.exclude_host = 1;
> -	attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES;
> -	attr.sample_period = GENMASK(63, 0);
> +	struct arm_pmu *arm_pmu = NULL, *tmp;
> +	struct arm_pmu_entry *entry;
> +	int cpu;
>  
> -	event = perf_event_create_kernel_counter(&attr, -1, current,
> -						 kvm_pmu_perf_overflow, &attr);
> +	mutex_lock(&arm_pmus_lock);
> +	cpu = get_cpu();
>  
> -	if (IS_ERR(event)) {
> -		pr_err_once("kvm: pmu event creation failed %ld\n",
> -			    PTR_ERR(event));
> -		return NULL;
> -	}
> +	list_for_each_entry(entry, &arm_pmus, entry) {
> +		tmp = entry->arm_pmu;
>  
> -	if (event->pmu) {
> -		pmu = to_arm_pmu(event->pmu);
> -		if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI ||
> -		    pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF)
> -			pmu = NULL;
> +		if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) {
> +			arm_pmu = tmp;
> +			break;
> +		}
>  	}
>  
> -	perf_event_disable(event);
> -	perf_event_release_kernel(event);
> +	put_cpu();
> +	mutex_unlock(&arm_pmus_lock);
>  
> -	return pmu;
> +	return arm_pmu;
>  }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-25  7:11       ` Oliver Upton
  2023-05-25 14:20         ` Peter Zijlstra
@ 2023-05-25 15:55         ` Nathan Chancellor
  1 sibling, 0 replies; 35+ messages in thread
From: Nathan Chancellor @ 2023-05-25 15:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ravi Bangoria, peterz, namhyung, eranian, acme, mark.rutland,
	jolsa, irogers, bp, kan.liang, adrian.hunter, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla, maz, kvmarm

On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> On Thu, May 25, 2023 at 10:46:01AM +0530, Ravi Bangoria wrote:
> > On 25-May-23 3:11 AM, Nathan Chancellor wrote:
> > > My apologies if this has already been reported or fixed already, I did a
> > > search of lore.kernel.org and did not find anything. This patch as
> > > commit 9551fbb64d09 ("perf/core: Remove pmu linear searching code") in
> > > -next breaks starting QEMU with KVM enabled on two of my arm64 machines:
> > > 
> > > $ qemu-system-aarch64 \
> > >     -display none \
> > >     -nodefaults \
> > >     -machine virt,gic-version=max \
> > >     -append 'console=ttyAMA0 earlycon' \
> > >     -kernel arch/arm64/boot/Image.gz \
> > >     -initrd rootfs.cpio \
> > >     -cpu host \
> > >     -enable-kvm \
> > >     -m 512m \
> > >     -smp 8 \
> > >     -serial mon:stdio
> > > qemu-system-aarch64: PMU: KVM_SET_DEVICE_ATTR: No such device
> > > qemu-system-aarch64: failed to set irq for PMU
> > > 
> > > In the kernel log, I see
> > > 
> > > [   42.944952] kvm: pmu event creation failed -2
> > > 
> > > I am not sure if this issue is unexpected as a result of this change or
> > > if there is something that needs to change on the arm64 KVM side (it
> > > appears the kernel message comes from arch/arm64/kvm/pmu-emul.c).
> > 
> > Thanks for reporting it.
> > 
> > Based on these detail, I feel the pmu registration failed in the host,
> > most probably because pmu driver did not pass pmu name while calling
> > perf_pmu_register(). Consequently kvm also failed while trying to use
> > it for guest. Can you please check host kernel logs.
> 
> The PMUv3 driver does pass a name, but it relies on getting back an
> allocated pmu id as @type is -1 in the call to perf_pmu_register().
> 
> What actually broke is how KVM probes for a default core PMU to use for
> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> reads the pmu from the returned perf_event. The linear search had the
> effect of eventually stumbling on the correct core PMU and succeeding.
> 
> Perf folks: is this WAI for heterogenous systems?
> 
> Either way, the whole KVM end of this scheme is a bit clunky, and I
> believe it to be unneccessary at this point as we maintain a list of
> core PMU instances that KVM is able to virtualize. We can just walk
> that to find a default PMU to use.
> 
> Not seeing any issues on -next with the below diff. If this works for
> folks I can actually wrap it up in a patch and send it out.

I can start QEMU on both the machines that had issues and my machines
continue to run without any visible issues but I have never done any
profile work within them. If there is any further testing or validation
that I should do, I am more than happy to do so. Until then, consider
it:

Tested-by: Nathan Chancellor <nathan@kernel.org>

> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 45727d50d18d..cbc0b662b7f8 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -694,47 +694,26 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
>  
>  static struct arm_pmu *kvm_pmu_probe_armpmu(void)
>  {
> -	struct perf_event_attr attr = { };
> -	struct perf_event *event;
> -	struct arm_pmu *pmu = NULL;
> -
> -	/*
> -	 * Create a dummy event that only counts user cycles. As we'll never
> -	 * leave this function with the event being live, it will never
> -	 * count anything. But it allows us to probe some of the PMU
> -	 * details. Yes, this is terrible.
> -	 */
> -	attr.type = PERF_TYPE_RAW;
> -	attr.size = sizeof(attr);
> -	attr.pinned = 1;
> -	attr.disabled = 0;
> -	attr.exclude_user = 0;
> -	attr.exclude_kernel = 1;
> -	attr.exclude_hv = 1;
> -	attr.exclude_host = 1;
> -	attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES;
> -	attr.sample_period = GENMASK(63, 0);
> +	struct arm_pmu *arm_pmu = NULL, *tmp;
> +	struct arm_pmu_entry *entry;
> +	int cpu;
>  
> -	event = perf_event_create_kernel_counter(&attr, -1, current,
> -						 kvm_pmu_perf_overflow, &attr);
> +	mutex_lock(&arm_pmus_lock);
> +	cpu = get_cpu();
>  
> -	if (IS_ERR(event)) {
> -		pr_err_once("kvm: pmu event creation failed %ld\n",
> -			    PTR_ERR(event));
> -		return NULL;
> -	}
> +	list_for_each_entry(entry, &arm_pmus, entry) {
> +		tmp = entry->arm_pmu;
>  
> -	if (event->pmu) {
> -		pmu = to_arm_pmu(event->pmu);
> -		if (pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_NI ||
> -		    pmu->pmuver == ID_AA64DFR0_EL1_PMUVer_IMP_DEF)
> -			pmu = NULL;
> +		if (cpumask_test_cpu(cpu, &tmp->supported_cpus)) {
> +			arm_pmu = tmp;
> +			break;
> +		}
>  	}
>  
> -	perf_event_disable(event);
> -	perf_event_release_kernel(event);
> +	put_cpu();
> +	mutex_unlock(&arm_pmus_lock);
>  
> -	return pmu;
> +	return arm_pmu;
>  }
>  
>  u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
> 
> -- 
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-25 14:20         ` Peter Zijlstra
@ 2023-05-25 15:56           ` Oliver Upton
  2023-05-26 23:00             ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Oliver Upton @ 2023-05-25 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ravi Bangoria, Nathan Chancellor, namhyung, eranian, acme,
	mark.rutland, jolsa, irogers, bp, kan.liang, adrian.hunter,
	maddy, x86, linux-perf-users, linux-kernel, sandipan.das,
	ananth.narayan, santosh.shukla, maz, kvmarm

On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> 
> > The PMUv3 driver does pass a name, but it relies on getting back an
> > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > 
> > What actually broke is how KVM probes for a default core PMU to use for
> > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > reads the pmu from the returned perf_event. The linear search had the
> > effect of eventually stumbling on the correct core PMU and succeeding.
> > 
> > Perf folks: is this WAI for heterogenous systems?
> 
> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> sure what ARM64 does here.
> 
> IIRC the only way is to hard affine things; that is, force vCPU of
> 'type' to the pCPU mask of 'type' CPUs.

We provide absolutely no illusion of consistency across implementations.
Userspace can select the PMU type, and then it is a userspace problem
affining vCPUs to the right pCPUs.

And if they get that wrong, we just bail and refuse to run the vCPU.

> If you don't do that; or let userspace 'override' that, things go
> sideways *real* fast.

Oh yeah, and I wish PMUs were the only problem with these hetero
systems...

> Mark gonna have to look at this.

Cool. I'll go ahead with the KVM cleanup regardless of the outcome.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-25 15:56           ` Oliver Upton
@ 2023-05-26 23:00             ` Ian Rogers
  2023-05-27 13:32               ` Marc Zyngier
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2023-05-26 23:00 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Peter Zijlstra, Ravi Bangoria, Nathan Chancellor, namhyung,
	eranian, acme, mark.rutland, jolsa, bp, kan.liang, adrian.hunter,
	maddy, x86, linux-perf-users, linux-kernel, sandipan.das,
	ananth.narayan, santosh.shukla, maz, kvmarm

On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> >
> > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > >
> > > What actually broke is how KVM probes for a default core PMU to use for
> > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > reads the pmu from the returned perf_event. The linear search had the
> > > effect of eventually stumbling on the correct core PMU and succeeding.
> > >
> > > Perf folks: is this WAI for heterogenous systems?
> >
> > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > sure what ARM64 does here.
> >
> > IIRC the only way is to hard affine things; that is, force vCPU of
> > 'type' to the pCPU mask of 'type' CPUs.
>
> We provide absolutely no illusion of consistency across implementations.
> Userspace can select the PMU type, and then it is a userspace problem
> affining vCPUs to the right pCPUs.
>
> And if they get that wrong, we just bail and refuse to run the vCPU.
>
> > If you don't do that; or let userspace 'override' that, things go
> > sideways *real* fast.
>
> Oh yeah, and I wish PMUs were the only problem with these hetero
> systems...

Just to add some context from what I understand. There are inbuilt
type numbers for PMUs:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
so the PMU generally called /sys/devices/cpu should have type 4 (ARM
give it another name). For heterogeneous ARM there is a single PMU and
the same events are programmed regardless of whether it is a big or a
little core - the cpumask lists all CPUs. On heterogeneous (aka
hybrid) Intel there are two PMUs, the performance cores have a PMU
called /sys/devices/cpu_core and it has type 4, the atom cores have a
PMU of /sys/devices/cpu_atom and on my Alderlake the type number is 8.
The cpu_core and cpu_atom PMUs list the CPUs that are valid for raw
style events, where the config values in perf_event_attr contains all
of the event programming data. There are also legacy events of
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE where to specify the PMU the
type is encoded in the high (and unused) 32-bits of config - so the
type would be something like PERF_TYPE_HARDWARE and then config would
be "value | (4 << 32)" for the performance core or "value | (8 << 32)"
for the atom.

If the vCPU and pCPUs mappings vary then there is a chance to change
the CPU mask on heterogeneous Intel, but it seems if the event is open
and you move from between core types then things are going to break.

Thanks,
Ian

> > Mark gonna have to look at this.
>
> Cool. I'll go ahead with the KVM cleanup regardless of the outcome.
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-26 23:00             ` Ian Rogers
@ 2023-05-27 13:32               ` Marc Zyngier
  2023-05-27 17:00                 ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Marc Zyngier @ 2023-05-27 13:32 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On Sat, 27 May 2023 00:00:47 +0100,
Ian Rogers <irogers@google.com> wrote:
> 
> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> > >
> > > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > > >
> > > > What actually broke is how KVM probes for a default core PMU to use for
> > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > > reads the pmu from the returned perf_event. The linear search had the
> > > > effect of eventually stumbling on the correct core PMU and succeeding.
> > > >
> > > > Perf folks: is this WAI for heterogenous systems?
> > >
> > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > > sure what ARM64 does here.
> > >
> > > IIRC the only way is to hard affine things; that is, force vCPU of
> > > 'type' to the pCPU mask of 'type' CPUs.
> >
> > We provide absolutely no illusion of consistency across implementations.
> > Userspace can select the PMU type, and then it is a userspace problem
> > affining vCPUs to the right pCPUs.
> >
> > And if they get that wrong, we just bail and refuse to run the vCPU.
> >
> > > If you don't do that; or let userspace 'override' that, things go
> > > sideways *real* fast.
> >
> > Oh yeah, and I wish PMUs were the only problem with these hetero
> > systems...
> 
> Just to add some context from what I understand. There are inbuilt
> type numbers for PMUs:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> give it another name). For heterogeneous ARM there is a single PMU and
> the same events are programmed regardless of whether it is a big or a
> little core - the cpumask lists all CPUs.

I think you misunderstood the way heterogeneous arm64 systems are
described . Each CPU type gets its own PMU type, and its own event
list. Case in point:

$ grep . /sys/devices/*pmu/{type,cpus}
/sys/devices/apple_avalanche_pmu/type:9
/sys/devices/apple_blizzard_pmu/type:8
/sys/devices/apple_avalanche_pmu/cpus:4-9
/sys/devices/apple_blizzard_pmu/cpus:0-3

Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
event number, nothing else.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-27 13:32               ` Marc Zyngier
@ 2023-05-27 17:00                 ` Ian Rogers
  2023-05-27 17:05                   ` Ian Rogers
  2023-05-27 18:38                   ` Marc Zyngier
  0 siblings, 2 replies; 35+ messages in thread
From: Ian Rogers @ 2023-05-27 17:00 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 27 May 2023 00:00:47 +0100,
> Ian Rogers <irogers@google.com> wrote:
> >
> > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> > > >
> > > > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > > > >
> > > > > What actually broke is how KVM probes for a default core PMU to use for
> > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > > > reads the pmu from the returned perf_event. The linear search had the
> > > > > effect of eventually stumbling on the correct core PMU and succeeding.
> > > > >
> > > > > Perf folks: is this WAI for heterogenous systems?
> > > >
> > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > > > sure what ARM64 does here.
> > > >
> > > > IIRC the only way is to hard affine things; that is, force vCPU of
> > > > 'type' to the pCPU mask of 'type' CPUs.
> > >
> > > We provide absolutely no illusion of consistency across implementations.
> > > Userspace can select the PMU type, and then it is a userspace problem
> > > affining vCPUs to the right pCPUs.
> > >
> > > And if they get that wrong, we just bail and refuse to run the vCPU.
> > >
> > > > If you don't do that; or let userspace 'override' that, things go
> > > > sideways *real* fast.
> > >
> > > Oh yeah, and I wish PMUs were the only problem with these hetero
> > > systems...
> >
> > Just to add some context from what I understand. There are inbuilt
> > type numbers for PMUs:
> > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> > so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> > give it another name). For heterogeneous ARM there is a single PMU and
> > the same events are programmed regardless of whether it is a big or a
> > little core - the cpumask lists all CPUs.
>
> I think you misunderstood the way heterogeneous arm64 systems are
> described . Each CPU type gets its own PMU type, and its own event
> list. Case in point:
>
> $ grep . /sys/devices/*pmu/{type,cpus}
> /sys/devices/apple_avalanche_pmu/type:9
> /sys/devices/apple_blizzard_pmu/type:8
> /sys/devices/apple_avalanche_pmu/cpus:4-9
> /sys/devices/apple_blizzard_pmu/cpus:0-3
>
> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> event number, nothing else.

Which PMU will a raw event open on? Note, the raw events don't support
the extended type that is present in PERF_TYPE_HARDWARE and
PERF_TYPE_HW_CACHE:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
as the bits are already in use for being just plain config values. I
suspect not being type 4 is a bug on apple ARM here.

Thanks,
Ian

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-27 17:00                 ` Ian Rogers
@ 2023-05-27 17:05                   ` Ian Rogers
  2023-05-27 18:38                   ` Marc Zyngier
  1 sibling, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2023-05-27 17:05 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On Sat, May 27, 2023 at 10:00 AM Ian Rogers <irogers@google.com> wrote:
>
> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Sat, 27 May 2023 00:00:47 +0100,
> > Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > >
> > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> > > > >
> > > > > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > > > > >
> > > > > > What actually broke is how KVM probes for a default core PMU to use for
> > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > > > > reads the pmu from the returned perf_event. The linear search had the
> > > > > > effect of eventually stumbling on the correct core PMU and succeeding.
> > > > > >
> > > > > > Perf folks: is this WAI for heterogenous systems?
> > > > >
> > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > > > > sure what ARM64 does here.
> > > > >
> > > > > IIRC the only way is to hard affine things; that is, force vCPU of
> > > > > 'type' to the pCPU mask of 'type' CPUs.
> > > >
> > > > We provide absolutely no illusion of consistency across implementations.
> > > > Userspace can select the PMU type, and then it is a userspace problem
> > > > affining vCPUs to the right pCPUs.
> > > >
> > > > And if they get that wrong, we just bail and refuse to run the vCPU.
> > > >
> > > > > If you don't do that; or let userspace 'override' that, things go
> > > > > sideways *real* fast.
> > > >
> > > > Oh yeah, and I wish PMUs were the only problem with these hetero
> > > > systems...
> > >
> > > Just to add some context from what I understand. There are inbuilt
> > > type numbers for PMUs:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> > > give it another name). For heterogeneous ARM there is a single PMU and
> > > the same events are programmed regardless of whether it is a big or a
> > > little core - the cpumask lists all CPUs.
> >
> > I think you misunderstood the way heterogeneous arm64 systems are
> > described . Each CPU type gets its own PMU type, and its own event
> > list. Case in point:
> >
> > $ grep . /sys/devices/*pmu/{type,cpus}
> > /sys/devices/apple_avalanche_pmu/type:9
> > /sys/devices/apple_blizzard_pmu/type:8
> > /sys/devices/apple_avalanche_pmu/cpus:4-9
> > /sys/devices/apple_blizzard_pmu/cpus:0-3
> >
> > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> > event number, nothing else.
>
> Which PMU will a raw event open on? Note, the raw events don't support
> the extended type that is present in PERF_TYPE_HARDWARE and
> PERF_TYPE_HW_CACHE:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
> as the bits are already in use for being just plain config values. I
> suspect not being type 4 is a bug on apple ARM here.
>
> Thanks,
> Ian

Btw, thanks for correcting me! :-D These assumptions are under
documented/tested and nobody wants to make interfaces/tools that are
broken. One source of information is:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/Documentation/admin-guide/perf
but it doesn't capture this.

Thanks,
Ian

> > Thanks,
> >
> >         M.
> >
> > --
> > Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-27 17:00                 ` Ian Rogers
  2023-05-27 17:05                   ` Ian Rogers
@ 2023-05-27 18:38                   ` Marc Zyngier
  2023-05-27 19:50                     ` Ian Rogers
  2023-05-30  7:45                     ` Thomas Richter
  1 sibling, 2 replies; 35+ messages in thread
From: Marc Zyngier @ 2023-05-27 18:38 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On Sat, 27 May 2023 18:00:13 +0100,
Ian Rogers <irogers@google.com> wrote:
> 
> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Sat, 27 May 2023 00:00:47 +0100,
> > Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > >
> > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> > > > >
> > > > > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > > > > >
> > > > > > What actually broke is how KVM probes for a default core PMU to use for
> > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > > > > reads the pmu from the returned perf_event. The linear search had the
> > > > > > effect of eventually stumbling on the correct core PMU and succeeding.
> > > > > >
> > > > > > Perf folks: is this WAI for heterogenous systems?
> > > > >
> > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > > > > sure what ARM64 does here.
> > > > >
> > > > > IIRC the only way is to hard affine things; that is, force vCPU of
> > > > > 'type' to the pCPU mask of 'type' CPUs.
> > > >
> > > > We provide absolutely no illusion of consistency across implementations.
> > > > Userspace can select the PMU type, and then it is a userspace problem
> > > > affining vCPUs to the right pCPUs.
> > > >
> > > > And if they get that wrong, we just bail and refuse to run the vCPU.
> > > >
> > > > > If you don't do that; or let userspace 'override' that, things go
> > > > > sideways *real* fast.
> > > >
> > > > Oh yeah, and I wish PMUs were the only problem with these hetero
> > > > systems...
> > >
> > > Just to add some context from what I understand. There are inbuilt
> > > type numbers for PMUs:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> > > give it another name). For heterogeneous ARM there is a single PMU and
> > > the same events are programmed regardless of whether it is a big or a
> > > little core - the cpumask lists all CPUs.
> >
> > I think you misunderstood the way heterogeneous arm64 systems are
> > described . Each CPU type gets its own PMU type, and its own event
> > list. Case in point:
> >
> > $ grep . /sys/devices/*pmu/{type,cpus}
> > /sys/devices/apple_avalanche_pmu/type:9
> > /sys/devices/apple_blizzard_pmu/type:8
> > /sys/devices/apple_avalanche_pmu/cpus:4-9
> > /sys/devices/apple_blizzard_pmu/cpus:0-3
> >
> > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> > event number, nothing else.
> 
> Which PMU will a raw event open on?

On the PMU that matches the current CPU.

> Note, the raw events don't support
> the extended type that is present in PERF_TYPE_HARDWARE and
> PERF_TYPE_HW_CACHE:
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
> as the bits are already in use for being just plain config values.

I'm not sure how relevant this is to the numbering of PMUs on arm64.

> I suspect not being type 4 is a bug on apple ARM here.

If that's a bug on this machine, it's a bug on all machines, at which
point it is the de-facto API:

$ grep . /sys/devices/armv8*/{type,cpus}
/sys/devices/armv8_cortex_a53/type:8
/sys/devices/armv8_cortex_a72/type:9
/sys/devices/armv8_cortex_a53/cpus:0-3
/sys/devices/armv8_cortex_a72/cpus:4-5

See, non-Apple HW. And now for a system with homogeneous CPUs:

$ grep . /sys/devices/armv8*/{type,cpus}
/sys/devices/armv8_pmuv3_0/type:8
/sys/devices/armv8_pmuv3_0/cpus:0-159

Still no type 4. I could go on for hours, I have plenty of HW around
me!

So whatever your source of information is, it doesn't match reality.
Our PMUs are numbered arbitrarily, and have been so for... a very long
time. At least since perf_pmu_register has supported dynamic
registration (see 2e80a82a49c4c).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-27 18:38                   ` Marc Zyngier
@ 2023-05-27 19:50                     ` Ian Rogers
  2023-05-30  7:45                     ` Thomas Richter
  1 sibling, 0 replies; 35+ messages in thread
From: Ian Rogers @ 2023-05-27 19:50 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On Sat, May 27, 2023 at 11:38 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Sat, 27 May 2023 18:00:13 +0100,
> Ian Rogers <irogers@google.com> wrote:
> >
> > On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Sat, 27 May 2023 00:00:47 +0100,
> > > Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> > > > >
> > > > > On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> > > > > > On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> > > > > >
> > > > > > > The PMUv3 driver does pass a name, but it relies on getting back an
> > > > > > > allocated pmu id as @type is -1 in the call to perf_pmu_register().
> > > > > > >
> > > > > > > What actually broke is how KVM probes for a default core PMU to use for
> > > > > > > a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> > > > > > > reads the pmu from the returned perf_event. The linear search had the
> > > > > > > effect of eventually stumbling on the correct core PMU and succeeding.
> > > > > > >
> > > > > > > Perf folks: is this WAI for heterogenous systems?
> > > > > >
> > > > > > TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> > > > > > sure what ARM64 does here.
> > > > > >
> > > > > > IIRC the only way is to hard affine things; that is, force vCPU of
> > > > > > 'type' to the pCPU mask of 'type' CPUs.
> > > > >
> > > > > We provide absolutely no illusion of consistency across implementations.
> > > > > Userspace can select the PMU type, and then it is a userspace problem
> > > > > affining vCPUs to the right pCPUs.
> > > > >
> > > > > And if they get that wrong, we just bail and refuse to run the vCPU.
> > > > >
> > > > > > If you don't do that; or let userspace 'override' that, things go
> > > > > > sideways *real* fast.
> > > > >
> > > > > Oh yeah, and I wish PMUs were the only problem with these hetero
> > > > > systems...
> > > >
> > > > Just to add some context from what I understand. There are inbuilt
> > > > type numbers for PMUs:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> > > > so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> > > > give it another name). For heterogeneous ARM there is a single PMU and
> > > > the same events are programmed regardless of whether it is a big or a
> > > > little core - the cpumask lists all CPUs.
> > >
> > > I think you misunderstood the way heterogeneous arm64 systems are
> > > described . Each CPU type gets its own PMU type, and its own event
> > > list. Case in point:
> > >
> > > $ grep . /sys/devices/*pmu/{type,cpus}
> > > /sys/devices/apple_avalanche_pmu/type:9
> > > /sys/devices/apple_blizzard_pmu/type:8
> > > /sys/devices/apple_avalanche_pmu/cpus:4-9
> > > /sys/devices/apple_blizzard_pmu/cpus:0-3
> > >
> > > Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> > > event number, nothing else.
> >
> > Which PMU will a raw event open on?
>
> On the PMU that matches the current CPU.

Thanks. This should really be captured in the man page. Presumably
that's the behavior with the -1 any CPU, and if a CPU is specified
then the selected PMU will match that CPU?

> > Note, the raw events don't support
> > the extended type that is present in PERF_TYPE_HARDWARE and
> > PERF_TYPE_HW_CACHE:
> > https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
> > as the bits are already in use for being just plain config values.
>
> I'm not sure how relevant this is to the numbering of PMUs on arm64.

It matters because on the tool side you can group events, for example
if you are measuring IPC then it makes sense that the instructions and
cycles events are both multiplexed as part of a group. The
instructions and cycles events have a type of PERF_TYPE_HARDWARE as
this is hard coded into the event parser in perf. For perf we may
specify this as:
perf stat -e '{cycles,instructions}' ...
When we open the events on a heterogeneous system the expectation is
we get a group of cycles and instructions for each PMU, the extended
type in the attribute for each event will be set to the type of the
PMU the group is targeting. When we parse the events we have cycles
events per PMU then instructions events per PMU, we then resort and
regroup the events as you can't have a group spanning PMUs except in a
few special cases like software events.

Now let's say we want to do a raw event from the json files, on Intel
let's say we choose the INST_RETIRED.ANY event and we want to measure
it with cycles:
perf stat -e '{cycles,inst_retired.any}' ...
The INST_RETIRED.ANY event will be matched on Intel with the PMUs
"cpu_core" and "cpu_atom" on heterogeneous systems. The cycles event
will be opened on two PMUs and the extended type set to each one. So
we have differing perf_event_attr types, but we know it is valid to
group the events because the raw event encoding's PMU type can be
matched to the extended type of the hardware event.

To better support the wildcard opening, sorting and grouping
operations of events I want to have an ability in the perf tool to map
a type number to a list of PMUs. Mapping a hardware/hw_cache type
number to PMUs should give the set of core PMUs for a heterogeneous
system. On heterogeneous Intel mapping a raw event (type 4) will give
"cpu_core" but it sounds like on ARM, if no sysfs PMU has type number
4, then we should map 4 to the set of core PMUs.

Mapping a type number to a set of perf tool's representation of PMUs
isn't current behavior, but hopefully you can see that ARM's behavior
is differing from Intel's, for which I'd been basing the
implementation so that we can support heterogeneous metrics, etc.

I'm also confused what should be the behavior of something like:
perf stat -e r0 ...
The code is going to open the event on every core PMU with the config set to 0:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/parse-events.c?h=perf-tools-next#n1485
But perhaps the intent on ARM is that a single raw type event is
opened? It seems the every core PMU behavior is more useful.

> > I suspect not being type 4 is a bug on apple ARM here.
>
> If that's a bug on this machine, it's a bug on all machines, at which
> point it is the de-facto API:
>
> $ grep . /sys/devices/armv8*/{type,cpus}
> /sys/devices/armv8_cortex_a53/type:8
> /sys/devices/armv8_cortex_a72/type:9
> /sys/devices/armv8_cortex_a53/cpus:0-3
> /sys/devices/armv8_cortex_a72/cpus:4-5
>
> See, non-Apple HW. And now for a system with homogeneous CPUs:
>
> $ grep . /sys/devices/armv8*/{type,cpus}
> /sys/devices/armv8_pmuv3_0/type:8
> /sys/devices/armv8_pmuv3_0/cpus:0-159
>
> Still no type 4. I could go on for hours, I have plenty of HW around
> me!

This is good to know, thanks for doing the research. I agree that it
is a de-facto API as this has happened. I'm hoping in the future we
can compress some /sys/devices directories and use them as test input.
It obviously won't be possible to open the events, etc. but we can
test things like the event/metric parsing matches an expectation.

Thanks,
Ian

> So whatever your source of information is, it doesn't match reality.
> Our PMUs are numbered arbitrarily, and have been so for... a very long
> time. At least since perf_pmu_register has supported dynamic
> registration (see 2e80a82a49c4c).
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-27 18:38                   ` Marc Zyngier
  2023-05-27 19:50                     ` Ian Rogers
@ 2023-05-30  7:45                     ` Thomas Richter
  2023-05-30 14:00                       ` Ian Rogers
  1 sibling, 1 reply; 35+ messages in thread
From: Thomas Richter @ 2023-05-30  7:45 UTC (permalink / raw)
  To: Marc Zyngier, Ian Rogers
  Cc: Oliver Upton, Peter Zijlstra, Ravi Bangoria, Nathan Chancellor,
	namhyung, eranian, acme, mark.rutland, jolsa, bp, kan.liang,
	adrian.hunter, maddy, x86, linux-perf-users, linux-kernel,
	sandipan.das, ananth.narayan, santosh.shukla, kvmarm

On 5/27/23 20:38, Marc Zyngier wrote:
> On Sat, 27 May 2023 18:00:13 +0100,
> Ian Rogers <irogers@google.com> wrote:
>>
>> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
>>>
>>> On Sat, 27 May 2023 00:00:47 +0100,
>>> Ian Rogers <irogers@google.com> wrote:
>>>>
>>>> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>>>>>
>>>>> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
>>>>>> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
>>>>>>
>>>>>>> The PMUv3 driver does pass a name, but it relies on getting back an
>>>>>>> allocated pmu id as @type is -1 in the call to perf_pmu_register().
>>>>>>>
>>>>>>> What actually broke is how KVM probes for a default core PMU to use for
>>>>>>> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
>>>>>>> reads the pmu from the returned perf_event. The linear search had the
>>>>>>> effect of eventually stumbling on the correct core PMU and succeeding.
>>>>>>>
>>>>>>> Perf folks: is this WAI for heterogenous systems?
>>>>>>
>>>>>> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
>>>>>> sure what ARM64 does here.
>>>>>>
>>>>>> IIRC the only way is to hard affine things; that is, force vCPU of
>>>>>> 'type' to the pCPU mask of 'type' CPUs.
>>>>>
>>>>> We provide absolutely no illusion of consistency across implementations.
>>>>> Userspace can select the PMU type, and then it is a userspace problem
>>>>> affining vCPUs to the right pCPUs.
>>>>>
>>>>> And if they get that wrong, we just bail and refuse to run the vCPU.
>>>>>
>>>>>> If you don't do that; or let userspace 'override' that, things go
>>>>>> sideways *real* fast.
>>>>>
>>>>> Oh yeah, and I wish PMUs were the only problem with these hetero
>>>>> systems...
>>>>
>>>> Just to add some context from what I understand. There are inbuilt
>>>> type numbers for PMUs:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
>>>> so the PMU generally called /sys/devices/cpu should have type 4 (ARM
>>>> give it another name). For heterogeneous ARM there is a single PMU and
>>>> the same events are programmed regardless of whether it is a big or a
>>>> little core - the cpumask lists all CPUs.
>>>
>>> I think you misunderstood the way heterogeneous arm64 systems are
>>> described . Each CPU type gets its own PMU type, and its own event
>>> list. Case in point:
>>>
>>> $ grep . /sys/devices/*pmu/{type,cpus}
>>> /sys/devices/apple_avalanche_pmu/type:9
>>> /sys/devices/apple_blizzard_pmu/type:8
>>> /sys/devices/apple_avalanche_pmu/cpus:4-9
>>> /sys/devices/apple_blizzard_pmu/cpus:0-3
>>>
>>> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
>>> event number, nothing else.
>>
>> Which PMU will a raw event open on?
> 
> On the PMU that matches the current CPU.
> 
>> Note, the raw events don't support
>> the extended type that is present in PERF_TYPE_HARDWARE and
>> PERF_TYPE_HW_CACHE:
>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
>> as the bits are already in use for being just plain config values.
> 
> I'm not sure how relevant this is to the numbering of PMUs on arm64.
> 
>> I suspect not being type 4 is a bug on apple ARM here.
> 
> If that's a bug on this machine, it's a bug on all machines, at which
> point it is the de-facto API:
> 
> $ grep . /sys/devices/armv8*/{type,cpus}
> /sys/devices/armv8_cortex_a53/type:8
> /sys/devices/armv8_cortex_a72/type:9
> /sys/devices/armv8_cortex_a53/cpus:0-3
> /sys/devices/armv8_cortex_a72/cpus:4-5
> 
> See, non-Apple HW. And now for a system with homogeneous CPUs:
> 
> $ grep . /sys/devices/armv8*/{type,cpus}
> /sys/devices/armv8_pmuv3_0/type:8
> /sys/devices/armv8_pmuv3_0/cpus:0-159
> 
> Still no type 4. I could go on for hours, I have plenty of HW around
> me!
> 
> So whatever your source of information is, it doesn't match reality.
> Our PMUs are numbered arbitrarily, and have been so for... a very long
> time. At least since perf_pmu_register has supported dynamic
> registration (see 2e80a82a49c4c).
> 
> Thanks,
> 
> 	M.
> 


I agree with Marc,
on s390 we have 5 different PMUs and all have arbitrary numbers
and have totally different features:

# ll /sys/devices/{cpum,pai}*/type
-r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf_diag/type
-r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf/type
-r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_sf/type
-r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_crypto/type
-r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_ext/type
# grep . /sys/devices/{cpum,pai}*/type
/sys/devices/cpum_cf_diag/type:9
/sys/devices/cpum_cf/type:8
/sys/devices/cpum_sf/type:4
/sys/devices/pai_crypto/type:10
/sys/devices/pai_ext/type:11
# 

Thanks Thomas
-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-30  7:45                     ` Thomas Richter
@ 2023-05-30 14:00                       ` Ian Rogers
  2023-05-31  9:09                         ` Thomas Richter
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2023-05-30 14:00 UTC (permalink / raw)
  To: Thomas Richter
  Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On Tue, May 30, 2023 at 12:45 AM Thomas Richter <tmricht@linux.ibm.com> wrote:
>
> On 5/27/23 20:38, Marc Zyngier wrote:
> > On Sat, 27 May 2023 18:00:13 +0100,
> > Ian Rogers <irogers@google.com> wrote:
> >>
> >> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@kernel.org> wrote:
> >>>
> >>> On Sat, 27 May 2023 00:00:47 +0100,
> >>> Ian Rogers <irogers@google.com> wrote:
> >>>>
> >>>> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >>>>>
> >>>>> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> >>>>>> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> >>>>>>
> >>>>>>> The PMUv3 driver does pass a name, but it relies on getting back an
> >>>>>>> allocated pmu id as @type is -1 in the call to perf_pmu_register().
> >>>>>>>
> >>>>>>> What actually broke is how KVM probes for a default core PMU to use for
> >>>>>>> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> >>>>>>> reads the pmu from the returned perf_event. The linear search had the
> >>>>>>> effect of eventually stumbling on the correct core PMU and succeeding.
> >>>>>>>
> >>>>>>> Perf folks: is this WAI for heterogenous systems?
> >>>>>>
> >>>>>> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> >>>>>> sure what ARM64 does here.
> >>>>>>
> >>>>>> IIRC the only way is to hard affine things; that is, force vCPU of
> >>>>>> 'type' to the pCPU mask of 'type' CPUs.
> >>>>>
> >>>>> We provide absolutely no illusion of consistency across implementations.
> >>>>> Userspace can select the PMU type, and then it is a userspace problem
> >>>>> affining vCPUs to the right pCPUs.
> >>>>>
> >>>>> And if they get that wrong, we just bail and refuse to run the vCPU.
> >>>>>
> >>>>>> If you don't do that; or let userspace 'override' that, things go
> >>>>>> sideways *real* fast.
> >>>>>
> >>>>> Oh yeah, and I wish PMUs were the only problem with these hetero
> >>>>> systems...
> >>>>
> >>>> Just to add some context from what I understand. There are inbuilt
> >>>> type numbers for PMUs:
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> >>>> so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> >>>> give it another name). For heterogeneous ARM there is a single PMU and
> >>>> the same events are programmed regardless of whether it is a big or a
> >>>> little core - the cpumask lists all CPUs.
> >>>
> >>> I think you misunderstood the way heterogeneous arm64 systems are
> >>> described . Each CPU type gets its own PMU type, and its own event
> >>> list. Case in point:
> >>>
> >>> $ grep . /sys/devices/*pmu/{type,cpus}
> >>> /sys/devices/apple_avalanche_pmu/type:9
> >>> /sys/devices/apple_blizzard_pmu/type:8
> >>> /sys/devices/apple_avalanche_pmu/cpus:4-9
> >>> /sys/devices/apple_blizzard_pmu/cpus:0-3
> >>>
> >>> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> >>> event number, nothing else.
> >>
> >> Which PMU will a raw event open on?
> >
> > On the PMU that matches the current CPU.
> >
> >> Note, the raw events don't support
> >> the extended type that is present in PERF_TYPE_HARDWARE and
> >> PERF_TYPE_HW_CACHE:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
> >> as the bits are already in use for being just plain config values.
> >
> > I'm not sure how relevant this is to the numbering of PMUs on arm64.
> >
> >> I suspect not being type 4 is a bug on apple ARM here.
> >
> > If that's a bug on this machine, it's a bug on all machines, at which
> > point it is the de-facto API:
> >
> > $ grep . /sys/devices/armv8*/{type,cpus}
> > /sys/devices/armv8_cortex_a53/type:8
> > /sys/devices/armv8_cortex_a72/type:9
> > /sys/devices/armv8_cortex_a53/cpus:0-3
> > /sys/devices/armv8_cortex_a72/cpus:4-5
> >
> > See, non-Apple HW. And now for a system with homogeneous CPUs:
> >
> > $ grep . /sys/devices/armv8*/{type,cpus}
> > /sys/devices/armv8_pmuv3_0/type:8
> > /sys/devices/armv8_pmuv3_0/cpus:0-159
> >
> > Still no type 4. I could go on for hours, I have plenty of HW around
> > me!
> >
> > So whatever your source of information is, it doesn't match reality.
> > Our PMUs are numbered arbitrarily, and have been so for... a very long
> > time. At least since perf_pmu_register has supported dynamic
> > registration (see 2e80a82a49c4c).
> >
> > Thanks,
> >
> >       M.
> >
>
>
> I agree with Marc,
> on s390 we have 5 different PMUs and all have arbitrary numbers
> and have totally different features:
>
> # ll /sys/devices/{cpum,pai}*/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf_diag/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_sf/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_crypto/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_ext/type
> # grep . /sys/devices/{cpum,pai}*/type
> /sys/devices/cpum_cf_diag/type:9
> /sys/devices/cpum_cf/type:8
> /sys/devices/cpum_sf/type:4
> /sys/devices/pai_crypto/type:10
> /sys/devices/pai_ext/type:11
> #

Thanks Thomas, could you:
$ ls /sys/devices/*/cpu*
The assumption is that if a file called "cpus" exists then this a
"core" PMU, while "cpumask" would indicate an uncore PMU. The
exception is /sys/devices/cpu where there is no cpus but all online
CPUs are assumed to be in "cpus" cpu map. On Intel one of the core
PMUs has type 4 which aligns with the perf_event.h "ABI" constant
PERF_TYPE_RAW, so opening a type 4 event will open it on that PMU. The
question is I have is what should be the behavior be on non-Intel for
type 4, for the big.little/hybrid case it seems opening it on core
PMUs would make most sense and is what is done for PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE.

Thanks,
Ian

> Thanks Thomas
> --
> Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
> --
> Vorsitzender des Aufsichtsrats: Gregor Pillen
> Geschäftsführung: David Faller
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-30 14:00                       ` Ian Rogers
@ 2023-05-31  9:09                         ` Thomas Richter
  2023-05-31 20:20                           ` Ian Rogers
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Richter @ 2023-05-31  9:09 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On 5/30/23 16:00, Ian Rogers wrote:
> ls /sys/devices/*/cpu*

Hi Ian,

here is the output yo requested:

# ls /sys/devices/*/cpu*
cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
cpu1  cpu4  cpu7         isolated    offline   present   uevent
cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
#

In fact it is the same as 
# ls  /sys/devices/system/cpu/
cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
cpu1  cpu4  cpu7         isolated    offline   present   uevent
cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
# 
This directory tree has nothing to do with events for perf, it is
merely used to support CPU hotplug on s390.

The PMUs on s390 are
# ls -ld /sys/devices/{cpum_,pai_}*
drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf
drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag
drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf
drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto
drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext
# 

I hope his helps.

-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-31  9:09                         ` Thomas Richter
@ 2023-05-31 20:20                           ` Ian Rogers
  2023-06-01 11:02                             ` Thomas Richter
  0 siblings, 1 reply; 35+ messages in thread
From: Ian Rogers @ 2023-05-31 20:20 UTC (permalink / raw)
  To: Thomas Richter
  Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On Wed, May 31, 2023 at 2:09 AM Thomas Richter <tmricht@linux.ibm.com> wrote:
>
> On 5/30/23 16:00, Ian Rogers wrote:
> > ls /sys/devices/*/cpu*
>
> Hi Ian,
>
> here is the output yo requested:
>
> # ls /sys/devices/*/cpu*
> cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
> cpu1  cpu4  cpu7         isolated    offline   present   uevent
> cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
> #
>
> In fact it is the same as
> # ls  /sys/devices/system/cpu/
> cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
> cpu1  cpu4  cpu7         isolated    offline   present   uevent
> cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
> #
> This directory tree has nothing to do with events for perf, it is
> merely used to support CPU hotplug on s390.
>
> The PMUs on s390 are
> # ls -ld /sys/devices/{cpum_,pai_}*
> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf
> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag
> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf
> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto
> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext
> #
>
> I hope his helps.

Thanks Thomas,

The perf tool has a notion of "core" and "other" PMUs - other means
uncore and things like interconnect PMUs. The distinction matters as
PMUs may have a list of CPUs with them, for "other" PMUs the CPU list
(known in the tool as the CPU map) is a suggestion of the CPU to open
events on. For example, on my laptop:
```
$ cat /sys/devices/system/cpu/online
0-15
$ cat /sys/devices/uncore_imc_free_running_0/cpumask
0
```
So I have 16 "CPUs" and the memory controller is suggesting opening
the events on CPU 0. However, if I do:
```
$ sudo perf stat -e 'uncore_imc_free_running_0/data_read/' -C 8 -a -A sleep 1

Performance counter stats for 'system wide':

CPU8                 4,617.60 MiB
uncore_imc_free_running_0/data_read/


      1.001094684 seconds time elapsed
```
Then things are good and the event was recorded on CPU 8, even though
the cpumask of the PMU only said CPU 0.

For "core" PMUs the CPU map works differently. A CPU outside of the
map is an error. For example, if I have a heterogeneous system with 2
CPUs, the first CPU on 1 PMU and the second CPU on a different PMU, if
I try to open events for the wrong PMU  type for the CPU then it
should fail. The CPU map should be interpreted as the set of CPUs
events are valid on.

The logic to determine if a PMU is "core" or "other" is here:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n1417
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n609

That is, a PMU is a "core" PMU if it is called "cpu" or if there is a
file called "cpus" inside the PMU's sysfs directory. It seems on s390
none of the PMUs are considered "core" and this is why we're having
issues.

Looking at:
https://lore.kernel.org/lkml/20180416132314.33249-1-tmricht@linux.ibm.com/

It would seem to make sense that "cpu", "cpum_cf" and "cpum_sf" should
all cause "is_pmu_core" to return true, From your output I suspect the
issue is that cpum_cf and cpum_sf both lack the "cpus" file - which is
also true on homogeneous x86's "cpu" PMU. We can fix the code with:

```
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 0520aa9fe991..bdc3f3b148fc 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1414,7 +1414,9 @@ void perf_pmu__del_formats(struct list_head *formats)

bool is_pmu_core(const char *name)
{
-       return !strcmp(name, "cpu") || is_sysfs_pmu_core(name);
+       return !strcmp(name, "cpu") ||
+              !strcmp(name, "cpum_cf") || !strcmp(name, "cpum_sf") ||
+              is_sysfs_pmu_core(name);
}

bool perf_pmu__supports_legacy_cache(const struct perf_pmu *pmu)
```

Wdyt? Thanks,
Ian

> --
> Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
> --
> Vorsitzender des Aufsichtsrats: Gregor Pillen
> Geschäftsführung: David Faller
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-05-31 20:20                           ` Ian Rogers
@ 2023-06-01 11:02                             ` Thomas Richter
  2023-06-01 11:18                               ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Thomas Richter @ 2023-06-01 11:02 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Marc Zyngier, Oliver Upton, Peter Zijlstra, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On 5/31/23 22:20, Ian Rogers wrote:
> On Wed, May 31, 2023 at 2:09 AM Thomas Richter <tmricht@linux.ibm.com> wrote:
>>
>> On 5/30/23 16:00, Ian Rogers wrote:
>>> ls /sys/devices/*/cpu*
>>
>> Hi Ian,
>>
>> here is the output yo requested:
>>
>> # ls /sys/devices/*/cpu*
>> cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
>> cpu1  cpu4  cpu7         isolated    offline   present   uevent
>> cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
>> #
>>
>> In fact it is the same as
>> # ls  /sys/devices/system/cpu/
>> cpu0  cpu3  cpu6         hotplug     modalias  possible  smt
>> cpu1  cpu4  cpu7         isolated    offline   present   uevent
>> cpu2  cpu5  dispatching  kernel_max  online    rescan    vulnerabilities
>> #
>> This directory tree has nothing to do with events for perf, it is
>> merely used to support CPU hotplug on s390.
>>
>> The PMUs on s390 are
>> # ls -ld /sys/devices/{cpum_,pai_}*
>> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf
>> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_cf_diag
>> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/cpum_sf
>> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_crypto
>> drwxr-xr-x 4 root root 0 May 25 15:45 /sys/devices/pai_ext
>> #
>>
>> I hope his helps.
> 
> Thanks Thomas,
> 
> The perf tool has a notion of "core" and "other" PMUs - other means
> uncore and things like interconnect PMUs. The distinction matters as
> PMUs may have a list of CPUs with them, for "other" PMUs the CPU list
> (known in the tool as the CPU map) is a suggestion of the CPU to open
> events on. For example, on my laptop:
> ```
> $ cat /sys/devices/system/cpu/online
> 0-15
> $ cat /sys/devices/uncore_imc_free_running_0/cpumask
> 0
> ```
> So I have 16 "CPUs" and the memory controller is suggesting opening
> the events on CPU 0. However, if I do:
> ```
> $ sudo perf stat -e 'uncore_imc_free_running_0/data_read/' -C 8 -a -A sleep 1
> 
> Performance counter stats for 'system wide':
> 
> CPU8                 4,617.60 MiB
> uncore_imc_free_running_0/data_read/
> 
> 
>       1.001094684 seconds time elapsed
> ```
> Then things are good and the event was recorded on CPU 8, even though
> the cpumask of the PMU only said CPU 0.
> 
> For "core" PMUs the CPU map works differently. A CPU outside of the
> map is an error. For example, if I have a heterogeneous system with 2
> CPUs, the first CPU on 1 PMU and the second CPU on a different PMU, if
> I try to open events for the wrong PMU  type for the CPU then it
> should fail. The CPU map should be interpreted as the set of CPUs
> events are valid on.
> 
> The logic to determine if a PMU is "core" or "other" is here:
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n1417
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/pmu.c?h=perf-tools-next#n609
> 
> That is, a PMU is a "core" PMU if it is called "cpu" or if there is a
> file called "cpus" inside the PMU's sysfs directory. It seems on s390
> none of the PMUs are considered "core" and this is why we're having
> issues.
> 
> Looking at:
> https://lore.kernel.org/lkml/20180416132314.33249-1-tmricht@linux.ibm.com/
> 
> It would seem to make sense that "cpu", "cpum_cf" and "cpum_sf" should
> all cause "is_pmu_core" to return true, From your output I suspect the
> issue is that cpum_cf and cpum_sf both lack the "cpus" file - which is
> also true on homogeneous x86's "cpu" PMU. We can fix the code with:
> 
> ```
> diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
> index 0520aa9fe991..bdc3f3b148fc 100644
> --- a/tools/perf/util/pmu.c
> +++ b/tools/perf/util/pmu.c
> @@ -1414,7 +1414,9 @@ void perf_pmu__del_formats(struct list_head *formats)
> 
> bool is_pmu_core(const char *name)
> {
> -       return !strcmp(name, "cpu") || is_sysfs_pmu_core(name);
> +       return !strcmp(name, "cpu") ||
> +              !strcmp(name, "cpum_cf") || !strcmp(name, "cpum_sf") ||
> +              is_sysfs_pmu_core(name);
> }
> 
> bool perf_pmu__supports_legacy_cache(const struct perf_pmu *pmu)
> ```
> 
> Wdyt? Thanks,
> Ian
> 

Ian, thanks very much for your help. I changed the code as you
suggested and added s390's PMU cpum_cf as in

bool is_pmu_core(const char *name)
{
        return !strcmp(name, "cpu") || !strcmp(name, "cpum_cf") ||
               is_sysfs_pmu_core(name);
}

The result is much better, now the
- test case 6.1 fails only 2 times and for a different reason:
  running test 2 'r1a'
  FAILED tests/parse-events.c:125 Raw PMU not matched
  Event test failure: test 2 'r1a' rc -1
  ...
  running test 14 'r1a:kp'
  FAILED tests/parse-events.c:125 Raw PMU not matched
  Event test failure: test 14 'r1a:kp' rc -1
  
- test case 10.2 fails for an unknown event bp_l1_btb_correct.
   10.2: PMU event map aliases         :
   --- start ---
   Using CPUID IBM,8561,703,T01,3.6,002f
   testing aliases core PMU cpum_cf: no alias, alias_table->name=bp_l1_btb_correct
   testing core PMU cpum_cf aliases: failed
   ---- end ----
   PMU events subtest 2: FAILED!
- test case 68 now succeeds:
  # ./perf test  68
  68: Parse and process metrics        : Ok
  #

Regarding test case 6.1: 
Raw PMU not matched is printed in tests/parse-events.c::125
Debugging revealed that function
  test_event()
  +--> parse_events() which returns 0
  +--> e->check() ---> test__checkevent_raw()
This function has TEST_ASSERT_VAL("Raw PMU not matched", raw_type_match);
triggering -1 because the PMU has type 8 which is not PERF_TYPE_RAW

Maybe make type >= PERF_TYPE_RAW to succeed the test?

Regarding test case 10.2: the event bp_l1_btb_correct does not exist on
s390. It is defined in
- pmu-events/empty-pmu-events.c
- pmu-events/arch/test/test_soc/cpu/branch.json
Not sure which one is used.
But this event is unknown on s390

What is the best way to fix these issues.

PS: I have the feeling, it gets complicated to have multiple hardware PMUs
per platform.

-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-06-01 11:02                             ` Thomas Richter
@ 2023-06-01 11:18                               ` Peter Zijlstra
  2023-06-01 11:20                                 ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2023-06-01 11:18 UTC (permalink / raw)
  To: Thomas Richter
  Cc: Ian Rogers, Marc Zyngier, Oliver Upton, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On Thu, Jun 01, 2023 at 01:02:30PM +0200, Thomas Richter wrote:

> PS: I have the feeling, it gets complicated to have multiple hardware PMUs
> per platform.

Recently someone was poking around giving the pmu device a parent, this
would, I think, result in sysfs links, which could be used to decide
what's what.

Core pmus would have the CPU device as their parent, while memory
controller thingies would link to the relevant node or something.

Ofc. all that's future-work/pending.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code
  2023-06-01 11:18                               ` Peter Zijlstra
@ 2023-06-01 11:20                                 ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2023-06-01 11:20 UTC (permalink / raw)
  To: Thomas Richter
  Cc: Ian Rogers, Marc Zyngier, Oliver Upton, Ravi Bangoria,
	Nathan Chancellor, namhyung, eranian, acme, mark.rutland, jolsa,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	kvmarm

On Thu, Jun 01, 2023 at 01:18:56PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:02:30PM +0200, Thomas Richter wrote:
> 
> > PS: I have the feeling, it gets complicated to have multiple hardware PMUs
> > per platform.
> 
> Recently someone was poking around giving the pmu device a parent, this
> would, I think, result in sysfs links, which could be used to decide
> what's what.
> 
> Core pmus would have the CPU device as their parent, while memory
> controller thingies would link to the relevant node or something.
> 
> Ofc. all that's future-work/pending.

https://lkml.kernel.org/r/20230404134225.13408-1-Jonathan.Cameron%40huawei.com

https://lkml.kernel.org/r/20230531110011.13963-2-Jonathan.Cameron%40huawei.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events
  2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
  2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
@ 2024-02-20  8:41   ` Pengfei Xu
  2024-02-23  5:27     ` Ravi Bangoria
  1 sibling, 1 reply; 35+ messages in thread
From: Pengfei Xu @ 2024-02-20  8:41 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla,
	pengfei.xu, lkp

Hi Ravi Bangoria,

On 2023-05-04 at 16:30:00 +0530, Ravi Bangoria wrote:
> Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
> cpu-clock events are interfaced through it but internally gets forwarded
> to their own pmus.
> 
> Rework this by overwriting event->attr.type in perf_swevent_init() which
> will cause perf_init_event() to retry with updated type and event will
> automatically get forwarded to right pmu. With the change, SW pmu no
> longer needs to be treated specially and can be included in 'pmu_idr'
> list.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  include/linux/perf_event.h | 10 +++++
>  kernel/events/core.c       | 77 ++++++++++++++++++++------------------
>  2 files changed, 51 insertions(+), 36 deletions(-)

Greeting!
There is task hung in perf_tp_event_init in v6.8-rc4 in guest.

All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240219_055325_perf_tp_event_init
Syzkaller reproduced code: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/repro.c
Syzkaller reproduced syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/repro.prog
Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/bisect_info.log
v6.8-rc4 issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240219_055325_perf_tp_event_init/841c35169323cd833294798e58b9bf63fa4fa1de_dmesg.log
v6.8-rc4 bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240219_055325_perf_tp_event_init/bzimage_v6.8-rc4.tar.gz

Bisected and found bad commit:
0d6d062ca27e perf/core: Rework forwarding of {task|cpu}-clock events

It could be reproduced in 3600s.
After reverted the above commit on top of v6.8-rc4 kernel, this issue was gone.

Syzkaller repro.report:
"
clocksource: Long readout interval, skipping watchdog check: cs_nsec: 2247936849 wd_nsec: 2247937071
INFO: task syz-executor703:819 blocked for more than 143 seconds.
      Not tainted 6.8.0-rc4-4e6c5b0f4ced+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor703 state:D stack:0     pid:819   tgid:819   ppid:760    flags:0x00000004
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5400 [inline]
 __schedule+0xa7d/0x2810 kernel/sched/core.c:6727
 __schedule_loop kernel/sched/core.c:6802 [inline]
 schedule+0xd0/0x290 kernel/sched/core.c:6817
 schedule_preempt_disabled+0x1c/0x30 kernel/sched/core.c:6874
 __mutex_lock_common kernel/locking/mutex.c:684 [inline]
 __mutex_lock+0xf63/0x1b50 kernel/locking/mutex.c:752
 mutex_lock_nested+0x1f/0x30 kernel/locking/mutex.c:804
 perf_trace_init+0x5c/0x310 kernel/trace/trace_event_perf.c:221
 perf_tp_event_init+0xaf/0x130 kernel/events/core.c:10106
 perf_try_init_event+0x13f/0x5a0 kernel/events/core.c:11672
 perf_init_event kernel/events/core.c:11742 [inline]
 perf_event_alloc kernel/events/core.c:12022 [inline]
 perf_event_alloc+0x1069/0x3e40 kernel/events/core.c:11888
 __do_sys_perf_event_open+0x48e/0x2c40 kernel/events/core.c:12529
 __se_sys_perf_event_open kernel/events/core.c:12420 [inline]
 __x64_sys_perf_event_open+0xc7/0x160 kernel/events/core.c:12420
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x74/0x150 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x45c36d
RSP: 002b:00007fffec7b5478 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045c36d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000080
RBP: 00000000000b8995 R08: 0000000000000000 R09: 000000000000031a
R10: 00000000ffffffff R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000000 R14: 00000000004f9018 R15: 0000000000000000
 </TASK>
INFO: task syz-executor703:821 blocked for more than 143 seconds.
      Not tainted 6.8.0-rc4-4e6c5b0f4ced+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor703 state:D stack:0     pid:821   tgid:821   ppid:767    flags:0x00000004
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5400 [inline]
 __schedule+0xa7d/0x2810 kernel/sched/core.c:6727
 __schedule_loop kernel/sched/core.c:6802 [inline]
 schedule+0xd0/0x290 kernel/sched/core.c:6817
 schedule_preempt_disabled+0x1c/0x30 kernel/sched/core.c:6874
 __mutex_lock_common kernel/locking/mutex.c:684 [inline]
 __mutex_lock+0xf63/0x1b50 kernel/locking/mutex.c:752
 mutex_lock_nested+0x1f/0x30 kernel/locking/mutex.c:804
 perf_trace_init+0x5c/0x310 kernel/trace/trace_event_perf.c:221
 perf_tp_event_init+0xaf/0x130 kernel/events/core.c:10106
 perf_try_init_event+0x13f/0x5a0 kernel/events/core.c:11672
 perf_init_event kernel/events/core.c:11742 [inline]
 perf_event_alloc kernel/events/core.c:12022 [inline]
 perf_event_alloc+0x1069/0x3e40 kernel/events/core.c:11888
 __do_sys_perf_event_open+0x48e/0x2c40 kernel/events/core.c:12529
 __se_sys_perf_event_open kernel/events/core.c:12420 [inline]
 __x64_sys_perf_event_open+0xc7/0x160 kernel/events/core.c:12420
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x74/0x150 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x45c36d
RSP: 002b:00007fffec7b5478 EFLAGS: 00000246 ORIG_RAX: 000000000000012a
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045c36d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000080
RBP: 00000000000b8bea R08: 0000000000000000 R09: 000000000000031e
R10: 00000000ffffffff R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000000 R14: 00000000004f9018 R15: 0000000000000000
"

Thanks!

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install

Best Regards,
Thanks!

> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d5628a7b5eaa..5c8a748f51ac 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -295,6 +295,8 @@ struct perf_event_pmu_context;
>  
>  struct perf_output_handle;
>  
> +#define PMU_NULL_DEV	((void *)(~0))
> +
>  /**
>   * struct pmu - generic performance monitoring unit
>   */
> @@ -827,6 +829,14 @@ struct perf_event {
>  	void *security;
>  #endif
>  	struct list_head		sb_list;
> +
> +	/*
> +	 * Certain events gets forwarded to another pmu internally by over-
> +	 * writing kernel copy of event->attr.type without user being aware
> +	 * of it. event->orig_type contains original 'type' requested by
> +	 * user.
> +	 */
> +	__u32				orig_type;
>  #endif /* CONFIG_PERF_EVENTS */
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 435815d3be3f..0695bb9fbbb6 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6647,7 +6647,7 @@ static void perf_sigtrap(struct perf_event *event)
>  		return;
>  
>  	send_sig_perf((void __user *)event->pending_addr,
> -		      event->attr.type, event->attr.sig_data);
> +		      event->orig_type, event->attr.sig_data);
>  }
>  
>  /*
> @@ -9951,6 +9951,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
>  	swevent_hlist_put();
>  }
>  
> +static struct pmu perf_cpu_clock; /* fwd declaration */
> +static struct pmu perf_task_clock;
> +
>  static int perf_swevent_init(struct perf_event *event)
>  {
>  	u64 event_id = event->attr.config;
> @@ -9966,7 +9969,10 @@ static int perf_swevent_init(struct perf_event *event)
>  
>  	switch (event_id) {
>  	case PERF_COUNT_SW_CPU_CLOCK:
> +		event->attr.type = perf_cpu_clock.type;
> +		return -ENOENT;
>  	case PERF_COUNT_SW_TASK_CLOCK:
> +		event->attr.type = perf_task_clock.type;
>  		return -ENOENT;
>  
>  	default:
> @@ -11086,7 +11092,7 @@ static void cpu_clock_event_read(struct perf_event *event)
>  
>  static int cpu_clock_event_init(struct perf_event *event)
>  {
> -	if (event->attr.type != PERF_TYPE_SOFTWARE)
> +	if (event->attr.type != perf_cpu_clock.type)
>  		return -ENOENT;
>  
>  	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
> @@ -11107,6 +11113,7 @@ static struct pmu perf_cpu_clock = {
>  	.task_ctx_nr	= perf_sw_context,
>  
>  	.capabilities	= PERF_PMU_CAP_NO_NMI,
> +	.dev		= PMU_NULL_DEV,
>  
>  	.event_init	= cpu_clock_event_init,
>  	.add		= cpu_clock_event_add,
> @@ -11167,7 +11174,7 @@ static void task_clock_event_read(struct perf_event *event)
>  
>  static int task_clock_event_init(struct perf_event *event)
>  {
> -	if (event->attr.type != PERF_TYPE_SOFTWARE)
> +	if (event->attr.type != perf_task_clock.type)
>  		return -ENOENT;
>  
>  	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
> @@ -11188,6 +11195,7 @@ static struct pmu perf_task_clock = {
>  	.task_ctx_nr	= perf_sw_context,
>  
>  	.capabilities	= PERF_PMU_CAP_NO_NMI,
> +	.dev		= PMU_NULL_DEV,
>  
>  	.event_init	= task_clock_event_init,
>  	.add		= task_clock_event_add,
> @@ -11415,31 +11423,31 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  		goto unlock;
>  
>  	pmu->type = -1;
> -	if (!name)
> -		goto skip_type;
> +	if (WARN_ONCE(!name, "Can not register anonymous pmu.\n")) {
> +		ret = -EINVAL;
> +		goto free_pdc;
> +	}
> +
>  	pmu->name = name;
>  
> -	if (type != PERF_TYPE_SOFTWARE) {
> -		if (type >= 0)
> -			max = type;
> +	if (type >= 0)
> +		max = type;
>  
> -		ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
> -		if (ret < 0)
> -			goto free_pdc;
> +	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
> +	if (ret < 0)
> +		goto free_pdc;
>  
> -		WARN_ON(type >= 0 && ret != type);
> +	WARN_ON(type >= 0 && ret != type);
>  
> -		type = ret;
> -	}
> +	type = ret;
>  	pmu->type = type;
>  
> -	if (pmu_bus_running) {
> +	if (pmu_bus_running && !pmu->dev) {
>  		ret = pmu_dev_alloc(pmu);
>  		if (ret)
>  			goto free_idr;
>  	}
>  
> -skip_type:
>  	ret = -ENOMEM;
>  	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
>  	if (!pmu->cpu_pmu_context)
> @@ -11481,16 +11489,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	if (!pmu->event_idx)
>  		pmu->event_idx = perf_event_idx_default;
>  
> -	/*
> -	 * Ensure the TYPE_SOFTWARE PMUs are at the head of the list,
> -	 * since these cannot be in the IDR. This way the linear search
> -	 * is fast, provided a valid software event is provided.
> -	 */
> -	if (type == PERF_TYPE_SOFTWARE || !name)
> -		list_add_rcu(&pmu->entry, &pmus);
> -	else
> -		list_add_tail_rcu(&pmu->entry, &pmus);
> -
> +	list_add_rcu(&pmu->entry, &pmus);
>  	atomic_set(&pmu->exclusive_cnt, 0);
>  	ret = 0;
>  unlock:
> @@ -11499,12 +11498,13 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
>  	return ret;
>  
>  free_dev:
> -	device_del(pmu->dev);
> -	put_device(pmu->dev);
> +	if (pmu->dev && pmu->dev != PMU_NULL_DEV) {
> +		device_del(pmu->dev);
> +		put_device(pmu->dev);
> +	}
>  
>  free_idr:
> -	if (pmu->type != PERF_TYPE_SOFTWARE)
> -		idr_remove(&pmu_idr, pmu->type);
> +	idr_remove(&pmu_idr, pmu->type);
>  
>  free_pdc:
>  	free_percpu(pmu->pmu_disable_count);
> @@ -11525,9 +11525,8 @@ void perf_pmu_unregister(struct pmu *pmu)
>  	synchronize_rcu();
>  
>  	free_percpu(pmu->pmu_disable_count);
> -	if (pmu->type != PERF_TYPE_SOFTWARE)
> -		idr_remove(&pmu_idr, pmu->type);
> -	if (pmu_bus_running) {
> +	idr_remove(&pmu_idr, pmu->type);
> +	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
>  		if (pmu->nr_addr_filters)
>  			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
>  		device_del(pmu->dev);
> @@ -11601,6 +11600,12 @@ static struct pmu *perf_init_event(struct perf_event *event)
>  
>  	idx = srcu_read_lock(&pmus_srcu);
>  
> +	/*
> +	 * Save original type before calling pmu->event_init() since certain
> +	 * pmus overwrites event->attr.type to forward event to another pmu.
> +	 */
> +	event->orig_type = event->attr.type;
> +
>  	/* Try parent's PMU first: */
>  	if (event->parent && event->parent->pmu) {
>  		pmu = event->parent->pmu;
> @@ -13640,8 +13645,8 @@ void __init perf_event_init(void)
>  	perf_event_init_all_cpus();
>  	init_srcu_struct(&pmus_srcu);
>  	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
> -	perf_pmu_register(&perf_cpu_clock, NULL, -1);
> -	perf_pmu_register(&perf_task_clock, NULL, -1);
> +	perf_pmu_register(&perf_cpu_clock, "cpu_clock", -1);
> +	perf_pmu_register(&perf_task_clock, "task_clock", -1);
>  	perf_tp_register();
>  	perf_event_init_cpu(smp_processor_id());
>  	register_reboot_notifier(&perf_reboot_notifier);
> @@ -13684,7 +13689,7 @@ static int __init perf_event_sysfs_init(void)
>  		goto unlock;
>  
>  	list_for_each_entry(pmu, &pmus, entry) {
> -		if (!pmu->name || pmu->type < 0)
> +		if (pmu->dev)
>  			continue;
>  
>  		ret = pmu_dev_alloc(pmu);
> -- 
> 2.40.0
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events
  2024-02-20  8:41   ` [PATCH v4 1/4] " Pengfei Xu
@ 2024-02-23  5:27     ` Ravi Bangoria
  2024-02-28 12:49       ` Ravi Bangoria
  0 siblings, 1 reply; 35+ messages in thread
From: Ravi Bangoria @ 2024-02-23  5:27 UTC (permalink / raw)
  To: Pengfei Xu
  Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, lkp,
	Ravi Bangoria

Hi Pengfei,

On 20-Feb-24 2:11 PM, Pengfei Xu wrote:
> Hi Ravi Bangoria,
> 
> On 2023-05-04 at 16:30:00 +0530, Ravi Bangoria wrote:
>> Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
>> cpu-clock events are interfaced through it but internally gets forwarded
>> to their own pmus.
>>
>> Rework this by overwriting event->attr.type in perf_swevent_init() which
>> will cause perf_init_event() to retry with updated type and event will
>> automatically get forwarded to right pmu. With the change, SW pmu no
>> longer needs to be treated specially and can be included in 'pmu_idr'
>> list.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>> ---
>>  include/linux/perf_event.h | 10 +++++
>>  kernel/events/core.c       | 77 ++++++++++++++++++++------------------
>>  2 files changed, 51 insertions(+), 36 deletions(-)
> 
> Greeting!
> There is task hung in perf_tp_event_init in v6.8-rc4 in guest.

Thanks for the bug report. I'm able to reproduce it. Will try to spend
more time to rootcause it.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events
  2024-02-23  5:27     ` Ravi Bangoria
@ 2024-02-28 12:49       ` Ravi Bangoria
  2024-02-29  3:41         ` Pengfei Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Ravi Bangoria @ 2024-02-28 12:49 UTC (permalink / raw)
  To: Pengfei Xu
  Cc: peterz, namhyung, eranian, acme, mark.rutland, jolsa, irogers,
	bp, kan.liang, adrian.hunter, maddy, x86, linux-perf-users,
	linux-kernel, sandipan.das, ananth.narayan, santosh.shukla, lkp,
	Ravi Bangoria

>>> Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
>>> cpu-clock events are interfaced through it but internally gets forwarded
>>> to their own pmus.
>>>
>>> Rework this by overwriting event->attr.type in perf_swevent_init() which
>>> will cause perf_init_event() to retry with updated type and event will
>>> automatically get forwarded to right pmu. With the change, SW pmu no
>>> longer needs to be treated specially and can be included in 'pmu_idr'
>>> list.
>>>
>>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
>>> ---
>>>  include/linux/perf_event.h | 10 +++++
>>>  kernel/events/core.c       | 77 ++++++++++++++++++++------------------
>>>  2 files changed, 51 insertions(+), 36 deletions(-)
>>
>> Greeting!
>> There is task hung in perf_tp_event_init in v6.8-rc4 in guest.
> 
> Thanks for the bug report. I'm able to reproduce it. Will try to spend
> more time to rootcause it.

Although the bisect has lead to 0d6d062ca27e as culprit commit, a minor
change (shown below) in the test program can create the same task hang
issue even with 0d6d062ca27e reverted.

-   *(uint32_t*)0x200000c0 = 6;   /* Use cpu-clock pmu type when 0d6d062ca27e is present */
+   *(uint32_t*)0x200000c0 = 1;   /* Use software pmu type when 0d6d062ca27e is absent */

So, 0d6d062ca27e is not the culprit commit.

Additionally,

o I've seen task hang or soft-lockups on a single cpu KVM guest while
  running your test as root and also as normal user with
  perf_event_paranoid=-1. But the same experiment on host, no lockups,
  only task hang. So I feel the bug report is false positive and there
  is no real issue (since the experiment requires special privilege).

o 0d6d062ca27e has inadvertently started allowing cpu-clock and task-
  clock events creation via their own pmu->type in perf_event_open(),
  instead of earlier design where the only interface was through sw
  pmu. Is it harmful? Probably not. But worth to be documented:

----><----

From c7ae1c57e2a23a05eb982524d37bc8542c9c9a34 Mon Sep 17 00:00:00 2001
From: Ravi Bangoria <ravi.bangoria@amd.com>
Date: Wed, 28 Feb 2024 17:29:04 +0530
Subject: [PATCH] perf/core: Document {cpu|task}-clock event open behavior

The standard interface to invoke task-clock and cpu-clock pmu is through
software pmu (see perf_swevent_init()), since these pmus are not exposed
to the user via sysfs and thus user doesn't know their pmu->type. However,
current code allows user to open an event if user has passed correct type
in the perf event attribute. This is not easily apparent from the code and
thus worth to be documented.

Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 kernel/events/core.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f0f0f71213a1..4072bccab3ba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11178,6 +11178,13 @@ static void cpu_clock_event_read(struct perf_event *event)
 
 static int cpu_clock_event_init(struct perf_event *event)
 {
+	/*
+	 * The standard interface to invoke task-clock pmu is through software
+	 * pmu(see perf_swevent_init()), since task-clock pmu is not exposed to
+	 * the user via sysfs and thus user doesn't know perf_task_clock.type.
+	 * However, allow user to open an event if user has passed correct type
+	 * in the attribute.
+	 */
 	if (event->attr.type != perf_cpu_clock.type)
 		return -ENOENT;
 
@@ -11260,6 +11267,13 @@ static void task_clock_event_read(struct perf_event *event)
 
 static int task_clock_event_init(struct perf_event *event)
 {
+	/*
+	 * The standard interface to invoke task-clock pmu is through software
+	 * pmu(see perf_swevent_init()), since task-clock pmu is not exposed to
+	 * the user via sysfs and thus user doesn't know perf_task_clock.type.
+	 * However, allow user to open an event if user has passed correct type
+	 * in the attribute.
+	 */
 	if (event->attr.type != perf_task_clock.type)
 		return -ENOENT;
 
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events
  2024-02-28 12:49       ` Ravi Bangoria
@ 2024-02-29  3:41         ` Pengfei Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Pengfei Xu @ 2024-02-29  3:41 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: peterz, namhyung, Eranian, Stephane, acme, mark.rutland, jolsa,
	irogers, bp, kan.liang, Hunter, Adrian, maddy, x86,
	linux-perf-users, linux-kernel, sandipan.das, ananth.narayan,
	santosh.shukla, lkp

On 2024-02-28 at 20:49:59 +0800, Ravi Bangoria wrote:
> >>> Currently, PERF_TYPE_SOFTWARE is treated specially since task-clock and
> >>> cpu-clock events are interfaced through it but internally gets forwarded
> >>> to their own pmus.
> >>>
> >>> Rework this by overwriting event->attr.type in perf_swevent_init() which
> >>> will cause perf_init_event() to retry with updated type and event will
> >>> automatically get forwarded to right pmu. With the change, SW pmu no
> >>> longer needs to be treated specially and can be included in 'pmu_idr'
> >>> list.
> >>>
> >>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> >>> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> >>> ---
> >>>  include/linux/perf_event.h | 10 +++++
> >>>  kernel/events/core.c       | 77 ++++++++++++++++++++------------------
> >>>  2 files changed, 51 insertions(+), 36 deletions(-)
> >>
> >> Greeting!
> >> There is task hung in perf_tp_event_init in v6.8-rc4 in guest.
> > 
> > Thanks for the bug report. I'm able to reproduce it. Will try to spend
> > more time to rootcause it.
> 
> Although the bisect has lead to 0d6d062ca27e as culprit commit, a minor
> change (shown below) in the test program can create the same task hang
> issue even with 0d6d062ca27e reverted.
> 
> -   *(uint32_t*)0x200000c0 = 6;   /* Use cpu-clock pmu type when 0d6d062ca27e is present */
> +   *(uint32_t*)0x200000c0 = 1;   /* Use software pmu type when 0d6d062ca27e is absent */
> 
> So, 0d6d062ca27e is not the culprit commit.

Thank you for your analysis and additional verification!

commit 0d6d062ca27e only changes kernel/events/core.c judgement:
PERF_TYPE_SOFTWARE to perf_cpu_clock.type in function cpu_clock_event_init():
"
@@ -11086,7 +11092,7 @@ static void cpu_clock_event_read(struct perf_event *event)
 
  static int cpu_clock_event_init(struct perf_event *event)
   {
   -       if (event->attr.type != PERF_TYPE_SOFTWARE)
   +       if (event->attr.type != perf_cpu_clock.type)
                   return -ENOENT;
"

After reverted the commit, and change test code 0x200000c0 from 6 to 1,
kernel code in kernel/events/core.c: func perf_swevent_init() still keeps
PERF_TYPE_SOFTWARE(1) judgement because above commit doesn't change judgement
in perf_swevent_init():
"
// PERF_TYPE_SOFTWARE = 1;
	if (event->attr.type != PERF_TYPE_SOFTWARE)
			return -ENOENT;
"

So it seems show that attr.type(0x200000c0) = 6 will return -2 and not solve
further action in perf_swevent_init() will not trigger task hang.

And if attr.type(0x200000c0) = 1 will pass the judgement and solve further
action in perf_swevent_init(), and then will trigger task hang.

Thanks for correction if there is something wrong.


> 
> Additionally,
> 
> o I've seen task hang or soft-lockups on a single cpu KVM guest while
>   running your test as root and also as normal user with
>   perf_event_paranoid=-1. But the same experiment on host, no lockups,
>   only task hang. So I feel the bug report is false positive and there
>   is no real issue (since the experiment requires special privilege).

Thanks for your info! Seems some problem will cause cpu soft-lockups.
If there is more clue, will update it.

> 
> o 0d6d062ca27e has inadvertently started allowing cpu-clock and task-
>   clock events creation via their own pmu->type in perf_event_open(),
>   instead of earlier design where the only interface was through sw
>   pmu. Is it harmful? Probably not. But worth to be documented:

Thanks a lot for description, and there is some other way to trigger
the perf_event type change.

Thanks!

> 
> ----><----
> 
> From c7ae1c57e2a23a05eb982524d37bc8542c9c9a34 Mon Sep 17 00:00:00 2001
> From: Ravi Bangoria <ravi.bangoria@amd.com>
> Date: Wed, 28 Feb 2024 17:29:04 +0530
> Subject: [PATCH] perf/core: Document {cpu|task}-clock event open behavior
> 
> The standard interface to invoke task-clock and cpu-clock pmu is through
> software pmu (see perf_swevent_init()), since these pmus are not exposed
> to the user via sysfs and thus user doesn't know their pmu->type. However,
> current code allows user to open an event if user has passed correct type
> in the perf event attribute. This is not easily apparent from the code and
> thus worth to be documented.
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
> ---
>  kernel/events/core.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f0f0f71213a1..4072bccab3ba 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -11178,6 +11178,13 @@ static void cpu_clock_event_read(struct perf_event *event)
>  
>  static int cpu_clock_event_init(struct perf_event *event)
>  {
> +	/*
> +	 * The standard interface to invoke task-clock pmu is through software
> +	 * pmu(see perf_swevent_init()), since task-clock pmu is not exposed to
> +	 * the user via sysfs and thus user doesn't know perf_task_clock.type.
> +	 * However, allow user to open an event if user has passed correct type
> +	 * in the attribute.
> +	 */
>  	if (event->attr.type != perf_cpu_clock.type)
>  		return -ENOENT;
>  
> @@ -11260,6 +11267,13 @@ static void task_clock_event_read(struct perf_event *event)
>  
>  static int task_clock_event_init(struct perf_event *event)
>  {
> +	/*
> +	 * The standard interface to invoke task-clock pmu is through software
> +	 * pmu(see perf_swevent_init()), since task-clock pmu is not exposed to
> +	 * the user via sysfs and thus user doesn't know perf_task_clock.type.
> +	 * However, allow user to open an event if user has passed correct type
> +	 * in the attribute.
> +	 */
>  	if (event->attr.type != perf_task_clock.type)
>  		return -ENOENT;
>  
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-02-29  3:46 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-04 10:59 [PATCH v4 0/4] perf: Rework event forwarding logic Ravi Bangoria
2023-05-04 11:00 ` [PATCH v4 1/4] perf/core: Rework forwarding of {task|cpu}-clock events Ravi Bangoria
2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
2024-02-20  8:41   ` [PATCH v4 1/4] " Pengfei Xu
2024-02-23  5:27     ` Ravi Bangoria
2024-02-28 12:49       ` Ravi Bangoria
2024-02-29  3:41         ` Pengfei Xu
2023-05-04 11:00 ` [PATCH v4 2/4] perf/ibs: Fix interface via core pmu events Ravi Bangoria
2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
2023-05-04 11:00 ` [PATCH v4 3/4] perf/core: Remove pmu linear searching code Ravi Bangoria
2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria
2023-05-24 21:41   ` [PATCH v4 3/4] " Nathan Chancellor
2023-05-25  5:16     ` Ravi Bangoria
2023-05-25  7:11       ` Oliver Upton
2023-05-25 14:20         ` Peter Zijlstra
2023-05-25 15:56           ` Oliver Upton
2023-05-26 23:00             ` Ian Rogers
2023-05-27 13:32               ` Marc Zyngier
2023-05-27 17:00                 ` Ian Rogers
2023-05-27 17:05                   ` Ian Rogers
2023-05-27 18:38                   ` Marc Zyngier
2023-05-27 19:50                     ` Ian Rogers
2023-05-30  7:45                     ` Thomas Richter
2023-05-30 14:00                       ` Ian Rogers
2023-05-31  9:09                         ` Thomas Richter
2023-05-31 20:20                           ` Ian Rogers
2023-06-01 11:02                             ` Thomas Richter
2023-06-01 11:18                               ` Peter Zijlstra
2023-06-01 11:20                                 ` Peter Zijlstra
2023-05-25 15:55         ` Nathan Chancellor
2023-05-04 11:00 ` [PATCH v4 4/4] perf test: Add selftest to test IBS invocation via core pmu events Ravi Bangoria
2023-05-05  9:16   ` Peter Zijlstra
2023-05-15 21:31     ` Arnaldo Carvalho de Melo
2023-05-15 21:33       ` Arnaldo Carvalho de Melo
2023-05-10 13:33   ` [tip: perf/core] " tip-bot2 for Ravi Bangoria

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.