linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics
@ 2022-12-12 12:58 Like Xu
  2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

Hi,

The Ice Lake core PMU provides built-in support for Top-down u-arch
Analysis (TMA) method level 1 metrics. These metrics are always available
to cross-validate performance observations, freeing general purpose
counters to count other events in high counter utilization scenarios.
For more details about the method, refer to Top-Down Analysis Method
chapter (Appendix B.1) of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (SDM 19.3.9.3 Performance Metrics)

This patchset enables Intel Guest Topdow for KVM-based guests. Its basic
enabling framework remains unchanged, a perf_metric msr is introduced,
a group (rather than one) of perf_events is created in KVM by binding to
fiexed counter3 to obtain hardware resources, and the guest value of
perf_metric msr is assembled based on the count of grouped perf_events.

On KVM, patches 0004/5/6 may be reviewd independently if KVM only
enable fixed counter3 as normal slot event for count and sampling. 
Patch 7 updates the infrastructure for creating grouped events in KVM,
and patch 8 uses group events to emulate guest MSR_PERF_METRICS.

On Perf, Patches 0001-0003 are awaiting review for tip/perf/core, and
could be accepted separately if they make sense. TBH, I don't think our
perf/core is fully prepared to support kernel space grouped counters,
considering comments around perf_enable_diasable(). But after much
exploration on my part, this is probably the most promising way to get
KVM to create slots plus metrics events. If the addition of *group_leader
messes things up, please shout at me on your needs.

More details in each commit messages may answer code-related questions.

A classic perf tool usage on a linux guest is as follows:
$ perf stat --topdown --td-level=1 -I1000 --no-metric-only sleep 1
#           time             counts unit events
     1.000548528         34,505,682      slots
     1.000548528         14,208,222      topdown-retiring                 #     41.5% Retiring
     1.000548528          1,623,796      topdown-bad-spec                 #      4.7% Bad Speculation
     1.000548528         14,614,171      topdown-fe-bound                 #     42.7% Frontend Bound
     1.000548528          3,788,859      topdown-be-bound                 #     11.1% Backend Bound

Related KUT will follow if there are no obstructive negative comments.

Nit, pre-patches includes:
https://lore.kernel.org/kvm/20221207071506.15733-2-likexu@tencent.com/
https://lore.kernel.org/kvm/20221205122048.16023-1-likexu@tencent.com/

Please feel free to comment and share your feedback.

Thanks,

Like Xu (8):
  perf/core: Add *group_leader to perf_event_create_kernel_counter()
  perf: x86/core: Expose the available number of the Topdown metrics
  perf: x86/core: Snyc PERF_METRICS bit together with fixed counter3
  KVM: x86/pmu: Add Intel CPUID-hinted Topdown Slots event
  KVM: x86/pmu: Add kernel-defined slots event to enable Fixed Counter3
  KVM: x86/pmu: properly use INTEL_PMC_FIXED_RDPMC_BASE macro
  KVM: x86/pmu: Use flex *event arrays to implement grouped events
  KVM: x86/pmu: Add MSR_PERF_METRICS MSR emulation to enable Topdown

 arch/arm64/kvm/pmu-emul.c                 |   4 +-
 arch/x86/events/core.c                    |   1 +
 arch/x86/events/intel/core.c              |   3 +
 arch/x86/include/asm/kvm_host.h           |  14 +-
 arch/x86/include/asm/msr-index.h          |   1 +
 arch/x86/include/asm/perf_event.h         |   1 +
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c |   4 +-
 arch/x86/kvm/pmu.c                        | 149 ++++++++++++++++++++--
 arch/x86/kvm/pmu.h                        |  31 +++--
 arch/x86/kvm/svm/pmu.c                    |   1 +
 arch/x86/kvm/vmx/pmu_intel.c              |  53 +++++++-
 arch/x86/kvm/vmx/vmx.c                    |   3 +
 arch/x86/kvm/x86.c                        |   9 +-
 include/linux/perf_event.h                |   1 +
 kernel/events/core.c                      |   4 +-
 kernel/events/hw_breakpoint.c             |   4 +-
 kernel/events/hw_breakpoint_test.c        |   2 +-
 kernel/watchdog_hld.c                     |   2 +-
 18 files changed, 239 insertions(+), 48 deletions(-)

-- 
2.38.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter()
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 13:23   ` Marc Zyngier
  2022-12-14  3:52   ` Ravi Bangoria
  2022-12-12 12:58 ` [PATCH RFC 2/8] perf: x86/core: Expose the available number of the Topdown metrics Like Xu
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, Marc Zyngier, Fenghua Yu,
	kvmarm, linux-perf-users

From: Like Xu <likexu@tencent.com>

Like syscalls users, kernel-space perf_event creators may also use group
counters abstraction to gain pmu functionalities, and an in-kernel counter
groups behave much like normal 'single' counters, following the group
semantics-based behavior.

No functional changes at this time. An example will be that KVM creates
Intel slot event as group leader and other topdown metric events to emulate
MSR_PERF_METRICS pmu capability for guests.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: kvmarm@lists.linux.dev
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/arm64/kvm/pmu-emul.c                 | 4 ++--
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++--
 arch/x86/kvm/pmu.c                        | 2 +-
 arch/x86/kvm/vmx/pmu_intel.c              | 2 +-
 include/linux/perf_event.h                | 1 +
 kernel/events/core.c                      | 4 +++-
 kernel/events/hw_breakpoint.c             | 4 ++--
 kernel/events/hw_breakpoint_test.c        | 2 +-
 kernel/watchdog_hld.c                     | 2 +-
 9 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index 24908400e190..11c3386bc86b 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -624,7 +624,7 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
 
 	attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc));
 
-	event = perf_event_create_kernel_counter(&attr, -1, current,
+	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
 						 kvm_pmu_perf_overflow, pmc);
 
 	if (IS_ERR(event)) {
@@ -713,7 +713,7 @@ static struct arm_pmu *kvm_pmu_probe_armpmu(void)
 	attr.config = ARMV8_PMUV3_PERFCTR_CPU_CYCLES;
 	attr.sample_period = GENMASK(63, 0);
 
-	event = perf_event_create_kernel_counter(&attr, -1, current,
+	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
 						 kvm_pmu_perf_overflow, &attr);
 
 	if (IS_ERR(event)) {
diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index d961ae3ed96e..43e54bb200cd 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -952,12 +952,12 @@ static int measure_residency_fn(struct perf_event_attr *miss_attr,
 	u64 tmp;
 
 	miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
-						      NULL, NULL, NULL);
+						      NULL, NULL, NULL, NULL);
 	if (IS_ERR(miss_event))
 		goto out;
 
 	hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
-						     NULL, NULL, NULL);
+						     NULL, NULL, NULL, NULL);
 	if (IS_ERR(hit_event))
 		goto out_miss;
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index eb594620dd75..f6c8180241d7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -204,7 +204,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 			attr.precise_ip = 3;
 	}
 
-	event = perf_event_create_kernel_counter(&attr, -1, current,
+	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
 						 kvm_perf_overflow, pmc);
 	if (IS_ERR(event)) {
 		pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f951dc756456..b746381307c7 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -299,7 +299,7 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu)
 	}
 
 	event = perf_event_create_kernel_counter(&attr, -1,
-						current, NULL, NULL);
+						current, NULL, NULL, NULL);
 	if (IS_ERR(event)) {
 		pr_debug_ratelimited("%s: failed %ld\n",
 					__func__, PTR_ERR(event));
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0031f7b4d9ab..5f34e1d0bff8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1023,6 +1023,7 @@ extern struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				int cpu,
 				struct task_struct *task,
+				struct perf_event *group_leader,
 				perf_overflow_handler_t callback,
 				void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7f04f995c975..f671b1a9a691 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12674,12 +12674,14 @@ SYSCALL_DEFINE5(perf_event_open,
  * @attr: attributes of the counter to create
  * @cpu: cpu in which the counter is bound
  * @task: task to profile (NULL for percpu)
+ * @group_leader: event group leader
  * @overflow_handler: callback to trigger when we hit the event
  * @context: context data could be used in overflow_handler callback
  */
 struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 				 struct task_struct *task,
+				 struct perf_event *group_leader,
 				 perf_overflow_handler_t overflow_handler,
 				 void *context)
 {
@@ -12694,7 +12696,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	if (attr->aux_output)
 		return ERR_PTR(-EINVAL);
 
-	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
+	event = perf_event_alloc(attr, cpu, task, group_leader, NULL,
 				 overflow_handler, context, -1);
 	if (IS_ERR(event)) {
 		err = PTR_ERR(event);
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index c3797701339c..65b5b1421e62 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -771,7 +771,7 @@ register_user_hw_breakpoint(struct perf_event_attr *attr,
 			    void *context,
 			    struct task_struct *tsk)
 {
-	return perf_event_create_kernel_counter(attr, -1, tsk, triggered,
+	return perf_event_create_kernel_counter(attr, -1, tsk, NULL, triggered,
 						context);
 }
 EXPORT_SYMBOL_GPL(register_user_hw_breakpoint);
@@ -881,7 +881,7 @@ register_wide_hw_breakpoint(struct perf_event_attr *attr,
 
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		bp = perf_event_create_kernel_counter(attr, cpu, NULL,
+		bp = perf_event_create_kernel_counter(attr, cpu, NULL, NULL,
 						      triggered, context);
 		if (IS_ERR(bp)) {
 			err = PTR_ERR(bp);
diff --git a/kernel/events/hw_breakpoint_test.c b/kernel/events/hw_breakpoint_test.c
index c57610f52bb4..b3597df12284 100644
--- a/kernel/events/hw_breakpoint_test.c
+++ b/kernel/events/hw_breakpoint_test.c
@@ -39,7 +39,7 @@ static struct perf_event *register_test_bp(int cpu, struct task_struct *tsk, int
 	attr.bp_addr = (unsigned long)&break_vars[idx];
 	attr.bp_len = HW_BREAKPOINT_LEN_1;
 	attr.bp_type = HW_BREAKPOINT_RW;
-	return perf_event_create_kernel_counter(&attr, cpu, tsk, NULL, NULL);
+	return perf_event_create_kernel_counter(&attr, cpu, tsk, NULL, NULL, NULL);
 }
 
 static void unregister_test_bp(struct perf_event **bp)
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..bb755dadba54 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -173,7 +173,7 @@ static int hardlockup_detector_event_create(void)
 	wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
 
 	/* Try to register using hardware perf events */
-	evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL,
+	evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL, NULL,
 					       watchdog_overflow_callback, NULL);
 	if (IS_ERR(evt)) {
 		pr_debug("Perf event create on CPU %d failed with %ld\n", cpu,
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/8] perf: x86/core: Expose the available number of the Topdown metrics
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
  2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 3/8] perf: x86/core: Snyc PERF_METRICS bit together with fixed counter3 Like Xu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, linux-perf-users

From: Like Xu <likexu@tencent.com>

Intel Sapphire Rapids server has 8 metrics events, while the Intel Ice Lake
only supports 4 metrics events. The available number of the Topdown
metrics are model specific without architecture hint.

To support guest Topdown metrics,  KVM may only rely on the cpu model
to emulate the correct number of metrics event on the platforms. It would
be nice to have the perf core tell KVM the available number of Topdown
metrics, just like x86_pmu.num_counters.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/events/core.c            | 1 +
 arch/x86/include/asm/perf_event.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b30b8bbcd1e2..d0d84c7a6876 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3006,6 +3006,7 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	 * which available for all cores.
 	 */
 	cap->num_counters_gp	= x86_pmu.num_counters;
+	cap->num_topdown_events = x86_pmu.num_topdown_events;
 	cap->num_counters_fixed	= x86_pmu.num_counters_fixed;
 	cap->bit_width_gp	= x86_pmu.cntval_bits;
 	cap->bit_width_fixed	= x86_pmu.cntval_bits;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5d0f6891ae61..3e263d291595 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -219,6 +219,7 @@ struct x86_pmu_capability {
 	int		version;
 	int		num_counters_gp;
 	int		num_counters_fixed;
+	int		num_topdown_events;
 	int		bit_width_gp;
 	int		bit_width_fixed;
 	unsigned int	events_mask;
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 3/8] perf: x86/core: Snyc PERF_METRICS bit together with fixed counter3
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
  2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
  2022-12-12 12:58 ` [PATCH RFC 2/8] perf: x86/core: Expose the available number of the Topdown metrics Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 4/8] KVM: x86/pmu: Add Intel CPUID-hinted Topdown Slots event Like Xu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson
  Cc: Paolo Bonzini, linux-kernel, kvm, linux-perf-users

From: Like Xu <likexu@tencent.com>

When the guest uses topdown (the fixed counter 3 and perf_metrics msr),
the sharing rule on the PERF_METRICS bit on the GLOBAL_CTRL msr does
not change, that is, it should be updated synchronously with the fixed
counter 3. Considering that guest topdown feature has just been enabled,
this is not a strictly bug fix.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/events/intel/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1b92bf05fd65..e7897fd9f7ab 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2436,6 +2436,8 @@ static void intel_pmu_disable_fixed(struct perf_event *event)
 		 */
 		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
 			return;
+
+		intel_clear_masks(event, GLOBAL_CTRL_EN_PERF_METRICS);
 		idx = INTEL_PMC_IDX_FIXED_SLOTS;
 	}
 
@@ -2729,6 +2731,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
 		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
 			return;
 
+		intel_set_masks(event, GLOBAL_CTRL_EN_PERF_METRICS);
 		idx = INTEL_PMC_IDX_FIXED_SLOTS;
 	}
 
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 4/8] KVM: x86/pmu: Add Intel CPUID-hinted Topdown Slots event
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
                   ` (2 preceding siblings ...)
  2022-12-12 12:58 ` [PATCH RFC 3/8] perf: x86/core: Snyc PERF_METRICS bit together with fixed counter3 Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 5/8] KVM: x86/pmu: Add kernel-defined slots event to enable Fixed Counter3 Like Xu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

From: Like Xu <likexu@tencent.com>

This event counts the total number of available slots for an unhalted
logical processor. Software can use this event as the denominator for the
top-level metrics of the Top-down Microarchitecture Analysis method.

Although the MSR_PERF_METRICS MSR required for topdown events is not
currently available in the guest, relying only on the data provided by
the slots event is sufficient for pmu users to perceive differences in
cpu pipeline machine-width across micro-architectures.

The standalone slots event, like the instruction event, can be counted
with gp counter or fixed counter 3 (if any). As the last CPUID-hinted
Architectural Performance Events, its availability is also controlled by
CPUID.AH.EBX. On the Linux, perf user may encode "-e cpu/event=0xa4,
umask=0x01/" or "-e cpu/slots/" to count slots events.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index b746381307c7..d86a6ba8c3f9 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -31,10 +31,11 @@
  * 4 - PERF_COUNT_HW_CACHE_MISSES
  * 5 - PERF_COUNT_HW_BRANCH_INSTRUCTIONS
  * 6 - PERF_COUNT_HW_BRANCH_MISSES
+ * 7 - CPUID-hinted Topdown Slots event (available on gp counter)
  *
  * the second part of hw_events is defined by the generic kernel perf:
  *
- * 7 - PERF_COUNT_HW_REF_CPU_CYCLES
+ * 8 - PERF_COUNT_HW_REF_CPU_CYCLES
  */
 static struct kvm_pmu_hw_event intel_arch_events[] = {
 	[0] = { 0x3c, 0x00 },
@@ -44,12 +45,13 @@ static struct kvm_pmu_hw_event intel_arch_events[] = {
 	[4] = { 0x2e, 0x41 },
 	[5] = { 0xc4, 0x00 },
 	[6] = { 0xc5, 0x00 },
+	[7] = { 0xa4, 0x01 },
 	/* The above index must match CPUID 0x0A.EBX bit vector */
-	[7] = { 0x00, 0x03 },
+	[8] = { 0x00, 0x03 },
 };
 
 /* mapping between fixed pmc index and intel_arch_events array */
-static int fixed_pmc_events[] = {1, 0, 7};
+static int fixed_pmc_events[] = {1, 0, 8};
 
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
@@ -109,7 +111,7 @@ static bool intel_hw_event_available(struct kvm_pmc *pmc)
 			continue;
 
 		/* disable event that reported as not present by cpuid */
-		if ((i < 7) && !(pmu->available_event_types & (1 << i)))
+		if (i < 8 && !(pmu->available_event_types & (1 << i)))
 			return false;
 
 		break;
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 5/8] KVM: x86/pmu: Add kernel-defined slots event to enable Fixed Counter3
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
                   ` (3 preceding siblings ...)
  2022-12-12 12:58 ` [PATCH RFC 4/8] KVM: x86/pmu: Add Intel CPUID-hinted Topdown Slots event Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 6/8] KVM: x86/pmu: properly use INTEL_PMC_FIXED_RDPMC_BASE macro Like Xu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

From: Like Xu <likexu@tencent.com>

The Topdown Slots event can be enabled on gp counter or fixed counter3 and
it does not differ from other fixed counters in terms of the use of count
and sampling modes (except for the hardware logic for event accumulation).

According to commit 6017608936c1 ("perf/x86/intel: Add Icelake support"),
KVM or any perf in-kernel user needs to reprogram fixed counter3 via the
kernel-defined Topdown Slots event for real fixed counter3 on the host.

Opportunistically fix a typo, s/msrs_to_saved_all/msrs_to_save_all/.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/vmx/pmu_intel.c    | 4 +++-
 arch/x86/kvm/x86.c              | 6 +++---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aa4eb8cfcd7e..413f2e104543 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -513,7 +513,7 @@ struct kvm_pmc {
 #define KVM_INTEL_PMC_MAX_GENERIC	8
 #define MSR_ARCH_PERFMON_PERFCTR_MAX	(MSR_ARCH_PERFMON_PERFCTR0 + KVM_INTEL_PMC_MAX_GENERIC - 1)
 #define MSR_ARCH_PERFMON_EVENTSEL_MAX	(MSR_ARCH_PERFMON_EVENTSEL0 + KVM_INTEL_PMC_MAX_GENERIC - 1)
-#define KVM_PMC_MAX_FIXED	3
+#define KVM_PMC_MAX_FIXED	4
 #define KVM_AMD_PMC_MAX_GENERIC	6
 struct kvm_pmu {
 	unsigned nr_arch_gp_counters;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d86a6ba8c3f9..637fd709f5f4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -36,6 +36,7 @@
  * the second part of hw_events is defined by the generic kernel perf:
  *
  * 8 - PERF_COUNT_HW_REF_CPU_CYCLES
+ * 9 - Kernel-defined Topdown Slots event (available on fixed counter 3)
  */
 static struct kvm_pmu_hw_event intel_arch_events[] = {
 	[0] = { 0x3c, 0x00 },
@@ -48,10 +49,11 @@ static struct kvm_pmu_hw_event intel_arch_events[] = {
 	[7] = { 0xa4, 0x01 },
 	/* The above index must match CPUID 0x0A.EBX bit vector */
 	[8] = { 0x00, 0x03 },
+	[9] = { 0x00, 0x04 },
 };
 
 /* mapping between fixed pmc index and intel_arch_events array */
-static int fixed_pmc_events[] = {1, 0, 8};
+static int fixed_pmc_events[] = {1, 0, 8, 9};
 
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 312aea1854ae..0b61cb58c877 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1435,7 +1435,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_IA32_UMWAIT_CONTROL,
 
 	MSR_ARCH_PERFMON_FIXED_CTR0, MSR_ARCH_PERFMON_FIXED_CTR1,
-	MSR_ARCH_PERFMON_FIXED_CTR0 + 2,
+	MSR_ARCH_PERFMON_FIXED_CTR0 + 2, MSR_ARCH_PERFMON_FIXED_CTR0 + 3,
 	MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_CORE_PERF_GLOBAL_STATUS,
 	MSR_CORE_PERF_GLOBAL_CTRL, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
 	MSR_IA32_PEBS_ENABLE, MSR_IA32_DS_AREA, MSR_PEBS_DATA_CFG,
@@ -7001,8 +7001,8 @@ static void kvm_init_msr_list(void)
 	u32 dummy[2];
 	unsigned i;
 
-	BUILD_BUG_ON_MSG(KVM_PMC_MAX_FIXED != 3,
-			 "Please update the fixed PMCs in msrs_to_saved_all[]");
+	BUILD_BUG_ON_MSG(KVM_PMC_MAX_FIXED != 4,
+			 "Please update the fixed PMCs in msrs_to_save_all[]");
 
 	num_msrs_to_save = 0;
 	num_emulated_msrs = 0;
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 6/8] KVM: x86/pmu: properly use INTEL_PMC_FIXED_RDPMC_BASE macro
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
                   ` (4 preceding siblings ...)
  2022-12-12 12:58 ` [PATCH RFC 5/8] KVM: x86/pmu: Add kernel-defined slots event to enable Fixed Counter3 Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 7/8] KVM: x86/pmu: Use flex *event arrays to implement grouped events Like Xu
  2022-12-12 12:58 ` [PATCH RFC 8/8] KVM: x86/pmu: Add MSR_PERF_METRICS MSR emulation to enable Topdown Like Xu
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

From: Like Xu <likexu@tencent.com>

Use macro INTEL_PMC_FIXED_RDPMC_BASE in the rdpmc context to
improve readability. No functional change intended.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 637fd709f5f4..b69d337d51d9 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -136,7 +136,7 @@ static bool intel_pmc_is_enabled(struct kvm_pmc *pmc)
 static bool intel_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	bool fixed = idx & (1u << 30);
+	bool fixed = idx & INTEL_PMC_FIXED_RDPMC_BASE;
 
 	idx &= ~(3u << 30);
 
@@ -148,7 +148,7 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 					    unsigned int idx, u64 *mask)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	bool fixed = idx & (1u << 30);
+	bool fixed = idx & INTEL_PMC_FIXED_RDPMC_BASE;
 	struct kvm_pmc *counters;
 	unsigned int num_counters;
 
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 7/8] KVM: x86/pmu: Use flex *event arrays to implement grouped events
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
                   ` (5 preceding siblings ...)
  2022-12-12 12:58 ` [PATCH RFC 6/8] KVM: x86/pmu: properly use INTEL_PMC_FIXED_RDPMC_BASE macro Like Xu
@ 2022-12-12 12:58 ` Like Xu
  2022-12-12 12:58 ` [PATCH RFC 8/8] KVM: x86/pmu: Add MSR_PERF_METRICS MSR emulation to enable Topdown Like Xu
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

From: Like Xu <likexu@tencent.com>

The current vPMU would expect that a pmc tends to represent a hardware
event and use pmc->counter to track the event value.

But from the perspective of the counter designer,a hardware counter
can choose to sacrifice the maximum bit width of the counter and instead
reflect the values of multiple hardware events on different fields of
the same counter.

For example, a 32-bit counter can be divided into four 8-bit wide hardware
counters, and these hardware events are kept incrementing independently
in their own bitfields. Software can read the counter values once to get
the latest values of multiple hardware events, thus saving the number of
software accesses to the hardware interface and in virtualization,
particularly reducing the number of vm-exits.

The most natural way to emulate this all-in-one hardware counter in KVM
is to use a 1-to-N mapping relationship, i.e., a pmc can be associated with
multiple perf_events, and these events are created in a group manner, which
are scheduled in the host perf as a group to obtain hardware resources and
present them to the guest.

In implementation, when the guest accesses this all-in-one counter,
its pmc->max_nr_events changes according to the hardware definition,
triggering the kvm's group event creation path, which is centrally
created and then enabled in order, which eliminates the code differences
of separate enablement. The grouped events are also released as a group.

Which hardware events correspond to each pmc and how to divide the
available bitfields is predefined by the hardware vendor, and KVM
does not freely combine them. A many-to-one pmc is no different from
a one-to-one pmc in many cases. The first pmc->perf_event is always
the source of the pmc->counter value, which implies that most functions
that manipulate the state of pmc->perf_event do not need to be modified.

Note, the specific rules for splicing multiple perf_event->count values
into a single value are defined by the hardware event, which does not
and should not appear in this generic change.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++-
 arch/x86/kvm/pmu.c              | 44 ++++++++++++++++++++++++++-------
 arch/x86/kvm/pmu.h              | 15 ++++++++---
 arch/x86/kvm/svm/pmu.c          |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    |  2 ++
 5 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 413f2e104543..73740aa891d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -494,12 +494,12 @@ enum pmc_type {
 struct kvm_pmc {
 	enum pmc_type type;
 	u8 idx;
+	u8 max_nr_events;
 	bool is_paused;
 	bool intr;
 	u64 counter;
 	u64 prev_counter;
 	u64 eventsel;
-	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
 	 * only for creating or reusing perf_event,
@@ -507,6 +507,10 @@ struct kvm_pmc {
 	 * ctrl value for fixed counters.
 	 */
 	u64 current_config;
+	union {
+		struct perf_event *perf_event;
+		DECLARE_FLEX_ARRAY(struct perf_event *, perf_events);
+	};
 };
 
 /* More counters may conflict with other existing Architectural MSRs */
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index f6c8180241d7..ae53a8298dec 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -160,7 +160,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 				 bool intr)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-	struct perf_event *event;
+	struct perf_event *event, *group_leader;
 	struct perf_event_attr attr = {
 		.type = type,
 		.size = sizeof(attr),
@@ -172,6 +172,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 		.config = config,
 	};
 	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+	unsigned int i;
 
 	attr.sample_period = get_sample_period(pmc, pmc->counter);
 
@@ -204,18 +205,39 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 			attr.precise_ip = 3;
 	}
 
-	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
-						 kvm_perf_overflow, pmc);
-	if (IS_ERR(event)) {
-		pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
-			    PTR_ERR(event), pmc->idx);
-		return PTR_ERR(event);
+	/*
+	 * To create grouped events, the first created perf_event doesn't
+	 * know it will be the group_leader and may move to an unexpected
+	 * enabling path, thus delay all enablement until after creation,
+	 * not affecting non-grouped events to save one perf interface call.
+	 */
+	if (pmc->max_nr_events > 1)
+		attr.disabled = 1;
+
+	for (i = 0; i < pmc->max_nr_events; i++) {
+		group_leader = i ? pmc->perf_event : NULL;
+		event = perf_event_create_kernel_counter(&attr, -1, current, group_leader,
+							 kvm_perf_overflow, pmc);
+		if (IS_ERR(event)) {
+			pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
+					     PTR_ERR(event), pmc->idx);
+			return PTR_ERR(event);
+		}
+
+		pmc->perf_events[i] = event;
+		pmc_to_pmu(pmc)->event_count++;
 	}
 
-	pmc->perf_event = event;
-	pmc_to_pmu(pmc)->event_count++;
 	pmc->is_paused = false;
 	pmc->intr = intr || pebs;
+
+	if (!attr.disabled)
+		return 0;
+
+	/* Enable grouped events in order. */
+	for (i = 0; i < pmc->max_nr_events; i++)
+		perf_event_enable(pmc->perf_events[i]);
+
 	return 0;
 }
 
@@ -223,6 +245,10 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
 {
 	u64 counter = pmc->counter;
 
+	/* The perf_event_pause() is not suitable for grouped events. */
+	if (pmc->max_nr_events > 1)
+		pmc_stop_counter(pmc);
+
 	if (!pmc->perf_event || pmc->is_paused)
 		return;
 
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index b9e29a199ab8..e4b738b7c208 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -64,12 +64,19 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 
 static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
 {
-	if (pmc->perf_event) {
-		perf_event_release_kernel(pmc->perf_event);
-		pmc->perf_event = NULL;
-		pmc->current_config = 0;
+	unsigned int i;
+
+	for (i = 0; i < pmc->max_nr_events; i++) {
+		if (!pmc->perf_events[i])
+			continue;
+
+		perf_event_release_kernel(pmc->perf_events[i]);
+		pmc->perf_events[i] = NULL;
 		pmc_to_pmu(pmc)->event_count--;
 	}
+
+	pmc->current_config = 0;
+	pmc->max_nr_events = 1;
 }
 
 static inline void pmc_stop_counter(struct kvm_pmc *pmc)
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 0e313fbae055..09e81200f657 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -200,6 +200,7 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
 		pmu->gp_counters[i].vcpu = vcpu;
 		pmu->gp_counters[i].idx = i;
 		pmu->gp_counters[i].current_config = 0;
+		pmu->gp_counters[i].max_nr_events = 1;
 	}
 }
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index b69d337d51d9..8e1f679d4d9d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -642,6 +642,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 		pmu->gp_counters[i].vcpu = vcpu;
 		pmu->gp_counters[i].idx = i;
 		pmu->gp_counters[i].current_config = 0;
+		pmu->gp_counters[i].max_nr_events = 1;
 	}
 
 	for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
@@ -649,6 +650,7 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 		pmu->fixed_counters[i].vcpu = vcpu;
 		pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED;
 		pmu->fixed_counters[i].current_config = 0;
+		pmu->fixed_counters[i].max_nr_events = 1;
 	}
 
 	lbr_desc->records.nr = 0;
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 8/8] KVM: x86/pmu: Add MSR_PERF_METRICS MSR emulation to enable Topdown
  2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
                   ` (6 preceding siblings ...)
  2022-12-12 12:58 ` [PATCH RFC 7/8] KVM: x86/pmu: Use flex *event arrays to implement grouped events Like Xu
@ 2022-12-12 12:58 ` Like Xu
  7 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-12 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sean Christopherson; +Cc: Paolo Bonzini, linux-kernel, kvm

From: Like Xu <likexu@tencent.com>

The Ice Lake core PMU provides built-in support for Top-down Micro-
architecture Analysis (TMA) method level 1 metrics. These metrics are
always available to cross-validate performance observations, freeing
general purpose counters to count other events in high counter
utilization scenarios.

A new MSR called MSR_PERF_METRICS, hinted by PERF_CAPABILITIES.
PERF_METRICS_AVAILABLE (bit 15), reports four (on icx) TMA Level 1
metrics directly. The lower 32 bits are divided into four 8-bit fields
each of which is an integer fraction of 255. When performance metrics
use type 2000H (INTEL_PMC_FIXED_RDPMC_METRICS), rdpmc could read
the value of perf_metric.

Bit EN_PERF_METRICS[48] have also been added to the following MSRs:
- MSR_CORE_PERF_GLOBAL_CTRL
- MSR_CORE_PERF_GLOBAL_STATUS
- MSR_CORE_PERF_GLOBAL_OVF_CTRL

When it comes to KVM implementation, the topdown mode is only
enabled when guest starts both registers, PERF_METRICS and fixed
counter 3, from zero.

In topdwon mode, vPMU creates a group of zero sample period events
for fixed counter3. The first pmc->event is a group leader slot event,
and its event->count is used to emulate counter value, marking bit 35
when it generates an interrupt, as usual. (Note, the fixed counter3 can
be used independently, for counting or sampling, but isn't compatible
with topdown mode at the same time).

The four (or more) sub-events for perf_metrics are each emulated with
a metric event (its event attr.config is kernel-defined, starting with
INTEL_TD_METRIC_RETIRING), which increments independently and
share the same bit INTEL_PERF_METRICS_IDX on overflow.

When pmu->perf_metric is read, vPMU collects all event->count from
multiple metric events, but their original values have already been
processed by the host perf core (defined in icl_get_metrics_event_value(),
with a division and multiplication). This part of calculation logic needs
to be reversed in vPMU to restore their real values and stitch them
together in turn to form pmu->perf_metric.

The pmc_read_counter() has been enhanced, moved out of the header file,
and wrapped with EXPORT_SYMBOL_GPL. A little trick is that the PMC
corresponding to bit 35 and 48 (INTEL_PERF_METRICS_IDX) are both
fixed counter3, in order to lazy release of all above grouped events.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h  |   6 ++
 arch/x86/include/asm/msr-index.h |   1 +
 arch/x86/kvm/pmu.c               | 105 ++++++++++++++++++++++++++++++-
 arch/x86/kvm/pmu.h               |  16 ++---
 arch/x86/kvm/vmx/pmu_intel.c     |  33 ++++++++++
 arch/x86/kvm/vmx/vmx.c           |   3 +
 arch/x86/kvm/x86.c               |   3 +
 7 files changed, 153 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 73740aa891d3..4f2e2ede09b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -522,6 +522,7 @@ struct kvm_pmc {
 struct kvm_pmu {
 	unsigned nr_arch_gp_counters;
 	unsigned nr_arch_fixed_counters;
+	unsigned int nr_perf_metrics;
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
 	u64 fixed_ctr_ctrl_mask;
@@ -564,6 +565,11 @@ struct kvm_pmu {
 	 */
 	u64 host_cross_mapped_mask;
 
+	/* Intel Topdown Performance Metrics */
+	u64 perf_metrics;
+	u64 perf_metrics_mask;
+	bool metric_event_ovf;
+
 	/*
 	 * The gate to release perf_events not marked in
 	 * pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4a2af82553e4..5dde0242ff28 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -246,6 +246,7 @@
 #define PERF_CAP_PEBS_BASELINE         BIT_ULL(14)
 #define PERF_CAP_PEBS_MASK	(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
 				 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
+#define MSR_CAP_PERF_METRICS           BIT_ULL(15)
 
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define RTIT_CTL_TRACEEN		BIT(0)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index ae53a8298dec..4bc888462b4f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -96,6 +96,11 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
 	kvm_pmu_deliver_pmi(vcpu);
 }
 
+static inline bool pmc_support_perf_metrics(struct kvm_pmc *pmc)
+{
+	return pmc->idx == (INTEL_PMC_IDX_FIXED + 3);
+}
+
 static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -115,6 +120,10 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 			skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
 						      (unsigned long *)&pmu->global_status);
 		}
+	} else if (pmu->metric_event_ovf && pmc_support_perf_metrics(pmc)) {
+		/* At least one of PERF_METRICS sub-counter has overflowed */
+		__set_bit(INTEL_PERF_METRICS_IDX, (unsigned long *)&pmu->global_status);
+		pmu->metric_event_ovf = false;
 	} else {
 		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
 	}
@@ -155,6 +164,16 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 }
 
+static inline bool perf_metrics_is_enabled(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+	return !pmu->perf_metrics && !pmc->counter &&
+		pmc_support_perf_metrics(pmc) && pmc_speculative_in_use(pmc) &&
+		pmc_is_enabled(pmc) && test_bit(GLOBAL_CTRL_EN_PERF_METRICS,
+						(unsigned long *)&pmu->global_ctrl);
+}
+
 static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 				 bool exclude_user, bool exclude_kernel,
 				 bool intr)
@@ -172,6 +191,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 		.config = config,
 	};
 	bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+	bool topdown = perf_metrics_is_enabled(pmc);
 	unsigned int i;
 
 	attr.sample_period = get_sample_period(pmc, pmc->counter);
@@ -205,6 +225,10 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 			attr.precise_ip = 3;
 	}
 
+	/* A group of events is needed to emulate each perf_metric. */
+	if (topdown)
+		pmc->max_nr_events = pmu->nr_perf_metrics + 1;
+
 	/*
 	 * To create grouped events, the first created perf_event doesn't
 	 * know it will be the group_leader and may move to an unexpected
@@ -215,6 +239,17 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 		attr.disabled = 1;
 
 	for (i = 0; i < pmc->max_nr_events; i++) {
+		if (topdown) {
+			/*
+			 * According to perf core, the group_leader slots event must
+			 * not be a sampling event for topdown metric use, and the
+			 * topdown metric events don't support sampling.
+			 */
+			attr.sample_period = 0;
+			if (i)
+				attr.config = INTEL_TD_METRIC_RETIRING + 0x100 * (i - 1);
+		}
+
 		group_leader = i ? pmc->perf_event : NULL;
 		event = perf_event_create_kernel_counter(&attr, -1, current, group_leader,
 							 kvm_perf_overflow, pmc);
@@ -229,7 +264,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
 	}
 
 	pmc->is_paused = false;
-	pmc->intr = intr || pebs;
+	pmc->intr = intr || pebs || topdown;
 
 	if (!attr.disabled)
 		return 0;
@@ -263,6 +298,9 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
 	if (!pmc->perf_event)
 		return false;
 
+	if (perf_metrics_is_enabled(pmc) == (pmc->max_nr_events == 1))
+		return false;
+
 	/* recalibrate sample period and check if it's accepted by perf core */
 	if (is_sampling_event(pmc->perf_event) &&
 	    perf_event_period(pmc->perf_event,
@@ -447,6 +485,60 @@ static int kvm_pmu_rdpmc_vmware(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	return 0;
 }
 
+void pmu_read_perf_metrics(struct kvm_pmu *pmu, u64 slots)
+{
+	struct kvm_pmc *pmc = &pmu->fixed_counters[3];
+	u64 old_counter, enabled, running, delta;
+	union {
+		u8 field[KVM_PERF_METRICS_NUM_MAX];
+		u64	data;
+	} perf_metrics;
+	unsigned int i;
+
+	perf_metrics.data = pmu->perf_metrics;
+	for (i = 1; i < pmc->max_nr_events; i++) {
+		if (!pmc->perf_events[i])
+			continue;
+
+		old_counter = perf_metrics.field[i - 1];
+		delta = perf_event_read_value(pmc->perf_events[i], &enabled, &running);
+
+		/*
+		 * Reverse the actual metric counter value out
+		 * according to icl_get_metrics_event_value().
+		 */
+		delta = mul_u64_u64_div_u64(delta, 0xff, slots) + 1;
+		perf_metrics.field[i - 1] = 0xff & (old_counter + delta);
+
+		/* Check if any metric counter have been overflowed. */
+		if (perf_metrics.field[i - 1] < old_counter)
+			pmu->metric_event_ovf = true;
+	}
+
+	if (pmu->metric_event_ovf)
+		__kvm_perf_overflow(pmc, false);
+
+	pmu->perf_metrics = perf_metrics.data & (~pmu->perf_metrics_mask);
+	__set_bit(INTEL_PERF_METRICS_IDX, pmu->pmc_in_use);
+}
+
+u64 pmc_read_counter(struct kvm_pmc *pmc)
+{
+	u64 counter, enabled, running, delta;
+
+	counter = pmc->counter;
+	if (pmc->perf_event && !pmc->is_paused) {
+		delta = perf_event_read_value(pmc->perf_event, &enabled, &running);
+		if (delta && pmc_support_perf_metrics(pmc))
+			pmu_read_perf_metrics(pmc_to_pmu(pmc), delta);
+		counter += delta;
+	}
+
+	/* FIXME: Scaling needed? */
+	return counter & pmc_bitmask(pmc);
+}
+EXPORT_SYMBOL_GPL(pmc_read_counter);
+
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 {
 	bool fast_mode = idx & (1u << 31);
@@ -469,7 +561,16 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	    (kvm_read_cr0(vcpu) & X86_CR0_PE))
 		return 1;
 
-	*data = pmc_read_counter(pmc) & mask;
+	if (idx & INTEL_PMC_FIXED_RDPMC_METRICS) {
+		if (!(vcpu->arch.perf_capabilities & MSR_CAP_PERF_METRICS))
+			return 1;
+
+		pmc_read_counter(&pmu->fixed_counters[3]);
+		*data = pmu->perf_metrics;
+	} else {
+		*data = pmc_read_counter(pmc) & mask;
+	}
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e4b738b7c208..be800a0c5366 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -10,6 +10,8 @@
 
 #define MSR_IA32_MISC_ENABLE_PMU_RO_MASK (MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL |	\
 					  MSR_IA32_MISC_ENABLE_BTS_UNAVAIL)
+#define INTEL_PERF_METRICS_IDX GLOBAL_CTRL_EN_PERF_METRICS
+#define KVM_PERF_METRICS_NUM_MAX	8
 
 /* retrieve the 4 bits for EN and PMI out of IA32_FIXED_CTR_CTRL */
 #define fixed_ctrl_field(ctrl_reg, idx) (((ctrl_reg) >> ((idx)*4)) & 0xf)
@@ -42,6 +44,8 @@ struct kvm_pmu_ops {
 };
 
 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
+void pmu_read_perf_metrics(struct kvm_pmu *pmu, u64 slots);
+u64 pmc_read_counter(struct kvm_pmc *pmc);
 
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
 {
@@ -50,18 +54,6 @@ static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
 	return pmu->counter_bitmask[pmc->type];
 }
 
-static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
-{
-	u64 counter, enabled, running;
-
-	counter = pmc->counter;
-	if (pmc->perf_event && !pmc->is_paused)
-		counter += perf_event_read_value(pmc->perf_event,
-						 &enabled, &running);
-	/* FIXME: Scaling needed? */
-	return counter & pmc_bitmask(pmc);
-}
-
 static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
 {
 	unsigned int i;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 8e1f679d4d9d..52fef388fdb0 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -81,6 +81,8 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
 	if (pmc_idx < INTEL_PMC_IDX_FIXED) {
 		return get_gp_pmc(pmu, MSR_P6_EVNTSEL0 + pmc_idx,
 				  MSR_P6_EVNTSEL0);
+	} else if (pmc_idx == INTEL_PERF_METRICS_IDX) {
+		return get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR3);
 	} else {
 		u32 idx = pmc_idx - INTEL_PMC_IDX_FIXED;
 
@@ -160,6 +162,12 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 		counters = pmu->gp_counters;
 		num_counters = pmu->nr_arch_gp_counters;
 	}
+
+	if (idx & INTEL_PMC_FIXED_RDPMC_METRICS) {
+		fixed = true;
+		idx = 3;
+	}
+
 	if (idx >= num_counters)
 		return NULL;
 	*mask &= pmu->counter_bitmask[fixed ? KVM_PMC_FIXED : KVM_PMC_GP];
@@ -229,6 +237,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 		ret = (perf_capabilities & PERF_CAP_PEBS_BASELINE) &&
 			((perf_capabilities & PERF_CAP_PEBS_FORMAT) > 3);
 		break;
+	case MSR_PERF_METRICS:
+		ret = vcpu->arch.perf_capabilities & MSR_CAP_PERF_METRICS;
+		break;
 	default:
 		ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
 			get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -385,6 +396,10 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_PEBS_DATA_CFG:
 		msr_info->data = pmu->pebs_data_cfg;
 		return 0;
+	case MSR_PERF_METRICS:
+		pmc_read_counter(&pmu->fixed_counters[3]);
+		msr_info->data = pmu->perf_metrics;
+		return 0;
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -472,6 +487,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 0;
 		}
 		break;
+	case MSR_PERF_METRICS:
+		if (pmu->perf_metrics == data)
+			return 0;
+		if (!(data & pmu->perf_metrics_mask)) {
+			pmu->perf_metrics = data;
+			return 0;
+		}
+		break;
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -545,6 +568,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->fixed_ctr_ctrl_mask = ~0ull;
 	pmu->pebs_enable_mask = ~0ull;
 	pmu->pebs_data_cfg_mask = ~0ull;
+	pmu->perf_metrics_mask = ~0ull;
 
 	entry = kvm_find_cpuid_entry(vcpu, 0xa);
 	if (!entry || !vcpu->kvm->arch.enable_pmu)
@@ -584,6 +608,13 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));
 	counter_mask = ~(((1ull << pmu->nr_arch_gp_counters) - 1) |
 		(((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED));
+	if (vcpu->arch.perf_capabilities & MSR_CAP_PERF_METRICS) {
+		counter_mask &= ~(1ULL << INTEL_PERF_METRICS_IDX);
+		pmu->nr_perf_metrics = min_t(int, KVM_PERF_METRICS_NUM_MAX,
+					     kvm_pmu_cap.num_topdown_events);
+		for (i = 0; i < pmu->nr_perf_metrics; i++)
+			pmu->perf_metrics_mask &= ~(0xffull << (i * 8));
+	}
 	pmu->global_ctrl_mask = counter_mask;
 	pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask
 			& ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF |
@@ -656,6 +687,8 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 	lbr_desc->records.nr = 0;
 	lbr_desc->event = NULL;
 	lbr_desc->msr_passthrough = false;
+
+	BUILD_BUG_ON(KVM_PERF_METRICS_NUM_MAX > INTEL_TD_METRIC_NUM);
 }
 
 static void intel_pmu_reset(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe5615fd8295..57312b5a3d9d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7721,6 +7721,9 @@ static u64 vmx_get_perf_capabilities(void)
 			perf_cap &= ~PERF_CAP_PEBS_BASELINE;
 	}
 
+	if (kvm_pmu_cap.num_topdown_events)
+		perf_cap |= host_perf_cap & MSR_CAP_PERF_METRICS;
+
 	return perf_cap;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b61cb58c877..a50b3ad7294c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1439,6 +1439,7 @@ static const u32 msrs_to_save_all[] = {
 	MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_CORE_PERF_GLOBAL_STATUS,
 	MSR_CORE_PERF_GLOBAL_CTRL, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
 	MSR_IA32_PEBS_ENABLE, MSR_IA32_DS_AREA, MSR_PEBS_DATA_CFG,
+	MSR_PERF_METRICS,
 
 	/* This part of MSRs should match KVM_INTEL_PMC_MAX_GENERIC. */
 	MSR_ARCH_PERFMON_PERFCTR0, MSR_ARCH_PERFMON_PERFCTR1,
@@ -3880,6 +3881,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_PEBS_ENABLE:
 	case MSR_IA32_DS_AREA:
 	case MSR_PEBS_DATA_CFG:
+	case MSR_PERF_METRICS:
 	case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
 		if (kvm_pmu_is_valid_msr(vcpu, msr))
 			return kvm_pmu_set_msr(vcpu, msr_info);
@@ -3983,6 +3985,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_PEBS_ENABLE:
 	case MSR_IA32_DS_AREA:
 	case MSR_PEBS_DATA_CFG:
+	case MSR_PERF_METRICS:
 	case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
 			return kvm_pmu_get_msr(vcpu, msr_info);
-- 
2.38.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter()
  2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
@ 2022-12-12 13:23   ` Marc Zyngier
  2022-12-15 13:11     ` Like Xu
  2022-12-14  3:52   ` Ravi Bangoria
  1 sibling, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2022-12-12 13:23 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, Sean Christopherson, Paolo Bonzini, linux-kernel,
	kvm, Fenghua Yu, kvmarm, linux-perf-users

On Mon, 12 Dec 2022 12:58:37 +0000,
Like Xu <like.xu.linux@gmail.com> wrote:
> 
> From: Like Xu <likexu@tencent.com>
> 
> Like syscalls users, kernel-space perf_event creators may also use group
> counters abstraction to gain pmu functionalities, and an in-kernel counter
> groups behave much like normal 'single' counters, following the group
> semantics-based behavior.
> 
> No functional changes at this time. An example will be that KVM creates
> Intel slot event as group leader and other topdown metric events to emulate
> MSR_PERF_METRICS pmu capability for guests.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Fenghua Yu <fenghua.yu@intel.com>
> Cc: kvmarm@lists.linux.dev
> Cc: linux-perf-users@vger.kernel.org
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---
>  arch/arm64/kvm/pmu-emul.c                 | 4 ++--
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++--
>  arch/x86/kvm/pmu.c                        | 2 +-
>  arch/x86/kvm/vmx/pmu_intel.c              | 2 +-
>  include/linux/perf_event.h                | 1 +
>  kernel/events/core.c                      | 4 +++-
>  kernel/events/hw_breakpoint.c             | 4 ++--
>  kernel/events/hw_breakpoint_test.c        | 2 +-
>  kernel/watchdog_hld.c                     | 2 +-
>  9 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
> index 24908400e190..11c3386bc86b 100644
> --- a/arch/arm64/kvm/pmu-emul.c
> +++ b/arch/arm64/kvm/pmu-emul.c
> @@ -624,7 +624,7 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
>  
>  	attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc));
>  
> -	event = perf_event_create_kernel_counter(&attr, -1, current,
> +	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
>  						 kvm_pmu_perf_overflow, pmc);

Wouldn't it be better to have a separate helper that takes the group leader
as a parameter, and reimplement perf_event_create_kernel_counter() in term
of this helper?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter()
  2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
  2022-12-12 13:23   ` Marc Zyngier
@ 2022-12-14  3:52   ` Ravi Bangoria
  2022-12-15 13:36     ` Like Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Ravi Bangoria @ 2022-12-14  3:52 UTC (permalink / raw)
  To: Like Xu
  Cc: Peter Zijlstra, Sean Christopherson, Paolo Bonzini, linux-kernel,
	kvm, Marc Zyngier, Fenghua Yu, kvmarm, linux-perf-users,
	Ravi Bangoria

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7f04f995c975..f671b1a9a691 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -12674,12 +12674,14 @@ SYSCALL_DEFINE5(perf_event_open,
>   * @attr: attributes of the counter to create
>   * @cpu: cpu in which the counter is bound
>   * @task: task to profile (NULL for percpu)
> + * @group_leader: event group leader
>   * @overflow_handler: callback to trigger when we hit the event
>   * @context: context data could be used in overflow_handler callback
>   */
>  struct perf_event *
>  perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
>  				 struct task_struct *task,
> +				 struct perf_event *group_leader,
>  				 perf_overflow_handler_t overflow_handler,
>  				 void *context)
>  {
> @@ -12694,7 +12696,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
>  	if (attr->aux_output)
>  		return ERR_PTR(-EINVAL);
>  
> -	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
> +	event = perf_event_alloc(attr, cpu, task, group_leader, NULL,
>  				 overflow_handler, context, -1);

Grouping involves lot of complexities. Setting group_leader won't be sufficient.
Please see perf_event_open() syscall code for more detail.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter()
  2022-12-12 13:23   ` Marc Zyngier
@ 2022-12-15 13:11     ` Like Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-15 13:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Peter Zijlstra, Sean Christopherson, Paolo Bonzini, linux-kernel,
	kvm, Fenghua Yu, kvmarm, linux-perf-users

On 12/12/2022 9:23 pm, Marc Zyngier wrote:
> On Mon, 12 Dec 2022 12:58:37 +0000,
> Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> From: Like Xu <likexu@tencent.com>
>>
>> Like syscalls users, kernel-space perf_event creators may also use group
>> counters abstraction to gain pmu functionalities, and an in-kernel counter
>> groups behave much like normal 'single' counters, following the group
>> semantics-based behavior.
>>
>> No functional changes at this time. An example will be that KVM creates
>> Intel slot event as group leader and other topdown metric events to emulate
>> MSR_PERF_METRICS pmu capability for guests.
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Fenghua Yu <fenghua.yu@intel.com>
>> Cc: kvmarm@lists.linux.dev
>> Cc: linux-perf-users@vger.kernel.org
>> Signed-off-by: Like Xu <likexu@tencent.com>
>> ---
>>   arch/arm64/kvm/pmu-emul.c                 | 4 ++--
>>   arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 ++--
>>   arch/x86/kvm/pmu.c                        | 2 +-
>>   arch/x86/kvm/vmx/pmu_intel.c              | 2 +-
>>   include/linux/perf_event.h                | 1 +
>>   kernel/events/core.c                      | 4 +++-
>>   kernel/events/hw_breakpoint.c             | 4 ++--
>>   kernel/events/hw_breakpoint_test.c        | 2 +-
>>   kernel/watchdog_hld.c                     | 2 +-
>>   9 files changed, 14 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>> index 24908400e190..11c3386bc86b 100644
>> --- a/arch/arm64/kvm/pmu-emul.c
>> +++ b/arch/arm64/kvm/pmu-emul.c
>> @@ -624,7 +624,7 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
>>   
>>   	attr.sample_period = compute_period(pmc, kvm_pmu_get_pmc_value(pmc));
>>   
>> -	event = perf_event_create_kernel_counter(&attr, -1, current,
>> +	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
>>   						 kvm_pmu_perf_overflow, pmc);
> 
> Wouldn't it be better to have a separate helper that takes the group leader
> as a parameter, and reimplement perf_event_create_kernel_counter() in term
> of this helper?
> 
> 	M.
> 

Applied. It makes changes more concise, thank you.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter()
  2022-12-14  3:52   ` Ravi Bangoria
@ 2022-12-15 13:36     ` Like Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2022-12-15 13:36 UTC (permalink / raw)
  To: Ravi Bangoria, Peter Zijlstra
  Cc: Sean Christopherson, Paolo Bonzini, linux-kernel, kvm,
	Marc Zyngier, Fenghua Yu, kvmarm, linux-perf-users

On 14/12/2022 11:52 am, Ravi Bangoria wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 7f04f995c975..f671b1a9a691 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -12674,12 +12674,14 @@ SYSCALL_DEFINE5(perf_event_open,
>>    * @attr: attributes of the counter to create
>>    * @cpu: cpu in which the counter is bound
>>    * @task: task to profile (NULL for percpu)
>> + * @group_leader: event group leader
>>    * @overflow_handler: callback to trigger when we hit the event
>>    * @context: context data could be used in overflow_handler callback
>>    */
>>   struct perf_event *
>>   perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
>>   				 struct task_struct *task,
>> +				 struct perf_event *group_leader,
>>   				 perf_overflow_handler_t overflow_handler,
>>   				 void *context)
>>   {
>> @@ -12694,7 +12696,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
>>   	if (attr->aux_output)
>>   		return ERR_PTR(-EINVAL);
>>   
>> -	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
>> +	event = perf_event_alloc(attr, cpu, task, group_leader, NULL,
>>   				 overflow_handler, context, -1);
> 
> Grouping involves lot of complexities. Setting group_leader won't be sufficient.
> Please see perf_event_open() syscall code for more detail.
> 
> Thanks,
> Ravi

This is the main reason why the RFC tag is added. More detailed professional 
reviews is encouraged.

AFAI, there does exist a number of code gaps here to support grouped events in 
the kernel,
but there are also opportunities, as there may be other new use cases bringing 
innovation.

I have to confirm this idea with maintainers first, the alternative is to create 
yet another special
perf_event similar to PMC_IDX_FIXED_VLBR, which schedules perf_metrics MSR for 
KVM stuff.

PeterZ, any input ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-12-15 13:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-12 12:58 [PATCH RFC 0/8] KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics Like Xu
2022-12-12 12:58 ` [PATCH RFC 1/8] perf/core: Add *group_leader to perf_event_create_kernel_counter() Like Xu
2022-12-12 13:23   ` Marc Zyngier
2022-12-15 13:11     ` Like Xu
2022-12-14  3:52   ` Ravi Bangoria
2022-12-15 13:36     ` Like Xu
2022-12-12 12:58 ` [PATCH RFC 2/8] perf: x86/core: Expose the available number of the Topdown metrics Like Xu
2022-12-12 12:58 ` [PATCH RFC 3/8] perf: x86/core: Snyc PERF_METRICS bit together with fixed counter3 Like Xu
2022-12-12 12:58 ` [PATCH RFC 4/8] KVM: x86/pmu: Add Intel CPUID-hinted Topdown Slots event Like Xu
2022-12-12 12:58 ` [PATCH RFC 5/8] KVM: x86/pmu: Add kernel-defined slots event to enable Fixed Counter3 Like Xu
2022-12-12 12:58 ` [PATCH RFC 6/8] KVM: x86/pmu: properly use INTEL_PMC_FIXED_RDPMC_BASE macro Like Xu
2022-12-12 12:58 ` [PATCH RFC 7/8] KVM: x86/pmu: Use flex *event arrays to implement grouped events Like Xu
2022-12-12 12:58 ` [PATCH RFC 8/8] KVM: x86/pmu: Add MSR_PERF_METRICS MSR emulation to enable Topdown Like Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).