linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/8] Intel Virtual PMU Optimization
@ 2018-11-01 10:04 Wei Wang
  2018-11-01 10:04 ` [PATCH v1 1/8] perf/x86: add support to mask counters from host Wei Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

This patch series optimizes the Intel PMU virtualization by reducing the
PMU virtualization overhead and providing guests with accurate PMU
statistics.

The differences of the traditional approach and the optimized apporach
are depicted in the figures here:
https://github.com/weiwangwork/vPMU/blob/master/vPMU%20Optimization.pdf

The traditional approach of PMU virtualization is host perf event oriented.
The KVM vPMU layer sits on top of the host perf subsystem and each guest's
update to the vPMU is translated into a new host perf event, which needs
to go through the host perf software stack (e.g. releasing the old perf
event, re-creating a new one and getting it rescheduled to a hardware perf
counter) for a reconfiguration.

With this optimization, we intend to make the virtualization layer to be
register oriented. The KVM vPMU layer is moved down to directly sit on the
hardware perf counters. The guest accesses to the vPMU registers can be
directly applied to the related hardware counter by the vPMU. It can
reduce the virtualization overhead from around 2500000ns to 400ns.
(Tested on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, and added host
booting parameter "nowatchdog" to avoid the noise from watchdog_hld)

We still need the vPMU to get the ownership of the physical perf counters
from the host perf core. The guest used counters are taken from the host
perf core via x86_perf_mask_perf_counters, which in most cases is a
bit-setting of the guest mask.

This series currently covers the basic perf counter virtualization. Other
features, such as pebs and lbr, will come after this series.

Wei Wang (8):
  perf/x86: add support to mask counters from host
  perf/x86/intel: add pmi callback support
  KVM/x86/vPMU: optimize intel vPMU
  KVM/x86/vPMU: support msr switch on vmx transitions
  KVM/x86/vPMU: intel_pmu_read_pmc
  KVM/x86/vPMU: remove some unused functions
  KVM/x86/vPMU: save/restore guest perf counters on vCPU switching
  KVM/x86/vPMU: return the counters to host if guest is torn down

 arch/x86/events/core.c            |  47 ++++
 arch/x86/events/intel/core.c      |  65 ++----
 arch/x86/events/perf_event.h      |  10 +-
 arch/x86/include/asm/kvm_host.h   |  13 ++
 arch/x86/include/asm/perf_event.h |  16 +-
 arch/x86/kvm/pmu.c                |  15 ++
 arch/x86/kvm/pmu.h                |   7 +
 arch/x86/kvm/pmu_intel.c          | 448 +++++++++++++++++++++++++-------------
 arch/x86/kvm/vmx.c                |   6 +-
 arch/x86/kvm/x86.c                |   6 +
 include/linux/kvm_host.h          |   1 +
 virt/kvm/kvm_main.c               |   3 +
 12 files changed, 416 insertions(+), 221 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 14:52   ` Peter Zijlstra
  2018-11-01 10:04 ` [PATCH v1 2/8] perf/x86/intel: add pmi callback support Wei Wang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

Add x86_perf_mask_perf_counters to reserve counters from the host perf
subsystem. The masked counters will not be assigned to any host perf
events. This can be used by the hypervisor to reserve perf counters for
a guest to use.

This function is currently supported on Intel CPUs only, but put in x86
perf core because the counter assignment is implemented here and we need
to re-enable the pmu which is defined in the x86 perf core in the case
that a counter to be masked happens to be used by the host.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 106911b..e73135a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -716,6 +716,7 @@ struct perf_sched {
 static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
 			    int num, int wmin, int wmax, int gpmax)
 {
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int idx;
 
 	memset(sched, 0, sizeof(*sched));
@@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
 	sched->max_weight	= wmax;
 	sched->max_gp		= gpmax;
 	sched->constraints	= constraints;
+#ifdef CONFIG_CPU_SUP_INTEL
+	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
+#endif
 
 	for (idx = 0; idx < num; idx++) {
 		if (constraints[idx]->weight == wmin)
@@ -2386,6 +2390,39 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
 	}
 }
 
+#ifdef CONFIG_CPU_SUP_INTEL
+/**
+ * x86_perf_mask_perf_counters - mask perf counters
+ * @mask: the bitmask of counters
+ *
+ * Mask the perf counters that are not available to be used by the perf core.
+ * If the counter to be masked has been assigned, it will be taken back and
+ * then the perf core will re-assign usable counters to its events.
+ *
+ * This can be used by a component outside the perf core to reserve counters.
+ * For example, a hypervisor uses it to reserve counters for a guest to use,
+ * and later return the counters by another call with the related bits cleared.
+ */
+void x86_perf_mask_perf_counters(u64 mask)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/*
+	 * If the counter happens to be used by a host event, take it back
+	 * first, and then restart the pmu after mask that counter as being
+	 * reserved.
+	 */
+	if (mask & cpuc->intel_ctrl_host_mask) {
+		perf_pmu_disable(&pmu);
+		cpuc->intel_ctrl_guest_mask = mask;
+		perf_pmu_enable(&pmu);
+	} else {
+		cpuc->intel_ctrl_guest_mask = mask;
+	}
+}
+EXPORT_SYMBOL_GPL(x86_perf_mask_perf_counters);
+#endif
+
 static inline int
 valid_user_frame(const void __user *fp, unsigned long size)
 {
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8bdf749..5b4463e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -297,6 +297,7 @@ static inline void perf_check_microcode(void) { }
 
 #ifdef CONFIG_CPU_SUP_INTEL
  extern void intel_pt_handle_vmx(int on);
+extern void x86_perf_mask_perf_counters(u64 mask);
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 2/8] perf/x86/intel: add pmi callback support
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
  2018-11-01 10:04 ` [PATCH v1 1/8] perf/x86: add support to mask counters from host Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 3/8] KVM/x86/vPMU: optimize intel vPMU Wei Wang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

This patch adds the pmi callback support for the counters reserved by
components outside the perf core. For example, a hypervisor may register
such a callback to get the guest notified about the receiving of the
pmi.

The host PMI handling requires the active_events to be non-zero, so we
need active_events to be at least 1 in x86_perf_mask_perf_counters when
there are counters used by the guest.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/events/core.c            | 10 ++++++++++
 arch/x86/events/intel/core.c      | 27 +++++++++++++++++++++++++++
 arch/x86/events/perf_event.h      |  4 ++++
 arch/x86/include/asm/perf_event.h |  5 +++++
 4 files changed, 46 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e73135a..504d102 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2408,6 +2408,16 @@ void x86_perf_mask_perf_counters(u64 mask)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	/*
+	 * Increment active_events by 1 if there are counters reserved for
+	 * the guest to use, and no need to do more increments if there are
+	 * already counters taken by the guest.
+	 */
+	if (!cpuc->intel_ctrl_guest_mask && mask)
+		atomic_inc(&active_events);
+	else if (cpuc->intel_ctrl_guest_mask && !mask)
+		atomic_dec(&active_events);
+
+	/*
 	 * If the counter happens to be used by a host event, take it back
 	 * first, and then restart the pmu after mask that counter as being
 	 * reserved.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0fb8659..f5e5191 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2283,6 +2283,13 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	status |= cpuc->intel_cp_status;
 
+	if (status & cpuc->intel_ctrl_guest_mask) {
+		cpuc->pmi_callback(cpuc->pmi_opaque,
+				   status & cpuc->intel_ctrl_guest_mask);
+		status &= ~cpuc->intel_ctrl_guest_mask;
+		handled++;
+	}
+
 	for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
 		struct perf_event *event = cpuc->events[bit];
 
@@ -3162,6 +3169,26 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 }
 EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
 
+void x86_perf_register_pmi_callback(pmi_callback_t callback, void *opaque)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->pmi_callback = callback;
+	cpuc->pmi_opaque = opaque;
+
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+EXPORT_SYMBOL_GPL(x86_perf_register_pmi_callback);
+
+void x86_perf_unregister_pmi_callback(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->pmi_callback = NULL;
+	cpuc->pmi_opaque = NULL;
+}
+EXPORT_SYMBOL_GPL(x86_perf_unregister_pmi_callback);
+
 static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index adae087..06404eb 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -197,6 +197,10 @@ struct cpu_hw_events {
 	unsigned int		txn_flags;
 	int			is_fake;
 
+	/* PMI related fields */
+	pmi_callback_t			pmi_callback;
+	void				*pmi_opaque;
+
 	/*
 	 * Intel DebugStore bits
 	 */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5b4463e..fd33688 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -275,6 +275,8 @@ struct perf_guest_switch_msr {
 	u64 host, guest;
 };
 
+typedef void (*pmi_callback_t)(void *opaque, u64 status);
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
@@ -298,6 +300,9 @@ static inline void perf_check_microcode(void) { }
 #ifdef CONFIG_CPU_SUP_INTEL
  extern void intel_pt_handle_vmx(int on);
 extern void x86_perf_mask_perf_counters(u64 mask);
+extern void x86_perf_register_pmi_callback(pmi_callback_t callback,
+					   void *opaque);
+extern void x86_perf_unregister_pmi_callback(void);
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 3/8] KVM/x86/vPMU: optimize intel vPMU
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
  2018-11-01 10:04 ` [PATCH v1 1/8] perf/x86: add support to mask counters from host Wei Wang
  2018-11-01 10:04 ` [PATCH v1 2/8] perf/x86/intel: add pmi callback support Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 4/8] KVM/x86/vPMU: support msr switch on vmx transitions Wei Wang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

Current vPMU relies on the host perf software stack to update the guest
change of the perf counter MSRs. The whole process includes releasing the
old perf event and re-creating a new one. This results in around 2500000ns
overhead to update a perf counter control MSR.

This can be avoided by having the vPMU layer directly sits on the hardware
perf counters, and in this case the guest accesses to the virtual perf
counters can be directly applied to the related hardware counter by vPMU.
The guest used counters are taken from the host perf core via
x86_perf_mask_perf_counters, which in most cases is a bit-setting of the
guest mask.

This patch implements the handling of guest accesses to the perf counter
MSRs. A host perf counter is assigned to the guest when the guest has the
vPMC enabled and returned to the host when the vPMC gets disabled.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/pmu_intel.c        | 257 ++++++++++++++++++++++++++++++++--------
 2 files changed, 209 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55e51ff..f8bc46d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -463,6 +463,7 @@ struct kvm_pmu {
 	u64 global_ovf_ctrl;
 	u64 counter_bitmask[2];
 	u64 global_ctrl_mask;
+	u64 assigned_pmc_bitmap;
 	u64 reserved_bits;
 	u8 version;
 	struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 5ab4a36..8c2d37f 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -185,7 +185,11 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_fixed_pmc(pmu, msr))) {
-			*data = pmc_read_counter(pmc);
+			if (test_bit(pmc->idx,
+				(unsigned long *)&pmu->assigned_pmc_bitmap))
+				rdmsrl(msr, *data);
+			else
+				*data = pmc->counter;
 			return 0;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
 			*data = pmc->eventsel;
@@ -196,59 +200,210 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 	return 1;
 }
 
+static void intel_pmu_update_fixed_ctrl_msr(u64 new_ctrl, u8 idx)
+{
+	u64 host_ctrl, mask;
+
+	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, host_ctrl);
+	mask = 0xfULL << (idx * 4);
+	host_ctrl &= ~mask;
+	new_ctrl <<= (idx * 4);
+	host_ctrl |= new_ctrl;
+	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, host_ctrl);
+}
+
+static void intel_pmu_save_pmc_counters(struct kvm_pmu *pmu, u32 idx)
+{
+	struct kvm_pmc *pmc;
+
+	pmc = intel_pmc_idx_to_pmc(pmu, idx);
+	/*
+	 * The control MSRs (pmc->eventsel and pmu->fixed_ctr_ctrl) always
+	 * store the updated value, so we only need to save the counter value.
+	 */
+	if (pmc->type == KVM_PMC_GP)
+		rdmsrl(MSR_IA32_PERFCTR0 + idx, pmc->counter);
+	else
+		rdmsrl(MSR_CORE_PERF_FIXED_CTR0 + idx - INTEL_PMC_IDX_FIXED,
+		       pmc->counter);
+}
+
+static void intel_pmu_restore_pmc_counters(struct kvm_pmu *pmu, u32 idx)
+{
+	struct kvm_pmc *pmc;
+
+	pmc = intel_pmc_idx_to_pmc(pmu, idx);
+
+	if (pmc->type == KVM_PMC_GP) {
+		wrmsrl(MSR_IA32_PERFCTR0 + idx, pmc->counter);
+		wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + idx, pmc->eventsel);
+	} else {
+		u8 ctrl;
+
+		idx -= INTEL_PMC_IDX_FIXED;
+		ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl, idx);
+
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + idx, pmc->counter);
+		intel_pmu_update_fixed_ctrl_msr(ctrl, idx);
+	}
+}
+
+/* Get the physical PMC from host and restore the vPMC states. */
+static inline void intel_pmu_get_pmc(struct kvm_pmu *pmu, unsigned int idx)
+{
+	/* Already assigned? */
+	if (test_bit(idx, (unsigned long *)&pmu->assigned_pmc_bitmap))
+		return;
+
+	set_bit(idx, (unsigned long *)&pmu->assigned_pmc_bitmap);
+	x86_perf_mask_perf_counters(pmu->assigned_pmc_bitmap);
+	intel_pmu_restore_pmc_counters(pmu, idx);
+}
+
+/* Save the physical PMC state and return it to host. */
+static inline void intel_pmu_put_pmc(struct kvm_pmu *pmu, unsigned int idx)
+{
+	/* Already returned? */
+	if (!test_bit(idx, (unsigned long *)&pmu->assigned_pmc_bitmap))
+		return;
+
+	intel_pmu_save_pmc_counters(pmu, idx);
+	clear_bit(idx, (unsigned long *)&pmu->assigned_pmc_bitmap);
+	x86_perf_mask_perf_counters(pmu->assigned_pmc_bitmap);
+}
+
+static int intel_pmu_set_fixed_ctrl(struct kvm_pmu *pmu,
+				    struct msr_data *msr_info)
+{
+	u8 old_ctrl, new_ctrl, pmc_idx, i;
+	u64 data = msr_info->data;
+
+	if (pmu->fixed_ctr_ctrl == data)
+		return 0;
+	if (unlikely(data & 0xfffffffffffff444ull))
+		return 1;
+
+	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
+		old_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl, i);
+		new_ctrl = fixed_ctrl_field(data, i);
+		if (old_ctrl == new_ctrl)
+			continue;
+
+		pmc_idx = INTEL_PMC_IDX_FIXED + i;
+		if (new_ctrl) {
+			/* Set the control after we own the pmc */
+			intel_pmu_get_pmc(pmu, pmc_idx);
+			intel_pmu_update_fixed_ctrl_msr((u64)new_ctrl, i);
+		} else {
+			/* Zero the control before we return the pmc */
+			intel_pmu_update_fixed_ctrl_msr((u64)new_ctrl, i);
+			intel_pmu_put_pmc(pmu, pmc_idx);
+		}
+	}
+	pmu->fixed_ctr_ctrl = data;
+
+	return 0;
+}
+
+static int intel_pmu_set_global_status(struct kvm_pmu *pmu,
+				       struct msr_data *msr_info)
+{
+	/* RO to the guest */
+	if (!msr_info->host_initiated)
+		return 1;
+
+	pmu->global_status = msr_info->data;
+	return 0;
+}
+
+static int intel_pmu_set_global_ctrl(struct kvm_pmu *pmu,
+				     struct msr_data *msr_info)
+{
+	u64 data = msr_info->data;
+
+	if (unlikely(data & pmu->global_ctrl_mask))
+		return 1;
+
+	pmu->global_ctrl = data;
+
+	return 0;
+}
+
+static int intel_pmu_set_ovf_ctrl(struct kvm_pmu *pmu,
+				  struct msr_data *msr_info)
+{
+	u64 data = msr_info->data;
+
+	if (unlikely(data & (pmu->global_ctrl_mask & ~(3ull<<62))))
+		return 1;
+
+	if (!msr_info->host_initiated)
+		pmu->global_status &= ~data;
+
+	pmu->global_ovf_ctrl = data;
+
+	return 0;
+}
+
+static int intel_pmu_set_gp_eventsel(struct kvm_pmc *pmc,
+				     struct msr_data *msr_info)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+	u64 data = msr_info->data;
+
+	if (pmc->eventsel == data)
+		return 0;
+
+	pmc->eventsel = data;
+
+	if (data & ARCH_PERFMON_EVENTSEL_ENABLE) {
+		intel_pmu_get_pmc(pmu, pmc->idx);
+		wrmsrl(msr_info->index, pmc->eventsel);
+	} else {
+		wrmsrl(msr_info->index, pmc->eventsel);
+		intel_pmu_put_pmc(pmu, pmc->idx);
+	}
+
+	return 0;
+}
+
+static int intel_pmu_set_pmc_counter(struct kvm_pmc *pmc,
+				     struct msr_data *msr_info)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+	pmc->counter = msr_info->data;
+	if (test_bit(pmc->idx, (unsigned long *)&pmu->assigned_pmc_bitmap))
+		wrmsrl(msr_info->index, pmc->counter);
+
+	return 0;
+}
+
 static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
 	u32 msr = msr_info->index;
-	u64 data = msr_info->data;
 
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
-		if (pmu->fixed_ctr_ctrl == data)
-			return 0;
-		if (!(data & 0xfffffffffffff444ull)) {
-			reprogram_fixed_counters(pmu, data);
-			return 0;
-		}
-		break;
+		return intel_pmu_set_fixed_ctrl(pmu, msr_info);
 	case MSR_CORE_PERF_GLOBAL_STATUS:
-		if (msr_info->host_initiated) {
-			pmu->global_status = data;
-			return 0;
-		}
-		break; /* RO MSR */
+		return intel_pmu_set_global_status(pmu, msr_info);
 	case MSR_CORE_PERF_GLOBAL_CTRL:
-		if (pmu->global_ctrl == data)
-			return 0;
-		if (!(data & pmu->global_ctrl_mask)) {
-			global_ctrl_changed(pmu, data);
-			return 0;
-		}
-		break;
+		return intel_pmu_set_global_ctrl(pmu, msr_info);
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-		if (!(data & (pmu->global_ctrl_mask & ~(3ull<<62)))) {
-			if (!msr_info->host_initiated)
-				pmu->global_status &= ~data;
-			pmu->global_ovf_ctrl = data;
-			return 0;
-		}
-		break;
+		return intel_pmu_set_ovf_ctrl(pmu, msr_info);
 	default:
-		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
-		    (pmc = get_fixed_pmc(pmu, msr))) {
-			if (!msr_info->host_initiated)
-				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
-			return 0;
-		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
-			if (data == pmc->eventsel)
-				return 0;
-			if (!(data & pmu->reserved_bits)) {
-				reprogram_gp_counter(pmc, data);
-				return 0;
-			}
-		}
+		pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0);
+		if (pmc)
+			return intel_pmu_set_gp_eventsel(pmc, msr_info);
+		pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0);
+		if (pmc)
+			return intel_pmu_set_pmc_counter(pmc, msr_info);
+		pmc = get_fixed_pmc(pmu, msr);
+		if (pmc)
+			return intel_pmu_set_pmc_counter(pmc, msr_info);
 	}
 
 	return 1;
@@ -326,20 +481,24 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
 	int i;
 
 	for (i = 0; i < INTEL_PMC_MAX_GENERIC; i++) {
-		struct kvm_pmc *pmc = &pmu->gp_counters[i];
-
-		pmc_stop_counter(pmc);
+		pmc = &pmu->gp_counters[i];
 		pmc->counter = pmc->eventsel = 0;
 	}
 
-	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++)
-		pmc_stop_counter(&pmu->fixed_counters[i]);
+	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
+		pmc = &pmu->fixed_counters[i];
+		pmc->counter = 0;
+	}
 
-	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
-		pmu->global_ovf_ctrl = 0;
+	pmu->fixed_ctr_ctrl = 0;
+	pmu->global_ctrl = 0;
+	pmu->global_status = 0;
+	pmu->global_ovf_ctrl = 0;
+	pmu->assigned_pmc_bitmap = 0;
 }
 
 struct kvm_pmu_ops intel_pmu_ops = {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 4/8] KVM/x86/vPMU: support msr switch on vmx transitions
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
                   ` (2 preceding siblings ...)
  2018-11-01 10:04 ` [PATCH v1 3/8] KVM/x86/vPMU: optimize intel vPMU Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 5/8] KVM/x86/vPMU: intel_pmu_read_pmc Wei Wang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

This patch adds support to intel vPMU to switch msrs on vmx transitions.
Currenly only 1 msr (global ctrl) is switched. The number can be
increased on demand in the future (e.g. pebs enable). The old method from
the host perf subsystem is also removed.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 60 ---------------------------------------
 arch/x86/events/perf_event.h      |  6 ----
 arch/x86/include/asm/kvm_host.h   | 11 +++++++
 arch/x86/include/asm/perf_event.h | 12 --------
 arch/x86/kvm/pmu.h                |  2 ++
 arch/x86/kvm/pmu_intel.c          | 19 +++++++++++++
 arch/x86/kvm/vmx.c                |  6 ++--
 7 files changed, 35 insertions(+), 81 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f5e5191..33c156f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3160,15 +3160,6 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	return 0;
 }
 
-struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
-{
-	if (x86_pmu.guest_get_msrs)
-		return x86_pmu.guest_get_msrs(nr);
-	*nr = 0;
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
-
 void x86_perf_register_pmi_callback(pmi_callback_t callback, void *opaque)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -3189,55 +3180,6 @@ void x86_perf_unregister_pmi_callback(void)
 }
 EXPORT_SYMBOL_GPL(x86_perf_unregister_pmi_callback);
 
-static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
-{
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
-
-	arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
-	arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
-	arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
-	/*
-	 * If PMU counter has PEBS enabled it is not enough to disable counter
-	 * on a guest entry since PEBS memory write can overshoot guest entry
-	 * and corrupt guest memory. Disabling PEBS solves the problem.
-	 */
-	arr[1].msr = MSR_IA32_PEBS_ENABLE;
-	arr[1].host = cpuc->pebs_enabled;
-	arr[1].guest = 0;
-
-	*nr = 2;
-	return arr;
-}
-
-static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr)
-{
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-	struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
-	int idx;
-
-	for (idx = 0; idx < x86_pmu.num_counters; idx++)  {
-		struct perf_event *event = cpuc->events[idx];
-
-		arr[idx].msr = x86_pmu_config_addr(idx);
-		arr[idx].host = arr[idx].guest = 0;
-
-		if (!test_bit(idx, cpuc->active_mask))
-			continue;
-
-		arr[idx].host = arr[idx].guest =
-			event->hw.config | ARCH_PERFMON_EVENTSEL_ENABLE;
-
-		if (event->attr.exclude_host)
-			arr[idx].host &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
-		else if (event->attr.exclude_guest)
-			arr[idx].guest &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
-	}
-
-	*nr = x86_pmu.num_counters;
-	return arr;
-}
-
 static void core_pmu_enable_event(struct perf_event *event)
 {
 	if (!event->attr.exclude_host)
@@ -3641,7 +3583,6 @@ static __initconst const struct x86_pmu core_pmu = {
 	.get_event_constraints	= intel_get_event_constraints,
 	.put_event_constraints	= intel_put_event_constraints,
 	.event_constraints	= intel_core_event_constraints,
-	.guest_get_msrs		= core_guest_get_msrs,
 	.format_attrs		= intel_arch_formats_attr,
 	.events_sysfs_show	= intel_event_sysfs_show,
 
@@ -3694,7 +3635,6 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_prepare		= intel_pmu_cpu_prepare,
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
-	.guest_get_msrs		= intel_guest_get_msrs,
 	.sched_task		= intel_pmu_sched_task,
 };
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 06404eb..9f818d5 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -227,7 +227,6 @@ struct cpu_hw_events {
 	 */
 	u64				intel_ctrl_guest_mask;
 	u64				intel_ctrl_host_mask;
-	struct perf_guest_switch_msr	guest_switch_msrs[X86_PMC_IDX_MAX];
 
 	/*
 	 * Intel checkpoint mask
@@ -645,11 +644,6 @@ struct x86_pmu {
 	 */
 	struct extra_reg *extra_regs;
 	unsigned int flags;
-
-	/*
-	 * Intel host/guest support (KVM)
-	 */
-	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
 struct x86_perf_task_context {
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8bc46d..b66d164 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -453,6 +453,16 @@ struct kvm_pmc {
 	struct kvm_vcpu *vcpu;
 };
 
+/*
+ * Below MSRs are currently switched on VMX transitions:
+ * - MSR_CORE_PERF_GLOBAL_CTRL
+ */
+#define KVM_PERF_SWITCH_MSR_NUM 1
+struct kvm_perf_switch_msr {
+	unsigned int msr;
+	u64 host, guest;
+};
+
 struct kvm_pmu {
 	unsigned nr_arch_gp_counters;
 	unsigned nr_arch_fixed_counters;
@@ -470,6 +480,7 @@ struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	struct kvm_perf_switch_msr switch_msrs[KVM_PERF_SWITCH_MSR_NUM];
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index fd33688..e9cff88 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -270,24 +270,12 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
 	);							\
 }
 
-struct perf_guest_switch_msr {
-	unsigned msr;
-	u64 host, guest;
-};
-
 typedef void (*pmi_callback_t)(void *opaque, u64 status);
 
-extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 extern int x86_perf_rdpmc_index(struct perf_event *event);
 #else
-static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
-{
-	*nr = 0;
-	return NULL;
-}
-
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	memset(cap, 0, sizeof(*cap));
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ba8898e..b3b0238 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -119,6 +119,8 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
+struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
+						      u32 *nr_msrs);
 
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 8c2d37f..2be31ad 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -501,6 +501,25 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 	pmu->assigned_pmc_bitmap = 0;
 }
 
+struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
+						      u32 *nr_msrs)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_perf_switch_msr *arr = pmu->switch_msrs;
+
+	arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
+	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, arr[0].host);
+	arr[0].host &= ~pmu->assigned_pmc_bitmap;
+	/*
+	 * The guest value will be written to the hardware msr when entering
+	 * the guest, and the bits of unassigned pmcs are not enabled.
+	 */
+	arr[0].guest = pmu->global_ctrl & pmu->assigned_pmc_bitmap;
+	*nr_msrs = KVM_PERF_SWITCH_MSR_NUM;
+
+	return arr;
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4555077..714d1f3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11073,10 +11073,10 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 
 static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 {
-	int i, nr_msrs;
-	struct perf_guest_switch_msr *msrs;
+	u32 i, nr_msrs;
+	struct kvm_perf_switch_msr *msrs;
 
-	msrs = perf_guest_get_msrs(&nr_msrs);
+	msrs = intel_pmu_get_switch_msrs(&vmx->vcpu, &nr_msrs);
 
 	if (!msrs)
 		return;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 5/8] KVM/x86/vPMU: intel_pmu_read_pmc
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
                   ` (3 preceding siblings ...)
  2018-11-01 10:04 ` [PATCH v1 4/8] KVM/x86/vPMU: support msr switch on vmx transitions Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 6/8] KVM/x86/vPMU: remove some unused functions Wei Wang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

This patch adds the handling of guest rdpmc. If the physical counter has
been assigned, directly reads the value from the hardware. Otherwise,
return to the guest the virtual counter value.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/pmu.c       |  3 +++
 arch/x86/kvm/pmu.h       |  1 +
 arch/x86/kvm/pmu_intel.c | 27 +++++++++++++++++++++++++++
 3 files changed, 31 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 58ead7d..7f2f63e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -284,6 +284,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	struct kvm_pmc *pmc;
 	u64 ctr_val;
 
+	if (kvm_x86_ops->pmu_ops->read_pmc)
+		return kvm_x86_ops->pmu_ops->read_pmc(vcpu, idx, data);
+
 	if (is_vmware_backdoor_pmc(idx))
 		return kvm_pmu_rdpmc_vmware(vcpu, idx, data);
 
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index b3b0238..7ab85bf 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,6 +30,7 @@ struct kvm_pmu_ops {
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
 	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+	int (*read_pmc)(struct kvm_vcpu *vcpu, unsigned int idx, u64 *data);
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 2be31ad..4c6183b 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -20,6 +20,9 @@
 #include "lapic.h"
 #include "pmu.h"
 
+#define INTEL_PMU_RDPMC_FIXED_PMC (1ULL << 30)
+#define INTEL_PMU_RDPMC_COUNTER_MASK (INTEL_PMU_RDPMC_FIXED_PMC - 1)
+
 static struct kvm_event_hw_type_mapping intel_arch_events[] = {
 	/* Index must match CPUID 0x0A.EBX bit vector */
 	[0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
@@ -409,6 +412,29 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return 1;
 }
 
+static int intel_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned int idx,
+			      u64 *data)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	unsigned int pmc_idx;
+
+	if (idx & INTEL_PMU_RDPMC_FIXED_PMC)
+		pmc_idx = INTEL_PMC_IDX_FIXED +
+			  (idx & INTEL_PMU_RDPMC_COUNTER_MASK);
+	else
+		pmc_idx = idx & INTEL_PMU_RDPMC_COUNTER_MASK;
+
+	if (test_bit(pmc_idx,
+		(unsigned long *)&pmu->assigned_pmc_bitmap))
+		rdpmcl(idx, *data);
+	else {
+		struct kvm_pmc *pmc = intel_pmc_idx_to_pmc(pmu, pmc_idx);
+		*data = pmc->counter;
+	}
+
+	return 0;
+}
+
 static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -530,6 +556,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.is_valid_msr = intel_is_valid_msr,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
+	.read_pmc = intel_pmu_read_pmc,
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 6/8] KVM/x86/vPMU: remove some unused functions
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
                   ` (4 preceding siblings ...)
  2018-11-01 10:04 ` [PATCH v1 5/8] KVM/x86/vPMU: intel_pmu_read_pmc Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 7/8] KVM/x86/vPMU: save/restore guest perf counters on vCPU switching Wei Wang
  2018-11-01 10:04 ` [PATCH v1 8/8] KVM/x86/vPMU: return the counters to host if guest is torn down Wei Wang
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

Those functions are used in the old intel vPMU implementation and are not
needed now.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/pmu_intel.c | 103 -----------------------------------------------
 1 file changed, 103 deletions(-)

diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 4c6183b..66d7c09 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -23,87 +23,6 @@
 #define INTEL_PMU_RDPMC_FIXED_PMC (1ULL << 30)
 #define INTEL_PMU_RDPMC_COUNTER_MASK (INTEL_PMU_RDPMC_FIXED_PMC - 1)
 
-static struct kvm_event_hw_type_mapping intel_arch_events[] = {
-	/* Index must match CPUID 0x0A.EBX bit vector */
-	[0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
-	[1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
-	[2] = { 0x3c, 0x01, PERF_COUNT_HW_BUS_CYCLES  },
-	[3] = { 0x2e, 0x4f, PERF_COUNT_HW_CACHE_REFERENCES },
-	[4] = { 0x2e, 0x41, PERF_COUNT_HW_CACHE_MISSES },
-	[5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
-	[6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
-	[7] = { 0x00, 0x30, PERF_COUNT_HW_REF_CPU_CYCLES },
-};
-
-/* mapping between fixed pmc index and intel_arch_events array */
-static int fixed_pmc_events[] = {1, 0, 7};
-
-static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
-{
-	int i;
-
-	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
-		u8 new_ctrl = fixed_ctrl_field(data, i);
-		u8 old_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl, i);
-		struct kvm_pmc *pmc;
-
-		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
-
-		if (old_ctrl == new_ctrl)
-			continue;
-
-		reprogram_fixed_counter(pmc, new_ctrl, i);
-	}
-
-	pmu->fixed_ctr_ctrl = data;
-}
-
-/* function is called when global control register has been updated. */
-static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
-{
-	int bit;
-	u64 diff = pmu->global_ctrl ^ data;
-
-	pmu->global_ctrl = data;
-
-	for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX)
-		reprogram_counter(pmu, bit);
-}
-
-static unsigned intel_find_arch_event(struct kvm_pmu *pmu,
-				      u8 event_select,
-				      u8 unit_mask)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(intel_arch_events); i++)
-		if (intel_arch_events[i].eventsel == event_select
-		    && intel_arch_events[i].unit_mask == unit_mask
-		    && (pmu->available_event_types & (1 << i)))
-			break;
-
-	if (i == ARRAY_SIZE(intel_arch_events))
-		return PERF_COUNT_HW_MAX;
-
-	return intel_arch_events[i].event_type;
-}
-
-static unsigned intel_find_fixed_event(int idx)
-{
-	if (idx >= ARRAY_SIZE(fixed_pmc_events))
-		return PERF_COUNT_HW_MAX;
-
-	return intel_arch_events[fixed_pmc_events[idx]].event_type;
-}
-
-/* check if a PMC is enabled by comparing it with globl_ctrl bits. */
-static bool intel_pmc_is_enabled(struct kvm_pmc *pmc)
-{
-	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-
-	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
-}
-
 static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
 {
 	if (pmc_idx < INTEL_PMC_IDX_FIXED)
@@ -128,23 +47,6 @@ static int intel_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx)
 		(fixed && idx >= pmu->nr_arch_fixed_counters);
 }
 
-static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu,
-					    unsigned idx)
-{
-	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	bool fixed = idx & (1u << 30);
-	struct kvm_pmc *counters;
-
-	idx &= ~(3u << 30);
-	if (!fixed && idx >= pmu->nr_arch_gp_counters)
-		return NULL;
-	if (fixed && idx >= pmu->nr_arch_fixed_counters)
-		return NULL;
-	counters = fixed ? pmu->fixed_counters : pmu->gp_counters;
-
-	return &counters[idx];
-}
-
 static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -547,11 +449,6 @@ struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
 }
 
 struct kvm_pmu_ops intel_pmu_ops = {
-	.find_arch_event = intel_find_arch_event,
-	.find_fixed_event = intel_find_fixed_event,
-	.pmc_is_enabled = intel_pmc_is_enabled,
-	.pmc_idx_to_pmc = intel_pmc_idx_to_pmc,
-	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
 	.get_msr = intel_pmu_get_msr,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 7/8] KVM/x86/vPMU: save/restore guest perf counters on vCPU switching
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
                   ` (5 preceding siblings ...)
  2018-11-01 10:04 ` [PATCH v1 6/8] KVM/x86/vPMU: remove some unused functions Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  2018-11-01 10:04 ` [PATCH v1 8/8] KVM/x86/vPMU: return the counters to host if guest is torn down Wei Wang
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

When the vCPU is scheduled in, restore the assigned perf counter states
and register the guest pmi callback. When the vCPU is scheduled out,
save the assigned perf counter states and unregister the guest PMI
callback.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c              | 12 ++++++++++++
 arch/x86/kvm/pmu.h              |  4 ++++
 arch/x86/kvm/pmu_intel.c        | 38 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  6 ++++++
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             |  3 +++
 7 files changed, 65 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b66d164..cb1c0bf 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@ struct kvm_pmu {
 	u64 counter_bitmask[2];
 	u64 global_ctrl_mask;
 	u64 assigned_pmc_bitmap;
+	u64 restore_pmc_bitmap;
 	u64 reserved_bits;
 	u8 version;
 	struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7f2f63e..4448a88 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -340,6 +340,18 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->pmu_ops->reset(vcpu);
 }
 
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_in)
+		kvm_x86_ops->pmu_ops->sched_in(vcpu);
+}
+
+void kvm_pmu_sched_out(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_out)
+		kvm_x86_ops->pmu_ops->sched_out(vcpu);
+}
+
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7ab85bf..77fc973 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -34,6 +34,8 @@ struct kvm_pmu_ops {
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
+	void (*sched_in)(struct kvm_vcpu *vcpu);
+	void (*sched_out)(struct kvm_vcpu *vcpu);
 };
 
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -118,6 +120,8 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu);
+void kvm_pmu_sched_out(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 66d7c09..9eb5230 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -427,6 +427,7 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 	pmu->global_status = 0;
 	pmu->global_ovf_ctrl = 0;
 	pmu->assigned_pmc_bitmap = 0;
+	pmu->restore_pmc_bitmap = 0;
 }
 
 struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
@@ -448,6 +449,41 @@ struct kvm_perf_switch_msr *intel_pmu_get_switch_msrs(struct kvm_vcpu *vcpu,
 	return arr;
 }
 
+static void intel_pmu_inject_guest_pmi(void *opaque, u64 status)
+{
+	struct kvm_vcpu *vcpu = opaque;
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	pmu->global_status |= status;
+	kvm_make_request(KVM_REQ_PMI, vcpu);
+}
+
+static void intel_pmu_sched_out(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u32 bit;
+
+	pmu->restore_pmc_bitmap = pmu->assigned_pmc_bitmap;
+	for_each_set_bit(bit, (unsigned long *)&pmu->restore_pmc_bitmap,
+			 X86_PMC_IDX_MAX)
+		intel_pmu_put_pmc(pmu, bit);
+
+	x86_perf_unregister_pmi_callback();
+}
+
+static void intel_pmu_sched_in(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u32 bit;
+
+	for_each_set_bit(bit, (unsigned long *)&pmu->restore_pmc_bitmap,
+			 X86_PMC_IDX_MAX)
+		intel_pmu_get_pmc(pmu, bit);
+	pmu->restore_pmc_bitmap = 0;
+
+	x86_perf_register_pmi_callback(intel_pmu_inject_guest_pmi, vcpu);
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
@@ -457,4 +493,6 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
+	.sched_out = intel_pmu_sched_out,
+	.sched_in = intel_pmu_sched_in,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 66d66d7..47308bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8986,9 +8986,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->arch.l1tf_flush_l1d = true;
+	kvm_pmu_sched_in(vcpu);
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
+{
+	kvm_pmu_sched_out(vcpu);
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c926698..478d602 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -795,6 +795,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu);
 
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2679e47..24c2be2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3990,6 +3990,9 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 
 	if (current->state == TASK_RUNNING)
 		vcpu->preempted = true;
+
+	kvm_arch_sched_out(vcpu);
+
 	kvm_arch_vcpu_put(vcpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 8/8] KVM/x86/vPMU: return the counters to host if guest is torn down
  2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
                   ` (6 preceding siblings ...)
  2018-11-01 10:04 ` [PATCH v1 7/8] KVM/x86/vPMU: save/restore guest perf counters on vCPU switching Wei Wang
@ 2018-11-01 10:04 ` Wei Wang
  7 siblings, 0 replies; 17+ messages in thread
From: Wei Wang @ 2018-11-01 10:04 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: mingo, rkrcmar, like.xu, wei.w.wang

Return the assigned counters to host in the case the guest is torn down
unexpectedly.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/pmu_intel.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 9eb5230..30fa8eb 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -410,7 +410,7 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
-	int i;
+	u32 i, bit;
 
 	for (i = 0; i < INTEL_PMC_MAX_GENERIC; i++) {
 		pmc = &pmu->gp_counters[i];
@@ -422,6 +422,10 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmc->counter = 0;
 	}
 
+	for_each_set_bit(bit, (unsigned long *)&pmu->assigned_pmc_bitmap,
+			 X86_PMC_IDX_MAX)
+		intel_pmu_put_pmc(pmu, bit);
+
 	pmu->fixed_ctr_ctrl = 0;
 	pmu->global_ctrl = 0;
 	pmu->global_status = 0;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-01 10:04 ` [PATCH v1 1/8] perf/x86: add support to mask counters from host Wei Wang
@ 2018-11-01 14:52   ` Peter Zijlstra
  2018-11-02  9:08     ` Wei Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2018-11-01 14:52 UTC (permalink / raw)
  To: Wei Wang; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, like.xu

On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote:
> Add x86_perf_mask_perf_counters to reserve counters from the host perf
> subsystem. The masked counters will not be assigned to any host perf
> events. This can be used by the hypervisor to reserve perf counters for
> a guest to use.
> 
> This function is currently supported on Intel CPUs only, but put in x86
> perf core because the counter assignment is implemented here and we need
> to re-enable the pmu which is defined in the x86 perf core in the case
> that a counter to be masked happens to be used by the host.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/perf_event.h |  1 +
>  2 files changed, 38 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 106911b..e73135a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -716,6 +716,7 @@ struct perf_sched {
>  static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
>  			    int num, int wmin, int wmax, int gpmax)
>  {
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>  	int idx;
>  
>  	memset(sched, 0, sizeof(*sched));
> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>  	sched->max_weight	= wmax;
>  	sched->max_gp		= gpmax;
>  	sched->constraints	= constraints;
> +#ifdef CONFIG_CPU_SUP_INTEL
> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> +#endif

NAK.  This completely undermines the whole purpose of event scheduling.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-01 14:52   ` Peter Zijlstra
@ 2018-11-02  9:08     ` Wei Wang
  2018-11-05  9:34       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Wang @ 2018-11-02  9:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, like.xu

On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 06:04:01PM +0800, Wei Wang wrote:
>> Add x86_perf_mask_perf_counters to reserve counters from the host perf
>> subsystem. The masked counters will not be assigned to any host perf
>> events. This can be used by the hypervisor to reserve perf counters for
>> a guest to use.
>>
>> This function is currently supported on Intel CPUs only, but put in x86
>> perf core because the counter assignment is implemented here and we need
>> to re-enable the pmu which is defined in the x86 perf core in the case
>> that a counter to be masked happens to be used by the host.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/events/core.c            | 37 +++++++++++++++++++++++++++++++++++++
>>   arch/x86/include/asm/perf_event.h |  1 +
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 106911b..e73135a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -716,6 +716,7 @@ struct perf_sched {
>>   static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
>>   			    int num, int wmin, int wmax, int gpmax)
>>   {
>> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>   	int idx;
>>   
>>   	memset(sched, 0, sizeof(*sched));
>> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>>   	sched->max_weight	= wmax;
>>   	sched->max_gp		= gpmax;
>>   	sched->constraints	= constraints;
>> +#ifdef CONFIG_CPU_SUP_INTEL
>> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
>> +#endif
> NAK.  This completely undermines the whole purpose of event scheduling.
>

Hi Peter,

Could you share more details how it would affect the host side event 
scheduling?

We wanted to partition out the perf counters that the guest needs and
dedicated them to guest usages for that period (i.e. guest occupation 
period).
The remaining counters are still schedulable for the host events (there 
won't be
many other perf events when the guest runs, watchdog_hld seems to be the one
that would).

Would you have any suggestions? Thanks.

Best,
Wei







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-02  9:08     ` Wei Wang
@ 2018-11-05  9:34       ` Peter Zijlstra
  2018-11-05 11:19         ` Wei Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2018-11-05  9:34 UTC (permalink / raw)
  To: Wei Wang; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, like.xu

On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
> On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
> > >   	sched->max_weight	= wmax;
> > >   	sched->max_gp		= gpmax;
> > >   	sched->constraints	= constraints;
> > > +#ifdef CONFIG_CPU_SUP_INTEL
> > > +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> > > +#endif
> > NAK.  This completely undermines the whole purpose of event scheduling.
> > 
> 
> Hi Peter,
> 
> Could you share more details how it would affect the host side event
> scheduling?

Not all counters are equal; suppose you have one of those chips that can
only do PEBS on counter 0, and then hand out 0 to the guest for some
silly event. That means nobody can use PEBS anymore.

> Would you have any suggestions?

I would suggest not to use virt in the first place of course ;-)

But whatever you do; you have to keep using host events to emulate the
guest PMU. That doesn't mean you can't improve things; that code is
quite insane from what you told earlier.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-05  9:34       ` Peter Zijlstra
@ 2018-11-05 11:19         ` Wei Wang
  2018-11-05 12:14           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Wei Wang @ 2018-11-05 11:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, like.xu

On 11/05/2018 05:34 PM, Peter Zijlstra wrote:
> On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
>> On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
>>>> @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
>>>>    	sched->max_weight	= wmax;
>>>>    	sched->max_gp		= gpmax;
>>>>    	sched->constraints	= constraints;
>>>> +#ifdef CONFIG_CPU_SUP_INTEL
>>>> +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
>>>> +#endif
>>> NAK.  This completely undermines the whole purpose of event scheduling.
>>>
>> Hi Peter,
>>
>> Could you share more details how it would affect the host side event
>> scheduling?
> Not all counters are equal; suppose you have one of those chips that can
> only do PEBS on counter 0, and then hand out 0 to the guest for some
> silly event. That means nobody can use PEBS anymore.

Thanks for sharing your point.

In this example (assume PEBS can only work with counter 0), how would 
the existing approach (i.e. using host event to emulate) work?
For example, guest wants to use PEBS, host also wants to use PEBS or 
other features that only counter 0 fits, I think either guest or host 
will not work then.

With the register level virtualization approach, we could further 
support that case: if guest requests to use a counter which host happens 
to be using, we can let host and guest both be satisfied by supporting 
counter context switching on guest/host switching. In this case, both 
guest and host can use counter 0. (I think this is actually a policy 
selection, the current series chooses to be guest first, we can further 
change it if necessary)


>> Would you have any suggestions?
> I would suggest not to use virt in the first place of course ;-)
>
> But whatever you do; you have to keep using host events to emulate the
> guest PMU. That doesn't mean you can't improve things; that code is
> quite insane from what you told earlier.

I agree that the host event emulation is a functional approach, but it 
may not be an effective one (also got complaints from people about 
today's perf in the guest).
We actually have similar problems when doing network virtualization. The 
more effective approach tends to be the one that bypasses the host 
network stack. Both the network stack and perf stack seem to be too 
heavy to be used as part of the emulation.

Hope we could have thorough discussions on the two approaches from 
virtualization point of view, and get to know why host event emulation 
has to be the one, or it could be better to use the register level 
virtualization model.

Best,
Wei




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-05 11:19         ` Wei Wang
@ 2018-11-05 12:14           ` Peter Zijlstra
  2018-11-05 15:37             ` Wang, Wei W
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2018-11-05 12:14 UTC (permalink / raw)
  To: Wei Wang; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, like.xu

On Mon, Nov 05, 2018 at 07:19:01PM +0800, Wei Wang wrote:
> On 11/05/2018 05:34 PM, Peter Zijlstra wrote:
> > On Fri, Nov 02, 2018 at 05:08:31PM +0800, Wei Wang wrote:
> > > On 11/01/2018 10:52 PM, Peter Zijlstra wrote:
> > > > > @@ -723,6 +724,9 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
> > > > >    	sched->max_weight	= wmax;
> > > > >    	sched->max_gp		= gpmax;
> > > > >    	sched->constraints	= constraints;
> > > > > +#ifdef CONFIG_CPU_SUP_INTEL
> > > > > +	sched->state.used[0]	= cpuc->intel_ctrl_guest_mask;
> > > > > +#endif
> > > > NAK.  This completely undermines the whole purpose of event scheduling.
> > > > 
> > > Hi Peter,
> > > 
> > > Could you share more details how it would affect the host side event
> > > scheduling?
> > Not all counters are equal; suppose you have one of those chips that can
> > only do PEBS on counter 0, and then hand out 0 to the guest for some
> > silly event. That means nobody can use PEBS anymore.
> 
> Thanks for sharing your point.
> 
> In this example (assume PEBS can only work with counter 0), how would the
> existing approach (i.e. using host event to emulate) work?
> For example, guest wants to use PEBS, host also wants to use PEBS or other
> features that only counter 0 fits, I think either guest or host will not
> work then.

The answer for PEBS is really simple; PEBS does not virtualize (Andi
tried and can tell you why; IIRC it has something to do with how the
hardware asks for a Linear Address instead of a Physical Address). So
the problem will not arrise.

But there are certainly constrained events that will result in the same
problem.

The traditional approach of perf on resource contention is to share it;
you get only partial runtime and can scale up the events given the
runtime metrics provided.

We also have perf_event_attr::pinned, which is normally only available
to root, in which case we'll end up marking any contending event to an
error state.

Neither are ideal for MSR level emulation.

> With the register level virtualization approach, we could further support
> that case: if guest requests to use a counter which host happens to be
> using, we can let host and guest both be satisfied by supporting counter
> context switching on guest/host switching. In this case, both guest and host
> can use counter 0. (I think this is actually a policy selection, the current
> series chooses to be guest first, we can further change it if necessary)

That can only work if the host counter has
perf_event_attr::exclude_guest=1, any counter without that must also
count when the guest is running.

(and, IIRC, normal perf tool events do not have that set by default)

> > > Would you have any suggestions?
> > I would suggest not to use virt in the first place of course ;-)
> > 
> > But whatever you do; you have to keep using host events to emulate the
> > guest PMU. That doesn't mean you can't improve things; that code is
> > quite insane from what you told earlier.
> 
> I agree that the host event emulation is a functional approach, but it may
> not be an effective one (also got complaints from people about today's perf
> in the guest).
> We actually have similar problems when doing network virtualization. The
> more effective approach tends to be the one that bypasses the host network
> stack. Both the network stack and perf stack seem to be too heavy to be used
> as part of the emulation.

The thing is; you cannot do blind pass-through of the PMU, some of its
features simply do not work in a guest. Also, the host perf driver
expects certain functionality that must be respected.

Those are the constraints you have to work with.

Back when we all started down this virt rathole, I proposed people do
paravirt perf, where events would be handed to the host kernel and let
the host kernel do its normal thing. But people wanted to do the MSR
based thing because of !linux guests.

Now I don't care about virt much, but I care about !linux guests even
less.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-05 12:14           ` Peter Zijlstra
@ 2018-11-05 15:37             ` Wang, Wei W
  2018-11-05 16:56               ` Peter Zijlstra
  2018-11-05 18:20               ` Andi Kleen
  0 siblings, 2 replies; 17+ messages in thread
From: Wang, Wei W @ 2018-11-05 15:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, Xu, Like

On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote:
> The answer for PEBS is really simple; PEBS does not virtualize (Andi tried and
> can tell you why; IIRC it has something to do with how the hardware asks for
> a Linear Address instead of a Physical Address). So the problem will not arrise.
> 
> But there are certainly constrained events that will result in the same
> problem.

Yes. I just followed the PEBS assumption to discuss the counter contention that you mentioned.

> 
> The traditional approach of perf on resource contention is to share it; you
> get only partial runtime and can scale up the events given the runtime
> metrics provided.
> 
> We also have perf_event_attr::pinned, which is normally only available to
> root, in which case we'll end up marking any contending event to an error
> state.

Yes,  that's one of the limitations with the existing host event emulation approach - in that case (i.e. both require counter 0 to work) the existing approach fails to have the guest perf function, even the pinned event has ".exclude_guest" set.

> 
> Neither are ideal for MSR level emulation.
> 
> That can only work if the host counter has perf_event_attr::exclude_guest=1,
> any counter without that must also count when the guest is running.
> 
> (and, IIRC, normal perf tool events do not have that set by default)

Probably no. Please see Line 81 at
https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c
perf_guest by default is false, which makes "attr->exclude_guest = 1"


> The thing is; you cannot do blind pass-through of the PMU, some of its
> features simply do not work in a guest. Also, the host perf driver expects
> certain functionality that must be respected.

Actually we are not blindly assigning the perf counters. Guest works with its own complete perf stack (like the one on the host) which also has its own constraints. The perf counter that the guest is requesting from the hypervisor is the one that comes out from its event constraints (i.e. the one that will work for that feature on the guest). 

The counter is also not passed through to the guest, guest accesses to the assigned counter will still exit to the hypervisor, and the hypervisor helps update the counter. 

Please correct me if I misunderstood your point.

> 
> Those are the constraints you have to work with.
> 
> Back when we all started down this virt rathole, I proposed people do
> paravirt perf, where events would be handed to the host kernel and let the
> host kernel do its normal thing. But people wanted to do the MSR based
> thing because of !linux guests.

IMHO, it is worthwhile to care more about the real use case. When a user gets a virtual machine from a vendor, all he can do is to run perf inside the guest. The above contention concerns would not happen, because the user wouldn't be able to come to the host to run perf on the virtualization software (e.g. ./perf qemu..) and in the meantime running perf in the guest to cause the contention.

On the other hand, when we improve the user experience of running perf inside the guest by reducing the virtualization overhead, that would bring real benefits to the real use case.

Best,
Wei


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-05 15:37             ` Wang, Wei W
@ 2018-11-05 16:56               ` Peter Zijlstra
  2018-11-05 18:20               ` Andi Kleen
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2018-11-05 16:56 UTC (permalink / raw)
  To: Wang, Wei W; +Cc: linux-kernel, kvm, pbonzini, ak, mingo, rkrcmar, Xu, Like


Please; don't send malformed emails like this. Lines wrap at 78 chars.

On Mon, Nov 05, 2018 at 03:37:24PM +0000, Wang, Wei W wrote:
> On Monday, November 5, 2018 8:14 PM, Peter Zijlstra wrote:


> > That can only work if the host counter has perf_event_attr::exclude_guest=1,
> > any counter without that must also count when the guest is running.
> > 
> > (and, IIRC, normal perf tool events do not have that set by default)
> 
> Probably no. Please see Line 81 at
> https://github.com/torvalds/linux/blob/master/tools/perf/util/util.c
> perf_guest by default is false, which makes "attr->exclude_guest = 1"

Then you're in luck. But if the host creates an even that has
exclude_guest=0 set, it should still work.

> > The thing is; you cannot do blind pass-through of the PMU, some of its
> > features simply do not work in a guest. Also, the host perf driver expects
> > certain functionality that must be respected.
> 
> Actually we are not blindly assigning the perf counters. Guest works
> with its own complete perf stack (like the one on the host) which also
> has its own constraints. 

But it knows nothing of the host state.

> The counter is also not passed through to the guest, guest accesses to
> the assigned counter will still exit to the hypervisor, and the
> hypervisor helps update the counter. 

Yes, you have to; because the PMU doesn't properly virtualize, also
because the HV -- linux in our case -- already claimed the PMU.

So the network passthrough case you mentioned simply doesn't apply at
all. Don't bother looking at it for inspiration.

> > Those are the constraints you have to work with.
> > 
> > Back when we all started down this virt rathole, I proposed people do
> > paravirt perf, where events would be handed to the host kernel and let the
> > host kernel do its normal thing. But people wanted to do the MSR based
> > thing because of !linux guests.
> 
> IMHO, it is worthwhile to care more about the real use case. When a
> user gets a virtual machine from a vendor, all he can do is to run
> perf inside the guest. The above contention concerns would not happen,
> because the user wouldn't be able to come to the host to run perf on
> the virtualization software (e.g. ./perf qemu..) and in the meantime
> running perf in the guest to cause the contention.

That's your job. Mine is to make sure that whatever you propose fits in
the existing model and doesn't make a giant mess of things.

And for Linux guests on Linux hosts, paravirt perf still makes the most
sense to me; then you get the host scheduling all the events and
providing the guest with the proper counts/runtimes/state.

> On the other hand, when we improve the user experience of running perf
> inside the guest by reducing the virtualization overhead, that would
> bring real benefits to the real use case.

You can start to improve things by doing a less stupid implementation of
the existing code.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/8] perf/x86: add support to mask counters from host
  2018-11-05 15:37             ` Wang, Wei W
  2018-11-05 16:56               ` Peter Zijlstra
@ 2018-11-05 18:20               ` Andi Kleen
  1 sibling, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2018-11-05 18:20 UTC (permalink / raw)
  To: Wang, Wei W
  Cc: Peter Zijlstra, linux-kernel, kvm, pbonzini, mingo, rkrcmar, Xu, Like


I ran into a similar problem with my PEBS virtualization patchkit.

My solution was: basically schedule as normal, but tell the scheduler to
force allocate a counter on a specific index.  It can be done
only with a few lines of change in the scheduler code.

-Andi





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-11-05 18:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-01 10:04 [PATCH v1 0/8] Intel Virtual PMU Optimization Wei Wang
2018-11-01 10:04 ` [PATCH v1 1/8] perf/x86: add support to mask counters from host Wei Wang
2018-11-01 14:52   ` Peter Zijlstra
2018-11-02  9:08     ` Wei Wang
2018-11-05  9:34       ` Peter Zijlstra
2018-11-05 11:19         ` Wei Wang
2018-11-05 12:14           ` Peter Zijlstra
2018-11-05 15:37             ` Wang, Wei W
2018-11-05 16:56               ` Peter Zijlstra
2018-11-05 18:20               ` Andi Kleen
2018-11-01 10:04 ` [PATCH v1 2/8] perf/x86/intel: add pmi callback support Wei Wang
2018-11-01 10:04 ` [PATCH v1 3/8] KVM/x86/vPMU: optimize intel vPMU Wei Wang
2018-11-01 10:04 ` [PATCH v1 4/8] KVM/x86/vPMU: support msr switch on vmx transitions Wei Wang
2018-11-01 10:04 ` [PATCH v1 5/8] KVM/x86/vPMU: intel_pmu_read_pmc Wei Wang
2018-11-01 10:04 ` [PATCH v1 6/8] KVM/x86/vPMU: remove some unused functions Wei Wang
2018-11-01 10:04 ` [PATCH v1 7/8] KVM/x86/vPMU: save/restore guest perf counters on vCPU switching Wei Wang
2018-11-01 10:04 ` [PATCH v1 8/8] KVM/x86/vPMU: return the counters to host if guest is torn down Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).