linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/12] Guest LBR Enabling
@ 2019-07-08  1:23 Wei Wang
  2019-07-08  1:23 ` [PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs Wei Wang
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
on Intel CPUs that captures branch related info. This patch series enables
this feature to KVM guests.

Here is a conclusion of the fundamental methods that we use:
1) the LBR feature is enabled per guest via QEMU setting of
   KVM_CAP_X86_GUEST_LBR;
2) the LBR stack is passed through to the guest for direct accesses after
   the guest's first access to any of the lbr related MSRs;
3) the host will help save/resotre the LBR stack when the vCPU is
   scheduled out/in.

ChangeLog:
    pmu.c:
        - patch 4: remove guest_cpuid_is_intel, which can be avoided by
                   directly checking pmu_ops->lbr_enable.
    pmu_intel.c:
        - patch 10: include MSR_LBR_SELECT into is_lbr_msr check, and pass
                    this msr through to the guest, instead of trapping
                    accesses.
        - patch 12: fix the condition check of setting LBRS_FROZEN and
                    COUNTERS_FROZEN;
                    update the global_ovf_ctrl_mask for LBRS_FROZEN and
                    COUNTERS_FROZEN.

previous:
https://lkml.org/lkml/2019/6/6/133

Like Xu (1):
  KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr
    stack

Wei Wang (11):
  perf/x86: fix the variable type of the LBR MSRs
  perf/x86: add a function to get the lbr stack
  KVM/x86: KVM_CAP_X86_GUEST_LBR
  KVM/x86: intel_pmu_lbr_enable
  KVM/x86/vPMU: tweak kvm_pmu_get_msr
  KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  perf/x86: no counter allocation support
  perf/x86: save/restore LBR_SELECT on vCPU switching
  KVM/x86/lbr: lazy save the guest lbr stack
  KVM/x86: remove the common handling of the debugctl msr
  KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN

 arch/x86/events/core.c            |  12 ++
 arch/x86/events/intel/lbr.c       |  43 ++++-
 arch/x86/events/perf_event.h      |   6 +-
 arch/x86/include/asm/kvm_host.h   |   5 +
 arch/x86/include/asm/perf_event.h |  16 ++
 arch/x86/kvm/cpuid.c              |   2 +-
 arch/x86/kvm/pmu.c                |  29 ++-
 arch/x86/kvm/pmu.h                |  12 +-
 arch/x86/kvm/pmu_amd.c            |   7 +-
 arch/x86/kvm/vmx/pmu_intel.c      | 393 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c            |   4 +-
 arch/x86/kvm/vmx/vmx.h            |   2 +
 arch/x86/kvm/x86.c                |  32 ++--
 include/linux/perf_event.h        |  13 ++
 include/uapi/linux/kvm.h          |   1 +
 kernel/events/core.c              |  37 ++--
 16 files changed, 562 insertions(+), 52 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 02/12] perf/x86: add a function to get the lbr stack Wei Wang
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The MSR variable type can be "unsigned int", which uses less memory than
the longer unsigned long. The lbr nr won't be a negative number, so make
it "unsigned int" as well.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a6ac2f4..186c1c7 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -682,8 +682,8 @@ struct x86_pmu {
 	/*
 	 * Intel LBR
 	 */
-	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
-	int		lbr_nr;			   /* hardware stack size */
+	unsigned int	lbr_tos, lbr_from, lbr_to,
+			lbr_nr;			   /* lbr stack and size */
 	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
 	const int	*lbr_sel_map;		   /* lbr_select mappings */
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 02/12] perf/x86: add a function to get the lbr stack
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
  2019-07-08  1:23 ` [PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The LBR stack MSRs are architecturally specific. The perf subsystem has
already assigned the abstracted MSR values based on the CPU architecture.

This patch enables a caller outside the perf subsystem to get the LBR
stack info. This is useful for hyperviosrs to prepare the lbr feature
for the guest.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/events/intel/lbr.c       | 23 +++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h | 14 ++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6f814a2..784642a 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void)
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
 		x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
+
+/**
+ * x86_perf_get_lbr_stack - get the lbr stack related MSRs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+	stack->nr = x86_pmu.lbr_nr;
+	stack->tos = x86_pmu.lbr_tos;
+	stack->from = x86_pmu.lbr_from;
+	stack->to = x86_pmu.lbr_to;
+
+	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+		stack->info = MSR_LBR_INFO_0;
+	else
+		stack->info = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 1392d5e..2606100 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -318,7 +318,16 @@ struct perf_guest_switch_msr {
 	u64 host, guest;
 };
 
+struct x86_perf_lbr_stack {
+	unsigned int	nr;
+	unsigned int	tos;
+	unsigned int	from;
+	unsigned int	to;
+	unsigned int	info;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 	return NULL;
 }
 
+static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+	return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	memset(cap, 0, sizeof(*cap));
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
  2019-07-08  1:23 ` [PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs Wei Wang
  2019-07-08  1:23 ` [PATCH v7 02/12] perf/x86: add a function to get the lbr stack Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable Wei Wang
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest
lbr feature.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 include/uapi/linux/kvm.h        |  1 +
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26d1eb8..8d80925 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -871,6 +871,7 @@ struct kvm_arch {
 	atomic_t vapics_in_nmi_mode;
 	struct mutex apic_map_lock;
 	struct kvm_apic_map *apic_map;
+	struct x86_perf_lbr_stack lbr_stack;
 
 	bool apic_access_page_done;
 
@@ -879,6 +880,7 @@ struct kvm_arch {
 	bool mwait_in_guest;
 	bool hlt_in_guest;
 	bool pause_in_guest;
+	bool lbr_in_guest;
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9857992..b35a118 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3086,6 +3086,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_MSR_FEATURES:
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
+	case KVM_CAP_X86_GUEST_LBR:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4622,6 +4623,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exception_payload_enabled = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_X86_GUEST_LBR:
+		r = -EINVAL;
+		if (cap->args[0] &&
+		    x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+			break;
+
+		if (copy_to_user((void __user *)cap->args[1],
+				 &kvm->arch.lbr_stack,
+				 sizeof(struct x86_perf_lbr_stack)))
+			break;
+		kvm->arch.lbr_in_guest = cap->args[0];
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2fe12b4..5391cbc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -993,6 +993,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SVE 170
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
+#define KVM_CAP_X86_GUEST_LBR 173
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (2 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 05/12] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The lbr stack is architecturally specific, for example, SKX has 32 lbr
stack entries while HSW has 16 entries, so a HSW guest running on a SKX
machine may not get accurate perf results. Currently, we forbid the
guest lbr enabling when the guest and host see different lbr stack
entries or the host and guest see different lbr stack msr indices.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/pmu.c           |   8 +++
 arch/x86/kvm/pmu.h           |   2 +
 arch/x86/kvm/vmx/pmu_intel.c | 136 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c           |   3 +-
 4 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 132d149..7d7ac18 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -296,6 +296,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	return 0;
 }
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->lbr_enable)
+		return kvm_x86_ops->pmu_ops->lbr_enable(vcpu);
+
+	return false;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 22dff66..c099b4b 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -29,6 +29,7 @@ struct kvm_pmu_ops {
 					  u64 *mask);
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
+	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*refresh)(struct kvm_vcpu *vcpu);
@@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu);
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 68d231d..ef8ebd4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -12,6 +12,7 @@
 #include <linux/kvm_host.h>
 #include <linux/perf_event.h>
 #include <asm/perf_event.h>
+#include <asm/intel-family.h>
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
@@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u8 vcpu_model = guest_cpuid_model(vcpu);
+	unsigned int vcpu_lbr_from, vcpu_lbr_nr;
+
+	if (x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+		return false;
+
+	if (guest_cpuid_family(vcpu) != boot_cpu_data.x86)
+		return false;
+
+	/*
+	 * It could be possible that people have vcpus of old model run on
+	 * physcal cpus of newer model, for example a BDW guest on a SKX
+	 * machine (but not possible to be the other way around).
+	 * The BDW guest may not get accurate results on a SKX machine as it
+	 * only reads 16 entries of the lbr stack while there are 32 entries
+	 * of recordings. We currently forbid the lbr enabling when the vcpu
+	 * and physical cpu see different lbr stack entries or the guest lbr
+	 * msr indices are not compatible with the host.
+	 */
+	switch (vcpu_model) {
+	case INTEL_FAM6_CORE2_MEROM:
+	case INTEL_FAM6_CORE2_MEROM_L:
+	case INTEL_FAM6_CORE2_PENRYN:
+	case INTEL_FAM6_CORE2_DUNNINGTON:
+		/* intel_pmu_lbr_init_core() */
+		vcpu_lbr_nr = 4;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_NEHALEM:
+	case INTEL_FAM6_NEHALEM_EP:
+	case INTEL_FAM6_NEHALEM_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_ATOM_BONNELL:
+	case INTEL_FAM6_ATOM_BONNELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL:
+	case INTEL_FAM6_ATOM_SALTWELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL_TABLET:
+		/* intel_pmu_lbr_init_atom() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_ATOM_SILVERMONT:
+	case INTEL_FAM6_ATOM_SILVERMONT_X:
+	case INTEL_FAM6_ATOM_SILVERMONT_MID:
+	case INTEL_FAM6_ATOM_AIRMONT:
+	case INTEL_FAM6_ATOM_AIRMONT_MID:
+		/* intel_pmu_lbr_init_slm() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GOLDMONT_X:
+		/* intel_pmu_lbr_init_skl(); */
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
+		/* intel_pmu_lbr_init_skl()*/
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_WESTMERE:
+	case INTEL_FAM6_WESTMERE_EP:
+	case INTEL_FAM6_WESTMERE_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_SANDYBRIDGE:
+	case INTEL_FAM6_SANDYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_IVYBRIDGE:
+	case INTEL_FAM6_IVYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_HASWELL_CORE:
+	case INTEL_FAM6_HASWELL_X:
+	case INTEL_FAM6_HASWELL_ULT:
+	case INTEL_FAM6_HASWELL_GT3E:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_BROADWELL_CORE:
+	case INTEL_FAM6_BROADWELL_XEON_D:
+	case INTEL_FAM6_BROADWELL_GT3E:
+	case INTEL_FAM6_BROADWELL_X:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_XEON_PHI_KNL:
+	case INTEL_FAM6_XEON_PHI_KNM:
+		/* intel_pmu_lbr_init_knl() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_SKYLAKE_MOBILE:
+	case INTEL_FAM6_SKYLAKE_DESKTOP:
+	case INTEL_FAM6_SKYLAKE_X:
+	case INTEL_FAM6_KABYLAKE_MOBILE:
+	case INTEL_FAM6_KABYLAKE_DESKTOP:
+		/* intel_pmu_lbr_init_skl() */
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	default:
+		vcpu_lbr_nr = 0;
+		vcpu_lbr_from = 0;
+		pr_warn("%s: vcpu model not supported %d\n", __func__,
+			vcpu_model);
+	}
+
+	if (vcpu_lbr_nr != kvm->arch.lbr_stack.nr ||
+	    vcpu_lbr_from != kvm->arch.lbr_stack.from) {
+		pr_warn("%s: vcpu model %x incompatible to pcpu %x\n",
+			__func__, vcpu_model, boot_cpu_data.x86_model);
+		return false;
+	}
+
+	return true;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -361,6 +496,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
+	.lbr_enable = intel_pmu_lbr_enable,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
 	.refresh = intel_pmu_refresh,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b35a118..5ba4e3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4625,8 +4625,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_X86_GUEST_LBR:
 		r = -EINVAL;
-		if (cap->args[0] &&
-		    x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+		if (cap->args[0] && !kvm_pmu_lbr_enable(kvm->vcpus[0]))
 			break;
 
 		if (copy_to_user((void __user *)cap->args[1],
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 05/12] KVM/x86/vPMU: tweak kvm_pmu_get_msr
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (3 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

This patch changes kvm_pmu_get_msr to get the msr_data struct, because
The host_initiated field from the struct could be used by get_msr. This
also makes this API be consistent with kvm_pmu_set_msr.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/pmu.c           |  4 ++--
 arch/x86/kvm/pmu.h           |  4 ++--
 arch/x86/kvm/pmu_amd.c       |  7 ++++---
 arch/x86/kvm/vmx/pmu_intel.c | 19 +++++++++++--------
 arch/x86/kvm/x86.c           |  4 ++--
 5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7d7ac18..ee6ed47 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -315,9 +315,9 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr);
 }
 
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
-	return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
+	return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr_info);
 }
 
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index c099b4b..7926b65 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,7 +30,7 @@ struct kvm_pmu_ops {
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
 	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
-	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
@@ -114,7 +114,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index c838838..4a64a3f 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -208,21 +208,22 @@ static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	u32 msr = msr_info->index;
 
 	/* MSR_PERFCTRn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
 	if (pmc) {
-		*data = pmc_read_counter(pmc);
+		msr_info->data = pmc_read_counter(pmc);
 		return 0;
 	}
 	/* MSR_EVNTSELn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
 	if (pmc) {
-		*data = pmc->eventsel;
+		msr_info->data = pmc->eventsel;
 		return 0;
 	}
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ef8ebd4..1e19b01 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -297,35 +297,38 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	u32 msr = msr_info->index;
 
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
-		*data = pmu->fixed_ctr_ctrl;
+		msr_info->data = pmu->fixed_ctr_ctrl;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_STATUS:
-		*data = pmu->global_status;
+		msr_info->data = pmu->global_status;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_CTRL:
-		*data = pmu->global_ctrl;
+		msr_info->data = pmu->global_ctrl;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-		*data = pmu->global_ovf_ctrl;
+		msr_info->data = pmu->global_ovf_ctrl;
 		return 0;
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			u64 val = pmc_read_counter(pmc);
-			*data = val & pmu->counter_bitmask[KVM_PMC_GP];
+			msr_info->data =
+				val & pmu->counter_bitmask[KVM_PMC_GP];
 			return 0;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
 			u64 val = pmc_read_counter(pmc);
-			*data = val & pmu->counter_bitmask[KVM_PMC_FIXED];
+			msr_info->data =
+				val & pmu->counter_bitmask[KVM_PMC_FIXED];
 			return 0;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
-			*data = pmc->eventsel;
+			msr_info->data = pmc->eventsel;
 			return 0;
 		}
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5ba4e3b..3e34286 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2790,7 +2790,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
 	case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
-			return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+			return kvm_pmu_get_msr(vcpu, msr_info);
 		msr_info->data = 0;
 		break;
 	case MSR_IA32_UCODE_REV:
@@ -2942,7 +2942,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
-			return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+			return kvm_pmu_get_msr(vcpu, msr_info);
 		if (!ignore_msrs) {
 			vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
 					       msr_info->index);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (4 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 05/12] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 07/12] perf/x86: no counter allocation support Wei Wang
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
the addresses stored in the LBR stack. Expose those bits to the guest
when the guest lbr feature is enabled.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/perf_event.h |  2 ++
 arch/x86/kvm/cpuid.c              |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c      | 16 ++++++++++++++++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 2606100..aa77da2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -95,6 +95,8 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_LBR_SHIFT	24
 
+#define X86_PERF_CAP_MASK_LBR_FMT			0x3f
+
 /*
  * Intel "Architectural Performance Monitoring" CPUID
  * detection/enumeration details:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4992e7c..4b9e713 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -361,7 +361,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+		F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
 		F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
 		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 1e19b01..09ae6ff 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	case MSR_CORE_PERF_GLOBAL_STATUS:
 	case MSR_CORE_PERF_GLOBAL_CTRL:
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+	case MSR_IA32_PERF_CAPABILITIES:
 		ret = pmu->version > 1;
 		break;
 	default:
@@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
 		msr_info->data = pmu->global_ovf_ctrl;
 		return 0;
+	case MSR_IA32_PERF_CAPABILITIES: {
+		u64 data;
+
+		if (!boot_cpu_has(X86_FEATURE_PDCM) ||
+		    (!msr_info->host_initiated &&
+		     !guest_cpuid_has(vcpu, X86_FEATURE_PDCM)))
+			return 1;
+		data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
+		msr_info->data = 0;
+		if (vcpu->kvm->arch.lbr_in_guest)
+			msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT);
+		return 0;
+	}
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			u64 val = pmc_read_counter(pmc);
@@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 0;
 		}
 		break;
+	case MSR_IA32_PERF_CAPABILITIES:
+		return 1; /* RO MSR */
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			if (msr_info->host_initiated)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 07/12] perf/x86: no counter allocation support
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (5 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08 14:29   ` Peter Zijlstra
  2019-07-08  1:23 ` [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

In some cases, an event may be created without needing a counter
allocation. For example, an lbr event may be created by the host
only to help save/restore the lbr stack on the vCPU context switching.

This patch adds a new interface to allow users to create a perf event
without the need of counter assignment.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/events/core.c     | 12 ++++++++++++
 include/linux/perf_event.h | 13 +++++++++++++
 kernel/events/core.c       | 37 +++++++++++++++++++++++++------------
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f315425..eebbd65 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -410,6 +410,9 @@ int x86_setup_perfctr(struct perf_event *event)
 	struct hw_perf_event *hwc = &event->hw;
 	u64 config;
 
+	if (is_no_counter_event(event))
+		return 0;
+
 	if (!is_sampling_event(event)) {
 		hwc->sample_period = x86_pmu.max_period;
 		hwc->last_period = hwc->sample_period;
@@ -1248,6 +1251,12 @@ static int x86_pmu_add(struct perf_event *event, int flags)
 	hwc = &event->hw;
 
 	n0 = cpuc->n_events;
+
+	if (is_no_counter_event(event)) {
+		n = n0;
+		goto done_collect;
+	}
+
 	ret = n = collect_events(cpuc, event, false);
 	if (ret < 0)
 		goto out;
@@ -1422,6 +1431,9 @@ static void x86_pmu_del(struct perf_event *event, int flags)
 	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
 		goto do_del;
 
+	if (is_no_counter_event(event))
+		goto do_del;
+
 	/*
 	 * Not a TXN, therefore cleanup properly.
 	 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0ab99c7..19e6593 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
+#define PERF_EV_CAP_NO_COUNTER		BIT(2)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
@@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
+perf_event_create(struct perf_event_attr *attr,
+		  int cpu,
+		  struct task_struct *task,
+		  perf_overflow_handler_t overflow_handler,
+		  void *context,
+		  bool counter_assignment);
+extern struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				int cpu,
 				struct task_struct *task,
@@ -1032,6 +1040,11 @@ static inline bool is_sampling_event(struct perf_event *event)
 	return event->attr.sample_period != 0;
 }
 
+static inline bool is_no_counter_event(struct perf_event *event)
+{
+	return !!(event->event_caps & PERF_EV_CAP_NO_COUNTER);
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3..70884df 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11162,18 +11162,10 @@ SYSCALL_DEFINE5(perf_event_open,
 	return err;
 }
 
-/**
- * perf_event_create_kernel_counter
- *
- * @attr: attributes of the counter to create
- * @cpu: cpu in which the counter is bound
- * @task: task to profile (NULL for percpu)
- */
-struct perf_event *
-perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
-				 struct task_struct *task,
-				 perf_overflow_handler_t overflow_handler,
-				 void *context)
+struct perf_event *perf_event_create(struct perf_event_attr *attr, int cpu,
+				     struct task_struct *task,
+				     perf_overflow_handler_t overflow_handler,
+				     void *context, bool need_counter)
 {
 	struct perf_event_context *ctx;
 	struct perf_event *event;
@@ -11193,6 +11185,9 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	/* Mark owner so we could distinguish it from user events. */
 	event->owner = TASK_TOMBSTONE;
 
+	if (!need_counter)
+		event->event_caps |= PERF_EV_CAP_NO_COUNTER;
+
 	ctx = find_get_context(event->pmu, task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
@@ -11241,6 +11236,24 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 err:
 	return ERR_PTR(err);
 }
+EXPORT_SYMBOL_GPL(perf_event_create);
+
+/**
+ * perf_event_create_kernel_counter
+ *
+ * @attr: attributes of the counter to create
+ * @cpu: cpu in which the counter is bound
+ * @task: task to profile (NULL for percpu)
+ */
+struct perf_event *
+perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
+				 struct task_struct *task,
+				 perf_overflow_handler_t overflow_handler,
+				 void *context)
+{
+	return perf_event_create(attr, cpu, task, overflow_handler,
+				 context, true);
+}
 EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
 
 void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (6 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 07/12] perf/x86: no counter allocation support Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08 14:48   ` Peter Zijlstra
  2019-07-09 11:45   ` Peter Zijlstra
  2019-07-08  1:23 ` [PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching Wei Wang
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

From: Like Xu <like.xu@intel.com>

This patch adds support to enable/disable the host side save/restore
for the guest lbr stack on vCPU switching. To enable that, the host
creates a perf event for the vCPU, and the event attributes are set
to the user callstack mode lbr so that all the conditions are meet in
the host perf subsystem to save the lbr stack on task switching.

The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially
used in the vCPU. So perf_event_create is invoked with need_counter=false
to get no counter assigned for the perf event.

The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
used by the vCPU only for context switching. When the perf subsystem
handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
it's non-zero, it simply returns.

Signed-off-by: Like Xu <like.xu@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/events/intel/lbr.c     | 13 +++++++--
 arch/x86/events/perf_event.h    |  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              |  3 ++
 arch/x86/kvm/vmx/pmu_intel.c    | 61 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 784642a..118764b 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	if (event->attr.exclude_guest && is_no_counter_event(event))
+		cpuc->vcpu_lbr = 1;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -509,6 +512,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
 		task_ctx->lbr_callstack_users--;
 	}
 
+	if (event->attr.exclude_guest && is_no_counter_event(event))
+		cpuc->vcpu_lbr = 0;
+
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users--;
 	cpuc->lbr_users--;
@@ -521,7 +527,7 @@ void intel_pmu_lbr_enable_all(bool pmi)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_enable(pmi);
 }
 
@@ -529,7 +535,7 @@ void intel_pmu_lbr_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_disable();
 }
 
@@ -669,7 +675,8 @@ void intel_pmu_lbr_read(void)
 	 * This could be smarter and actually check the event,
 	 * but this simple approach seems to work for now.
 	 */
-	if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+	if (!cpuc->lbr_users || cpuc->vcpu_lbr ||
+	    cpuc->lbr_users == cpuc->lbr_pebs_users)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 186c1c7..86605d1 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -238,6 +238,7 @@ struct cpu_hw_events {
 	/*
 	 * Intel LBR bits
 	 */
+	u8				vcpu_lbr;
 	int				lbr_users;
 	int				lbr_pebs_users;
 	struct perf_branch_stack	lbr_stack;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d80925..79e9c92 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@ struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	struct perf_event *vcpu_lbr_event;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7926b65..384a0b7 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -123,6 +123,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu);
+extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 09ae6ff..24a544e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -507,6 +507,67 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmu->global_ovf_ctrl = 0;
 }
 
+int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event;
+
+	/*
+	 * The main purpose of this perf event is to have the host perf core
+	 * help save/restore the guest lbr stack on vcpu switching. There is
+	 * no perf counters allocated for the event.
+	 *
+	 * About the attr:
+	 * exclude_guest: set to true to indicate that the event runs on the
+	 *                host only.
+	 * pinned:        set to false, so that the FLEXIBLE events will not
+	 *                be rescheduled for this event which actually doesn't
+	 *                need a perf counter.
+	 * config:        Actually this field won't be used by the perf core
+	 *                as this event doesn't have a perf counter.
+	 * sample_period: Same as above.
+	 * sample_type:   tells the perf core that it is an lbr event.
+	 * branch_sample_type: tells the perf core that the lbr event works in
+	 *                the user callstack mode so that the lbr stack will be
+	 *                saved/restored on vCPU switching.
+	 */
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_RAW,
+		.size = sizeof(attr),
+		.exclude_guest = true,
+		.pinned = false,
+		.config = 0,
+		.sample_period = 0,
+		.sample_type = PERF_SAMPLE_BRANCH_STACK,
+		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+				      PERF_SAMPLE_BRANCH_USER,
+	};
+
+	if (pmu->vcpu_lbr_event)
+		return 0;
+
+	event = perf_event_create(&attr, -1, current, NULL, NULL, false);
+	if (IS_ERR(event)) {
+		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
+		return -ENOENT;
+	}
+	pmu->vcpu_lbr_event = event;
+
+	return 0;
+}
+
+void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event = pmu->vcpu_lbr_event;
+
+	if (!event)
+		return;
+
+	perf_event_release_kernel(event);
+	pmu->vcpu_lbr_event = NULL;
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (7 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack Wei Wang
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The vCPU lbr event relies on the host to save/restore all the lbr
related MSRs. So add the LBR_SELECT save/restore to the related
functions for the vCPU case.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/events/intel/lbr.c  | 7 +++++++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 118764b..4861a9d 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 
 	wrmsrl(x86_pmu.lbr_tos, tos);
 	task_ctx->lbr_stack_state = LBR_NONE;
+
+	if (cpuc->vcpu_lbr)
+		wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
 }
 
 static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -409,6 +412,10 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
 			rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
 	}
+
+	if (cpuc->vcpu_lbr)
+		rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
+
 	task_ctx->valid_lbrs = i;
 	task_ctx->tos = tos;
 	task_ctx->lbr_stack_state = LBR_VALID;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 86605d1..e37ff82 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -721,6 +721,7 @@ struct x86_perf_task_context {
 	u64 lbr_from[MAX_LBR_ENTRIES];
 	u64 lbr_to[MAX_LBR_ENTRIES];
 	u64 lbr_info[MAX_LBR_ENTRIES];
+	u64 lbr_sel;
 	int tos;
 	int valid_lbrs;
 	int lbr_callstack_users;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (8 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08 14:53   ` Peter Zijlstra
  2019-07-08  1:23 ` [PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr Wei Wang
  2019-07-08  1:23 ` [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN Wei Wang
  11 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
  stack to be interceptible, so that the host can capture whether the
  lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
  the vCPU support of the guest lbr switching.

Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack;
- pass the stack through to the guest.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/pmu.c              |   6 ++
 arch/x86/kvm/pmu.h              |   2 +
 arch/x86/kvm/vmx/pmu_intel.c    | 141 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |   4 +-
 arch/x86/kvm/vmx/vmx.h          |   2 +
 arch/x86/kvm/x86.c              |   2 +
 7 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 79e9c92..cf8996e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -469,6 +469,8 @@ struct kvm_pmu {
 	u64 global_ctrl_mask;
 	u64 global_ovf_ctrl_mask;
 	u64 reserved_bits;
+	/* Indicate if the lbr msrs were accessed in this vCPU time slice */
+	bool lbr_used;
 	u8 version;
 	struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index ee6ed47..323bb45 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -325,6 +325,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
 }
 
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_in)
+		kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu);
+}
+
 /* refresh PMU settings. This function generally is called when underlying
  * settings are changed (such as changes of PMU CPUID by guest VMs), which
  * should rarely happen.
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 384a0b7..cadf91a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,6 +32,7 @@ struct kvm_pmu_ops {
 	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+	void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
@@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 24a544e..fd09777 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -13,10 +13,12 @@
 #include <linux/perf_event.h>
 #include <asm/perf_event.h>
 #include <asm/intel-family.h>
+#include <asm/vmx.h>
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
 #include "pmu.h"
+#include "vmx.h"
 
 static struct kvm_event_hw_type_mapping intel_arch_events[] = {
 	/* Index must match CPUID 0x0A.EBX bit vector */
@@ -141,6 +143,18 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu,
 	return &counters[idx];
 }
 
+static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+	struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+	int nr = stack->nr;
+
+	return !!(index == MSR_LBR_SELECT ||
+		  index == stack->tos ||
+		  (index >= stack->from && index < stack->from + nr) ||
+		  (index >= stack->to && index < stack->to + nr) ||
+		  (index >= stack->info && index < stack->info));
+}
+
 static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -152,9 +166,12 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	case MSR_CORE_PERF_GLOBAL_CTRL:
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
 	case MSR_IA32_PERF_CAPABILITIES:
+	case MSR_IA32_DEBUGCTLMSR:
 		ret = pmu->version > 1;
 		break;
 	default:
+		if (is_lbr_msr(vcpu, msr))
+			return pmu->version > 1;
 		ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
 			get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
 			get_fixed_pmc(pmu, msr);
@@ -298,6 +315,104 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static void intel_pmu_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu,
+						 bool set)
+{
+	unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+	struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+	int nr = stack->nr;
+	int i;
+
+	vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_SELECT,
+				  MSR_TYPE_RW, set);
+	vmx_set_intercept_for_msr(msr_bitmap, stack->tos,
+				  MSR_TYPE_RW, set);
+	for (i = 0; i < nr; i++) {
+		vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
+					  MSR_TYPE_RW, set);
+		vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
+					  MSR_TYPE_RW, set);
+		if (stack->info)
+			vmx_set_intercept_for_msr(msr_bitmap, stack->info + i,
+						  MSR_TYPE_RW, set);
+	}
+}
+
+static bool intel_pmu_get_lbr_msr(struct kvm_vcpu *vcpu,
+			      struct msr_data *msr_info)
+{
+	u32 index = msr_info->index;
+	bool ret = false;
+
+	switch (index) {
+	case MSR_IA32_DEBUGCTLMSR:
+		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		ret = true;
+		break;
+	default:
+		if (is_lbr_msr(vcpu, index)) {
+			ret = true;
+			rdmsrl(index, msr_info->data);
+		}
+	}
+
+	return ret;
+}
+
+static bool intel_pmu_set_lbr_msr(struct kvm_vcpu *vcpu,
+				  struct msr_data *msr_info)
+{
+	u32 index = msr_info->index;
+	u64 data = msr_info->data;
+	bool ret = false;
+
+	switch (index) {
+	case MSR_IA32_DEBUGCTLMSR:
+		ret = true;
+		/*
+		 * Currently, only FREEZE_LBRS_ON_PMI and DEBUGCTLMSR_LBR are
+		 * supported.
+		 */
+		data &= (DEBUGCTLMSR_FREEZE_LBRS_ON_PMI | DEBUGCTLMSR_LBR);
+		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+		break;
+	default:
+		if (is_lbr_msr(vcpu, index)) {
+			ret = true;
+			wrmsrl(index, data);
+		}
+	}
+
+	return ret;
+}
+
+static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
+				 struct msr_data *msr_info,
+				 bool set)
+{
+	bool ret = false;
+
+	/*
+	 * Some userspace implementations (e.g. QEMU) expects the msrs to be
+	 * always accesible.
+	 */
+	if (!msr_info->host_initiated && !vcpu->kvm->arch.lbr_in_guest)
+		return false;
+
+	if (set)
+		ret = intel_pmu_set_lbr_msr(vcpu, msr_info);
+	else
+		ret = intel_pmu_get_lbr_msr(vcpu, msr_info);
+
+	if (ret && !vcpu->arch.pmu.lbr_used) {
+		vcpu->arch.pmu.lbr_used = true;
+		intel_pmu_set_intercept_for_lbr_msrs(vcpu, false);
+		intel_pmu_enable_save_guest_lbr(vcpu);
+	}
+
+	return ret;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -344,6 +459,8 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
 			msr_info->data = pmc->eventsel;
 			return 0;
+		} else if (intel_pmu_access_lbr_msr(vcpu, msr_info, false)) {
+			return 0;
 		}
 	}
 
@@ -407,12 +524,33 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 				reprogram_gp_counter(pmc, data);
 				return 0;
 			}
+		} else if (intel_pmu_access_lbr_msr(vcpu, msr_info, true)) {
+			return 0;
 		}
 	}
 
 	return 1;
 }
 
+static void intel_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 guest_debugctl;
+
+	if (pmu->lbr_used) {
+		pmu->lbr_used = false;
+		intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+	} else if (pmu->vcpu_lbr_event) {
+		/*
+		 * The lbr feature wasn't used during that last vCPU time
+		 * slice, so it's time to disable the vCPU side save/restore.
+		 */
+		guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		if (!(guest_debugctl & DEBUGCTLMSR_LBR))
+			intel_pmu_disable_save_guest_lbr(vcpu);
+	}
+}
+
 static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -505,6 +643,8 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 
 	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
 		pmu->global_ovf_ctrl = 0;
+
+	intel_pmu_disable_save_guest_lbr(vcpu);
 }
 
 int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
@@ -579,6 +719,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.lbr_enable = intel_pmu_lbr_enable,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
+	.sched_in = intel_pmu_sched_in,
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98eac3..5dc0dcf 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3523,8 +3523,8 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
 	}
 }
 
-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
-			     			      u32 msr, int type, bool value)
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr, int type,
+			       bool value)
 {
 	if (value)
 		vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 61128b4..ed94909 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -317,6 +317,8 @@ void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
 bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
 void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr, int type,
+			       bool value);
 struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
 void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3e34286..a3bc7f2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9233,6 +9233,8 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->arch.l1tf_flush_l1d = true;
+
+	kvm_pmu_sched_in(vcpu, cpu);
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (9 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08  1:23 ` [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN Wei Wang
  11 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separatedly in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/kvm/x86.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3bc7f2..08aa34b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2513,18 +2513,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		}
 		break;
-	case MSR_IA32_DEBUGCTLMSR:
-		if (!data) {
-			/* We support the non-activated case already */
-			break;
-		} else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
-			/* Values other than LBR and BTF are vendor-specific,
-			   thus reserved and should throw a #GP */
-			return 1;
-		}
-		vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
-			    __func__, data);
-		break;
 	case 0x200 ... 0x2ff:
 		return kvm_mtrr_set_msr(vcpu, msr, data);
 	case MSR_IA32_APICBASE:
@@ -2766,7 +2754,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	switch (msr_info->index) {
 	case MSR_IA32_PLATFORM_ID:
 	case MSR_IA32_EBL_CR_POWERON:
-	case MSR_IA32_DEBUGCTLMSR:
 	case MSR_IA32_LASTBRANCHFROMIP:
 	case MSR_IA32_LASTBRANCHTOIP:
 	case MSR_IA32_LASTINTFROMIP:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
  2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
                   ` (10 preceding siblings ...)
  2019-07-08  1:23 ` [PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr Wei Wang
@ 2019-07-08  1:23 ` Wei Wang
  2019-07-08 15:09   ` Peter Zijlstra
  11 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-08  1:23 UTC (permalink / raw)
  To: linux-kernel, kvm, pbonzini, ak, peterz
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

This patch enables the LBR related features in Arch v4 in advance,
though the current vPMU only has v2 support. Other arch v4 related
support will be enabled later in another series.

Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
has been set and a PMI is generated. The CTR_FRZ bit is set when
debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
---
 arch/x86/kvm/pmu.c           | 11 +++++++++--
 arch/x86/kvm/pmu.h           |  1 +
 arch/x86/kvm/vmx/pmu_intel.c | 20 ++++++++++++++++++++
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 323bb45..89bff8f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -52,6 +52,13 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
 	kvm_pmu_deliver_pmi(vcpu);
 }
 
+static void kvm_perf_set_global_status(struct kvm_pmu *pmu, u8 idx)
+{
+	__set_bit(idx, (unsigned long *)&pmu->global_status);
+	if (kvm_x86_ops->pmu_ops->set_global_status)
+		kvm_x86_ops->pmu_ops->set_global_status(pmu, idx);
+}
+
 static void kvm_perf_overflow(struct perf_event *perf_event,
 			      struct perf_sample_data *data,
 			      struct pt_regs *regs)
@@ -61,7 +68,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 
 	if (!test_and_set_bit(pmc->idx,
 			      (unsigned long *)&pmu->reprogram_pmi)) {
-		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+		kvm_perf_set_global_status(pmu, pmc->idx);
 		kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 	}
 }
@@ -75,7 +82,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
 
 	if (!test_and_set_bit(pmc->idx,
 			      (unsigned long *)&pmu->reprogram_pmi)) {
-		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+		kvm_perf_set_global_status(pmu, pmc->idx);
 		kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 
 		/*
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cadf91a..408ddc2 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -24,6 +24,7 @@ struct kvm_pmu_ops {
 				    u8 unit_mask);
 	unsigned (*find_fixed_event)(int idx);
 	bool (*pmc_is_enabled)(struct kvm_pmc *pmc);
+	void (*set_global_status)(struct kvm_pmu *pmu, u8 idx);
 	struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx);
 	struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, unsigned idx,
 					  u64 *mask);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index fd09777..6f74b69 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -413,6 +413,22 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static void intel_pmu_set_global_status(struct kvm_pmu *pmu, u8 idx)
+{
+	u64 guest_debugctl;
+
+	if (pmu->version >= 4) {
+		guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+
+		if (guest_debugctl & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)
+			__set_bit(GLOBAL_STATUS_LBRS_FROZEN,
+				  (unsigned long *)&pmu->global_status);
+		if (guest_debugctl & DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI)
+			__set_bit(GLOBAL_STATUS_COUNTERS_FROZEN,
+				  (unsigned long *)&pmu->global_status);
+	}
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -597,6 +613,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask
 			& ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF |
 			    MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD);
+	if (pmu->version >= 4)
+		pmu->global_ovf_ctrl_mask &= ~(GLOBAL_STATUS_LBRS_FROZEN |
+					       GLOBAL_STATUS_COUNTERS_FROZEN);
 	if (kvm_x86_ops->pt_supported())
 		pmu->global_ovf_ctrl_mask &=
 				~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;
@@ -711,6 +730,7 @@ void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,
+	.set_global_status = intel_pmu_set_global_status,
 	.pmc_is_enabled = intel_pmc_is_enabled,
 	.pmc_idx_to_pmc = intel_pmc_idx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 07/12] perf/x86: no counter allocation support
  2019-07-08  1:23 ` [PATCH v7 07/12] perf/x86: no counter allocation support Wei Wang
@ 2019-07-08 14:29   ` Peter Zijlstra
  2019-07-09  2:58     ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-08 14:29 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 09:23:14AM +0800, Wei Wang wrote:
> In some cases, an event may be created without needing a counter
> allocation. For example, an lbr event may be created by the host
> only to help save/restore the lbr stack on the vCPU context switching.
> 
> This patch adds a new interface to allow users to create a perf event
> without the need of counter assignment.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---

I _really_ hate this one.

>  arch/x86/events/core.c     | 12 ++++++++++++
>  include/linux/perf_event.h | 13 +++++++++++++
>  kernel/events/core.c       | 37 +++++++++++++++++++++++++------------
>  3 files changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index f315425..eebbd65 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -410,6 +410,9 @@ int x86_setup_perfctr(struct perf_event *event)
>  	struct hw_perf_event *hwc = &event->hw;
>  	u64 config;
>  
> +	if (is_no_counter_event(event))
> +		return 0;
> +
>  	if (!is_sampling_event(event)) {
>  		hwc->sample_period = x86_pmu.max_period;
>  		hwc->last_period = hwc->sample_period;
> @@ -1248,6 +1251,12 @@ static int x86_pmu_add(struct perf_event *event, int flags)
>  	hwc = &event->hw;
>  
>  	n0 = cpuc->n_events;
> +
> +	if (is_no_counter_event(event)) {
> +		n = n0;
> +		goto done_collect;
> +	}
> +
>  	ret = n = collect_events(cpuc, event, false);
>  	if (ret < 0)
>  		goto out;
> @@ -1422,6 +1431,9 @@ static void x86_pmu_del(struct perf_event *event, int flags)
>  	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
>  		goto do_del;
>  
> +	if (is_no_counter_event(event))
> +		goto do_del;
> +
>  	/*
>  	 * Not a TXN, therefore cleanup properly.
>  	 */

That's truely an abomination.

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0ab99c7..19e6593 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
>   */
>  #define PERF_EV_CAP_SOFTWARE		BIT(0)
>  #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
> +#define PERF_EV_CAP_NO_COUNTER		BIT(2)
>  
>  #define SWEVENT_HLIST_BITS		8
>  #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
> @@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh);
>  extern void perf_event_update_userpage(struct perf_event *event);
>  extern int perf_event_release_kernel(struct perf_event *event);
>  extern struct perf_event *
> +perf_event_create(struct perf_event_attr *attr,
> +		  int cpu,
> +		  struct task_struct *task,
> +		  perf_overflow_handler_t overflow_handler,
> +		  void *context,
> +		  bool counter_assignment);
> +extern struct perf_event *
>  perf_event_create_kernel_counter(struct perf_event_attr *attr,
>  				int cpu,
>  				struct task_struct *task,

Why the heck are you creating this wrapper nonsense?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-08  1:23 ` [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang
@ 2019-07-08 14:48   ` Peter Zijlstra
  2019-07-09  3:04     ` Wei Wang
  2019-07-09 11:45   ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-08 14:48 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
> From: Like Xu <like.xu@intel.com>
> 
> This patch adds support to enable/disable the host side save/restore

This patch should be disqualified on Changelog alone...

  Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy

> for the guest lbr stack on vCPU switching. To enable that, the host
> creates a perf event for the vCPU, and the event attributes are set
> to the user callstack mode lbr so that all the conditions are meet in
> the host perf subsystem to save the lbr stack on task switching.
> 
> The host side lbr perf event are created only for the purpose of saving
> and restoring the lbr stack. There is no need to enable the lbr
> functionality for this perf event, because the feature is essentially
> used in the vCPU. So perf_event_create is invoked with need_counter=false
> to get no counter assigned for the perf event.
> 
> The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
> used by the vCPU only for context switching. When the perf subsystem
> handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
> it's non-zero, it simply returns.

*WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
this?

Many of these patches don't explain why things are done; that's a
problem.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
  2019-07-08  1:23 ` [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack Wei Wang
@ 2019-07-08 14:53   ` Peter Zijlstra
  2019-07-08 15:11     ` Andi Kleen
  2019-07-09  3:14     ` Wei Wang
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-08 14:53 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 09:23:17AM +0800, Wei Wang wrote:
> When the vCPU is scheduled in:
> - if the lbr feature was used in the last vCPU time slice, set the lbr
>   stack to be interceptible, so that the host can capture whether the
>   lbr feature will be used in this time slice;
> - if the lbr feature wasn't used in the last vCPU time slice, disable
>   the vCPU support of the guest lbr switching.
> 
> Upon the first access to one of the lbr related MSRs (since the vCPU was
> scheduled in):
> - record that the guest has used the lbr;
> - create a host perf event to help save/restore the guest lbr stack;
> - pass the stack through to the guest.

I don't understand a word of that.

Who cares if the LBR MSRs are touched; the guest expects up-to-date
values when it does reads them.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
  2019-07-08  1:23 ` [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN Wei Wang
@ 2019-07-08 15:09   ` Peter Zijlstra
  2019-07-09  3:24     ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-08 15:09 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 09:23:19AM +0800, Wei Wang wrote:
> This patch enables the LBR related features in Arch v4 in advance,
> though the current vPMU only has v2 support. Other arch v4 related
> support will be enabled later in another series.
> 
> Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
> the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
> has been set and a PMI is generated. The CTR_FRZ bit is set when
> debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.

(that's still a misnomer; it is: freeze_perfmon_on_overflow)

Why?

Who uses that v4 crud? It's broken. It looses events between overflow
and PMI.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
  2019-07-08 14:53   ` Peter Zijlstra
@ 2019-07-08 15:11     ` Andi Kleen
  2019-07-09 11:39       ` Peter Zijlstra
  2019-07-09  3:14     ` Wei Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2019-07-08 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wei Wang, linux-kernel, kvm, pbonzini, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

> I don't understand a word of that.
> 
> Who cares if the LBR MSRs are touched; the guest expects up-to-date
> values when it does reads them.

This is for only when the LBRs are disabled in the guest.

It doesn't make sense to constantly save/restore disabled LBRs, which
would be a large overhead for guests that don't use LBRs. 

When the LBRs are enabled they always need to be saved/restored as you
say.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 07/12] perf/x86: no counter allocation support
  2019-07-08 14:29   ` Peter Zijlstra
@ 2019-07-09  2:58     ` Wei Wang
  2019-07-09  9:43       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-09  2:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/08/2019 10:29 PM, Peter Zijlstra wrote:

Thanks for the comments.

>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0ab99c7..19e6593 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
>>    */
>>   #define PERF_EV_CAP_SOFTWARE		BIT(0)
>>   #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
>> +#define PERF_EV_CAP_NO_COUNTER		BIT(2)
>>   
>>   #define SWEVENT_HLIST_BITS		8
>>   #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
>> @@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh);
>>   extern void perf_event_update_userpage(struct perf_event *event);
>>   extern int perf_event_release_kernel(struct perf_event *event);
>>   extern struct perf_event *
>> +perf_event_create(struct perf_event_attr *attr,
>> +		  int cpu,
>> +		  struct task_struct *task,
>> +		  perf_overflow_handler_t overflow_handler,
>> +		  void *context,
>> +		  bool counter_assignment);
>> +extern struct perf_event *
>>   perf_event_create_kernel_counter(struct perf_event_attr *attr,
>>   				int cpu,
>>   				struct task_struct *task,
> Why the heck are you creating this wrapper nonsense?

(please see early discussions: https://lkml.org/lkml/2018/9/20/868)
I thought we agreed that the perf event created here don't need to consume
an extra counter.

In the previous version, we added a "no_counter" bit to perf_event_attr, and
that will be exposed to user ABI, which seems not good.
(https://lkml.org/lkml/2019/2/14/791)
So we wrap a new kernel API above to support this.

Do you have a different suggestion to do this?
(exclude host/guest just clears the enable bit when on VM-exit/entry,
still consumes the counter)

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-08 14:48   ` Peter Zijlstra
@ 2019-07-09  3:04     ` Wei Wang
  2019-07-09  9:39       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-09  3:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/08/2019 10:48 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
>> From: Like Xu <like.xu@intel.com>
>>
>> This patch adds support to enable/disable the host side save/restore
> This patch should be disqualified on Changelog alone...
>
>    Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy

OK, we will discard "This patch" in the description:

To enable/disable the host side save/restore for the guest lbr stack on
vCPU switching, the host creates a perf event for the vCPU..

>> for the guest lbr stack on vCPU switching. To enable that, the host
>> creates a perf event for the vCPU, and the event attributes are set
>> to the user callstack mode lbr so that all the conditions are meet in
>> the host perf subsystem to save the lbr stack on task switching.
>>
>> The host side lbr perf event are created only for the purpose of saving
>> and restoring the lbr stack. There is no need to enable the lbr
>> functionality for this perf event, because the feature is essentially
>> used in the vCPU. So perf_event_create is invoked with need_counter=false
>> to get no counter assigned for the perf event.
>>
>> The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
>> used by the vCPU only for context switching. When the perf subsystem
>> handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
>> it's non-zero, it simply returns.
> *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
> this?

Because the VMX transition is much more frequent than the vCPU switching.
On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
would be too large for the frequent VMX transitions.

LBR state is saved when vCPU is scheduled out to ensure that this vCPU's
LBR data doesn't get lost (as another vCPU or host thread that is 
scheduled in
may use LBR)


> Many of these patches don't explain why things are done; that's a
> problem.

We'll improve, please help when you find anywhere isn't clear to you, 
thanks.

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
  2019-07-08 14:53   ` Peter Zijlstra
  2019-07-08 15:11     ` Andi Kleen
@ 2019-07-09  3:14     ` Wei Wang
  1 sibling, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-09  3:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/08/2019 10:53 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 09:23:17AM +0800, Wei Wang wrote:
>> When the vCPU is scheduled in:
>> - if the lbr feature was used in the last vCPU time slice, set the lbr
>>    stack to be interceptible, so that the host can capture whether the
>>    lbr feature will be used in this time slice;
>> - if the lbr feature wasn't used in the last vCPU time slice, disable
>>    the vCPU support of the guest lbr switching.
>>
>> Upon the first access to one of the lbr related MSRs (since the vCPU was
>> scheduled in):
>> - record that the guest has used the lbr;
>> - create a host perf event to help save/restore the guest lbr stack;
>> - pass the stack through to the guest.
> I don't understand a word of that.
>
> Who cares if the LBR MSRs are touched; the guest expects up-to-date
> values when it does reads them.

Another host thread who shares the same pCPU with this vCPU thread
may use the lbr stack, so the host needs to save/restore the vCPU's lbr 
state.
Otherwise the guest perf inside the vCPU wouldn't read the correct data
from the lbr msr (as the msrs are changed by another host thread already).

As Andi also replied, if the vCPU isn't using lbr anymore, host doesn't need
to save the lbr msr then.

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
  2019-07-08 15:09   ` Peter Zijlstra
@ 2019-07-09  3:24     ` Wei Wang
  2019-07-09 11:35       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-09  3:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/08/2019 11:09 PM, Peter Zijlstra wrote:
> On Mon, Jul 08, 2019 at 09:23:19AM +0800, Wei Wang wrote:
>> This patch enables the LBR related features in Arch v4 in advance,
>> though the current vPMU only has v2 support. Other arch v4 related
>> support will be enabled later in another series.
>>
>> Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
>> the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
>> has been set and a PMI is generated. The CTR_FRZ bit is set when
>> debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.
> (that's still a misnomer; it is: freeze_perfmon_on_overflow)

OK. (but that was directly copied from the sdm 18.2.4.1)


> Why?
>
> Who uses that v4 crud?

I saw the native perf driver has been updated to v4.
After the vPMU gets updated to v4, the guest perf would use that.

If you prefer to hold on this patch until vPMU v4 support,
we could do that as well.


> It's broken. It looses events between overflow
> and PMI.

Do you mean it's a v4 hardware issue?

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-09  3:04     ` Wei Wang
@ 2019-07-09  9:39       ` Peter Zijlstra
  2019-07-09 11:34         ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09  9:39 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote:
> On 07/08/2019 10:48 PM, Peter Zijlstra wrote:

> > *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
> > this?
> 
> Because the VMX transition is much more frequent than the vCPU switching.
> On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
> would be too large for the frequent VMX transitions.
> 
> LBR state is saved when vCPU is scheduled out to ensure that this
> vCPU's LBR data doesn't get lost (as another vCPU or host thread that
> is scheduled in may use LBR)

But VMENTER/VMEXIT still have to enable/disable the LBR, right?
Otherwise the host will pollute LBR contents. And you then rely on this
'fake' event to ensure the host doesn't use LBR when the VCPU is
running.

But what about the counter scheduling rules; what happens when a CPU
event claims the LBR before the task event can claim it? CPU events have
precedence over task events.

Then your VCPU will clobber host state; which is BAD.

I'm missing all these details in the Changelogs. Please describe the
whole setup and explain why this approach.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 07/12] perf/x86: no counter allocation support
  2019-07-09  2:58     ` Wei Wang
@ 2019-07-09  9:43       ` Peter Zijlstra
  2019-07-09 11:36         ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09  9:43 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Tue, Jul 09, 2019 at 10:58:46AM +0800, Wei Wang wrote:
> On 07/08/2019 10:29 PM, Peter Zijlstra wrote:
> 
> Thanks for the comments.
> 
> > 
> > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > > index 0ab99c7..19e6593 100644
> > > --- a/include/linux/perf_event.h
> > > +++ b/include/linux/perf_event.h
> > > @@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
> > >    */
> > >   #define PERF_EV_CAP_SOFTWARE		BIT(0)
> > >   #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
> > > +#define PERF_EV_CAP_NO_COUNTER		BIT(2)
> > >   #define SWEVENT_HLIST_BITS		8
> > >   #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
> > > @@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh);
> > >   extern void perf_event_update_userpage(struct perf_event *event);
> > >   extern int perf_event_release_kernel(struct perf_event *event);
> > >   extern struct perf_event *
> > > +perf_event_create(struct perf_event_attr *attr,
> > > +		  int cpu,
> > > +		  struct task_struct *task,
> > > +		  perf_overflow_handler_t overflow_handler,
> > > +		  void *context,
> > > +		  bool counter_assignment);
> > > +extern struct perf_event *
> > >   perf_event_create_kernel_counter(struct perf_event_attr *attr,
> > >   				int cpu,
> > >   				struct task_struct *task,
> > Why the heck are you creating this wrapper nonsense?
> 
> (please see early discussions: https://lkml.org/lkml/2018/9/20/868)
> I thought we agreed that the perf event created here don't need to consume
> an extra counter.

That's almost a year ago; I really can't remember that and you didn't
put any of that in your Changelog to help me remember.

(also please use: https://lkml.kernel.org/r/$msgid style links)

> In the previous version, we added a "no_counter" bit to perf_event_attr, and
> that will be exposed to user ABI, which seems not good.
> (https://lkml.org/lkml/2019/2/14/791)
> So we wrap a new kernel API above to support this.
> 
> Do you have a different suggestion to do this?
> (exclude host/guest just clears the enable bit when on VM-exit/entry,
> still consumes the counter)

Just add an argument to perf_event_create_kernel_counter() ?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-09  9:39       ` Peter Zijlstra
@ 2019-07-09 11:34         ` Wei Wang
  2019-07-09 12:19           ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Wei Wang @ 2019-07-09 11:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/09/2019 05:39 PM, Peter Zijlstra wrote:
> On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote:
>> On 07/08/2019 10:48 PM, Peter Zijlstra wrote:
>>> *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
>>> this?
>> Because the VMX transition is much more frequent than the vCPU switching.
>> On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
>> would be too large for the frequent VMX transitions.
>>
>> LBR state is saved when vCPU is scheduled out to ensure that this
>> vCPU's LBR data doesn't get lost (as another vCPU or host thread that
>> is scheduled in may use LBR)
> But VMENTER/VMEXIT still have to enable/disable the LBR, right?
> Otherwise the host will pollute LBR contents. And you then rely on this
> 'fake' event to ensure the host doesn't use LBR when the VCPU is
> running.

Yes, only the debugctl msr is save/restore on vmx tranisions.


>
> But what about the counter scheduling rules;

The counter is emulated independent of the lbr emulation.

Here is the background reason:

The direction we are going is the architectural emulation, where the 
features
are emulated based on the hardware behavior described in the spec. So 
the lbr
emulation path only offers the lbr feature to the guest (no counters 
associated, as
the lbr feature doesn't have a counter essentially).

If the above isn't clear, please see this example: the guest could run 
any software
to use the lbr feature (non-perf or non-linux, or even a testing kernel 
module to try
lbr for their own purpose), and it could choose to use a regular timer 
to do sampling.
If the lbr emulation takes a counter to generate a PMI to the guest to 
do sampling,
that pmi isn't expected from the guest perspective.

So the counter scheduling isn't considered by the lbr emulation here, it 
is considered
by the counter emulation. If the guest needs a counter, it configures 
the related msr,
which traps to KVM, and the counter emulation has it own emulation path
(e.g. reprogram_gp_counter which is called when the guest writes to the 
emulated
eventsel msr).


> what happens when a CPU
> event claims the LBR before the task event can claim it? CPU events have
> precedence over task events.

I think the precedence (cpu pined and task pined) is for the counter 
multiplexing,
right?

For the lbr feature, could we thought of it as first come, first served?
For example, if we have 2 host threads who want to use lbr at the same time,
I think one of them would simply fail to use.

So if guest first gets the lbr, host wouldn't take over unless some 
userspace
command (we added to QEMU) is executed to have the vCPU actively
stop using lbr.


>
> I'm missing all these details in the Changelogs. Please describe the
> whole setup and explain why this approach.

OK, just shared some important background above.
I'll see if any more important details missed.

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
  2019-07-09  3:24     ` Wei Wang
@ 2019-07-09 11:35       ` Peter Zijlstra
  2019-07-10  9:23         ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09 11:35 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Tue, Jul 09, 2019 at 11:24:07AM +0800, Wei Wang wrote:
> On 07/08/2019 11:09 PM, Peter Zijlstra wrote:
> > On Mon, Jul 08, 2019 at 09:23:19AM +0800, Wei Wang wrote:
> > > This patch enables the LBR related features in Arch v4 in advance,
> > > though the current vPMU only has v2 support. Other arch v4 related
> > > support will be enabled later in another series.
> > > 
> > > Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
> > > the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
> > > has been set and a PMI is generated. The CTR_FRZ bit is set when
> > > debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.
> > (that's still a misnomer; it is: freeze_perfmon_on_overflow)
> 
> OK. (but that was directly copied from the sdm 18.2.4.1)

Yeah, I know. But that name doesn't correctly describe what it actually
does. If it worked as named it would in fact be OK.

> > Why?
> > 
> > Who uses that v4 crud?
> 
> I saw the native perf driver has been updated to v4.

It's default disabled and I'm temped to simply remove it. See below.

> After the vPMU gets updated to v4, the guest perf would use that.
> 
> If you prefer to hold on this patch until vPMU v4 support,
> we could do that as well.
> 
> 
> > It's broken. It looses events between overflow
> > and PMI.
> 
> Do you mean it's a v4 hardware issue?

Yeah; although I'm not sure if its an implementation or specification
problem. But as it exists it is of very limited use.

Fundamentally our events (with exception of event groups) are
independent. Events should always count, except when the PMI is running
-- so as to not include the measurement overhead in the measurement
itself. But this (mis)feature stops the entire PMU as soon as a single
counter overflows, inhibiting all other counters from running (as they
should) until the PMI has happened and reset the state.

(Note that, strictly speaking, we even expect the overflowing counter to
continue counting until the PMI happens. Having an overflow should not
mean we loose events. A sampling and !sampling event should produce the
same event count.)

So even when there's only a single event (group) scheduled, it isn't
strictly right. And when there's multiple events scheduled it is
definitely wrong.

And while I understand the purpose of the current semantics; it makes a
single event group sample count more coherent, the fact that is looses
events just bugs me something fierce -- and as shown, it breaks tools.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 07/12] perf/x86: no counter allocation support
  2019-07-09  9:43       ` Peter Zijlstra
@ 2019-07-09 11:36         ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-09 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/09/2019 05:43 PM, Peter Zijlstra wrote:
> That's almost a year ago; I really can't remember that and you didn't
> put any of that in your Changelog to help me remember.
>
> (also please use: https://lkml.kernel.org/r/$msgid style links)

OK, I'll put this link in the cover letter or commit log for a reminder.

>
>> In the previous version, we added a "no_counter" bit to perf_event_attr, and
>> that will be exposed to user ABI, which seems not good.
>> (https://lkml.org/lkml/2019/2/14/791)
>> So we wrap a new kernel API above to support this.
>>
>> Do you have a different suggestion to do this?
>> (exclude host/guest just clears the enable bit when on VM-exit/entry,
>> still consumes the counter)
> Just add an argument to perf_event_create_kernel_counter() ?

Yes. I didn't find a proper place to add this "no_counter" indicator, so 
added a
wrapper to avoid changing existing callers of 
perf_event_create_kernel_counter.

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
  2019-07-08 15:11     ` Andi Kleen
@ 2019-07-09 11:39       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09 11:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Wei Wang, linux-kernel, kvm, pbonzini, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 08:11:41AM -0700, Andi Kleen wrote:
> > I don't understand a word of that.
> > 
> > Who cares if the LBR MSRs are touched; the guest expects up-to-date
> > values when it does reads them.
> 
> This is for only when the LBRs are disabled in the guest.

But your Changelog didn't clearly state that. It was an unreadable mess.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-08  1:23 ` [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang
  2019-07-08 14:48   ` Peter Zijlstra
@ 2019-07-09 11:45   ` Peter Zijlstra
  2019-07-10  8:21     ` Wei Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09 11:45 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct perf_event *event;
> +
> +	/*
> +	 * The main purpose of this perf event is to have the host perf core
> +	 * help save/restore the guest lbr stack on vcpu switching. There is
> +	 * no perf counters allocated for the event.
> +	 *
> +	 * About the attr:
> +	 * exclude_guest: set to true to indicate that the event runs on the
> +	 *                host only.

That's a lie; it _never_ runs. You specifically don't want the host to
enable LBR so as not to corrupt the guest state.

> +	 * pinned:        set to false, so that the FLEXIBLE events will not
> +	 *                be rescheduled for this event which actually doesn't
> +	 *                need a perf counter.

Unparsable gibberish. Specifically by making it flexible it is
susceptible to rotation and there's no guarantee it will actually get
scheduled.

> +	 * config:        Actually this field won't be used by the perf core
> +	 *                as this event doesn't have a perf counter.
> +	 * sample_period: Same as above.

If it's unused; why do we need to set it at all?

> +	 * sample_type:   tells the perf core that it is an lbr event.
> +	 * branch_sample_type: tells the perf core that the lbr event works in
> +	 *                the user callstack mode so that the lbr stack will be
> +	 *                saved/restored on vCPU switching.

Again; doesn't make sense. What does the user part have to do with
save/restore? What happens when this vcpu thread drops to userspace for
an assist?

> +	 */
> +	struct perf_event_attr attr = {
> +		.type = PERF_TYPE_RAW,
> +		.size = sizeof(attr),
> +		.exclude_guest = true,
> +		.pinned = false,
> +		.config = 0,
> +		.sample_period = 0,
> +		.sample_type = PERF_SAMPLE_BRANCH_STACK,
> +		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> +				      PERF_SAMPLE_BRANCH_USER,
> +	};
> +
> +	if (pmu->vcpu_lbr_event)
> +		return 0;
> +
> +	event = perf_event_create(&attr, -1, current, NULL, NULL, false);
> +	if (IS_ERR(event)) {
> +		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
> +		return -ENOENT;
> +	}
> +	pmu->vcpu_lbr_event = event;
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-09 11:34         ` Wei Wang
@ 2019-07-09 12:19           ` Peter Zijlstra
  2019-07-10  8:19             ` Wei Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-09 12:19 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On Tue, Jul 09, 2019 at 07:34:26PM +0800, Wei Wang wrote:

> > But what about the counter scheduling rules;
> 
> The counter is emulated independent of the lbr emulation.

> > what happens when a CPU
> > event claims the LBR before the task event can claim it? CPU events have
> > precedence over task events.
> 
> I think the precedence (cpu pined and task pined) is for the counter
> multiplexing,
> right?

No; for all scheduling. The order is:

  CPU-pinned
  Task-pinned
  CPU-flexible
  Task-flexible

The way you created the event it would land in 'task-flexible', but even
if you make it task-pinned, a CPU (or CPU-pinned) event could claim the
LBR before your fake event.

> For the lbr feature, could we thought of it as first come, first served?
> For example, if we have 2 host threads who want to use lbr at the same time,
> I think one of them would simply fail to use.
>
> So if guest first gets the lbr, host wouldn't take over unless some
> userspace command (we added to QEMU) is executed to have the vCPU
> actively stop using lbr.

Doesn't work that way.

Say you start KVM with LBR emulation, it creates this task event, it
gets the LBR (nobody else wants it) and the guest works and starts using
the LBR.

Then the host creates a CPU LBR event and the vCPU suddenly gets denied
the LBR and the guest no longer functions correctly.

Or you should fail to VMENTER, in which case you starve the guest, but
at least it doesn't malfunction.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-09 12:19           ` Peter Zijlstra
@ 2019-07-10  8:19             ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-10  8:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/09/2019 08:19 PM, Peter Zijlstra wrote:
>
>> For the lbr feature, could we thought of it as first come, first served?
>> For example, if we have 2 host threads who want to use lbr at the same time,
>> I think one of them would simply fail to use.
>>
>> So if guest first gets the lbr, host wouldn't take over unless some
>> userspace command (we added to QEMU) is executed to have the vCPU
>> actively stop using lbr.
> Doesn't work that way.
>
> Say you start KVM with LBR emulation, it creates this task event, it
> gets the LBR (nobody else wants it) and the guest works and starts using
> the LBR.
>
> Then the host creates a CPU LBR event and the vCPU suddenly gets denied
> the LBR and the guest no longer functions correctly.
>
> Or you should fail to VMENTER, in which case you starve the guest, but
> at least it doesn't malfunction.
>

Ok, I think we could go with the first option for now, like this:

When there are other host threads starting to use lbr, host perf notifies
kvm to disable the guest use of lbr (add a notifier_block), which includes:
  - cancel passing through lbr msrs to guest
  - upon guest access to the lbr msr (will trap to KVM now), return #GP
    (or 0).


Best,
Wei




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
  2019-07-09 11:45   ` Peter Zijlstra
@ 2019-07-10  8:21     ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-10  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/09/2019 07:45 PM, Peter Zijlstra wrote:
>
>> +	 * config:        Actually this field won't be used by the perf core
>> +	 *                as this event doesn't have a perf counter.
>> +	 * sample_period: Same as above.
> If it's unused; why do we need to set it at all?

OK, we'll remove the unused fields.

>
>> +	 * sample_type:   tells the perf core that it is an lbr event.
>> +	 * branch_sample_type: tells the perf core that the lbr event works in
>> +	 *                the user callstack mode so that the lbr stack will be
>> +	 *                saved/restored on vCPU switching.
> Again; doesn't make sense. What does the user part have to do with
> save/restore? What happens when this vcpu thread drops to userspace for
> an assist?

This is a fake event which doesn't run the lbr on the host.
Returning to userspace doesn't need save/restore lbr, because userspace
wouldn't change those lbr msrs.

The event is created to save/restore lbr msrs on vcpu switching.
Host perf only do this save/restore for "user callstack mode" lbr event, 
so we
construct the event to be "user callstack mode" to reuse the host lbr 
save/restore.

An alternative method is to have KVM vcpu sched_in/out callback
save/restore those msrs, which don't need to create this fake event.

Please let me know if you prefer the second method.

Best,
Wei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
  2019-07-09 11:35       ` Peter Zijlstra
@ 2019-07-10  9:23         ` Wei Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Wei Wang @ 2019-07-10  9:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, pbonzini, ak, kan.liang, mingo, rkrcmar,
	like.xu, jannh, arei.gonglei, jmattson

On 07/09/2019 07:35 PM, Peter Zijlstra wrote:
>
> Yeah; although I'm not sure if its an implementation or specification
> problem. But as it exists it is of very limited use.
>
> Fundamentally our events (with exception of event groups) are
> independent. Events should always count, except when the PMI is running
> -- so as to not include the measurement overhead in the measurement
> itself. But this (mis)feature stops the entire PMU as soon as a single
> counter overflows, inhibiting all other counters from running (as they
> should) until the PMI has happened and reset the state.
>
> (Note that, strictly speaking, we even expect the overflowing counter to
> continue counting until the PMI happens. Having an overflow should not
> mean we loose events. A sampling and !sampling event should produce the
> same event count.)
>
> So even when there's only a single event (group) scheduled, it isn't
> strictly right. And when there's multiple events scheduled it is
> definitely wrong.
>
> And while I understand the purpose of the current semantics; it makes a
> single event group sample count more coherent, the fact that is looses
> events just bugs me something fierce -- and as shown, it breaks tools.

Thanks for sharing the finding.
If I understand this correctly, you observed that counter getting freezed
earlier than expected (expected to freeze at the time PMI gets generated).

Have you talked to anyone for possible freeze adjustment from the hardware?

Best,
Wei


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-07-10  9:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-08  1:23 [PATCH v7 00/12] Guest LBR Enabling Wei Wang
2019-07-08  1:23 ` [PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs Wei Wang
2019-07-08  1:23 ` [PATCH v7 02/12] perf/x86: add a function to get the lbr stack Wei Wang
2019-07-08  1:23 ` [PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
2019-07-08  1:23 ` [PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable Wei Wang
2019-07-08  1:23 ` [PATCH v7 05/12] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
2019-07-08  1:23 ` [PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
2019-07-08  1:23 ` [PATCH v7 07/12] perf/x86: no counter allocation support Wei Wang
2019-07-08 14:29   ` Peter Zijlstra
2019-07-09  2:58     ` Wei Wang
2019-07-09  9:43       ` Peter Zijlstra
2019-07-09 11:36         ` Wei Wang
2019-07-08  1:23 ` [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang
2019-07-08 14:48   ` Peter Zijlstra
2019-07-09  3:04     ` Wei Wang
2019-07-09  9:39       ` Peter Zijlstra
2019-07-09 11:34         ` Wei Wang
2019-07-09 12:19           ` Peter Zijlstra
2019-07-10  8:19             ` Wei Wang
2019-07-09 11:45   ` Peter Zijlstra
2019-07-10  8:21     ` Wei Wang
2019-07-08  1:23 ` [PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching Wei Wang
2019-07-08  1:23 ` [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack Wei Wang
2019-07-08 14:53   ` Peter Zijlstra
2019-07-08 15:11     ` Andi Kleen
2019-07-09 11:39       ` Peter Zijlstra
2019-07-09  3:14     ` Wei Wang
2019-07-08  1:23 ` [PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr Wei Wang
2019-07-08  1:23 ` [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN Wei Wang
2019-07-08 15:09   ` Peter Zijlstra
2019-07-09  3:24     ` Wei Wang
2019-07-09 11:35       ` Peter Zijlstra
2019-07-10  9:23         ` Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).