linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/14] Guest LBR Enabling
@ 2019-08-06  7:16 Wei Wang
  2019-08-06  7:16 ` [PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs Wei Wang
                   ` (15 more replies)
  0 siblings, 16 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
on Intel CPUs that captures branch related info. This patch series enables
this feature to KVM guests.

Each guest can be configured to expose this LBR feature to the guest via
userspace setting the enabling param in KVM_CAP_X86_GUEST_LBR (patch 3).

About the lbr emulation method:
Since the vcpu get scheduled in, the lbr related msrs are made
interceptible. This makes guest first access to a lbr related msr always
vm-exit to kvm, so that kvm can know whether the lbr feature is used
during the vcpu time slice. The kvm lbr msr handler does the following
things:
  - create an lbr perf event (task pinned) for the vcpu thread.
    The perf event mainly serves 2 purposes:
      -- follow the host perf scheduling rules to manage the vcpu's usage
         of lbr (e.g. a cpu pinned lbr event could reclaim lbr and thus
         stopping the vcpu's use);
      -- have the host perf do context switching of the lbr state on the
         vcpu thread switching.
  - pass the lbr related msrs through to the guest.
    This enables the following guest accesses to the lbr related msrs
    without vm-exit, as long as the vcpu's lbr event owns the lbr feature.
    A cpu pinned lbr event on the host could come and take over the lbr
    feature via IPI calls. In this case, the pass-through will be
    cancelled (patch 13), and the guest following accesses to the lbr msrs
    will vm-exit to kvm and accesses will be forbidden in the handler.

If the guest doesn't touch any of the lbr related msrs (likely the guest
doesn't need to run lbr in the near future), the vcpu's lbr perf event
will be freed (please see patch 12 commit for more details).

* Tests
Conclusion: the profiling results on the guest are similar to that on the host.

Run: ./perf -b ./test_program

- Test on the host:
Overhead  Command  Source Shared Object  Source Symbol    Target Symbol   
  22.35%  ftest    libc-2.23.so          [.] __random     [.] __random        
   8.20%  ftest    ftest                 [.] qux          [.] qux             
   5.88%  ftest    ftest                 [.] random@plt   [.] __random        
   5.88%  ftest    libc-2.23.so          [.] __random     [.] __random_r  
   5.79%  ftest    ftest                 [.] main         [.] random@plt  
   5.60%  ftest    ftest                 [.] main         [.] foo             
   5.24%  ftest    libc-2.23.so          [.] __random     [.] main            
   5.20%  ftest    libc-2.23.so          [.] __random_r   [.] __random        
   5.00%  ftest    ftest                 [.] foo          [.] qux             
   4.91%  ftest    ftest                 [.] main         [.] bar             
   4.83%  ftest    ftest                 [.] bar          [.] qux             
   4.57%  ftest    ftest                 [.] main         [.] main            
   4.38%  ftest    ftest                 [.] foo          [.] main            
   4.13%  ftest    ftest                 [.] qux          [.] foo             
   3.89%  ftest    ftest                 [.] qux          [.] bar             
   3.86%  ftest    ftest                 [.] bar          [.] main            

- Test on the guest:
Overhead  Command  Source Shaged Object  Source Symbol    Target Symbol
  22.36%  ftest    libc-2.23.so          [.] random       [.] random  
   8.55%  ftest    ftest                 [.] qux          [.] qux                    
   5.79%  ftest    libc-2.23.so          [.] random       [.] random_r                     
   5.64%  ftest    ftest                 [.] random@plt   [.] random                     
   5.58%  ftest    ftest                 [.] main         [.] random@plt                       
   5.55%  ftest    ftest                 [.] main         [.] foo                       
   5.41%  ftest    libc-2.23.so          [.] random       [.] main                 
   5.31%  ftest    libc-2.23.so          [.] random_r     [.] random                      
   5.11%  ftest    ftest                 [.] foo          [.] qux                     
   4.93%  ftest    ftest                 [.] main         [.] main                     
   4.59%  ftest    ftest                 [.] qux          [.] bar                       
   4.49%  ftest    ftest                 [.] bar          [.] main                       
   4.42%  ftest    ftest                 [.] bar          [.] qux                       
   4.16%  ftest    ftest                 [.] main         [.] bar                       
   3.95%  ftest    ftest                 [.] qux          [.] foo                        
   3.79%  ftest    ftest                 [.] foo          [.] main
(due to the lib version difference, "random" is equavlent to __random above)

v7->v8 Changelog:
  - Patch 3:
    -- document KVM_CAP_X86_GUEST_LBR in api.txt
    -- make the check of KVM_CAP_X86_GUEST_LBR return the size of
       struct x86_perf_lbr_stack, to let userspace do a compatibility
       check.
  - Patch 7:
    -- support perf scheduler to not assign a counter for the perf event
       that has PERF_EV_CAP_NO_COUNTER set (rather than skipping the perf
       scheduler). This allows the scheduler to detect lbr usage conflicts
       via get_event_constraints, and lower priority events will finally
       fail to use lbr.
    -- define X86_PMC_IDX_NA as "-1", which represents a never assigned
       counter id. There are other places that use "-1", but could be
       updated to use the new macro in another patch series.
  - Patch 8:
    -- move the event->owner assignment into perf_event_alloc to have it
       set before event_init is called. Please see this patch's commit for
       reasons.
  - Patch 9:
    -- use "exclude_host" and "is_kernel_event" to decide if the lbr event
       is used for the vcpu lbr emulation, which doesn't need a counter,
       and removes the usage of the previous new perf_event_create API.
    -- remove the unused attr fields.
  - Patch 10:
    -- set a hardware reserved bit (bit 62 of LBR_SELECT) to reg->config
       for the vcpu lbr emulation event. This makes the config different
       from other host lbr event, so that they don't share the lbr.
       Please see the comments in the patch for the reasons why they
       shouldn't share.
  - Patch 12:
    -- disable interrupt and check if the vcpu lbr event owns the lbr
       feature before kvm writing to the lbr related msr. This avoids kvm
       updating the lbr msrs after lbr has been reclaimed by other events
       via ipi.
    -- remove arch v4 related support.
  - Patch 13:
    -- double check if the vcpu lbr event owns the lbr feature before
       vm-entry into the guest. The lbr pass-through will be cancelled if
       lbr feature has been reclaimed by a cpu pinned lbr event.

Previous:
https://lkml.kernel.org/r/1562548999-37095-1-git-send-email-wei.w.wang@intel.com

Wei Wang (14):
  perf/x86: fix the variable type of the lbr msrs
  perf/x86: add a function to get the addresses of the lbr stack msrs
  KVM/x86: KVM_CAP_X86_GUEST_LBR
  KVM/x86: intel_pmu_lbr_enable
  KVM/x86/vPMU: tweak kvm_pmu_get_msr
  KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  perf/x86: support to create a perf event without counter allocation
  perf/core: set the event->owner before event_init
  KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread
  perf/x86/lbr: don't share lbr for the vcpu usage case
  perf/x86: save/restore LBR_SELECT on vcpu switching
  KVM/x86/lbr: lbr emulation
  KVM/x86/vPMU: check the lbr feature before entering guest
  KVM/x86: remove the common handling of the debugctl msr

 Documentation/virt/kvm/api.txt    |  26 +++
 arch/x86/events/core.c            |  36 ++-
 arch/x86/events/intel/core.c      |   3 +
 arch/x86/events/intel/lbr.c       |  95 +++++++-
 arch/x86/events/perf_event.h      |   6 +-
 arch/x86/include/asm/kvm_host.h   |   5 +
 arch/x86/include/asm/perf_event.h |  17 ++
 arch/x86/kvm/cpuid.c              |   2 +-
 arch/x86/kvm/pmu.c                |  24 +-
 arch/x86/kvm/pmu.h                |  11 +-
 arch/x86/kvm/pmu_amd.c            |   7 +-
 arch/x86/kvm/vmx/pmu_intel.c      | 476 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c            |   4 +-
 arch/x86/kvm/vmx/vmx.h            |   2 +
 arch/x86/kvm/x86.c                |  47 ++--
 include/linux/perf_event.h        |  18 ++
 include/uapi/linux/kvm.h          |   1 +
 kernel/events/core.c              |  19 +-
 18 files changed, 738 insertions(+), 61 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs Wei Wang
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The msr variable type can be "unsigned int", which uses less memory than
the longer unsigned long. The lbr nr won't be a negative number, so make
it "unsigned int" as well.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8751008..27e4d32 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -660,8 +660,8 @@ struct x86_pmu {
 	/*
 	 * Intel LBR
 	 */
-	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
-	int		lbr_nr;			   /* hardware stack size */
+	unsigned int	lbr_tos, lbr_from, lbr_to,
+			lbr_nr;			   /* lbr stack and size */
 	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
 	const int	*lbr_sel_map;		   /* lbr_select mappings */
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
  2019-08-06  7:16 ` [PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The lbr stack msrs are model specific. The perf subsystem has already
assigned the abstracted msr address values based on the cpu model.
So add a function to enable callers outside the perf subsystem to get the
lbr stack addresses. This is useful for hypervisors to emulate the lbr
feature for the guest.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/events/intel/lbr.c       | 23 +++++++++++++++++++++++
 arch/x86/include/asm/perf_event.h | 14 ++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6f814a2..9b2d05c 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void)
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
 		x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
+
+/**
+ * x86_perf_get_lbr_stack - get the lbr stack related msrs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+	stack->nr = x86_pmu.lbr_nr;
+	stack->tos = x86_pmu.lbr_tos;
+	stack->from = x86_pmu.lbr_from;
+	stack->to = x86_pmu.lbr_to;
+
+	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+		stack->info = MSR_LBR_INFO_0;
+	else
+		stack->info = 0;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 1392d5e..2606100 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -318,7 +318,16 @@ struct perf_guest_switch_msr {
 	u64 host, guest;
 };
 
+struct x86_perf_lbr_stack {
+	unsigned int	nr;
+	unsigned int	tos;
+	unsigned int	from;
+	unsigned int	to;
+	unsigned int	info;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 	return NULL;
 }
 
+static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+	return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
 	memset(cap, 0, sizeof(*cap));
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
  2019-08-06  7:16 ` [PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs Wei Wang
  2019-08-06  7:16 ` [PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable Wei Wang
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest
lbr feature.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 Documentation/virt/kvm/api.txt  | 26 ++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c              | 16 ++++++++++++++++
 include/uapi/linux/kvm.h        |  1 +
 4 files changed, 45 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d06776..64632a8 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -5046,6 +5046,32 @@ it hard or impossible to use it correctly.  The availability of
 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
 Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
 
+7.19 KVM_CAP_X86_GUEST_LBR
+Architectures: x86
+Parameters: args[0] whether feature should be enabled or not
+            args[1] pointer to the userspace memory to load the lbr stack info
+
+The lbr stack info is described by
+struct x86_perf_lbr_stack {
+	unsigned int	nr;
+	unsigned int	tos;
+	unsigned int	from;
+	unsigned int	to;
+	unsigned int	info;
+};
+
+@nr: number of lbr stack entries
+@tos: index of the top of stack msr
+@from: index of the msr that stores a branch source address
+@to: index of the msr that stores a branch destination address
+@info: index of the msr that stores lbr related flags
+
+Enabling this capability allows guest accesses to the lbr feature. Otherwise,
+#GP will be injected to the guest when it accesses to the lbr related msrs.
+
+After the feature is enabled, before exiting to userspace, kvm handlers should
+fill the lbr stack info into the userspace memory pointed by args[1].
+
 8. Other capabilities.
 ----------------------
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7b0a4ee..d29dddd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -875,6 +875,7 @@ struct kvm_arch {
 	atomic_t vapics_in_nmi_mode;
 	struct mutex apic_map_lock;
 	struct kvm_apic_map *apic_map;
+	struct x86_perf_lbr_stack lbr_stack;
 
 	bool apic_access_page_done;
 
@@ -884,6 +885,7 @@ struct kvm_arch {
 	bool hlt_in_guest;
 	bool pause_in_guest;
 	bool cstate_in_guest;
+	bool lbr_in_guest;
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6d951c..e1eb1be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3129,6 +3129,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_EXCEPTION_PAYLOAD:
 		r = 1;
 		break;
+	case KVM_CAP_X86_GUEST_LBR:
+		r = sizeof(struct x86_perf_lbr_stack);
+		break;
 	case KVM_CAP_SYNC_REGS:
 		r = KVM_SYNC_X86_VALID_FIELDS;
 		break;
@@ -4670,6 +4673,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exception_payload_enabled = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_X86_GUEST_LBR:
+		r = -EINVAL;
+		if (cap->args[0] &&
+		    x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+			break;
+
+		if (copy_to_user((void __user *)cap->args[1],
+				 &kvm->arch.lbr_stack,
+				 sizeof(struct x86_perf_lbr_stack)))
+			break;
+		kvm->arch.lbr_in_guest = cap->args[0];
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5e3f12d..dd53edc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
 #define KVM_CAP_PMU_EVENT_FILTER 173
+#define KVM_CAP_X86_GUEST_LBR 174
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (2 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The lbr stack is model specific, for example, SKX has 32 lbr stack
entries while HSW has 16 entries, so a HSW guest running on a SKX
machine may not get accurate perf results. Currently, we forbid the
guest lbr enabling when the guest and host see different lbr stack
entries or the host and guest see different lbr stack msr indices.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/kvm/pmu.c           |   8 +++
 arch/x86/kvm/pmu.h           |   2 +
 arch/x86/kvm/vmx/pmu_intel.c | 136 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c           |   3 +-
 4 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 46875bb..26fac6c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -331,6 +331,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	return 0;
 }
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->lbr_enable)
+		return kvm_x86_ops->pmu_ops->lbr_enable(vcpu);
+
+	return false;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 58265f7..d9eec9a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -29,6 +29,7 @@ struct kvm_pmu_ops {
 					  u64 *mask);
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
+	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*refresh)(struct kvm_vcpu *vcpu);
@@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu);
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4dea0e0..6294a86 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -12,6 +12,7 @@
 #include <linux/kvm_host.h>
 #include <linux/perf_event.h>
 #include <asm/perf_event.h>
+#include <asm/intel-family.h>
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
@@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	u8 vcpu_model = guest_cpuid_model(vcpu);
+	unsigned int vcpu_lbr_from, vcpu_lbr_nr;
+
+	if (x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+		return false;
+
+	if (guest_cpuid_family(vcpu) != boot_cpu_data.x86)
+		return false;
+
+	/*
+	 * It could be possible that people have vcpus of old model run on
+	 * physcal cpus of newer model, for example a BDW guest on a SKX
+	 * machine (but not possible to be the other way around).
+	 * The BDW guest may not get accurate results on a SKX machine as it
+	 * only reads 16 entries of the lbr stack while there are 32 entries
+	 * of recordings. We currently forbid the lbr enabling when the vcpu
+	 * and physical cpu see different lbr stack entries or the guest lbr
+	 * msr indices are not compatible with the host.
+	 */
+	switch (vcpu_model) {
+	case INTEL_FAM6_CORE2_MEROM:
+	case INTEL_FAM6_CORE2_MEROM_L:
+	case INTEL_FAM6_CORE2_PENRYN:
+	case INTEL_FAM6_CORE2_DUNNINGTON:
+		/* intel_pmu_lbr_init_core() */
+		vcpu_lbr_nr = 4;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_NEHALEM:
+	case INTEL_FAM6_NEHALEM_EP:
+	case INTEL_FAM6_NEHALEM_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_ATOM_BONNELL:
+	case INTEL_FAM6_ATOM_BONNELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL:
+	case INTEL_FAM6_ATOM_SALTWELL_MID:
+	case INTEL_FAM6_ATOM_SALTWELL_TABLET:
+		/* intel_pmu_lbr_init_atom() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_ATOM_SILVERMONT:
+	case INTEL_FAM6_ATOM_SILVERMONT_X:
+	case INTEL_FAM6_ATOM_SILVERMONT_MID:
+	case INTEL_FAM6_ATOM_AIRMONT:
+	case INTEL_FAM6_ATOM_AIRMONT_MID:
+		/* intel_pmu_lbr_init_slm() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_CORE_FROM;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GOLDMONT_X:
+		/* intel_pmu_lbr_init_skl(); */
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
+		/* intel_pmu_lbr_init_skl()*/
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_WESTMERE:
+	case INTEL_FAM6_WESTMERE_EP:
+	case INTEL_FAM6_WESTMERE_EX:
+		/* intel_pmu_lbr_init_nhm() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_SANDYBRIDGE:
+	case INTEL_FAM6_SANDYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_IVYBRIDGE:
+	case INTEL_FAM6_IVYBRIDGE_X:
+		/* intel_pmu_lbr_init_snb() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_HASWELL_CORE:
+	case INTEL_FAM6_HASWELL_X:
+	case INTEL_FAM6_HASWELL_ULT:
+	case INTEL_FAM6_HASWELL_GT3E:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_BROADWELL_CORE:
+	case INTEL_FAM6_BROADWELL_XEON_D:
+	case INTEL_FAM6_BROADWELL_GT3E:
+	case INTEL_FAM6_BROADWELL_X:
+		/* intel_pmu_lbr_init_hsw() */
+		vcpu_lbr_nr = 16;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_XEON_PHI_KNL:
+	case INTEL_FAM6_XEON_PHI_KNM:
+		/* intel_pmu_lbr_init_knl() */
+		vcpu_lbr_nr = 8;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	case INTEL_FAM6_SKYLAKE_MOBILE:
+	case INTEL_FAM6_SKYLAKE_DESKTOP:
+	case INTEL_FAM6_SKYLAKE_X:
+	case INTEL_FAM6_KABYLAKE_MOBILE:
+	case INTEL_FAM6_KABYLAKE_DESKTOP:
+		/* intel_pmu_lbr_init_skl() */
+		vcpu_lbr_nr = 32;
+		vcpu_lbr_from = MSR_LBR_NHM_FROM;
+		break;
+	default:
+		vcpu_lbr_nr = 0;
+		vcpu_lbr_from = 0;
+		pr_warn("%s: vcpu model not supported %d\n", __func__,
+			vcpu_model);
+	}
+
+	if (vcpu_lbr_nr != kvm->arch.lbr_stack.nr ||
+	    vcpu_lbr_from != kvm->arch.lbr_stack.from) {
+		pr_warn("%s: vcpu model %x incompatible to pcpu %x\n",
+			__func__, vcpu_model, boot_cpu_data.x86_model);
+		return false;
+	}
+
+	return true;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -366,6 +501,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
+	.lbr_enable = intel_pmu_lbr_enable,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
 	.refresh = intel_pmu_refresh,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1eb1be..da45962 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4675,8 +4675,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		break;
 	case KVM_CAP_X86_GUEST_LBR:
 		r = -EINVAL;
-		if (cap->args[0] &&
-		    x86_perf_get_lbr_stack(&kvm->arch.lbr_stack))
+		if (cap->args[0] && !kvm_pmu_lbr_enable(kvm->vcpus[0]))
 			break;
 
 		if (copy_to_user((void __user *)cap->args[1],
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (3 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Change kvm_pmu_get_msr to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/kvm/pmu.c           |  4 ++--
 arch/x86/kvm/pmu.h           |  4 ++--
 arch/x86/kvm/pmu_amd.c       |  7 ++++---
 arch/x86/kvm/vmx/pmu_intel.c | 19 +++++++++++--------
 arch/x86/kvm/x86.c           |  4 ++--
 5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 26fac6c..1a291ed 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -350,9 +350,9 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr);
 }
 
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
-	return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
+	return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr_info);
 }
 
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d9eec9a..f61024e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,7 +30,7 @@ struct kvm_pmu_ops {
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
 	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
-	int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
@@ -114,7 +114,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index c838838..4a64a3f 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -208,21 +208,22 @@ static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	u32 msr = msr_info->index;
 
 	/* MSR_PERFCTRn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
 	if (pmc) {
-		*data = pmc_read_counter(pmc);
+		msr_info->data = pmc_read_counter(pmc);
 		return 0;
 	}
 	/* MSR_EVNTSELn */
 	pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
 	if (pmc) {
-		*data = pmc->eventsel;
+		msr_info->data = pmc->eventsel;
 		return 0;
 	}
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 6294a86..53bb95e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -297,35 +297,38 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	u32 msr = msr_info->index;
 
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
-		*data = pmu->fixed_ctr_ctrl;
+		msr_info->data = pmu->fixed_ctr_ctrl;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_STATUS:
-		*data = pmu->global_status;
+		msr_info->data = pmu->global_status;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_CTRL:
-		*data = pmu->global_ctrl;
+		msr_info->data = pmu->global_ctrl;
 		return 0;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-		*data = pmu->global_ovf_ctrl;
+		msr_info->data = pmu->global_ovf_ctrl;
 		return 0;
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			u64 val = pmc_read_counter(pmc);
-			*data = val & pmu->counter_bitmask[KVM_PMC_GP];
+			msr_info->data =
+				val & pmu->counter_bitmask[KVM_PMC_GP];
 			return 0;
 		} else if ((pmc = get_fixed_pmc(pmu, msr))) {
 			u64 val = pmc_read_counter(pmc);
-			*data = val & pmu->counter_bitmask[KVM_PMC_FIXED];
+			msr_info->data =
+				val & pmu->counter_bitmask[KVM_PMC_FIXED];
 			return 0;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
-			*data = pmc->eventsel;
+			msr_info->data = pmc->eventsel;
 			return 0;
 		}
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da45962..7a7cd93 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2824,7 +2824,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
 	case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
-			return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+			return kvm_pmu_get_msr(vcpu, msr_info);
 		msr_info->data = 0;
 		break;
 	case MSR_IA32_UCODE_REV:
@@ -2982,7 +2982,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
-			return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+			return kvm_pmu_get_msr(vcpu, msr_info);
 		if (!ignore_msrs) {
 			vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
 					       msr_info->index);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (4 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation Wei Wang
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
the addresses stored in the lbr stack. Expose those bits to the guest
when the guest lbr feature is enabled.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/include/asm/perf_event.h |  2 ++
 arch/x86/kvm/cpuid.c              |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c      | 16 ++++++++++++++++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 2606100..aa77da2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -95,6 +95,8 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_LBR_SHIFT	24
 
+#define X86_PERF_CAP_MASK_LBR_FMT			0x3f
+
 /*
  * Intel "Architectural Performance Monitoring" CPUID
  * detection/enumeration details:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 22c2720..826b2dc 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -458,7 +458,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+		F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
 		F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
 		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 53bb95e..f0ad78f 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	case MSR_CORE_PERF_GLOBAL_STATUS:
 	case MSR_CORE_PERF_GLOBAL_CTRL:
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+	case MSR_IA32_PERF_CAPABILITIES:
 		ret = pmu->version > 1;
 		break;
 	default:
@@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
 		msr_info->data = pmu->global_ovf_ctrl;
 		return 0;
+	case MSR_IA32_PERF_CAPABILITIES: {
+		u64 data;
+
+		if (!boot_cpu_has(X86_FEATURE_PDCM) ||
+		    (!msr_info->host_initiated &&
+		     !guest_cpuid_has(vcpu, X86_FEATURE_PDCM)))
+			return 1;
+		data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
+		msr_info->data = 0;
+		if (vcpu->kvm->arch.lbr_in_guest)
+			msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT);
+		return 0;
+	}
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			u64 val = pmc_read_counter(pmc);
@@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 0;
 		}
 		break;
+	case MSR_IA32_PERF_CAPABILITIES:
+		return 1; /* RO MSR */
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
 			if (msr_info->host_initiated)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (5 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 08/14] perf/core: set the event->owner before event_init Wei Wang
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Hypervisors may create an lbr event for a vcpu's lbr emulation, and the
emulation doesn't need a counter fundamentally. This makes the emulation
follow the x86 SDM's description about lbr, which doesn't include a
counter, and also avoids wasting a counter.

The perf scheduler is supported to not assign a counter for a perf event
which doesn't need a counter. Define a macro, X86_PMC_IDX_NA, to replace
"-1", which represents a never assigned counter id.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
https://lkml.kernel.org/r/20180920162407.GA24124@hirez.programming.kicks-ass.net
---
 arch/x86/events/core.c            | 36 +++++++++++++++++++++++++++---------
 arch/x86/events/intel/core.c      |  3 +++
 arch/x86/include/asm/perf_event.h |  1 +
 include/linux/perf_event.h        | 11 +++++++++++
 4 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 81b005e..ffa27bb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -73,7 +73,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	int idx = hwc->idx;
 	u64 delta;
 
-	if (idx == INTEL_PMC_IDX_FIXED_BTS)
+	if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA))
 		return 0;
 
 	/*
@@ -595,7 +595,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	atomic_inc(&active_events);
 	event->destroy = hw_perf_event_destroy;
 
-	event->hw.idx = -1;
+	event->hw.idx = X86_PMC_IDX_NA;
 	event->hw.last_cpu = -1;
 	event->hw.last_tag = ~0ULL;
 
@@ -763,6 +763,8 @@ static bool perf_sched_restore_state(struct perf_sched *sched)
 static bool __perf_sched_find_counter(struct perf_sched *sched)
 {
 	struct event_constraint *c;
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *e = cpuc->event_list[sched->state.event];
 	int idx;
 
 	if (!sched->state.unassigned)
@@ -772,6 +774,14 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
 		return false;
 
 	c = sched->constraints[sched->state.event];
+	if (c == &emptyconstraint)
+		return false;
+
+	if (is_no_counter_event(e)) {
+		idx = X86_PMC_IDX_NA;
+		goto done;
+	}
+
 	/* Prefer fixed purpose counters */
 	if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) {
 		idx = INTEL_PMC_IDX_FIXED;
@@ -797,7 +807,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched)
 done:
 	sched->state.counter = idx;
 
-	if (c->overlap)
+	if ((idx != X86_PMC_IDX_NA) && c->overlap)
 		perf_sched_save_state(sched);
 
 	return true;
@@ -918,7 +928,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 		c = cpuc->event_constraint[i];
 
 		/* never assigned */
-		if (hwc->idx == -1)
+		if (hwc->idx == X86_PMC_IDX_NA)
 			break;
 
 		/* constraint still honored */
@@ -969,7 +979,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	if (!unsched && assign) {
 		for (i = 0; i < n; i++) {
 			e = cpuc->event_list[i];
-			if (x86_pmu.commit_scheduling)
+			if (x86_pmu.commit_scheduling &&
+			    (assign[i] != X86_PMC_IDX_NA))
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
 	} else {
@@ -1038,7 +1049,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	hwc->last_cpu = smp_processor_id();
 	hwc->last_tag = ++cpuc->tags[i];
 
-	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+	if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
+	    (hwc->idx == X86_PMC_IDX_NA)) {
 		hwc->config_base = 0;
 		hwc->event_base	= 0;
 	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
@@ -1115,7 +1127,7 @@ static void x86_pmu_enable(struct pmu *pmu)
 			 * - running on same CPU as last time
 			 * - no other event has used the counter since
 			 */
-			if (hwc->idx == -1 ||
+			if (hwc->idx == X86_PMC_IDX_NA ||
 			    match_prev_assignment(hwc, cpuc, i))
 				continue;
 
@@ -1169,7 +1181,7 @@ int x86_perf_event_set_period(struct perf_event *event)
 	s64 period = hwc->sample_period;
 	int ret = 0, idx = hwc->idx;
 
-	if (idx == INTEL_PMC_IDX_FIXED_BTS)
+	if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA))
 		return 0;
 
 	/*
@@ -1306,7 +1318,7 @@ static void x86_pmu_start(struct perf_event *event, int flags)
 	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
 		return;
 
-	if (WARN_ON_ONCE(idx == -1))
+	if (idx == X86_PMC_IDX_NA)
 		return;
 
 	if (flags & PERF_EF_RELOAD) {
@@ -1388,6 +1400,9 @@ void x86_pmu_stop(struct perf_event *event, int flags)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
 
+	if (hwc->idx == X86_PMC_IDX_NA)
+		return;
+
 	if (test_bit(hwc->idx, cpuc->active_mask)) {
 		x86_pmu.disable(event);
 		__clear_bit(hwc->idx, cpuc->active_mask);
@@ -2128,6 +2143,9 @@ static int x86_pmu_event_idx(struct perf_event *event)
 	if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
 		return 0;
 
+	if (idx == X86_PMC_IDX_NA)
+		return X86_PMC_IDX_NA;
+
 	if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
 		idx -= INTEL_PMC_IDX_FIXED;
 		idx |= 1 << 30;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b5..177c321 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2156,6 +2156,9 @@ static void intel_pmu_disable_event(struct perf_event *event)
 		return;
 	}
 
+	if (hwc->idx == X86_PMC_IDX_NA)
+		return;
+
 	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index aa77da2..23ab90c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -11,6 +11,7 @@
 #define INTEL_PMC_IDX_FIXED				       32
 
 #define X86_PMC_IDX_MAX					       64
+#define X86_PMC_IDX_NA					       -1
 
 #define MSR_ARCH_PERFMON_PERFCTR0			      0xc1
 #define MSR_ARCH_PERFMON_PERFCTR1			      0xc2
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e8ad3c5..2cae06a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -530,6 +530,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  */
 #define PERF_EV_CAP_SOFTWARE		BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
+#define PERF_EV_CAP_NO_COUNTER		BIT(2)
 
 #define SWEVENT_HLIST_BITS		8
 #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
@@ -1039,6 +1040,16 @@ static inline bool is_sampling_event(struct perf_event *event)
 	return event->attr.sample_period != 0;
 }
 
+static inline bool is_no_counter_event(struct perf_event *event)
+{
+	return !!(event->event_caps & PERF_EV_CAP_NO_COUNTER);
+}
+
+static inline void perf_event_set_no_counter(struct perf_event *event)
+{
+	event->event_caps |= PERF_EV_CAP_NO_COUNTER;
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 08/14] perf/core: set the event->owner before event_init
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (6 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread Wei Wang
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Kernel and user events can be distinguished by checking event->owner. Some
pmu driver implementation may need to know event->owner in event_init. For
example, intel_pmu_setup_lbr_filter treats a kernel event with
exclude_host set as an lbr event created for guest lbr emulation, which
doesn't need a pmu counter.

So move the event->owner assignment into perf_event_alloc to have it
set before event_init is called.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 kernel/events/core.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c11..7663f85 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10288,6 +10288,7 @@ static void account_event(struct perf_event *event)
 static struct perf_event *
 perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		 struct task_struct *task,
+		 struct task_struct *owner,
 		 struct perf_event *group_leader,
 		 struct perf_event *parent_event,
 		 perf_overflow_handler_t overflow_handler,
@@ -10340,6 +10341,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	event->group_leader	= group_leader;
 	event->pmu		= NULL;
 	event->oncpu		= -1;
+	event->owner		= owner;
 
 	event->parent		= parent_event;
 
@@ -10891,7 +10893,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (flags & PERF_FLAG_PID_CGROUP)
 		cgroup_fd = pid;
 
-	event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
+	event = perf_event_alloc(&attr, cpu, task, current, group_leader, NULL,
 				 NULL, NULL, cgroup_fd);
 	if (IS_ERR(event)) {
 		err = PTR_ERR(event);
@@ -11153,8 +11155,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	perf_event__header_size(event);
 	perf_event__id_header_size(event);
 
-	event->owner = current;
-
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 
@@ -11231,16 +11231,13 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 	 * Get the target context (task or percpu):
 	 */
 
-	event = perf_event_alloc(attr, cpu, task, NULL, NULL,
+	event = perf_event_alloc(attr, cpu, task, TASK_TOMBSTONE, NULL, NULL,
 				 overflow_handler, context, -1);
 	if (IS_ERR(event)) {
 		err = PTR_ERR(event);
 		goto err;
 	}
 
-	/* Mark owner so we could distinguish it from user events. */
-	event->owner = TASK_TOMBSTONE;
-
 	ctx = find_get_context(event->pmu, task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
@@ -11677,6 +11674,7 @@ inherit_event(struct perf_event *parent_event,
 
 	child_event = perf_event_alloc(&parent_event->attr,
 					   parent_event->cpu,
+					   parent_event->owner,
 					   child,
 					   group_leader, parent_event,
 					   NULL, NULL, -1);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (7 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 08/14] perf/core: set the event->owner before event_init Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case Wei Wang
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

VMX transition is much more frequent than vcpu switching, and
saving/restoring tens of lbr msrs (e.g. 32 lbr stack entries) would add
too much overhead to the frequent vmx transition, which is not
necessary. So the vcpu's lbr state only gets saved/restored on the vcpu
context switching.

The main purposes of using the vcpu's lbr perf event are
- follow the host perf scheduling rules to manage the vcpu's usage of
  lbr (e.g. a cpu pinned lbr event could reclaim lbr and thus stopping
  the vcpu's use);
- have the host perf do context switching of the lbr state on the vcpu
  thread context switching.
Please see the comments in intel_pmu_create_lbr_event for more details.

To achieve the pure lbr emulation, the perf event is created only to
claim for the lbr feature, and no perf counter is needed for it.

The vcpu_lbr field is added to indicate to the host lbr driver that the
lbr is currently assigned to a vcpu to use. The guest driver inside the
vcpu has its own logic to use the lbr, thus the host side lbr driver
doesn't need to enable and use the lbr feature in this case.

Some design choice considerations:
- Why using "is_kernel_event", instead of checking the PF_VCPU flag, to
  determine that it is a vcpu perf event for lbr emulation?
  This is because PF_VCPU is set right before vm-entry into the guest,
  and cleared after the guest vm-exits to the host. So that flag doesn't
  remain set when running the host code.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Like Xu <like.xu@intel.com>
Signed-off-by: Like Xu <like.xu@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/events/intel/lbr.c     | 38 ++++++++++++++++++++++--
 arch/x86/events/perf_event.h    |  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 64 +++++++++++++++++++++++++++++++++++++++++
 include/linux/perf_event.h      |  7 +++++
 kernel/events/core.c            |  7 -----
 6 files changed, 108 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9b2d05c..4f4bd18 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,14 @@ void intel_pmu_lbr_add(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	/*
+	 * An lbr event without a counter indicates this is for the vcpu lbr
+	 * emulation, so set the vcpu_lbr flag when the vcpu lbr event
+	 * gets scheduled on the lbr here.
+	 */
+	if (is_no_counter_event(event))
+		cpuc->vcpu_lbr = 1;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -509,6 +517,14 @@ void intel_pmu_lbr_del(struct perf_event *event)
 		task_ctx->lbr_callstack_users--;
 	}
 
+	/*
+	 * An lbr event without a counter indicates this is for the vcpu lbr
+	 * emulation, so clear the vcpu_lbr flag when the vcpu's lbr event
+	 * gets scheduled out from the lbr.
+	 */
+	if (is_no_counter_event(event))
+		cpuc->vcpu_lbr = 0;
+
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users--;
 	cpuc->lbr_users--;
@@ -521,7 +537,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	/*
+	 * The vcpu lbr emulation doesn't need host to enable lbr at this
+	 * point, because the guest will set the enabling at a proper time
+	 * itself.
+	 */
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_enable(pmi);
 }
 
@@ -529,7 +550,11 @@ void intel_pmu_lbr_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	/*
+	 * Same as intel_pmu_lbr_enable_all, the guest is responsible for
+	 * clearing the enabling itself.
+	 */
+	if (cpuc->lbr_users && !cpuc->vcpu_lbr)
 		__intel_pmu_lbr_disable();
 }
 
@@ -668,8 +693,12 @@ void intel_pmu_lbr_read(void)
 	 *
 	 * This could be smarter and actually check the event,
 	 * but this simple approach seems to work for now.
+	 *
+	 * And no need to read the lbr msrs here if the vcpu lbr event
+	 * is using it, as the guest will read them itself.
 	 */
-	if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+	if (!cpuc->lbr_users || cpuc->vcpu_lbr ||
+	    cpuc->lbr_users == cpuc->lbr_pebs_users)
 		return;
 
 	if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -802,6 +831,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return -EOPNOTSUPP;
 
+	if (event->attr.exclude_host && is_kernel_event(event))
+		perf_event_set_no_counter(event);
+
 	/*
 	 * setup SW LBR filter
 	 */
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 27e4d32..8b90a25 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -220,6 +220,7 @@ struct cpu_hw_events {
 	/*
 	 * Intel LBR bits
 	 */
+	u8				vcpu_lbr;
 	int				lbr_users;
 	int				lbr_pebs_users;
 	struct perf_branch_stack	lbr_stack;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d29dddd..692a0c2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@ struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	struct perf_event *lbr_event;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f0ad78f..89730f8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -164,6 +164,70 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event;
+
+	/*
+	 * The perf event is created for the following purposes:
+	 * - have the host perf subsystem manage (prioritize) the guest's use
+	 *   of lbr with other host lbr events (if there are). The pinned field
+	 *   is set to true to make this event task pinned. If a cpu pinned
+	 *   lbr event reclaims lbr, the event->oncpu field will be set to -1.
+	 *   It will be checked at the moment before vm-entry, and the lbr
+	 *   feature will not be passed through to the guest for direct
+	 *   accesses if the vcpu's lbr event does not own the lbr feature
+	 *   anymore. This will cause the guest's lbr accesses to trap to the
+	 *   kvm's handler, where the accesses will be prevented in this case.
+	 * - have the host perf subsystem help save/restore the guest lbr stack
+	 *   on vcpu switching. Since the host perf only performs this
+	 *   save/restore for the user callstack mode lbr event, we configure
+	 *   the sample_type and branch_sample_type fields accordingly to make
+	 *   this a user callstack mode lbr event.
+	 *
+	 * This perf event is used for the emulation of the lbr feature, which
+	 * doesn't have a pmu counter. Accordingly, the related attr fields,
+	 * such as config and sample period, don't need to be set here.
+	 * exclude_host is set to tell the perf lbr driver that the event is for
+	 * the guest lbr emulation.
+	 */
+	struct perf_event_attr attr = {
+		.type = PERF_TYPE_RAW,
+		.size = sizeof(attr),
+		.pinned = true,
+		.exclude_host = true,
+		.sample_type = PERF_SAMPLE_BRANCH_STACK,
+		.branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+				      PERF_SAMPLE_BRANCH_USER,
+	};
+
+	if (pmu->lbr_event)
+		return 0;
+
+	event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
+						 NULL);
+	if (IS_ERR(event)) {
+		pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
+		return -ENOENT;
+	}
+	pmu->lbr_event = event;
+
+	return 0;
+}
+
+void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct perf_event *event = pmu->lbr_event;
+
+	if (!event)
+		return;
+
+	perf_event_release_kernel(event);
+	pmu->lbr_event = NULL;
+}
+
 static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2cae06a..eb76cdc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1050,6 +1050,13 @@ static inline void perf_event_set_no_counter(struct perf_event *event)
 	event->event_caps |= PERF_EV_CAP_NO_COUNTER;
 }
 
+#define TASK_TOMBSTONE ((void *)-1L)
+
+static inline bool is_kernel_event(struct perf_event *event)
+{
+	return READ_ONCE(event->owner) == TASK_TOMBSTONE;
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7663f85..9aa987a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -164,13 +164,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
 	raw_spin_unlock(&cpuctx->ctx.lock);
 }
 
-#define TASK_TOMBSTONE ((void *)-1L)
-
-static bool is_kernel_event(struct perf_event *event)
-{
-	return READ_ONCE(event->owner) == TASK_TOMBSTONE;
-}
-
 /*
  * On task ctx scheduling...
  *
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (8 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching Wei Wang
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

Perf event scheduling lets multiple lbr events share the lbr if they
use the same config for LBR_SELECT. For the vcpu case, the vcpu's lbr
event created on the host deliberately sets the config to the user
callstack mode to have the host support to save/restore the lbr state
on vcpu context switching, and the config won't be written to the
LBR_SELECT, as the LBR_SELECT is configured by the guest, which might
not be the same as the user callstack mode. So don't allow the vcpu's
lbr event to share lbr with other host lbr events.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/events/intel/lbr.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 4f4bd18..a0f3686 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -45,6 +45,12 @@ static const enum {
 #define LBR_CALL_STACK_BIT	9 /* enable call stack */
 
 /*
+ * Set this hardware reserved bit if the lbr perf event is for the vcpu lbr
+ * emulation. This makes the reg->config different from other regular lbr
+ * events' config, so that they will not share the lbr feature.
+ */
+#define LBR_VCPU_BIT		62
+/*
  * Following bit only exists in Linux; we mask it out before writing it to
  * the actual MSR. But it helps the constraint perf code to understand
  * that this is a separate configuration.
@@ -62,6 +68,7 @@ static const enum {
 #define LBR_FAR		(1 << LBR_FAR_BIT)
 #define LBR_CALL_STACK	(1 << LBR_CALL_STACK_BIT)
 #define LBR_NO_INFO	(1ULL << LBR_NO_INFO_BIT)
+#define LBR_VCPU	(1ULL << LBR_VCPU_BIT)
 
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
@@ -818,6 +825,26 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	    (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO))
 		reg->config |= LBR_NO_INFO;
 
+	/*
+	 * An lbr perf event without a counter indicates this is for the vcpu
+	 * lbr emulation. The vcpu lbr emulation does not allow the lbr
+	 * feature to be shared with other lbr events on the host, because the
+	 * LBR_SELECT msr is configured by the guest itself. The reg->config
+	 * is deliberately configured to be user call stack mode via the
+	 * related attr fileds to get the host perf's help to save/restore the
+	 * lbr state on vcpu context switching. It doesn't represent what
+	 * LBR_SELECT will be configured.
+	 *
+	 * Set the reserved LBR_VCPU bit for the vcpu usage case, so that the
+	 * vcpu's lbr perf event will not share the lbr feature with other perf
+	 * events. (see __intel_shared_reg_get_constraints, failing to share
+	 * makes it return the emptyconstraint, which finally makes
+	 * x86_schedule_events fail to schedule the lower priority lbr event on
+	 * the lbr feature).
+	 */
+	if (is_no_counter_event(event))
+		reg->config |= LBR_VCPU;
+
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (9 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 12/14] KVM/x86/lbr: lbr emulation Wei Wang
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The regular host lbr perf event doesn't save/restore the LBR_SELECT msr
during a thread context switching, because the LBR_SELECT value is
generated from attr.branch_sample_type and already stored in
event->hw.branch_reg (please see intel_pmu_setup_hw_filter), which
doesn't get lost during thread context switching.

The attr.branch_sample_type for the vcpu lbr event is deliberately set
to the user call stack mode to enable the perf core to save/restore the
lbr related msrs on vcpu switching. So the attr.branch_sample_type
essentially doesn't represent what the guest pmu driver will write to
LBR_SELECT. Meanwhile, the host lbr driver doesn't configure the lbr msrs,
including the LBR_SELECT msr, for the vcpu thread case, as the pmu driver
inside the vcpu will do that.

So for the vcpu case, add the LBR_SELECT save/restore to ensure what the
guest writes to the LBR_SELECT msr doesn't get lost during the vcpu context
switching.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/events/intel/lbr.c  | 7 +++++++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index a0f3686..236f8352 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -390,6 +390,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
 
 	wrmsrl(x86_pmu.lbr_tos, tos);
 	task_ctx->lbr_stack_state = LBR_NONE;
+
+	if (cpuc->vcpu_lbr)
+		wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
 }
 
 static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -416,6 +419,10 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
 		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
 			rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
 	}
+
+	if (cpuc->vcpu_lbr)
+		rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
+
 	task_ctx->valid_lbrs = i;
 	task_ctx->tos = tos;
 	task_ctx->lbr_stack_state = LBR_VALID;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8b90a25..0b2f660 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -699,6 +699,7 @@ struct x86_perf_task_context {
 	u64 lbr_from[MAX_LBR_ENTRIES];
 	u64 lbr_to[MAX_LBR_ENTRIES];
 	u64 lbr_info[MAX_LBR_ENTRIES];
+	u64 lbr_sel;
 	int tos;
 	int valid_lbrs;
 	int lbr_callstack_users;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 12/14] KVM/x86/lbr: lbr emulation
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (10 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-12-10 23:37   ` Sean Christopherson
  2019-08-06  7:16 ` [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest Wei Wang
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

In general, the lbr emulation works in this way:
Guest first access (since vcpu scheduled in) to the lbr related msr
gets trapped to kvm, and the handler will do the following things:
  - create an lbr perf event to have the vcpu get the lbr feature
    from host perf following the perf scheduling rules;
  - pass the lbr related msrs through to the guest for direct accesses
    without vm-exits till the end of this vcpu time slice.

The guest first access is made interceptible so that the kvm side lbr
emulation can always get if the lbr feature has been used during the
vcpu time slice. If the lbr feature isn't used during a time slice,
the lbr event created for the vcpu will be freed.

Some considerations:
- Why not free the vcpu lbr event when the guest clears the lbr enable bit?
Guest may frequently clear the lbr enable bit (in the debugctl msr) during
its use of the lbr feature, e.g. in PMI handler. This will cause the kvm
emulation to frequently alloc/free the vcpu lbr event, which is
unnecessary. Technically, we want to free the vcpu lbr event when the guest
doesn't need to run lbr anymore. Heuristically, we free the vcpu lbr event
when the guest doesn't touch any of the lbr msrs during an entire vcpu time
slice.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/pmu.c              |   6 ++
 arch/x86/kvm/pmu.h              |   2 +
 arch/x86/kvm/vmx/pmu_intel.c    | 206 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |   4 +-
 arch/x86/kvm/vmx/vmx.h          |   2 +
 arch/x86/kvm/x86.c              |   2 +
 7 files changed, 222 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 692a0c2..ecd22b5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -469,6 +469,8 @@ struct kvm_pmu {
 	u64 global_ctrl_mask;
 	u64 global_ovf_ctrl_mask;
 	u64 reserved_bits;
+	/* Indicate if the lbr msrs were accessed in this vcpu time slice */
+	bool lbr_used;
 	u8 version;
 	struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 1a291ed..afad092 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -360,6 +360,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
 }
 
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_in)
+		kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu);
+}
+
 /* refresh PMU settings. This function generally is called when underlying
  * settings are changed (such as changes of PMU CPUID by guest VMs), which
  * should rarely happen.
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f61024e..f875721 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,6 +32,7 @@ struct kvm_pmu_ops {
 	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+	void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
@@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 89730f8..5580f1a 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -17,6 +17,7 @@
 #include "cpuid.h"
 #include "lapic.h"
 #include "pmu.h"
+#include "vmx.h"
 
 static struct kvm_event_hw_type_mapping intel_arch_events[] = {
 	/* Index must match CPUID 0x0A.EBX bit vector */
@@ -141,6 +142,19 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu,
 	return &counters[idx];
 }
 
+/* Return true if it is one of the lbr related msrs. */
+static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+	struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+	int nr = stack->nr;
+
+	return !!(index == MSR_LBR_SELECT ||
+		  index == stack->tos ||
+		  (index >= stack->from && index < stack->from + nr) ||
+		  (index >= stack->to && index < stack->to + nr) ||
+		  (index >= stack->info && index < stack->info));
+}
+
 static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -152,9 +166,12 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	case MSR_CORE_PERF_GLOBAL_CTRL:
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
 	case MSR_IA32_PERF_CAPABILITIES:
+	case MSR_IA32_DEBUGCTLMSR:
 		ret = pmu->version > 1;
 		break;
 	default:
+		if (is_lbr_msr(vcpu, msr))
+			return pmu->version > 1;
 		ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
 			get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
 			get_fixed_pmc(pmu, msr);
@@ -362,6 +379,163 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+/*
+ * "set = 1" to make the lbr msrs interceptible, otherwise pass the lbr msrs
+ * through to the guest.
+ */
+static void intel_pmu_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu,
+						 bool set)
+{
+	unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+	struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+	int nr = stack->nr;
+	int i;
+
+	vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_SELECT,
+				  MSR_TYPE_RW, set);
+	vmx_set_intercept_for_msr(msr_bitmap, stack->tos,
+				  MSR_TYPE_RW, set);
+	for (i = 0; i < nr; i++) {
+		vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
+					  MSR_TYPE_RW, set);
+		vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
+					  MSR_TYPE_RW, set);
+		if (stack->info)
+			vmx_set_intercept_for_msr(msr_bitmap, stack->info + i,
+						  MSR_TYPE_RW, set);
+	}
+}
+
+static bool intel_pmu_set_lbr_msr(struct kvm_vcpu *vcpu,
+				  struct msr_data *msr_info)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u32 index = msr_info->index;
+	u64 data = msr_info->data;
+	bool ret = false;
+
+	/* The lbr event should have been allocated when reaching here. */
+	if (WARN_ON(!pmu->lbr_event))
+		return ret;
+
+	/*
+	 * Host perf could reclaim the lbr feature via ipi calls, and this can
+	 * be detected via lbr_event->oncpu being set to -1. To ensure the
+	 * writes to the lbr msrs don't happen after the lbr feature has been
+	 * reclaimed by the host, the interrupt is disabled before performing
+	 * the writes.
+	 */
+	local_irq_disable();
+	if (pmu->lbr_event->oncpu == -1)
+		goto out;
+
+	switch (index) {
+	case MSR_IA32_DEBUGCTLMSR:
+		ret = true;
+		/*
+		 * Currently, only FREEZE_LBRS_ON_PMI and DEBUGCTLMSR_LBR are
+		 * supported.
+		 */
+		data &= (DEBUGCTLMSR_FREEZE_LBRS_ON_PMI | DEBUGCTLMSR_LBR);
+		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+		break;
+	default:
+		if (is_lbr_msr(vcpu, index)) {
+			ret = true;
+			wrmsrl(index, data);
+		}
+	}
+
+out:
+	local_irq_enable();
+	return ret;
+}
+
+static bool intel_pmu_get_lbr_msr(struct kvm_vcpu *vcpu,
+				  struct msr_data *msr_info)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u32 index = msr_info->index;
+	bool ret = false;
+
+	/* The lbr event should have been allocated when reaching here. */
+	if (WARN_ON(!pmu->lbr_event))
+		return ret;
+
+	/*
+	 * Disable irq to ensure the lbr feature doesn't get reclaimed by the
+	 * host at the time the value is read from the msr, this avoids the
+	 * host lbr value to be leaked to the guest. If lbr has been reclaimed,
+	 * return 0 on guest reads.
+	 */
+	local_irq_disable();
+	if (pmu->lbr_event->oncpu == -1) {
+		msr_info->data = 0;
+		goto out;
+	}
+
+	switch (index) {
+	case MSR_IA32_DEBUGCTLMSR:
+		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		ret = true;
+		break;
+	default:
+		if (is_lbr_msr(vcpu, index)) {
+			ret = true;
+			rdmsrl(index, msr_info->data);
+		}
+	}
+
+out:
+	local_irq_enable();
+	return ret;
+}
+
+static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
+				     struct msr_data *msr_info,
+				     bool set)
+{
+	u32 index = msr_info->index;
+	bool ret = false;
+
+	/* Return false if the msr access has nothing to do with lbr. */
+	if ((index != MSR_IA32_DEBUGCTLMSR) && !is_lbr_msr(vcpu, index))
+		return false;
+
+	/*
+	 * Guest initiated access is allowed when userspace has explicitly
+	 * enabled lbr.
+	 */
+	if (!msr_info->host_initiated && !vcpu->kvm->arch.lbr_in_guest)
+		return false;
+
+	if (intel_pmu_create_lbr_event(vcpu))
+		return false;
+
+	if (set)
+		ret = intel_pmu_set_lbr_msr(vcpu, msr_info);
+	else
+		ret = intel_pmu_get_lbr_msr(vcpu, msr_info);
+
+	/*
+	 * If this is the guest's first access to an lbr related msr, pass
+	 * the lbr related msrs through to the guest for direct accesses.
+	 * It is possible in theory that a cpu pinned lbr event could take
+	 * over the lbr feature the moment after the pass-through is set
+	 * up via intel_pmu_set_intercept_for_lbr_msrs below. Don't worry,
+	 * because it will be double checked right before vm-entry to
+	 * ensure the lbr msrs are pass-throughed with the lbr being owned
+	 * by this vcpu's lbr event (see vcpu_enter_guest for more
+	 * details).
+	 */
+	if (ret && !vcpu->arch.pmu.lbr_used) {
+		vcpu->arch.pmu.lbr_used = true;
+		intel_pmu_set_intercept_for_lbr_msrs(vcpu, false);
+	}
+
+	return ret;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -408,6 +582,8 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
 			msr_info->data = pmc->eventsel;
 			return 0;
+		} else if (intel_pmu_access_lbr_msr(vcpu, msr_info, false)) {
+			return 0;
 		}
 	}
 
@@ -471,12 +647,39 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 				reprogram_gp_counter(pmc, data);
 				return 0;
 			}
+		} else if (intel_pmu_access_lbr_msr(vcpu, msr_info, true)) {
+			return 0;
 		}
 	}
 
 	return 1;
 }
 
+static void intel_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 guest_debugctl;
+
+	/*
+	 * The lbr feature was used in last vcpu time slice, so set the lbr
+	 * msrs interceptible so that we can capture whether it's used again
+	 * in this time slice.
+	 */
+	if (pmu->lbr_used) {
+		pmu->lbr_used = false;
+		intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+	} else if (pmu->lbr_event) {
+		/*
+		 * The lbr feature wasn't used during last vcpu time slice
+		 * and the vcpu lbr event hasn't been freed, so it's time to
+		 * free the lbr event.
+		 */
+		guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+		if (!(guest_debugctl & DEBUGCTLMSR_LBR))
+			intel_pmu_free_lbr_event(vcpu);
+	}
+}
+
 static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -574,6 +777,8 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 
 	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
 		pmu->global_ovf_ctrl = 0;
+
+	intel_pmu_free_lbr_event(vcpu);
 }
 
 struct kvm_pmu_ops intel_pmu_ops = {
@@ -587,6 +792,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.lbr_enable = intel_pmu_lbr_enable,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
+	.sched_in = intel_pmu_sched_in,
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 074385c..af1a1f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3557,8 +3557,8 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
 	}
 }
 
-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
-			     			      u32 msr, int type, bool value)
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr, int type,
+			       bool value)
 {
 	if (value)
 		vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 82d0bc3..85acaf0 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -328,6 +328,8 @@ void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
 bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
 void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr, int type,
+			       bool value);
 struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
 void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a7cd93..b76f019 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9294,6 +9294,8 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->arch.l1tf_flush_l1d = true;
+
+	kvm_pmu_sched_in(vcpu, cpu);
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (11 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 12/14] KVM/x86/lbr: lbr emulation Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-08-07  6:02   ` Wei Wang
  2019-08-06  7:16 ` [PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr Wei Wang
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The guest can access the lbr related msrs only when the vcpu's lbr event
has been assigned the lbr feature. A cpu pinned lbr event (though no such
event usages in the current upstream kernel) could reclaim the lbr feature
from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is
running in the non-root mode, this will cause the cpu to vm-exit to handle
the host ipi and then vm-entry back to the guest. So on vm-entry (where
interrupt has been disabled), we double confirm that the vcpu's lbr event
is still assigned the lbr feature via checking event->oncpu.

The pass-through of the lbr related msrs will be cancelled if the lbr is
reclaimed, and the following guest accesses to the lbr related msrs will
vm-exit to the related msr emulation handler in kvm, which will prevent
the accesses.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/kvm/pmu.c           |  6 ++++++
 arch/x86/kvm/pmu.h           |  3 +++
 arch/x86/kvm/vmx/pmu_intel.c | 35 +++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c           | 13 +++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index afad092..ed10a57 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->enabled_feature_confirm)
+		kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu);
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
 	if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f875721..7467907 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,6 +30,7 @@ struct kvm_pmu_ops {
 	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
 	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
 	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
+	void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 	void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
@@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5580f1a..421051aa 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 	intel_pmu_free_lbr_event(vcpu);
 }
 
+void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	/*
+	 * Either lbr_event being NULL or lbr_used being false indicates that
+	 * the lbr msrs haven't been passed through to the guest, so no need
+	 * to cancel passthrough.
+	 */
+	if (!pmu->lbr_event || !pmu->lbr_used)
+		return;
+
+	/*
+	 * The lbr feature gets reclaimed via IPI calls, so checking of
+	 * lbr_event->oncpu needs to be in an atomic context. Just confirm
+	 * that irq has been disabled already.
+	 */
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * Cancel the pass-through of the lbr msrs if lbr has been reclaimed
+	 * by the host perf.
+	 */
+	if (pmu->lbr_event->oncpu != -1) {
+		pmu->lbr_used = false;
+		intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+	}
+}
+
+void intel_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)
+{
+	intel_pmu_lbr_confirm(vcpu);
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,
@@ -790,6 +824,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.is_valid_msr_idx = intel_is_valid_msr_idx,
 	.is_valid_msr = intel_is_valid_msr,
 	.lbr_enable = intel_pmu_lbr_enable,
+	.enabled_feature_confirm = intel_pmu_enabled_feature_confirm,
 	.get_msr = intel_pmu_get_msr,
 	.set_msr = intel_pmu_set_msr,
 	.sched_in = intel_pmu_sched_in,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b76f019..efaf0e8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7985,6 +7985,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	smp_mb__after_srcu_read_unlock();
 
 	/*
+	 * Higher priority host perf events (e.g. cpu pinned) could reclaim the
+	 * pmu resources (e.g. lbr) that were assigned to the vcpu. This is
+	 * usually done via ipi calls (see perf_install_in_context for
+	 * details).
+	 *
+	 * Before entering the non-root mode (with irq disabled here), double
+	 * confirm that the pmu features enabled to the guest are not reclaimed
+	 * by higher priority host events. Otherwise, disallow vcpu's access to
+	 * the reclaimed features.
+	 */
+	kvm_pmu_enabled_feature_confirm(vcpu);
+
+	/*
 	 * This handles the case where a posted interrupt was
 	 * notified with kvm_vcpu_kick.
 	 */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (12 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest Wei Wang
@ 2019-08-06  7:16 ` Wei Wang
  2019-09-06  8:50 ` [PATCH v8 00/14] Guest LBR Enabling Wang, Wei W
  2020-01-30 20:14 ` Eduardo Habkost
  15 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-06  7:16 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, wei.w.wang, jannh,
	arei.gonglei, jmattson

The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separatedly in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
---
 arch/x86/kvm/x86.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index efaf0e8..3839ebd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2528,18 +2528,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		}
 		break;
-	case MSR_IA32_DEBUGCTLMSR:
-		if (!data) {
-			/* We support the non-activated case already */
-			break;
-		} else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
-			/* Values other than LBR and BTF are vendor-specific,
-			   thus reserved and should throw a #GP */
-			return 1;
-		}
-		vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
-			    __func__, data);
-		break;
 	case 0x200 ... 0x2ff:
 		return kvm_mtrr_set_msr(vcpu, msr, data);
 	case MSR_IA32_APICBASE:
@@ -2800,7 +2788,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	switch (msr_info->index) {
 	case MSR_IA32_PLATFORM_ID:
 	case MSR_IA32_EBL_CR_POWERON:
-	case MSR_IA32_DEBUGCTLMSR:
 	case MSR_IA32_LASTBRANCHFROMIP:
 	case MSR_IA32_LASTBRANCHTOIP:
 	case MSR_IA32_LASTINTFROMIP:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest
  2019-08-06  7:16 ` [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest Wei Wang
@ 2019-08-07  6:02   ` Wei Wang
  0 siblings, 0 replies; 20+ messages in thread
From: Wei Wang @ 2019-08-07  6:02 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: kan.liang, mingo, rkrcmar, like.xu, jannh, arei.gonglei, jmattson

On 08/06/2019 03:16 PM, Wei Wang wrote:
> The guest can access the lbr related msrs only when the vcpu's lbr event
> has been assigned the lbr feature. A cpu pinned lbr event (though no such
> event usages in the current upstream kernel) could reclaim the lbr feature
> from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is
> running in the non-root mode, this will cause the cpu to vm-exit to handle
> the host ipi and then vm-entry back to the guest. So on vm-entry (where
> interrupt has been disabled), we double confirm that the vcpu's lbr event
> is still assigned the lbr feature via checking event->oncpu.
>
> The pass-through of the lbr related msrs will be cancelled if the lbr is
> reclaimed, and the following guest accesses to the lbr related msrs will
> vm-exit to the related msr emulation handler in kvm, which will prevent
> the accesses.
>
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> ---
>   arch/x86/kvm/pmu.c           |  6 ++++++
>   arch/x86/kvm/pmu.h           |  3 +++
>   arch/x86/kvm/vmx/pmu_intel.c | 35 +++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c           | 13 +++++++++++++
>   4 files changed, 57 insertions(+)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index afad092..ed10a57 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
>   	return false;
>   }
>   
> +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)
> +{
> +	if (kvm_x86_ops->pmu_ops->enabled_feature_confirm)
> +		kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu);
> +}
> +
>   void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
>   {
>   	if (lapic_in_kernel(vcpu))
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index f875721..7467907 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -30,6 +30,7 @@ struct kvm_pmu_ops {
>   	int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
>   	bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
>   	bool (*lbr_enable)(struct kvm_vcpu *vcpu);
> +	void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu);
>   	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
>   	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
>   	void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
> @@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>   
>   bool is_vmware_backdoor_pmc(u32 pmc_idx);
>   
> +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu);
> +
>   extern struct kvm_pmu_ops intel_pmu_ops;
>   extern struct kvm_pmu_ops amd_pmu_ops;
>   #endif /* __KVM_X86_PMU_H */
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 5580f1a..421051aa 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
>   	intel_pmu_free_lbr_event(vcpu);
>   }
>   
> +void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	/*
> +	 * Either lbr_event being NULL or lbr_used being false indicates that
> +	 * the lbr msrs haven't been passed through to the guest, so no need
> +	 * to cancel passthrough.
> +	 */
> +	if (!pmu->lbr_event || !pmu->lbr_used)
> +		return;
> +
> +	/*
> +	 * The lbr feature gets reclaimed via IPI calls, so checking of
> +	 * lbr_event->oncpu needs to be in an atomic context. Just confirm
> +	 * that irq has been disabled already.
> +	 */
> +	lockdep_assert_irqs_disabled();
> +
> +	/*
> +	 * Cancel the pass-through of the lbr msrs if lbr has been reclaimed
> +	 * by the host perf.
> +	 */
> +	if (pmu->lbr_event->oncpu != -1) {

A mistake here,  should be "pmu->lbr_event->oncpu == -1".
(It didn't seem to affect the profiling result, but generated
more vm-exits due to mistakenly cancelling the passthrough)

Best,
Wei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v8 00/14] Guest LBR Enabling
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (13 preceding siblings ...)
  2019-08-06  7:16 ` [PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr Wei Wang
@ 2019-09-06  8:50 ` Wang, Wei W
  2020-01-30 20:14 ` Eduardo Habkost
  15 siblings, 0 replies; 20+ messages in thread
From: Wang, Wei W @ 2019-09-06  8:50 UTC (permalink / raw)
  To: linux-kernel, kvm, ak, peterz, pbonzini
  Cc: Liang, Kan, mingo, rkrcmar, Xu, Like, jannh, arei.gonglei, jmattson

A polite ping for comments on this version, thanks!

On Tuesday, August 6, 2019 3:16 PM, Wei Wang wrote:
> Last Branch Recording (LBR) is a performance monitor unit (PMU) feature on
> Intel CPUs that captures branch related info. This patch series enables this
> feature to KVM guests.
> 
> Each guest can be configured to expose this LBR feature to the guest via
> userspace setting the enabling param in KVM_CAP_X86_GUEST_LBR (patch
> 3).
> 
> About the lbr emulation method:
> Since the vcpu get scheduled in, the lbr related msrs are made interceptible.
> This makes guest first access to a lbr related msr always vm-exit to kvm, so
> that kvm can know whether the lbr feature is used during the vcpu time slice.
> The kvm lbr msr handler does the following
> things:
>   - create an lbr perf event (task pinned) for the vcpu thread.
>     The perf event mainly serves 2 purposes:
>       -- follow the host perf scheduling rules to manage the vcpu's usage
>          of lbr (e.g. a cpu pinned lbr event could reclaim lbr and thus
>          stopping the vcpu's use);
>       -- have the host perf do context switching of the lbr state on the
>          vcpu thread switching.
>   - pass the lbr related msrs through to the guest.
>     This enables the following guest accesses to the lbr related msrs
>     without vm-exit, as long as the vcpu's lbr event owns the lbr feature.
>     A cpu pinned lbr event on the host could come and take over the lbr
>     feature via IPI calls. In this case, the pass-through will be
>     cancelled (patch 13), and the guest following accesses to the lbr msrs
>     will vm-exit to kvm and accesses will be forbidden in the handler.
> 
> If the guest doesn't touch any of the lbr related msrs (likely the guest doesn't
> need to run lbr in the near future), the vcpu's lbr perf event will be freed
> (please see patch 12 commit for more details).
> 
> * Tests
> Conclusion: the profiling results on the guest are similar to that on the host.
> 
> Run: ./perf -b ./test_program
> 
> - Test on the host:
> Overhead  Command  Source Shared Object  Source Symbol    Target
> Symbol
>   22.35%  ftest    libc-2.23.so          [.] __random     [.]
> __random
>    8.20%  ftest    ftest                 [.] qux          [.] qux
>    5.88%  ftest    ftest                 [.] random@plt   [.]
> __random
>    5.88%  ftest    libc-2.23.so          [.] __random     [.]
> __random_r
>    5.79%  ftest    ftest                 [.] main         [.]
> random@plt
>    5.60%  ftest    ftest                 [.] main         [.] foo
>    5.24%  ftest    libc-2.23.so          [.] __random     [.] main
>    5.20%  ftest    libc-2.23.so          [.] __random_r   [.]
> __random
>    5.00%  ftest    ftest                 [.] foo          [.] qux
>    4.91%  ftest    ftest                 [.] main         [.] bar
>    4.83%  ftest    ftest                 [.] bar          [.] qux
>    4.57%  ftest    ftest                 [.] main         [.] main
>    4.38%  ftest    ftest                 [.] foo          [.] main
>    4.13%  ftest    ftest                 [.] qux          [.] foo
>    3.89%  ftest    ftest                 [.] qux          [.] bar
>    3.86%  ftest    ftest                 [.] bar          [.] main
> 
> - Test on the guest:
> Overhead  Command  Source Shaged Object  Source Symbol    Target
> Symbol
>   22.36%  ftest    libc-2.23.so          [.] random       [.] random
>    8.55%  ftest    ftest                 [.] qux          [.] qux
>    5.79%  ftest    libc-2.23.so          [.] random       [.]
> random_r
>    5.64%  ftest    ftest                 [.] random@plt   [.]
> random
>    5.58%  ftest    ftest                 [.] main         [.]
> random@plt
>    5.55%  ftest    ftest                 [.] main         [.] foo
>    5.41%  ftest    libc-2.23.so          [.] random       [.] main
>    5.31%  ftest    libc-2.23.so          [.] random_r     [.] random
>    5.11%  ftest    ftest                 [.] foo          [.] qux
>    4.93%  ftest    ftest                 [.] main         [.] main
>    4.59%  ftest    ftest                 [.] qux          [.] bar
>    4.49%  ftest    ftest                 [.] bar          [.] main
>    4.42%  ftest    ftest                 [.] bar          [.] qux
>    4.16%  ftest    ftest                 [.] main         [.] bar
>    3.95%  ftest    ftest                 [.] qux          [.] foo
>    3.79%  ftest    ftest                 [.] foo          [.] main
> (due to the lib version difference, "random" is equavlent to __random above)
> 
> v7->v8 Changelog:
>   - Patch 3:
>     -- document KVM_CAP_X86_GUEST_LBR in api.txt
>     -- make the check of KVM_CAP_X86_GUEST_LBR return the size of
>        struct x86_perf_lbr_stack, to let userspace do a compatibility
>        check.
>   - Patch 7:
>     -- support perf scheduler to not assign a counter for the perf event
>        that has PERF_EV_CAP_NO_COUNTER set (rather than skipping the
> perf
>        scheduler). This allows the scheduler to detect lbr usage conflicts
>        via get_event_constraints, and lower priority events will finally
>        fail to use lbr.
>     -- define X86_PMC_IDX_NA as "-1", which represents a never assigned
>        counter id. There are other places that use "-1", but could be
>        updated to use the new macro in another patch series.
>   - Patch 8:
>     -- move the event->owner assignment into perf_event_alloc to have it
>        set before event_init is called. Please see this patch's commit for
>        reasons.
>   - Patch 9:
>     -- use "exclude_host" and "is_kernel_event" to decide if the lbr event
>        is used for the vcpu lbr emulation, which doesn't need a counter,
>        and removes the usage of the previous new perf_event_create API.
>     -- remove the unused attr fields.
>   - Patch 10:
>     -- set a hardware reserved bit (bit 62 of LBR_SELECT) to reg->config
>        for the vcpu lbr emulation event. This makes the config different
>        from other host lbr event, so that they don't share the lbr.
>        Please see the comments in the patch for the reasons why they
>        shouldn't share.
>   - Patch 12:
>     -- disable interrupt and check if the vcpu lbr event owns the lbr
>        feature before kvm writing to the lbr related msr. This avoids kvm
>        updating the lbr msrs after lbr has been reclaimed by other events
>        via ipi.
>     -- remove arch v4 related support.
>   - Patch 13:
>     -- double check if the vcpu lbr event owns the lbr feature before
>        vm-entry into the guest. The lbr pass-through will be cancelled if
>        lbr feature has been reclaimed by a cpu pinned lbr event.
> 
> Previous:
> https://lkml.kernel.org/r/1562548999-37095-1-git-send-email-wei.w.wang
> @intel.com
> 
> Wei Wang (14):
>   perf/x86: fix the variable type of the lbr msrs
>   perf/x86: add a function to get the addresses of the lbr stack msrs
>   KVM/x86: KVM_CAP_X86_GUEST_LBR
>   KVM/x86: intel_pmu_lbr_enable
>   KVM/x86/vPMU: tweak kvm_pmu_get_msr
>   KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
>   perf/x86: support to create a perf event without counter allocation
>   perf/core: set the event->owner before event_init
>   KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread
>   perf/x86/lbr: don't share lbr for the vcpu usage case
>   perf/x86: save/restore LBR_SELECT on vcpu switching
>   KVM/x86/lbr: lbr emulation
>   KVM/x86/vPMU: check the lbr feature before entering guest
>   KVM/x86: remove the common handling of the debugctl msr
> 
>  Documentation/virt/kvm/api.txt    |  26 +++
>  arch/x86/events/core.c            |  36 ++-
>  arch/x86/events/intel/core.c      |   3 +
>  arch/x86/events/intel/lbr.c       |  95 +++++++-
>  arch/x86/events/perf_event.h      |   6 +-
>  arch/x86/include/asm/kvm_host.h   |   5 +
>  arch/x86/include/asm/perf_event.h |  17 ++
>  arch/x86/kvm/cpuid.c              |   2 +-
>  arch/x86/kvm/pmu.c                |  24 +-
>  arch/x86/kvm/pmu.h                |  11 +-
>  arch/x86/kvm/pmu_amd.c            |   7 +-
>  arch/x86/kvm/vmx/pmu_intel.c      | 476
> +++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.c            |   4 +-
>  arch/x86/kvm/vmx/vmx.h            |   2 +
>  arch/x86/kvm/x86.c                |  47 ++--
>  include/linux/perf_event.h        |  18 ++
>  include/uapi/linux/kvm.h          |   1 +
>  kernel/events/core.c              |  19 +-
>  18 files changed, 738 insertions(+), 61 deletions(-)
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 12/14] KVM/x86/lbr: lbr emulation
  2019-08-06  7:16 ` [PATCH v8 12/14] KVM/x86/lbr: lbr emulation Wei Wang
@ 2019-12-10 23:37   ` Sean Christopherson
  0 siblings, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2019-12-10 23:37 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, ak, peterz, pbonzini, kan.liang, mingo,
	rkrcmar, like.xu, jannh, arei.gonglei, jmattson

On Tue, Aug 06, 2019 at 03:16:12PM +0800, Wei Wang wrote:
> +static bool intel_pmu_set_lbr_msr(struct kvm_vcpu *vcpu,
> +				  struct msr_data *msr_info)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	u32 index = msr_info->index;
> +	u64 data = msr_info->data;
> +	bool ret = false;
> +
> +	/* The lbr event should have been allocated when reaching here. */
> +	if (WARN_ON(!pmu->lbr_event))
> +		return ret;
> +
> +	/*
> +	 * Host perf could reclaim the lbr feature via ipi calls, and this can
> +	 * be detected via lbr_event->oncpu being set to -1. To ensure the
> +	 * writes to the lbr msrs don't happen after the lbr feature has been
> +	 * reclaimed by the host, the interrupt is disabled before performing
> +	 * the writes.
> +	 */
> +	local_irq_disable();
> +	if (pmu->lbr_event->oncpu == -1)
> +		goto out;
> +
> +	switch (index) {
> +	case MSR_IA32_DEBUGCTLMSR:
> +		ret = true;
> +		/*
> +		 * Currently, only FREEZE_LBRS_ON_PMI and DEBUGCTLMSR_LBR are
> +		 * supported.
> +		 */
> +		data &= (DEBUGCTLMSR_FREEZE_LBRS_ON_PMI | DEBUGCTLMSR_LBR);
> +		vmcs_write64(GUEST_IA32_DEBUGCTL, data);
> +		break;
> +	default:
> +		if (is_lbr_msr(vcpu, index)) {
> +			ret = true;
> +			wrmsrl(index, data);

@data needs to be run through is_noncanonical_address() when writing the
MSRs that take an address.  In general, it looks like there's a lack of
checking on the validity of @data.

> +		}
> +	}
> +
> +out:
> +	local_irq_enable();
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v8 00/14] Guest LBR Enabling
  2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
                   ` (14 preceding siblings ...)
  2019-09-06  8:50 ` [PATCH v8 00/14] Guest LBR Enabling Wang, Wei W
@ 2020-01-30 20:14 ` Eduardo Habkost
  2020-01-31  1:01   ` Wang, Wei W
  15 siblings, 1 reply; 20+ messages in thread
From: Eduardo Habkost @ 2020-01-30 20:14 UTC (permalink / raw)
  To: Wei Wang
  Cc: linux-kernel, kvm, ak, peterz, pbonzini, kan.liang, mingo,
	rkrcmar, like.xu, jannh, arei.gonglei, jmattson,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Tue, Aug 06, 2019 at 03:16:00PM +0800, Wei Wang wrote:
> Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
> on Intel CPUs that captures branch related info. This patch series enables
> this feature to KVM guests.
> 
> Each guest can be configured to expose this LBR feature to the guest via
> userspace setting the enabling param in KVM_CAP_X86_GUEST_LBR (patch 3).

Are QEMU patches for enabling KVM_CAP_X86_GUEST_LBR being planned?

-- 
Eduardo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH v8 00/14] Guest LBR Enabling
  2020-01-30 20:14 ` Eduardo Habkost
@ 2020-01-31  1:01   ` Wang, Wei W
  0 siblings, 0 replies; 20+ messages in thread
From: Wang, Wei W @ 2020-01-31  1:01 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: linux-kernel, kvm, ak, peterz, pbonzini, Liang, Kan, mingo,
	rkrcmar, Xu, Like, jannh, arei.gonglei, jmattson,
	Arnaldo Carvalho de Melo, Jiri Olsa

On Friday, January 31, 2020 4:14 AM, Eduardo Habkost wrote:
> On Tue, Aug 06, 2019 at 03:16:00PM +0800, Wei Wang wrote:
> > Last Branch Recording (LBR) is a performance monitor unit (PMU)
> > feature on Intel CPUs that captures branch related info. This patch
> > series enables this feature to KVM guests.
> >
> > Each guest can be configured to expose this LBR feature to the guest
> > via userspace setting the enabling param in KVM_CAP_X86_GUEST_LBR
> (patch 3).
> 
> Are QEMU patches for enabling KVM_CAP_X86_GUEST_LBR being planned?
> 
Yes, we have a couple of qemu patches. That's planned to be reviewed after the kernel part gets finalized 😊

Best,
Wei

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-01-31  1:02 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-06  7:16 [PATCH v8 00/14] Guest LBR Enabling Wei Wang
2019-08-06  7:16 ` [PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs Wei Wang
2019-08-06  7:16 ` [PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs Wei Wang
2019-08-06  7:16 ` [PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR Wei Wang
2019-08-06  7:16 ` [PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable Wei Wang
2019-08-06  7:16 ` [PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr Wei Wang
2019-08-06  7:16 ` [PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest Wei Wang
2019-08-06  7:16 ` [PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation Wei Wang
2019-08-06  7:16 ` [PATCH v8 08/14] perf/core: set the event->owner before event_init Wei Wang
2019-08-06  7:16 ` [PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread Wei Wang
2019-08-06  7:16 ` [PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case Wei Wang
2019-08-06  7:16 ` [PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching Wei Wang
2019-08-06  7:16 ` [PATCH v8 12/14] KVM/x86/lbr: lbr emulation Wei Wang
2019-12-10 23:37   ` Sean Christopherson
2019-08-06  7:16 ` [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest Wei Wang
2019-08-07  6:02   ` Wei Wang
2019-08-06  7:16 ` [PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr Wei Wang
2019-09-06  8:50 ` [PATCH v8 00/14] Guest LBR Enabling Wang, Wei W
2020-01-30 20:14 ` Eduardo Habkost
2020-01-31  1:01   ` Wang, Wei W

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).