kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization
@ 2022-08-31  8:53 Like Xu
  2022-08-31  8:53 ` [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

Good well-designed tests can help us find more bugs, especially when
the test steps differ from the Linux kernel behaviour in terms of the
timing of access to virtualized hw resources.

Please feel free to run tests, add more or share comments.

Previous:
https://lore.kernel.org/kvm/20220823093221.38075-1-likexu@tencent.com/

V2 RESEND -> V3 Changelog:
- Post perf change as a separate patch to the perf folks; (Sean)
- Rewrite the deferred logic using imperative mood; (Sean)
- Drop some useless comment; (Sean)
- Rename __reprogram_counter() to kvm_pmu_request_counter_reprogam(); (Sean)
- Replace a play-by-play of the code changes with a high level description; (); (Sean)
- Rename pmc->stale_counter to pmc->prev_counter; (Sean)
- Drop an unnecessary check about pmc->prev_counter; (Sean)
- Simply the code about "CTLn is even, CTRn is odd"; (Sean)
- Refine commit message to avoid pronouns; (Sean)

Like Xu (7):
  KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
  KVM: x86/pmu: Don't generate PEBS records for emulated instructions
  KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
  KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
  KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
  KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement
    amd_*_to_pmc()
  KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters
    scalability

 arch/x86/include/asm/kvm_host.h |   6 +-
 arch/x86/kvm/pmu.c              |  44 +++++++-----
 arch/x86/kvm/pmu.h              |   6 +-
 arch/x86/kvm/svm/pmu.c          | 121 ++++++--------------------------
 arch/x86/kvm/vmx/pmu_intel.c    |  36 +++++-----
 5 files changed, 75 insertions(+), 138 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-09-22 22:30   ` Sean Christopherson
  2022-08-31  8:53 ` [PATCH v3 2/7] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

In the extreme case of host counters multiplexing and contention, the
perf_event requested by the guest's pebs counter is not allocated to any
actual physical counter, in which case hw.idx is bookkept as -1,
resulting in an out-of-bounds access to host_cross_mapped_mask.

Fixes: 854250329c02 ("KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations")
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c399637a3a79..d595ff33d32d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -776,20 +776,20 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
 void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
 {
 	struct kvm_pmc *pmc = NULL;
-	int bit;
+	int bit, hw_idx;
 
 	for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl,
 			 X86_PMC_IDX_MAX) {
 		pmc = intel_pmc_idx_to_pmc(pmu, bit);
 
 		if (!pmc || !pmc_speculative_in_use(pmc) ||
-		    !intel_pmc_is_enabled(pmc))
+		    !intel_pmc_is_enabled(pmc) || !pmc->perf_event)
 			continue;
 
-		if (pmc->perf_event && pmc->idx != pmc->perf_event->hw.idx) {
-			pmu->host_cross_mapped_mask |=
-				BIT_ULL(pmc->perf_event->hw.idx);
-		}
+		hw_idx = pmc->perf_event->hw.idx;
+		/* make it a little less dependent on perf's exact behavior */
+		if (hw_idx != pmc->idx && hw_idx > -1)
+			pmu->host_cross_mapped_mask |= BIT_ULL(hw_idx);
 	}
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/7] KVM: x86/pmu: Don't generate PEBS records for emulated instructions
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
  2022-08-31  8:53 ` [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-08-31  8:53 ` [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

KVM will accumulate an enabled counter for at least INSTRUCTIONS or
BRANCH_INSTRUCTION hw event from any KVM emulated instructions,
generating emulated overflow interrupt on counter overflow, which
in theory should also happen when the PEBS counter overflows but
it currently lacks this part of the underlying support (e.g. through
software injection of records in the irq context or a lazy approach).

In this case, KVM skips the injection of this BUFFER_OVF PMI (effectively
dropping one PEBS record) and let the overflow counter move on. The loss
of a single sample does not introduce a loss of accuracy, but is easily
noticeable for certain specific instructions.

This issue is expected to be addressed along with the issue
of PEBS cross-mapped counters with a slow-path proposal.

Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter")
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/pmu.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 02f9e4f245bd..390d697efde1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -106,9 +106,19 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 		return;
 
 	if (pmc->perf_event && pmc->perf_event->attr.precise_ip) {
-		/* Indicate PEBS overflow PMI to guest. */
-		skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
-					      (unsigned long *)&pmu->global_status);
+		if (!in_pmi) {
+			/*
+			 * TODO: KVM is currently _choosing_ to not generate records
+			 * for emulated instructions, avoiding BUFFER_OVF PMI when
+			 * there are no records. Strictly speaking, it should be done
+			 * as well in the right context to improve sampling accuracy.
+			 */
+			skip_pmi = true;
+		} else {
+			/* Indicate PEBS overflow PMI to guest. */
+			skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+						      (unsigned long *)&pmu->global_status);
+		}
 	} else {
 		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
 	}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
  2022-08-31  8:53 ` [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
  2022-08-31  8:53 ` [PATCH v3 2/7] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-09-22 22:35   ` Sean Christopherson
  2022-08-31  8:53 ` [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

The check logic in the pmc_resume_counter() to determine whether
a perf_event is reusable is partial and flawed, especially when it
comes to a pseudocode sequence (not correct but clearly valid) like:

  - enabling a counter and its PEBS bit
  - enable global_ctrl
  - run workload
  - disable only the PEBS bit, leaving the global_ctrl bit enabled

In this corner case, a perf_event created for PEBS can be reused by
a normal counter before it has been released and recreated, and when this
normal counter overflows, it triggers a PEBS interrupt (precise_ip != 0).

To address this issue, the reuse check has been revamped and KVM will
go back to do reprogram_counter() when any bit of guest PEBS_ENABLE
msr has changed, which is similar to what global_ctrl_changed() does.

Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter")
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/pmu.c           |  4 ++--
 arch/x86/kvm/vmx/pmu_intel.c | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 390d697efde1..d9b9a0f0db17 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -237,8 +237,8 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
 			      get_sample_period(pmc, pmc->counter)))
 		return false;
 
-	if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
-	    pmc->perf_event->attr.precise_ip)
+	if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
+	    (!!pmc->perf_event->attr.precise_ip))
 		return false;
 
 	/* reuse perf_event to serve as pmc_reprogram_counter() does*/
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d595ff33d32d..6242b0b81116 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -68,15 +68,11 @@ static struct kvm_pmc *intel_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
 	}
 }
 
-/* function is called when global control register has been updated. */
-static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
+static void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
 {
 	int bit;
-	u64 diff = pmu->global_ctrl ^ data;
 	struct kvm_pmc *pmc;
 
-	pmu->global_ctrl = data;
-
 	for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) {
 		pmc = intel_pmc_idx_to_pmc(pmu, bit);
 		if (pmc)
@@ -397,7 +393,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	struct kvm_pmc *pmc;
 	u32 msr = msr_info->index;
 	u64 data = msr_info->data;
-	u64 reserved_bits;
+	u64 reserved_bits, diff;
 
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
@@ -418,7 +414,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (pmu->global_ctrl == data)
 			return 0;
 		if (kvm_valid_perf_global_ctrl(pmu, data)) {
-			global_ctrl_changed(pmu, data);
+			diff = pmu->global_ctrl ^ data;
+			pmu->global_ctrl = data;
+			reprogram_counters(pmu, diff);
 			return 0;
 		}
 		break;
@@ -433,7 +431,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (pmu->pebs_enable == data)
 			return 0;
 		if (!(data & pmu->pebs_enable_mask)) {
+			diff = pmu->pebs_enable ^ data;
 			pmu->pebs_enable = data;
+			reprogram_counters(pmu, diff);
 			return 0;
 		}
 		break;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (2 preceding siblings ...)
  2022-08-31  8:53 ` [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-09-22 23:18   ` Sean Christopherson
  2022-08-31  8:53 ` [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter Like Xu
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
a counter multiple times during a single VM-Exit.

Deferring programming will also allow KVM to fix a bug where immediately
reprogramming a counter can result in sleeping (taking a mutex) while
interrupts are disabled in the VM-Exit fastpath.

Introducing kvm_pmu_request_counter_reprogam() to make it obvious that
KVM is _requesting_ a reprogram and not actually doing the reprogram.

Opportunistically refine related comments to avoid misunderstandings.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c              | 11 ++++++-----
 arch/x86/kvm/pmu.h              |  6 +++++-
 arch/x86/kvm/svm/pmu.c          |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c    |  6 +++---
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c96c43c313a..4e568a7ef464 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -493,6 +493,7 @@ struct kvm_pmc {
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
+	 * only for creating or reusing perf_event,
 	 * eventsel value for general purpose counters,
 	 * ctrl value for fixed counters.
 	 */
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d9b9a0f0db17..7f391750ebd3 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -101,7 +101,6 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
 	bool skip_pmi = false;
 
-	/* Ignore counters that have been reprogrammed already. */
 	if (test_and_set_bit(pmc->idx, pmu->reprogram_pmi))
 		return;
 
@@ -293,7 +292,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
 	return allow_event;
 }
 
-void reprogram_counter(struct kvm_pmc *pmc)
+static void reprogram_counter(struct kvm_pmc *pmc)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
 	u64 eventsel = pmc->eventsel;
@@ -335,7 +334,6 @@ void reprogram_counter(struct kvm_pmc *pmc)
 			      !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
 			      eventsel & ARCH_PERFMON_EVENTSEL_INT);
 }
-EXPORT_SYMBOL_GPL(reprogram_counter);
 
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 {
@@ -345,10 +343,11 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 	for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
 		struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
 
-		if (unlikely(!pmc || !pmc->perf_event)) {
+		if (unlikely(!pmc)) {
 			clear_bit(bit, pmu->reprogram_pmi);
 			continue;
 		}
+
 		reprogram_counter(pmc);
 	}
 
@@ -542,7 +541,9 @@ static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
 static inline bool cpl_is_matched(struct kvm_pmc *pmc)
 {
 	bool select_os, select_user;
-	u64 config = pmc->current_config;
+	u64 config = pmc_is_gp(pmc) ? pmc->eventsel :
+		(u64)fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
+				      pmc->idx - INTEL_PMC_IDX_FIXED);
 
 	if (pmc_is_gp(pmc)) {
 		select_os = config & ARCH_PERFMON_EVENTSEL_OS;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 5cc5721f260b..847e7112a5d3 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -183,7 +183,11 @@ static inline void kvm_init_pmu_capability(void)
 					     KVM_PMC_MAX_FIXED);
 }
 
-void reprogram_counter(struct kvm_pmc *pmc);
+static inline void kvm_pmu_request_counter_reprogam(struct kvm_pmc *pmc)
+{
+	__set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
+	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+}
 
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index f24613a108c5..70219c19b872 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -238,7 +238,7 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		data &= ~pmu->reserved_bits;
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
-			reprogram_counter(pmc);
+			kvm_pmu_request_counter_reprogam(pmc);
 		}
 		return 0;
 	}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 6242b0b81116..863a6ff9e214 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -52,7 +52,7 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
 
 		__set_bit(INTEL_PMC_IDX_FIXED + i, pmu->pmc_in_use);
-		reprogram_counter(pmc);
+		kvm_pmu_request_counter_reprogam(pmc);
 	}
 }
 
@@ -76,7 +76,7 @@ static void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
 	for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX) {
 		pmc = intel_pmc_idx_to_pmc(pmu, bit);
 		if (pmc)
-			reprogram_counter(pmc);
+			kvm_pmu_request_counter_reprogam(pmc);
 	}
 }
 
@@ -477,7 +477,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 				reserved_bits ^= HSW_IN_TX_CHECKPOINTED;
 			if (!(data & reserved_bits)) {
 				pmc->eventsel = data;
-				reprogram_counter(pmc);
+				kvm_pmu_request_counter_reprogam(pmc);
 				return 0;
 			}
 		} else if (intel_pmu_handle_lbr_msrs_access(vcpu, msr_info, false))
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (3 preceding siblings ...)
  2022-08-31  8:53 ` [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-09-22 23:59   ` Sean Christopherson
  2022-08-31  8:53 ` [PATCH v3 6/7] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Wanpeng Li

From: Like Xu <likexu@tencent.com>

Defer reprogramming counters and handling overflow via KVM_REQ_PMU
when incrementing counters.  KVM skips emulated WRMSR in the VM-Exit
fastpath, the fastpath runs with IRQs disabled, skipping instructions
can increment and reprogram counters, reprogramming counters can
sleep, and sleeping is disallowed while IRQs are disabled.

 [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
 [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
 [*] preempt_count: 1, expected: 0
 [*] RCU nest depth: 0, expected: 0
 [*] INFO: lockdep is turned off.
 [*] irq event stamp: 0
 [*] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
 [*] softirqs last  enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
 [*] softirqs last disabled at (0): [<0000000000000000>] 0x0
 [*] Preemption disabled at:
 [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
 [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
 [*] Call Trace:
 [*]  <TASK>
 [*]  dump_stack_lvl+0x6c/0x9b
 [*]  __might_resched.cold+0x22e/0x297
 [*]  __mutex_lock+0xc0/0x23b0
 [*]  perf_event_ctx_lock_nested+0x18f/0x340
 [*]  perf_event_pause+0x1a/0x110
 [*]  reprogram_counter+0x2af/0x1490 [kvm]
 [*]  kvm_pmu_trigger_event+0x429/0x950 [kvm]
 [*]  kvm_skip_emulated_instruction+0x48/0x90 [kvm]
 [*]  handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
 [*]  vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
 [*]  vcpu_enter_guest+0x1d22/0x3dc0 [kvm]

Add a field to kvm_pmc to track the previous counter value in order
to defer overflow detection to kvm_pmu_handle_event() (reprogramming
must be done before handling overflow).

Opportunistically shrink sizeof(struct kvm_pmc) a bit.

Suggested-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/pmu.c              | 13 ++++++-------
 arch/x86/kvm/svm/pmu.c          |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c    |  4 ++--
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4e568a7ef464..08c3f90b4ac3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -488,7 +488,10 @@ enum pmc_type {
 struct kvm_pmc {
 	enum pmc_type type;
 	u8 idx;
+	bool is_paused;
+	bool intr;
 	u64 counter;
+	u64 prev_counter;
 	u64 eventsel;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
@@ -498,8 +501,6 @@ struct kvm_pmc {
 	 * ctrl value for fixed counters.
 	 */
 	u64 current_config;
-	bool is_paused;
-	bool intr;
 };
 
 #define KVM_PMC_MAX_FIXED	3
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7f391750ebd3..3c42df3a55ff 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -349,6 +349,10 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 		}
 
 		reprogram_counter(pmc);
+
+		if (pmc->counter < pmc->prev_counter)
+			__kvm_perf_overflow(pmc, false);
+		pmc->prev_counter = 0;
 	}
 
 	/*
@@ -521,14 +525,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 
 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
-	u64 prev_count;
-
-	prev_count = pmc->counter;
+	pmc->prev_counter = pmc->counter;
 	pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
-
-	reprogram_counter(pmc);
-	if (pmc->counter < prev_count)
-		__kvm_perf_overflow(pmc, false);
+	kvm_pmu_request_counter_reprogam(pmc);
 }
 
 static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 70219c19b872..0166f3bc6447 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -290,7 +290,7 @@ static void amd_pmu_reset(struct kvm_vcpu *vcpu)
 		struct kvm_pmc *pmc = &pmu->gp_counters[i];
 
 		pmc_stop_counter(pmc);
-		pmc->counter = pmc->eventsel = 0;
+		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
 	}
 }
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 863a6ff9e214..1d3d0bd3e0e7 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -647,14 +647,14 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmc = &pmu->gp_counters[i];
 
 		pmc_stop_counter(pmc);
-		pmc->counter = pmc->eventsel = 0;
+		pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
 	}
 
 	for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
 		pmc = &pmu->fixed_counters[i];
 
 		pmc_stop_counter(pmc);
-		pmc->counter = 0;
+		pmc->counter = pmc->prev_counter = 0;
 	}
 
 	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/7] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc()
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (4 preceding siblings ...)
  2022-08-31  8:53 ` [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-08-31  8:53 ` [PATCH v3 7/7] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

Access PMU counters on AMD by directly indexing the array of general
purpose counters instead of translating the PMC index to an MSR index.
AMD only supports gp counters, there's no need to translate a PMC index
to an MSR index and back to a PMC index.

Opportunistically apply array_index_nospec() to reduce the attack
surface for speculative execution and remove the dead code.

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/svm/pmu.c | 41 +++++------------------------------------
 1 file changed, 5 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 0166f3bc6447..c736757c29d2 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -33,23 +33,6 @@ enum index {
 	INDEX_ERROR,
 };
 
-static unsigned int get_msr_base(struct kvm_pmu *pmu, enum pmu_type type)
-{
-	struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
-
-	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
-		if (type == PMU_TYPE_COUNTER)
-			return MSR_F15H_PERF_CTR;
-		else
-			return MSR_F15H_PERF_CTL;
-	} else {
-		if (type == PMU_TYPE_COUNTER)
-			return MSR_K7_PERFCTR0;
-		else
-			return MSR_K7_EVNTSEL0;
-	}
-}
-
 static enum index msr_to_index(u32 msr)
 {
 	switch (msr) {
@@ -141,18 +124,12 @@ static bool amd_pmc_is_enabled(struct kvm_pmc *pmc)
 
 static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
 {
-	unsigned int base = get_msr_base(pmu, PMU_TYPE_COUNTER);
-	struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
+	unsigned int num_counters = pmu->nr_arch_gp_counters;
 
-	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
-		/*
-		 * The idx is contiguous. The MSRs are not. The counter MSRs
-		 * are interleaved with the event select MSRs.
-		 */
-		pmc_idx *= 2;
-	}
+	if (pmc_idx >= num_counters)
+		return NULL;
 
-	return get_gp_pmc_amd(pmu, base + pmc_idx, PMU_TYPE_COUNTER);
+	return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
 }
 
 static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
@@ -168,15 +145,7 @@ static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
 static struct kvm_pmc *amd_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 	unsigned int idx, u64 *mask)
 {
-	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-	struct kvm_pmc *counters;
-
-	idx &= ~(3u << 30);
-	if (idx >= pmu->nr_arch_gp_counters)
-		return NULL;
-	counters = pmu->gp_counters;
-
-	return &counters[idx];
+	return amd_pmc_idx_to_pmc(vcpu_to_pmu(vcpu), idx & ~(3u << 30));
 }
 
 static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 7/7] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (5 preceding siblings ...)
  2022-08-31  8:53 ` [PATCH v3 6/7] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
@ 2022-08-31  8:53 ` Like Xu
  2022-09-07  9:52 ` [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
  2022-09-22 22:29 ` Sean Christopherson
  8 siblings, 0 replies; 17+ messages in thread
From: Like Xu @ 2022-08-31  8:53 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

From: Like Xu <likexu@tencent.com>

If the number of AMD gp counters continues to grow, the code will
be very clumsy and the switch-case design of inline get_gp_pmc_amd()
will also bloat the kernel text size.

The target code is taught to manage two groups of MSRs, each
representing a different version of the AMD PMU counter MSRs.
The MSR addresses of each group are contiguous, with no holes,
and there is no intersection between two sets of addresses,
but they are discrete in functionality by design like this:

[Group A : All counter MSRs are tightly bound to all event select MSRs ]

  MSR_K7_EVNTSEL0			0xc0010000
  MSR_K7_EVNTSELi			0xc0010000 + i
  ...
  MSR_K7_EVNTSEL3			0xc0010003
  MSR_K7_PERFCTR0			0xc0010004
  MSR_K7_PERFCTRi			0xc0010004 + i
  ...
  MSR_K7_PERFCTR3			0xc0010007

[Group B : The counter MSRs are interleaved with the event select MSRs ]

  MSR_F15H_PERF_CTL0		0xc0010200
  MSR_F15H_PERF_CTR0		(0xc0010200 + 1)
  ...
  MSR_F15H_PERF_CTLi		(0xc0010200 + 2 * i)
  MSR_F15H_PERF_CTRi		(0xc0010200 + 2 * i + 1)
  ...
  MSR_F15H_PERF_CTL5		(0xc0010200 + 2 * 5)
  MSR_F15H_PERF_CTR5		(0xc0010200 + 2 * 5 + 1)

Rewrite get_gp_pmc_amd() in this way: first determine which group of
registers is accessed, then determine if it matches its requested type,
applying different scaling ratios respectively, and finally get pmc_idx
to pass into amd_pmc_idx_to_pmc().

Signed-off-by: Like Xu <likexu@tencent.com>
---
 arch/x86/kvm/svm/pmu.c | 88 ++++++++++--------------------------------
 1 file changed, 20 insertions(+), 68 deletions(-)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index c736757c29d2..2ec420b85d6a 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -23,90 +23,52 @@ enum pmu_type {
 	PMU_TYPE_EVNTSEL,
 };
 
-enum index {
-	INDEX_ZERO = 0,
-	INDEX_ONE,
-	INDEX_TWO,
-	INDEX_THREE,
-	INDEX_FOUR,
-	INDEX_FIVE,
-	INDEX_ERROR,
-};
-
-static enum index msr_to_index(u32 msr)
+static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
 {
-	switch (msr) {
-	case MSR_F15H_PERF_CTL0:
-	case MSR_F15H_PERF_CTR0:
-	case MSR_K7_EVNTSEL0:
-	case MSR_K7_PERFCTR0:
-		return INDEX_ZERO;
-	case MSR_F15H_PERF_CTL1:
-	case MSR_F15H_PERF_CTR1:
-	case MSR_K7_EVNTSEL1:
-	case MSR_K7_PERFCTR1:
-		return INDEX_ONE;
-	case MSR_F15H_PERF_CTL2:
-	case MSR_F15H_PERF_CTR2:
-	case MSR_K7_EVNTSEL2:
-	case MSR_K7_PERFCTR2:
-		return INDEX_TWO;
-	case MSR_F15H_PERF_CTL3:
-	case MSR_F15H_PERF_CTR3:
-	case MSR_K7_EVNTSEL3:
-	case MSR_K7_PERFCTR3:
-		return INDEX_THREE;
-	case MSR_F15H_PERF_CTL4:
-	case MSR_F15H_PERF_CTR4:
-		return INDEX_FOUR;
-	case MSR_F15H_PERF_CTL5:
-	case MSR_F15H_PERF_CTR5:
-		return INDEX_FIVE;
-	default:
-		return INDEX_ERROR;
-	}
+	unsigned int num_counters = pmu->nr_arch_gp_counters;
+
+	if (pmc_idx >= num_counters)
+		return NULL;
+
+	return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
 }
 
 static inline struct kvm_pmc *get_gp_pmc_amd(struct kvm_pmu *pmu, u32 msr,
 					     enum pmu_type type)
 {
 	struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
+	unsigned int idx;
 
 	if (!vcpu->kvm->arch.enable_pmu)
 		return NULL;
 
 	switch (msr) {
-	case MSR_F15H_PERF_CTL0:
-	case MSR_F15H_PERF_CTL1:
-	case MSR_F15H_PERF_CTL2:
-	case MSR_F15H_PERF_CTL3:
-	case MSR_F15H_PERF_CTL4:
-	case MSR_F15H_PERF_CTL5:
+	case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
 		if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
 			return NULL;
-		fallthrough;
+		/*
+		 * Each PMU counter has a pair of CTL and CTR MSRs. CTLn
+		 * MSRs (accessed via EVNTSEL) are even, CTRn MSRs are odd.
+		 */
+		idx = (unsigned int)((msr - MSR_F15H_PERF_CTL0) / 2);
+		if (!(msr & 0x1) != (type == PMU_TYPE_EVNTSEL))
+			return NULL;
+		break;
 	case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
 		if (type != PMU_TYPE_EVNTSEL)
 			return NULL;
+		idx = msr - MSR_K7_EVNTSEL0;
 		break;
-	case MSR_F15H_PERF_CTR0:
-	case MSR_F15H_PERF_CTR1:
-	case MSR_F15H_PERF_CTR2:
-	case MSR_F15H_PERF_CTR3:
-	case MSR_F15H_PERF_CTR4:
-	case MSR_F15H_PERF_CTR5:
-		if (!guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE))
-			return NULL;
-		fallthrough;
 	case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
 		if (type != PMU_TYPE_COUNTER)
 			return NULL;
+		idx = msr - MSR_K7_PERFCTR0;
 		break;
 	default:
 		return NULL;
 	}
 
-	return &pmu->gp_counters[msr_to_index(msr)];
+	return amd_pmc_idx_to_pmc(pmu, idx);
 }
 
 static bool amd_hw_event_available(struct kvm_pmc *pmc)
@@ -122,16 +84,6 @@ static bool amd_pmc_is_enabled(struct kvm_pmc *pmc)
 	return true;
 }
 
-static struct kvm_pmc *amd_pmc_idx_to_pmc(struct kvm_pmu *pmu, int pmc_idx)
-{
-	unsigned int num_counters = pmu->nr_arch_gp_counters;
-
-	if (pmc_idx >= num_counters)
-		return NULL;
-
-	return &pmu->gp_counters[array_index_nospec(pmc_idx, num_counters)];
-}
-
 static bool amd_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (6 preceding siblings ...)
  2022-08-31  8:53 ` [PATCH v3 7/7] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
@ 2022-09-07  9:52 ` Like Xu
  2022-09-19  8:58   ` Like Xu
  2022-09-22 22:29 ` Sean Christopherson
  8 siblings, 1 reply; 17+ messages in thread
From: Like Xu @ 2022-09-07  9:52 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel

A review reminder for shepherds. Thanks!

On 31/8/2022 4:53 pm, Like Xu wrote:
> Good well-designed tests can help us find more bugs, especially when
> the test steps differ from the Linux kernel behaviour in terms of the
> timing of access to virtualized hw resources.
> 
> Please feel free to run tests, add more or share comments.
> 
> Previous:
> https://lore.kernel.org/kvm/20220823093221.38075-1-likexu@tencent.com/
> 
> V2 RESEND -> V3 Changelog:
> - Post perf change as a separate patch to the perf folks; (Sean)
> - Rewrite the deferred logic using imperative mood; (Sean)
> - Drop some useless comment; (Sean)
> - Rename __reprogram_counter() to kvm_pmu_request_counter_reprogam(); (Sean)
> - Replace a play-by-play of the code changes with a high level description; (); (Sean)
> - Rename pmc->stale_counter to pmc->prev_counter; (Sean)
> - Drop an unnecessary check about pmc->prev_counter; (Sean)
> - Simply the code about "CTLn is even, CTRn is odd"; (Sean)
> - Refine commit message to avoid pronouns; (Sean)
> 
> Like Xu (7):
>    KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
>    KVM: x86/pmu: Don't generate PEBS records for emulated instructions
>    KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
>    KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
>    KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
>    KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement
>      amd_*_to_pmc()
>    KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters
>      scalability
> 
>   arch/x86/include/asm/kvm_host.h |   6 +-
>   arch/x86/kvm/pmu.c              |  44 +++++++-----
>   arch/x86/kvm/pmu.h              |   6 +-
>   arch/x86/kvm/svm/pmu.c          | 121 ++++++--------------------------
>   arch/x86/kvm/vmx/pmu_intel.c    |  36 +++++-----
>   5 files changed, 75 insertions(+), 138 deletions(-)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization
  2022-09-07  9:52 ` [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
@ 2022-09-19  8:58   ` Like Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Like Xu @ 2022-09-19  8:58 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, linux-kernel, Jim Mattson

+ Jim for more comments if any.

On 7/9/2022 5:52 pm, Like Xu wrote:
> A review reminder for shepherds. Thanks!
> 
> On 31/8/2022 4:53 pm, Like Xu wrote:
>> Good well-designed tests can help us find more bugs, especially when
>> the test steps differ from the Linux kernel behaviour in terms of the
>> timing of access to virtualized hw resources.
>>
>> Please feel free to run tests, add more or share comments.
>>
>> Previous:
>> https://lore.kernel.org/kvm/20220823093221.38075-1-likexu@tencent.com/
>>
>> V2 RESEND -> V3 Changelog:
>> - Post perf change as a separate patch to the perf folks; (Sean)
>> - Rewrite the deferred logic using imperative mood; (Sean)
>> - Drop some useless comment; (Sean)
>> - Rename __reprogram_counter() to kvm_pmu_request_counter_reprogam(); (Sean)
>> - Replace a play-by-play of the code changes with a high level description; 
>> (); (Sean)
>> - Rename pmc->stale_counter to pmc->prev_counter; (Sean)
>> - Drop an unnecessary check about pmc->prev_counter; (Sean)
>> - Simply the code about "CTLn is even, CTRn is odd"; (Sean)
>> - Refine commit message to avoid pronouns; (Sean)
>>
>> Like Xu (7):
>>    KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
>>    KVM: x86/pmu: Don't generate PEBS records for emulated instructions
>>    KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
>>    KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
>>    KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
>>    KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement
>>      amd_*_to_pmc()
>>    KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters
>>      scalability
>>
>>   arch/x86/include/asm/kvm_host.h |   6 +-
>>   arch/x86/kvm/pmu.c              |  44 +++++++-----
>>   arch/x86/kvm/pmu.h              |   6 +-
>>   arch/x86/kvm/svm/pmu.c          | 121 ++++++--------------------------
>>   arch/x86/kvm/vmx/pmu_intel.c    |  36 +++++-----
>>   5 files changed, 75 insertions(+), 138 deletions(-)
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization
  2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
                   ` (7 preceding siblings ...)
  2022-09-07  9:52 ` [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
@ 2022-09-22 22:29 ` Sean Christopherson
  2022-09-22 23:11   ` Sean Christopherson
  8 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 22:29 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Aug 31, 2022, Like Xu wrote:
> Like Xu (7):
>   KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
>   KVM: x86/pmu: Don't generate PEBS records for emulated instructions
>   KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
>   KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
>   KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
>   KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement
>     amd_*_to_pmc()
>   KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters
>     scalability

With a few tweaks (will respond to individual patches), pushed everything except
the "Defer" patches to to branch `for_paolo/6.1` at:

    https://github.com/sean-jc/linux.git

Unless you hear otherwise, it will make its way to kvm/queue "soon".

Regarding the "defer" patches, your patches are ok (with one or two tweaks), but
there are existing bugs that I believe will interact poorly with using reprogram_pmi
more agressively.  Nothing major, but I'd prefer to get everything squared away
before merging, and definitely want your input on my proposed fixes.  I'll post
the patches shortly.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask
  2022-08-31  8:53 ` [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
@ 2022-09-22 22:30   ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 22:30 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Aug 31, 2022, Like Xu wrote:
> +		hw_idx = pmc->perf_event->hw.idx;
> +		/* make it a little less dependent on perf's exact behavior */

Reworded this to explain how a negative value can be encountered.

> +		if (hw_idx != pmc->idx && hw_idx > -1)
> +			pmu->host_cross_mapped_mask |= BIT_ULL(hw_idx);
>  	}
>  }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters
  2022-08-31  8:53 ` [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
@ 2022-09-22 22:35   ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 22:35 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Aug 31, 2022, Like Xu wrote:
> To address this issue, the reuse check has been revamped and KVM will

State what the patch does as a command, don't describe the effects in the past
tense.

> go back to do reprogram_counter() when any bit of guest PEBS_ENABLE
> msr has changed, which is similar to what global_ctrl_changed() does.

This managed to confuse the heck out of me.  I misread reprogram_counter() as
reprogram_counters() because that's what I see in the diff, and then got all
turned around because this patch also stealthily renames global_ctrl_changed() to
reprogram_counters().

This is a very good example of when splitting a refactor with a bug fix can help
tremendously, e.g. with the refactor moved to a separate patch, this fix becomes:

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 390d697efde1..d9b9a0f0db17 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -237,8 +237,8 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
                              get_sample_period(pmc, pmc->counter)))
                return false;
 
-       if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
-           pmc->perf_event->attr.precise_ip)
+       if (test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) !=
+           (!!pmc->perf_event->attr.precise_ip))
                return false;
 
        /* reuse perf_event to serve as pmc_reprogram_counter() does*/
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5592b1259e1b..25b70a85bef5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -431,7 +431,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                if (pmu->pebs_enable == data)
                        return 0;
                if (!(data & pmu->pebs_enable_mask)) {
+                       diff = pmu->pebs_enable ^ data;
                        pmu->pebs_enable = data;
+                       reprogram_counters(pmu, diff);
                        return 0;
                }
                break;

which is much, much easier to describe and to follow.

TL;DR: I split this into two patches.

> Fixes: 79f3e3b58386 ("KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter")
> Signed-off-by: Like Xu <likexu@tencent.com>
> ---

...

> -/* function is called when global control register has been updated. */
> -static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
> +static void reprogram_counters(struct kvm_pmu *pmu, u64 diff)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization
  2022-09-22 22:29 ` Sean Christopherson
@ 2022-09-22 23:11   ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 23:11 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Sep 22, 2022, Sean Christopherson wrote:
> Regarding the "defer" patches, your patches are ok (with one or two tweaks), but
> there are existing bugs that I believe will interact poorly with using reprogram_pmi
> more agressively.  Nothing major, but I'd prefer to get everything squared away
> before merging, and definitely want your input on my proposed fixes.  I'll post
> the patches shortly.

I take that back, there are issues with deferred overflow handling.  I'll follow-up
in the patch.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()
  2022-08-31  8:53 ` [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
@ 2022-09-22 23:18   ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 23:18 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Aug 31, 2022, Like Xu wrote:
> @@ -542,7 +541,9 @@ static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
>  static inline bool cpl_is_matched(struct kvm_pmc *pmc)
>  {
>  	bool select_os, select_user;
> -	u64 config = pmc->current_config;
> +	u64 config = pmc_is_gp(pmc) ? pmc->eventsel :
> +		(u64)fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
> +				      pmc->idx - INTEL_PMC_IDX_FIXED);

Don't use a ternary here, the same conditional exists immediately below, i.e.
this can simply be:

	u64 config;

	if (pmc_is_gp(pmc)) {
		config = pmc->eventsel;
		select_os = config & ARCH_PERFMON_EVENTSEL_OS;
		select_user = config & ARCH_PERFMON_EVENTSEL_USR;
	} else {
		config = fixed_ctrl_field(pmc_to_pmu(pmc)->fixed_ctr_ctrl,
					  pmc->idx - INTEL_PMC_IDX_FIXED);
		select_os = config & 0x1;
		select_user = config & 0x2;
	}

Note, there's no need to cast to a u64; fixed_ctr_ctrl is a u64, and even if it
gets temporarily truncated to a u8, the relevant path only checks bits 0 and 1.

>  
>  	if (pmc_is_gp(pmc)) {
>  		select_os = config & ARCH_PERFMON_EVENTSEL_OS;
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 5cc5721f260b..847e7112a5d3 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -183,7 +183,11 @@ static inline void kvm_init_pmu_capability(void)
>  					     KVM_PMC_MAX_FIXED);
>  }
>  
> -void reprogram_counter(struct kvm_pmc *pmc);
> +static inline void kvm_pmu_request_counter_reprogam(struct kvm_pmc *pmc)
> +{
> +	__set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);

Given that everything else that accesses reprogram_pmi uses atomic accesses, please
make this atomic as well, i.e. use set_bit().  Even if reprogram_pmi is currently
guaranteed to be accessed only from the CPU that has loaded the vCPU (no idea if
that's true), I am planning on fixing the PMU filter (KVM doesn't force reprogramming
on filter changes) by setting all bits in reprogram_pmi from a separate CPU, at
which point this needs to be atomic.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
  2022-08-31  8:53 ` [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter Like Xu
@ 2022-09-22 23:59   ` Sean Christopherson
  2022-10-14  8:08     ` Like Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-09-22 23:59 UTC (permalink / raw)
  To: Like Xu; +Cc: Paolo Bonzini, kvm, linux-kernel, Wanpeng Li

On Wed, Aug 31, 2022, Like Xu wrote:
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 7f391750ebd3..3c42df3a55ff 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -349,6 +349,10 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>  		}
>  
>  		reprogram_counter(pmc);
> +
> +		if (pmc->counter < pmc->prev_counter)
> +			__kvm_perf_overflow(pmc, false);

I would prefer to stick this in repgrogram_counter(), after pausing the counter
and checking that the event is enabled, but before the actual programming/resume.

I don't think false positives are actually possible, especially without my fixes
for clearing reprogram_pmi bits (incoming), but I don't like the twisty logic
that's required to suss out that prev_counter can be non-zero if and only if the
PMC is enabled.

The bigger issue is that calling __kvm_perf_overflow() here can get false negatives.
If reprogramming fails due to contention, the reprogram_pmi bit will be left set
and so this check in __kvm_perf_overflow() will suppress the PMI.

	if (test_and_set_bit(pmc->idx, pmu->reprogram_pmi))
		return;

And the related issue is that because __kvm_perf_overflow() sets the bit and
makes another KVM_REQ_PMU, overflow will cause KVM to reprogram the counter a
second time.  That's especially inefficient since KVM will get quite far into the
VM-Enter flow before detecting the new event.

So, I think this? (goes on top of patches I'm about to post)

static void reprogram_counter(struct kvm_pmc *pmc)
{
	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
	u64 eventsel = pmc->eventsel;
	u64 new_config = eventsel;
	u8 fixed_ctr_ctrl;

	pmc_pause_counter(pmc);

	if (!pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
		goto reprogram_complete;

	if (!check_pmu_event_filter(pmc))
		goto reprogram_complete;

	if (pmc->counter < pmc->prev_counter)
		__kvm_perf_overflow(pmc, false);

	if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
		printk_once("kvm pmu: pin control bit is ignored\n");

	if (pmc_is_fixed(pmc)) {
		fixed_ctr_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl,
						  pmc->idx - INTEL_PMC_IDX_FIXED);
		if (fixed_ctr_ctrl & 0x1)
			eventsel |= ARCH_PERFMON_EVENTSEL_OS;
		if (fixed_ctr_ctrl & 0x2)
			eventsel |= ARCH_PERFMON_EVENTSEL_USR;
		if (fixed_ctr_ctrl & 0x8)
			eventsel |= ARCH_PERFMON_EVENTSEL_INT;
		new_config = (u64)fixed_ctr_ctrl;
	}

	if (pmc->current_config == new_config && pmc_resume_counter(pmc))
		goto reprogram_complete;

	pmc_release_perf_event(pmc);

	pmc->current_config = new_config;

	/*
	 * If reprogramming fails, e.g. due to contention, leave the counter's
	 * regprogram bit set, i.e. opportunistically try again on the next PMU
	 * refresh.  Don't make a new request as doing so can stall the guest
	 * if reprogramming repeatedly fails.
	 */
	if (pmc_reprogram_counter(pmc, PERF_TYPE_RAW,
				  (eventsel & pmu->raw_event_mask),
				  !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
				  !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
				  eventsel & ARCH_PERFMON_EVENTSEL_INT))
		return;

reprogram_complete:
	clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
	pmc->prev_counter = 0;
}


static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
{
	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
	bool skip_pmi = false;

	if (pmc->perf_event && pmc->perf_event->attr.precise_ip) {
		if (!in_pmi) {
			/*
			 * TODO: KVM is currently _choosing_ to not generate records
			 * for emulated instructions, avoiding BUFFER_OVF PMI when
			 * there are no records. Strictly speaking, it should be done
			 * as well in the right context to improve sampling accuracy.
			 */
			skip_pmi = true;
		} else {
			/* Indicate PEBS overflow PMI to guest. */
			skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
						      (unsigned long *)&pmu->global_status);
		}
	} else {
		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
	}

	if (!pmc->intr || skip_pmi)
		return;

	/*
	 * Inject PMI. If vcpu was in a guest mode during NMI PMI
	 * can be ejected on a guest mode re-entry. Otherwise we can't
	 * be sure that vcpu wasn't executing hlt instruction at the
	 * time of vmexit and is not going to re-enter guest mode until
	 * woken up. So we should wake it, but this is impossible from
	 * NMI context. Do it from irq work instead.
	 */
	if (in_pmi && !kvm_handling_nmi_from_guest(pmc->vcpu))
		irq_work_queue(&pmc_to_pmu(pmc)->irq_work);
	else
		kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
}

static void kvm_perf_overflow(struct perf_event *perf_event,
			      struct perf_sample_data *data,
			      struct pt_regs *regs)
{
	struct kvm_pmc *pmc = perf_event->overflow_handler_context;

	/*
	 * Ignore overflow events for counters that are scheduled to be
	 * reprogrammed, e.g. if a PMI for the previous event races with KVM's
	 * handling of a related guest WRMSR.
	 */
	if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
		return;

	__kvm_perf_overflow(pmc, true);

	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter
  2022-09-22 23:59   ` Sean Christopherson
@ 2022-10-14  8:08     ` Like Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Like Xu @ 2022-10-14  8:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On 23/9/2022 7:59 am, Sean Christopherson wrote:
> The bigger issue is that calling __kvm_perf_overflow() here can get false negatives.
> If reprogramming fails due to contention, the reprogram_pmi bit will be left set
> and so this check in __kvm_perf_overflow() will suppress the PMI.

I understand your motivation, and before we reprocess on "left set reprogram_pmi 
bit",
we may need reproducible cases to cover "reprogramming fails".

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-10-14  8:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31  8:53 [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
2022-08-31  8:53 ` [PATCH v3 1/7] KVM: x86/pmu: Avoid setting BIT_ULL(-1) to pmu->host_cross_mapped_mask Like Xu
2022-09-22 22:30   ` Sean Christopherson
2022-08-31  8:53 ` [PATCH v3 2/7] KVM: x86/pmu: Don't generate PEBS records for emulated instructions Like Xu
2022-08-31  8:53 ` [PATCH v3 3/7] KVM: x86/pmu: Avoid using PEBS perf_events for normal counters Like Xu
2022-09-22 22:35   ` Sean Christopherson
2022-08-31  8:53 ` [PATCH v3 4/7] KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() Like Xu
2022-09-22 23:18   ` Sean Christopherson
2022-08-31  8:53 ` [PATCH v3 5/7] KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter Like Xu
2022-09-22 23:59   ` Sean Christopherson
2022-10-14  8:08     ` Like Xu
2022-08-31  8:53 ` [PATCH v3 6/7] KVM: x86/svm/pmu: Direct access pmu->gp_counter[] to implement amd_*_to_pmc() Like Xu
2022-08-31  8:53 ` [PATCH v3 7/7] KVM: x86/svm/pmu: Rewrite get_gp_pmc_amd() for more counters scalability Like Xu
2022-09-07  9:52 ` [PATCH v3 0/7] x86/pmu: Corner cases fixes and optimization Like Xu
2022-09-19  8:58   ` Like Xu
2022-09-22 22:29 ` Sean Christopherson
2022-09-22 23:11   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).