kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events
@ 2021-11-12 23:52 Jim Mattson
  2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jim Mattson @ 2021-11-12 23:52 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: Jim Mattson

Google Cloud has a customer that needs accurate virtualization of two
architected PMU events on Intel hardware: "instructions retired" and
"branch instructions retired." The existing PMU virtualization code
fails to account for instructions that are emulated by kvm.

Accurately virtualizing all PMU events for all microarchitectures is a
herculean task, but there are only 8 architected events, so maybe we
can at least try to get those right.

Eric Hankland wrote this code originally, but his plate is full, so
I've volunteered to shepherd the changes through upstream acceptance.

Jim Mattson (2):
  KVM: x86: Update vPMCs when retiring instructions
  KVM: x86: Update vPMCs when retiring branch instructions

 arch/x86/kvm/emulate.c     | 57 +++++++++++++++++++++-----------------
 arch/x86/kvm/kvm_emulate.h |  1 +
 arch/x86/kvm/pmu.c         | 31 +++++++++++++++++++++
 arch/x86/kvm/pmu.h         |  1 +
 arch/x86/kvm/vmx/nested.c  |  6 +++-
 arch/x86/kvm/x86.c         |  5 ++++
 6 files changed, 75 insertions(+), 26 deletions(-)

-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-12 23:52 [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Jim Mattson
@ 2021-11-12 23:52 ` Jim Mattson
  2021-11-16 10:30   ` Paolo Bonzini
  2021-11-16 12:43   ` Like Xu
  2021-11-12 23:52 ` [PATCH 2/2] KVM: x86: Update vPMCs when retiring branch instructions Jim Mattson
  2021-11-15  3:43 ` [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Like Xu
  2 siblings, 2 replies; 14+ messages in thread
From: Jim Mattson @ 2021-11-12 23:52 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: Jim Mattson, Eric Hankland

When KVM retires a guest instruction through emulation, increment any
vPMCs that are configured to monitor "instructions retired," and
update the sample period of those counters so that they will overflow
at the right time.

Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Added 'static' to kvm_pmu_incr_counter() definition.
  - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
    PERF_EVENT_STATE_ACTIVE.
]
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
---
 arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
 arch/x86/kvm/pmu.h |  1 +
 arch/x86/kvm/x86.c |  3 +++
 3 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 09873f6488f7..153c488032a5 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 	kvm_pmu_reset(vcpu);
 }
 
+static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
+{
+	u64 counter_value, sample_period;
+
+	if (pmc->perf_event &&
+	    pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
+	    pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
+	    pmc->perf_event->attr.config == evt) {
+		pmc->counter++;
+		counter_value = pmc_read_counter(pmc);
+		sample_period = get_sample_period(pmc, counter_value);
+		if (!counter_value)
+			perf_event_overflow(pmc->perf_event, NULL, NULL);
+		if (local64_read(&pmc->perf_event->hw.period_left) >
+		    sample_period)
+			perf_event_period(pmc->perf_event, sample_period);
+	}
+}
+
+void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
+		kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+		kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
+}
+EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
+
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_pmu_event_filter tmp, *filter;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 59d6b76203d5..d1dd2294f8fb 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7def720227d..bd49e2a204d5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 	if (unlikely(!r))
 		return 0;
 
+	kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
+
 	/*
 	 * rflags is the old, "raw" value of the flags.  The new value has
 	 * not been saved yet.
@@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
 		if (!ctxt->have_exception ||
 		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
+			kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
 			kvm_rip_write(vcpu, ctxt->eip);
 			if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
 				r = kvm_vcpu_do_singlestep(vcpu);
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] KVM: x86: Update vPMCs when retiring branch instructions
  2021-11-12 23:52 [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Jim Mattson
  2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
@ 2021-11-12 23:52 ` Jim Mattson
  2021-11-15  3:43 ` [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Like Xu
  2 siblings, 0 replies; 14+ messages in thread
From: Jim Mattson @ 2021-11-12 23:52 UTC (permalink / raw)
  To: kvm, pbonzini; +Cc: Jim Mattson, Eric Hankland

When KVM retires a guest branch instruction through emulation,
increment any vPMCs that are configured to monitor "branch
instructions retired," and update the sample period of those counters
so that they will overflow at the right time.

Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Moved/consolidated the calls to kvm_pmu_record_event() in the
    emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
    that code.
]
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
---
 arch/x86/kvm/emulate.c     | 55 +++++++++++++++++++++-----------------
 arch/x86/kvm/kvm_emulate.h |  1 +
 arch/x86/kvm/vmx/nested.c  |  7 +++--
 arch/x86/kvm/x86.c         |  2 ++
 4 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 28b1a4e57827..166a145fc1e6 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -175,6 +175,7 @@
 #define No16	    ((u64)1 << 53)  /* No 16 bit operand */
 #define IncSP       ((u64)1 << 54)  /* SP is incremented before ModRM calc */
 #define TwoMemOp    ((u64)1 << 55)  /* Instruction has two memory operand */
+#define IsBranch    ((u64)1 << 56)  /* Instruction is considered a branch. */
 
 #define DstXacc     (DstAccLo | SrcAccHi | SrcWrite)
 
@@ -191,8 +192,9 @@
 #define FASTOP_SIZE 8
 
 struct opcode {
-	u64 flags : 56;
-	u64 intercept : 8;
+	u64 flags;
+	u8 intercept;
+	u8 pad[7];
 	union {
 		int (*execute)(struct x86_emulate_ctxt *ctxt);
 		const struct opcode *group;
@@ -4364,10 +4366,10 @@ static const struct opcode group4[] = {
 static const struct opcode group5[] = {
 	F(DstMem | SrcNone | Lock,		em_inc),
 	F(DstMem | SrcNone | Lock,		em_dec),
-	I(SrcMem | NearBranch,			em_call_near_abs),
-	I(SrcMemFAddr | ImplicitOps,		em_call_far),
-	I(SrcMem | NearBranch,			em_jmp_abs),
-	I(SrcMemFAddr | ImplicitOps,		em_jmp_far),
+	I(SrcMem | NearBranch | IsBranch,       em_call_near_abs),
+	I(SrcMemFAddr | ImplicitOps | IsBranch, em_call_far),
+	I(SrcMem | NearBranch | IsBranch,       em_jmp_abs),
+	I(SrcMemFAddr | ImplicitOps | IsBranch, em_jmp_far),
 	I(SrcMem | Stack | TwoMemOp,		em_push), D(Undefined),
 };
 
@@ -4577,7 +4579,7 @@ static const struct opcode opcode_table[256] = {
 	I2bvIP(DstDI | SrcDX | Mov | String | Unaligned, em_in, ins, check_perm_in), /* insb, insw/insd */
 	I2bvIP(SrcSI | DstDX | String, em_out, outs, check_perm_out), /* outsb, outsw/outsd */
 	/* 0x70 - 0x7F */
-	X16(D(SrcImmByte | NearBranch)),
+	X16(D(SrcImmByte | NearBranch | IsBranch)),
 	/* 0x80 - 0x87 */
 	G(ByteOp | DstMem | SrcImm, group1),
 	G(DstMem | SrcImm, group1),
@@ -4596,7 +4598,7 @@ static const struct opcode opcode_table[256] = {
 	DI(SrcAcc | DstReg, pause), X7(D(SrcAcc | DstReg)),
 	/* 0x98 - 0x9F */
 	D(DstAcc | SrcNone), I(ImplicitOps | SrcAcc, em_cwd),
-	I(SrcImmFAddr | No64, em_call_far), N,
+	I(SrcImmFAddr | No64 | IsBranch, em_call_far), N,
 	II(ImplicitOps | Stack, em_pushf, pushf),
 	II(ImplicitOps | Stack, em_popf, popf),
 	I(ImplicitOps, em_sahf), I(ImplicitOps, em_lahf),
@@ -4616,17 +4618,19 @@ static const struct opcode opcode_table[256] = {
 	X8(I(DstReg | SrcImm64 | Mov, em_mov)),
 	/* 0xC0 - 0xC7 */
 	G(ByteOp | Src2ImmByte, group2), G(Src2ImmByte, group2),
-	I(ImplicitOps | NearBranch | SrcImmU16, em_ret_near_imm),
-	I(ImplicitOps | NearBranch, em_ret),
+	I(ImplicitOps | NearBranch | SrcImmU16 | IsBranch, em_ret_near_imm),
+	I(ImplicitOps | NearBranch | IsBranch, em_ret),
 	I(DstReg | SrcMemFAddr | ModRM | No64 | Src2ES, em_lseg),
 	I(DstReg | SrcMemFAddr | ModRM | No64 | Src2DS, em_lseg),
 	G(ByteOp, group11), G(0, group11),
 	/* 0xC8 - 0xCF */
-	I(Stack | SrcImmU16 | Src2ImmByte, em_enter), I(Stack, em_leave),
-	I(ImplicitOps | SrcImmU16, em_ret_far_imm),
-	I(ImplicitOps, em_ret_far),
-	D(ImplicitOps), DI(SrcImmByte, intn),
-	D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret),
+	I(Stack | SrcImmU16 | Src2ImmByte | IsBranch, em_enter),
+	I(Stack | IsBranch, em_leave),
+	I(ImplicitOps | SrcImmU16 | IsBranch, em_ret_far_imm),
+	I(ImplicitOps | IsBranch, em_ret_far),
+	D(ImplicitOps | IsBranch), DI(SrcImmByte | IsBranch, intn),
+	D(ImplicitOps | No64 | IsBranch),
+	II(ImplicitOps | IsBranch, em_iret, iret),
 	/* 0xD0 - 0xD7 */
 	G(Src2One | ByteOp, group2), G(Src2One, group2),
 	G(Src2CL | ByteOp, group2), G(Src2CL, group2),
@@ -4637,14 +4641,15 @@ static const struct opcode opcode_table[256] = {
 	/* 0xD8 - 0xDF */
 	N, E(0, &escape_d9), N, E(0, &escape_db), N, E(0, &escape_dd), N, N,
 	/* 0xE0 - 0xE7 */
-	X3(I(SrcImmByte | NearBranch, em_loop)),
-	I(SrcImmByte | NearBranch, em_jcxz),
+	X3(I(SrcImmByte | NearBranch | IsBranch, em_loop)),
+	I(SrcImmByte | NearBranch | IsBranch, em_jcxz),
 	I2bvIP(SrcImmUByte | DstAcc, em_in,  in,  check_perm_in),
 	I2bvIP(SrcAcc | DstImmUByte, em_out, out, check_perm_out),
 	/* 0xE8 - 0xEF */
-	I(SrcImm | NearBranch, em_call), D(SrcImm | ImplicitOps | NearBranch),
-	I(SrcImmFAddr | No64, em_jmp_far),
-	D(SrcImmByte | ImplicitOps | NearBranch),
+	I(SrcImm | NearBranch | IsBranch, em_call),
+	D(SrcImm | ImplicitOps | NearBranch | IsBranch),
+	I(SrcImmFAddr | No64 | IsBranch, em_jmp_far),
+	D(SrcImmByte | ImplicitOps | NearBranch | IsBranch),
 	I2bvIP(SrcDX | DstAcc, em_in,  in,  check_perm_in),
 	I2bvIP(SrcAcc | DstDX, em_out, out, check_perm_out),
 	/* 0xF0 - 0xF7 */
@@ -4660,7 +4665,7 @@ static const struct opcode opcode_table[256] = {
 static const struct opcode twobyte_table[256] = {
 	/* 0x00 - 0x0F */
 	G(0, group6), GD(0, &group7), N, N,
-	N, I(ImplicitOps | EmulateOnUD, em_syscall),
+	N, I(ImplicitOps | EmulateOnUD | IsBranch, em_syscall),
 	II(ImplicitOps | Priv, em_clts, clts), N,
 	DI(ImplicitOps | Priv, invd), DI(ImplicitOps | Priv, wbinvd), N, N,
 	N, D(ImplicitOps | ModRM | SrcMem | NoAccess), N, N,
@@ -4691,8 +4696,8 @@ static const struct opcode twobyte_table[256] = {
 	IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
 	II(ImplicitOps | Priv, em_rdmsr, rdmsr),
 	IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
-	I(ImplicitOps | EmulateOnUD, em_sysenter),
-	I(ImplicitOps | Priv | EmulateOnUD, em_sysexit),
+	I(ImplicitOps | EmulateOnUD | IsBranch, em_sysenter),
+	I(ImplicitOps | Priv | EmulateOnUD | IsBranch, em_sysexit),
 	N, N,
 	N, N, N, N, N, N, N, N,
 	/* 0x40 - 0x4F */
@@ -4710,7 +4715,7 @@ static const struct opcode twobyte_table[256] = {
 	N, N, N, N,
 	N, N, N, GP(SrcReg | DstMem | ModRM | Mov, &pfx_0f_6f_0f_7f),
 	/* 0x80 - 0x8F */
-	X16(D(SrcImm | NearBranch)),
+	X16(D(SrcImm | NearBranch | IsBranch)),
 	/* 0x90 - 0x9F */
 	X16(D(ByteOp | DstMem | SrcNone | ModRM| Mov)),
 	/* 0xA0 - 0xA7 */
@@ -5224,6 +5229,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
 		ctxt->d |= opcode.flags;
 	}
 
+	ctxt->is_branch = opcode.flags & IsBranch;
+
 	/* Unrecognised? */
 	if (ctxt->d == 0)
 		return EMULATION_FAILED;
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 68b420289d7e..39eded2426ff 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -369,6 +369,7 @@ struct x86_emulate_ctxt {
 	struct fetch_cache fetch;
 	struct read_cache io_read;
 	struct read_cache mem_read;
+	bool is_branch;
 };
 
 /* Repeat String Operation Prefix */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b4ee5e9f9e20..fb15249dc385 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3523,10 +3523,13 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 	if (evmptrld_status == EVMPTRLD_ERROR) {
 		kvm_queue_exception(vcpu, UD_VECTOR);
 		return 1;
-	} else if (CC(evmptrld_status == EVMPTRLD_VMFAIL)) {
-		return nested_vmx_failInvalid(vcpu);
 	}
 
+	kvm_pmu_record_event(vcpu, PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
+
+	if (CC(evmptrld_status == EVMPTRLD_VMFAIL))
+		return nested_vmx_failInvalid(vcpu);
+
 	if (CC(!evmptr_is_valid(vmx->nested.hv_evmcs_vmptr) &&
 	       vmx->nested.current_vmptr == INVALID_GPA))
 		return nested_vmx_failInvalid(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bd49e2a204d5..ae86a5cbccd7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8104,6 +8104,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		if (!ctxt->have_exception ||
 		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
 			kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
+			if (ctxt->is_branch)
+				kvm_pmu_record_event(vcpu, PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
 			kvm_rip_write(vcpu, ctxt->eip);
 			if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
 				r = kvm_vcpu_do_singlestep(vcpu);
-- 
2.34.0.rc1.387.gb447b232ab-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events
  2021-11-12 23:52 [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Jim Mattson
  2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
  2021-11-12 23:52 ` [PATCH 2/2] KVM: x86: Update vPMCs when retiring branch instructions Jim Mattson
@ 2021-11-15  3:43 ` Like Xu
  2021-11-15 17:51   ` Jim Mattson
  2 siblings, 1 reply; 14+ messages in thread
From: Like Xu @ 2021-11-15  3:43 UTC (permalink / raw)
  To: Jim Mattson; +Cc: kvm, Inc. (kernel-recipes.org)

On 13/11/2021 7:52 am, Jim Mattson wrote:
> Google Cloud has a customer that needs accurate virtualization of two
> architected PMU events on Intel hardware: "instructions retired" and
> "branch instructions retired." The existing PMU virtualization code
> fails to account for instructions that are emulated by kvm.

Does this customer need to set force_emulation_prefix=Y ?

Is this "accurate statistics" capability fatal to the use case ?

> 
> Accurately virtualizing all PMU events for all microarchitectures is a
> herculean task, but there are only 8 architected events, so maybe we
> can at least try to get those right.

I assume you mean the architectural events "Instruction Retired"
and "Branch Instruction Retired" defined by the Intel CPUID
since it looks we don't have a similar concept on AMD.

This patch set opens Pandora's Box, especially when we have
the real accurate Guest PEBS facility, and things get even
more complicated for just some PMU corner use cases.

> 
> Eric Hankland wrote this code originally, but his plate is full, so
> I've volunteered to shepherd the changes through upstream acceptance.

Does Eric have more code to implement
accurate virtualization on the following events ?

"UnHalted Core Cycles"
"UnHalted Reference Cycles"
"LLC Reference"
"LLC Misses"
"Branch Misses Retired"
"Topdown Slots" (unimplemented)

Obviously, it's difficult, even absurd, to emulate these.

> 
> Jim Mattson (2):
>    KVM: x86: Update vPMCs when retiring instructions
>    KVM: x86: Update vPMCs when retiring branch instructions
> 
>   arch/x86/kvm/emulate.c     | 57 +++++++++++++++++++++-----------------
>   arch/x86/kvm/kvm_emulate.h |  1 +
>   arch/x86/kvm/pmu.c         | 31 +++++++++++++++++++++
>   arch/x86/kvm/pmu.h         |  1 +
>   arch/x86/kvm/vmx/nested.c  |  6 +++-
>   arch/x86/kvm/x86.c         |  5 ++++
>   6 files changed, 75 insertions(+), 26 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events
  2021-11-15  3:43 ` [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Like Xu
@ 2021-11-15 17:51   ` Jim Mattson
  2021-11-16  3:22     ` Like Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jim Mattson @ 2021-11-15 17:51 UTC (permalink / raw)
  To: Like Xu; +Cc: kvm, Inc. (kernel-recipes.org)

On Sun, Nov 14, 2021 at 7:43 PM Like Xu <like.xu.linux@gmail.com> wrote:
>
> On 13/11/2021 7:52 am, Jim Mattson wrote:
> > Google Cloud has a customer that needs accurate virtualization of two
> > architected PMU events on Intel hardware: "instructions retired" and
> > "branch instructions retired." The existing PMU virtualization code
> > fails to account for instructions that are emulated by kvm.
>
> Does this customer need to set force_emulation_prefix=Y ?

No. That module parameter does make it easier to write the test, though.

It's possible that the L0 hypervisor will never emulate a branch
instruction for this use case. However, since the code being
instrumented is potential malware, one can't make the usual
assumptions about "well-behaved" code. For example, it is quite
possible that the code in question deliberately runs with the TLBs and
in-memory page tables out of sync. Therefore, it's hard to prove that
the "branch instructions retired" patch isn't needed.

> Is this "accurate statistics" capability fatal to the use case ?

Yes, that is my understanding.

> >
> > Accurately virtualizing all PMU events for all microarchitectures is a
> > herculean task, but there are only 8 architected events, so maybe we
> > can at least try to get those right.
>
> I assume you mean the architectural events "Instruction Retired"
> and "Branch Instruction Retired" defined by the Intel CPUID
> since it looks we don't have a similar concept on AMD.

Yes.

> This patch set opens Pandora's Box, especially when we have
> the real accurate Guest PEBS facility, and things get even
> more complicated for just some PMU corner use cases.

KVM's PMU virtualization is rife with bugs, but this patch set doesn't
make that worse. It actually makes things better by fixing two of
those bugs.

> >
> > Eric Hankland wrote this code originally, but his plate is full, so
> > I've volunteered to shepherd the changes through upstream acceptance.
>
> Does Eric have more code to implement
> accurate virtualization on the following events ?

No. We only offer PMU virtualization to one customer, and that
customer is only interested in the two events addressed by this patch
set.

> "UnHalted Core Cycles"
> "UnHalted Reference Cycles"
> "LLC Reference"
> "LLC Misses"
> "Branch Misses Retired"
> "Topdown Slots" (unimplemented)
>
> Obviously, it's difficult, even absurd, to emulate these.

Sorry; I should not have mentioned the eight architected events. It's
not entirely clear what some of these events mean in a virtual
environment. Let's just stick to the two events covered by this patch
set.

> > Jim Mattson (2):
> >    KVM: x86: Update vPMCs when retiring instructions
> >    KVM: x86: Update vPMCs when retiring branch instructions
> >
> >   arch/x86/kvm/emulate.c     | 57 +++++++++++++++++++++-----------------
> >   arch/x86/kvm/kvm_emulate.h |  1 +
> >   arch/x86/kvm/pmu.c         | 31 +++++++++++++++++++++
> >   arch/x86/kvm/pmu.h         |  1 +
> >   arch/x86/kvm/vmx/nested.c  |  6 +++-
> >   arch/x86/kvm/x86.c         |  5 ++++
> >   6 files changed, 75 insertions(+), 26 deletions(-)
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events
  2021-11-15 17:51   ` Jim Mattson
@ 2021-11-16  3:22     ` Like Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Like Xu @ 2021-11-16  3:22 UTC (permalink / raw)
  To: Jim Mattson; +Cc: kvm, Inc. (kernel-recipes.org)

Hi Jim,

On 16/11/2021 1:51 am, Jim Mattson wrote:
> On Sun, Nov 14, 2021 at 7:43 PM Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> On 13/11/2021 7:52 am, Jim Mattson wrote:
>>> Google Cloud has a customer that needs accurate virtualization of two
>>> architected PMU events on Intel hardware: "instructions retired" and
>>> "branch instructions retired." The existing PMU virtualization code
>>> fails to account for instructions that are emulated by kvm.
>>
>> Does this customer need to set force_emulation_prefix=Y ?
> 
> No. That module parameter does make it easier to write the test, though.
> 
> It's possible that the L0 hypervisor will never emulate a branch
> instruction for this use case. However, since the code being
> instrumented is potential malware, one can't make the usual
> assumptions about "well-behaved" code. For example, it is quite
> possible that the code in question deliberately runs with the TLBs and
> in-memory page tables out of sync. Therefore, it's hard to prove that
> the "branch instructions retired" patch isn't needed.

Thanks for your input.

> 
>> Is this "accurate statistics" capability fatal to the use case ?
> 
> Yes, that is my understanding.

Uh, looks like it's right time to do this.

> 
>>>
>>> Accurately virtualizing all PMU events for all microarchitectures is a
>>> herculean task, but there are only 8 architected events, so maybe we
>>> can at least try to get those right.
>>
>> I assume you mean the architectural events "Instruction Retired"
>> and "Branch Instruction Retired" defined by the Intel CPUID
>> since it looks we don't have a similar concept on AMD.
> 
> Yes.
> 
>> This patch set opens Pandora's Box, especially when we have
>> the real accurate Guest PEBS facility, and things get even
>> more complicated for just some PMU corner use cases.
> 
> KVM's PMU virtualization is rife with bugs, but this patch set doesn't
> make that worse. It actually makes things better by fixing two of
> those bugs.

Yes, I can't agree more.

> 
>>>
>>> Eric Hankland wrote this code originally, but his plate is full, so
>>> I've volunteered to shepherd the changes through upstream acceptance.
>>
>> Does Eric have more code to implement
>> accurate virtualization on the following events ?
> 
> No. We only offer PMU virtualization to one customer, and that
> customer is only interested in the two events addressed by this patch
> set.

Fine to me and I'll start looking at the code.

> 
>> "UnHalted Core Cycles"
>> "UnHalted Reference Cycles"
>> "LLC Reference"
>> "LLC Misses"
>> "Branch Misses Retired"
>> "Topdown Slots" (unimplemented)
>>
>> Obviously, it's difficult, even absurd, to emulate these.
> 
> Sorry; I should not have mentioned the eight architected events. It's
> not entirely clear what some of these events mean in a virtual
> environment. Let's just stick to the two events covered by this patch
> set.

Thanks for the clarification.

> 
>>> Jim Mattson (2):
>>>     KVM: x86: Update vPMCs when retiring instructions
>>>     KVM: x86: Update vPMCs when retiring branch instructions
>>>
>>>    arch/x86/kvm/emulate.c     | 57 +++++++++++++++++++++-----------------
>>>    arch/x86/kvm/kvm_emulate.h |  1 +
>>>    arch/x86/kvm/pmu.c         | 31 +++++++++++++++++++++
>>>    arch/x86/kvm/pmu.h         |  1 +
>>>    arch/x86/kvm/vmx/nested.c  |  6 +++-
>>>    arch/x86/kvm/x86.c         |  5 ++++
>>>    6 files changed, 75 insertions(+), 26 deletions(-)
>>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
@ 2021-11-16 10:30   ` Paolo Bonzini
  2021-11-16 12:43   ` Like Xu
  1 sibling, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2021-11-16 10:30 UTC (permalink / raw)
  To: Jim Mattson, kvm; +Cc: Eric Hankland

On 11/13/21 00:52, Jim Mattson wrote:
> When KVM retires a guest instruction through emulation, increment any
> vPMCs that are configured to monitor "instructions retired," and
> update the sample period of those counters so that they will overflow
> at the right time.
> 
> Signed-off-by: Eric Hankland <ehankland@google.com>
> [jmattson:
>    - Split the code to increment "branch instructions retired" into a
>      separate commit.
>    - Added 'static' to kvm_pmu_incr_counter() definition.
>    - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
>      PERF_EVENT_STATE_ACTIVE.
> ]
> Signed-off-by: Jim Mattson <jmattson@google.com>
> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")

Queued both, with the addition of an

+	if (!pmu->event_count)
+		return;

check in kvm_pmu_record_event.

Paolo

> ---
>   arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
>   arch/x86/kvm/pmu.h |  1 +
>   arch/x86/kvm/x86.c |  3 +++
>   3 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 09873f6488f7..153c488032a5 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>   	kvm_pmu_reset(vcpu);
>   }
>   
> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
> +{
> +	u64 counter_value, sample_period;
> +
> +	if (pmc->perf_event &&
> +	    pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
> +	    pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
> +	    pmc->perf_event->attr.config == evt) {
> +		pmc->counter++;
> +		counter_value = pmc_read_counter(pmc);
> +		sample_period = get_sample_period(pmc, counter_value);
> +		if (!counter_value)
> +			perf_event_overflow(pmc->perf_event, NULL, NULL);
> +		if (local64_read(&pmc->perf_event->hw.period_left) >
> +		    sample_period)
> +			perf_event_period(pmc->perf_event, sample_period);
> +	}
> +}
> +
> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	int i;
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> +		kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> +		kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
> +}
> +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
> +
>   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_pmu_event_filter tmp, *filter;
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 59d6b76203d5..d1dd2294f8fb 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
>   void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
>   void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
>   
>   bool is_vmware_backdoor_pmc(u32 pmc_idx);
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7def720227d..bd49e2a204d5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>   	if (unlikely(!r))
>   		return 0;
>   
> +	kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> +
>   	/*
>   	 * rflags is the old, "raw" value of the flags.  The new value has
>   	 * not been saved yet.
> @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
>   		if (!ctxt->have_exception ||
>   		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
> +			kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>   			kvm_rip_write(vcpu, ctxt->eip);
>   			if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
>   				r = kvm_vcpu_do_singlestep(vcpu);
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
  2021-11-16 10:30   ` Paolo Bonzini
@ 2021-11-16 12:43   ` Like Xu
  2021-11-16 22:15     ` Jim Mattson
  1 sibling, 1 reply; 14+ messages in thread
From: Like Xu @ 2021-11-16 12:43 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Eric Hankland, kvm,
	Paolo Bonzini - Distinguished Engineer (kernel-recipes.org)

Hi Jim,

On 13/11/2021 7:52 am, Jim Mattson wrote:
> When KVM retires a guest instruction through emulation, increment any
> vPMCs that are configured to monitor "instructions retired," and
> update the sample period of those counters so that they will overflow
> at the right time.
> 
> Signed-off-by: Eric Hankland <ehankland@google.com>
> [jmattson:
>    - Split the code to increment "branch instructions retired" into a
>      separate commit.
>    - Added 'static' to kvm_pmu_incr_counter() definition.
>    - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
>      PERF_EVENT_STATE_ACTIVE.
> ]
> Signed-off-by: Jim Mattson <jmattson@google.com>
> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
> ---
>   arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
>   arch/x86/kvm/pmu.h |  1 +
>   arch/x86/kvm/x86.c |  3 +++
>   3 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 09873f6488f7..153c488032a5 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>   	kvm_pmu_reset(vcpu);
>   }
>   
> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
> +{
> +	u64 counter_value, sample_period;
> +
> +	if (pmc->perf_event &&

We need to incr pmc->counter whether it has a perf_event or not.

> +	    pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&

We need to cover PERF_TYPE_RAW as well, for example,
it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.

We just need to focus on checking the select and umask bits:

static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
	unsigned int perf_hw_id)
{
	u64 old_eventsel = pmc->eventsel;
	unsigned int config;

	pmc->eventsel &=
		(ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK);
	config = kvm_x86_ops.pmu_ops->find_perf_hw_id(pmc);
	pmc->eventsel = old_eventsel;
	return config == perf_hw_id;
}

> +	    pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&

Again, we should not care the pmc->perf_event.

> +	    pmc->perf_event->attr.config == evt) {

So how about the emulated instructions for
ARCH_PERFMON_EVENTSEL_USR and ARCH_PERFMON_EVENTSEL_USR ?

> +		pmc->counter++;
> +		counter_value = pmc_read_counter(pmc);
> +		sample_period = get_sample_period(pmc, counter_value);
> +		if (!counter_value)
> +			perf_event_overflow(pmc->perf_event, NULL, NULL);

We need to call kvm_perf_overflow() or kvm_perf_overflow_intr().
And the patch set doesn't export the perf_event_overflow() SYMBOL.

> +		if (local64_read(&pmc->perf_event->hw.period_left) >
> +		    sample_period)
> +			perf_event_period(pmc->perf_event, sample_period);
> +	}
> +}

Not cc PeterZ or perf reviewers for this part of code is not a good thing.

How about this:

static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
{
	struct kvm_pmu *pmu = pmc_to_pmu(pmc);

	pmc->counter++;
	reprogram_counter(pmu, pmc->idx);
	if (!pmc_read_counter(pmc))
		// https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
		kvm_pmu_counter_overflow(pmc, need_overflow_intr(pmc));
}

> +
> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)

s/kvm_pmu_record_event/kvm_pmu_trigger_event/

> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	int i;
> +
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> +		kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);

Why do we need to accumulate a counter that is not enabled at all ?

> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> +		kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);

How about this:

	for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
		pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i);

		if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc))
			continue;

		// https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
		if (eventsel_match_perf_hw_id(pmc, perf_hw_id))
			kvm_pmu_incr_counter(pmc);
	}

> +}
> +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
> +
>   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_pmu_event_filter tmp, *filter;
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index 59d6b76203d5..d1dd2294f8fb 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
>   void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
>   void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
>   
>   bool is_vmware_backdoor_pmc(u32 pmc_idx);
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7def720227d..bd49e2a204d5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>   	if (unlikely(!r))
>   		return 0;
>   
> +	kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> +
>   	/*
>   	 * rflags is the old, "raw" value of the flags.  The new value has
>   	 * not been saved yet.
> @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
>   		if (!ctxt->have_exception ||
>   		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
> +			kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>   			kvm_rip_write(vcpu, ctxt->eip);
>   			if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
>   				r = kvm_vcpu_do_singlestep(vcpu);
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-16 12:43   ` Like Xu
@ 2021-11-16 22:15     ` Jim Mattson
  2021-11-17  3:21       ` Like Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jim Mattson @ 2021-11-16 22:15 UTC (permalink / raw)
  To: Like Xu
  Cc: Eric Hankland, kvm,
	Paolo Bonzini - Distinguished Engineer (kernel-recipes.org),
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, linux-perf-users,
	LKML

On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
>
> Hi Jim,
>
> On 13/11/2021 7:52 am, Jim Mattson wrote:
> > When KVM retires a guest instruction through emulation, increment any
> > vPMCs that are configured to monitor "instructions retired," and
> > update the sample period of those counters so that they will overflow
> > at the right time.
> >
> > Signed-off-by: Eric Hankland <ehankland@google.com>
> > [jmattson:
> >    - Split the code to increment "branch instructions retired" into a
> >      separate commit.
> >    - Added 'static' to kvm_pmu_incr_counter() definition.
> >    - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
> >      PERF_EVENT_STATE_ACTIVE.
> > ]
> > Signed-off-by: Jim Mattson <jmattson@google.com>
> > Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
> > ---
> >   arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
> >   arch/x86/kvm/pmu.h |  1 +
> >   arch/x86/kvm/x86.c |  3 +++
> >   3 files changed, 35 insertions(+)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 09873f6488f7..153c488032a5 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> >       kvm_pmu_reset(vcpu);
> >   }
> >
> > +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
> > +{
> > +     u64 counter_value, sample_period;
> > +
> > +     if (pmc->perf_event &&
>
> We need to incr pmc->counter whether it has a perf_event or not.
>
> > +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
>
> We need to cover PERF_TYPE_RAW as well, for example,
> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
>
> We just need to focus on checking the select and umask bits:

[What follows applies only to Intel CPUs. I haven't looked at AMD's
PMU implementation yet.]

Looking at the SDM, volume 3, Figure 18-1: Layout of IA32_PERFEVTSELx
MSRs, there seems to be a lot of complexity here, actually. In
addition to checking for the desired event select and unit mask, it
looks like we need to check the following:

1. The EN bit is set.
2. The CMASK field is 0 (for events that can only happen once per cycle).
3. The E bit is clear (maybe?).
4. The OS bit is set if the guest is running at CPL0.
5. The USR bit is set if the guest is running at CPL>0.


> static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
>         unsigned int perf_hw_id)
> {
>         u64 old_eventsel = pmc->eventsel;
>         unsigned int config;
>
>         pmc->eventsel &=
>                 (ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK);
>         config = kvm_x86_ops.pmu_ops->find_perf_hw_id(pmc);
>         pmc->eventsel = old_eventsel;
>         return config == perf_hw_id;
> }
>
> > +         pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
>
> Again, we should not care the pmc->perf_event.

This test was intended as a proxy for checking that the counter is
enabled in the guest's IA32_PERF_GLOBAL_CTRL MSR.

> > +         pmc->perf_event->attr.config == evt) {
>
> So how about the emulated instructions for
> ARCH_PERFMON_EVENTSEL_USR and ARCH_PERFMON_EVENTSEL_USR ?

I assume you're referring to the OS and USR bits of the corresponding
IA32_PERFEVTSELx MSR. I agree that these bits have to be consulted,
along with guest privilege level, before deciding whether or not to
count the event.

> > +             pmc->counter++;
> > +             counter_value = pmc_read_counter(pmc);
> > +             sample_period = get_sample_period(pmc, counter_value);
> > +             if (!counter_value)
> > +                     perf_event_overflow(pmc->perf_event, NULL, NULL);
>
> We need to call kvm_perf_overflow() or kvm_perf_overflow_intr().
> And the patch set doesn't export the perf_event_overflow() SYMBOL.

Oops. I was compiling with kvm built into vmlinux, so I missed this.

> > +             if (local64_read(&pmc->perf_event->hw.period_left) >
> > +                 sample_period)
> > +                     perf_event_period(pmc->perf_event, sample_period);
> > +     }
> > +}
>
> Not cc PeterZ or perf reviewers for this part of code is not a good thing.

Added.

> How about this:
>
> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
> {
>         struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>
>         pmc->counter++;
>         reprogram_counter(pmu, pmc->idx);
>         if (!pmc_read_counter(pmc))
>                 // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>                 kvm_pmu_counter_overflow(pmc, need_overflow_intr(pmc));
> }
>
> > +
> > +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
>
> s/kvm_pmu_record_event/kvm_pmu_trigger_event/
>
> > +{
> > +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +     int i;
> > +
> > +     for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> > +             kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
>
> Why do we need to accumulate a counter that is not enabled at all ?

In the original code, the condition checked in kmu_pmu_incr_counter()
was intended to filter out disabled counters.

> > +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> > +             kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
>
> How about this:
>
>         for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
>                 pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i);
>
>                 if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc))
>                         continue;
>
>                 // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>                 if (eventsel_match_perf_hw_id(pmc, perf_hw_id))
>                         kvm_pmu_incr_counter(pmc);
>         }
>

Let me expand the list of reviewers and come back with v2 after I
collect more input.

Thanks!


> > +}
> > +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
> > +
> >   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
> >   {
> >       struct kvm_pmu_event_filter tmp, *filter;
> > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > index 59d6b76203d5..d1dd2294f8fb 100644
> > --- a/arch/x86/kvm/pmu.h
> > +++ b/arch/x86/kvm/pmu.h
> > @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
> >   void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
> >   void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
> >   int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
> > +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
> >
> >   bool is_vmware_backdoor_pmc(u32 pmc_idx);
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d7def720227d..bd49e2a204d5 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> >       if (unlikely(!r))
> >               return 0;
> >
> > +     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> > +
> >       /*
> >        * rflags is the old, "raw" value of the flags.  The new value has
> >        * not been saved yet.
> > @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >               vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> >               if (!ctxt->have_exception ||
> >                   exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
> > +                     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> >                       kvm_rip_write(vcpu, ctxt->eip);
> >                       if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
> >                               r = kvm_vcpu_do_singlestep(vcpu);
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-16 22:15     ` Jim Mattson
@ 2021-11-17  3:21       ` Like Xu
  2021-11-17 20:01         ` Jim Mattson
  0 siblings, 1 reply; 14+ messages in thread
From: Like Xu @ 2021-11-17  3:21 UTC (permalink / raw)
  To: Jim Mattson, Paolo Bonzini - Distinguished Engineer (kernel-recipes.org)
  Cc: Eric Hankland, kvm, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, LKML, Peter Zijlstra (Intel OTC, Netherlander)

On 17/11/2021 6:15 am, Jim Mattson wrote:
> On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> Hi Jim,
>>
>> On 13/11/2021 7:52 am, Jim Mattson wrote:
>>> When KVM retires a guest instruction through emulation, increment any
>>> vPMCs that are configured to monitor "instructions retired," and
>>> update the sample period of those counters so that they will overflow
>>> at the right time.
>>>
>>> Signed-off-by: Eric Hankland <ehankland@google.com>
>>> [jmattson:
>>>     - Split the code to increment "branch instructions retired" into a
>>>       separate commit.
>>>     - Added 'static' to kvm_pmu_incr_counter() definition.
>>>     - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
>>>       PERF_EVENT_STATE_ACTIVE.
>>> ]
>>> Signed-off-by: Jim Mattson <jmattson@google.com>
>>> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
>>> ---
>>>    arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
>>>    arch/x86/kvm/pmu.h |  1 +
>>>    arch/x86/kvm/x86.c |  3 +++
>>>    3 files changed, 35 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>> index 09873f6488f7..153c488032a5 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>>>        kvm_pmu_reset(vcpu);
>>>    }
>>>
>>> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
>>> +{
>>> +     u64 counter_value, sample_period;
>>> +
>>> +     if (pmc->perf_event &&
>>
>> We need to incr pmc->counter whether it has a perf_event or not.
>>
>>> +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
>>
>> We need to cover PERF_TYPE_RAW as well, for example,
>> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
>> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
>>
>> We just need to focus on checking the select and umask bits:
> 
> [What follows applies only to Intel CPUs. I haven't looked at AMD's
> PMU implementation yet.]

x86 has the same bit definition and semantics on at least the select and umask bits.

> 
> Looking at the SDM, volume 3, Figure 18-1: Layout of IA32_PERFEVTSELx
> MSRs, there seems to be a lot of complexity here, actually. In

The devil is in the details.

> addition to checking for the desired event select and unit mask, it
> looks like we need to check the following:
> 
> 1. The EN bit is set.

We need to cover the EN bit of fixed counter 0 for HW_INSTRUCTIONS.

> 2. The CMASK field is 0 (for events that can only happen once per cycle).
> 3. The E bit is clear (maybe?).

The "Edge detect" bit is about hw detail and let's ignore it.

> 4. The OS bit is set if the guest is running at CPL0.
> 5. The USR bit is set if the guest is running at CPL>0.

CPL is a necessity.

> 
> 
>> static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
>>          unsigned int perf_hw_id)
>> {
>>          u64 old_eventsel = pmc->eventsel;
>>          unsigned int config;
>>
>>          pmc->eventsel &=
>>                  (ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK);
>>          config = kvm_x86_ops.pmu_ops->find_perf_hw_id(pmc);
>>          pmc->eventsel = old_eventsel;
>>          return config == perf_hw_id;
>> }

My proposal is to incr counter as long as the select and mask bits match the 
generi event.

What do you think?

>>
>>> +         pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
>>
>> Again, we should not care the pmc->perf_event.
> 
> This test was intended as a proxy for checking that the counter is
> enabled in the guest's IA32_PERF_GLOBAL_CTRL MSR.

The two are not equivalent.

A enabled counter means true from "pmc_is_enabled(pmc)  && 
pmc_speculative_in_use(pmc)".
A well-emulated counter means true from "perf_event->state == 
PERF_EVENT_STATE_ACTIVE".

A bad-emulated but enabled counter should be incremented for emulated instructions.

> 
>>> +         pmc->perf_event->attr.config == evt) {
>>
>> So how about the emulated instructions for
>> ARCH_PERFMON_EVENTSEL_USR and ARCH_PERFMON_EVENTSEL_USR ?
> 
> I assume you're referring to the OS and USR bits of the corresponding
> IA32_PERFEVTSELx MSR. I agree that these bits have to be consulted,
> along with guest privilege level, before deciding whether or not to
> count the event.

Thanks and we may need update the testcase as well.

> 
>>> +             pmc->counter++;
>>> +             counter_value = pmc_read_counter(pmc);
>>> +             sample_period = get_sample_period(pmc, counter_value);
>>> +             if (!counter_value)
>>> +                     perf_event_overflow(pmc->perf_event, NULL, NULL);
>>
>> We need to call kvm_perf_overflow() or kvm_perf_overflow_intr().
>> And the patch set doesn't export the perf_event_overflow() SYMBOL.
> 
> Oops. I was compiling with kvm built into vmlinux, so I missed this.

In fact, I don't think the perf code would accept such rude symbolic export
And I do propose to apply kvm_pmu_incr_counter() in a less invasive way.

> 
>>> +             if (local64_read(&pmc->perf_event->hw.period_left) >
>>> +                 sample_period)
>>> +                     perf_event_period(pmc->perf_event, sample_period);
>>> +     }
>>> +}
>>
>> Not cc PeterZ or perf reviewers for this part of code is not a good thing.
> 
> Added.
> 
>> How about this:
>>
>> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>> {
>>          struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>>
>>          pmc->counter++;
>>          reprogram_counter(pmu, pmc->idx);
>>          if (!pmc_read_counter(pmc))
>>                  // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>>                  kvm_pmu_counter_overflow(pmc, need_overflow_intr(pmc));
>> }
>>
>>> +
>>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
>>
>> s/kvm_pmu_record_event/kvm_pmu_trigger_event/
>>
>>> +{
>>> +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>> +     int i;
>>> +
>>> +     for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>> +             kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
>>
>> Why do we need to accumulate a counter that is not enabled at all ?
> 
> In the original code, the condition checked in kmu_pmu_incr_counter()
> was intended to filter out disabled counters.

The bar of code review haven't been lowered, eh?

> 
>>> +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>> +             kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
>>
>> How about this:
>>
>>          for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
>>                  pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i);
>>
>>                  if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc))
>>                          continue;
>>
>>                  // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>>                  if (eventsel_match_perf_hw_id(pmc, perf_hw_id))
>>                          kvm_pmu_incr_counter(pmc);
>>          }
>>
> 
> Let me expand the list of reviewers and come back with v2 after I
> collect more input.

I'm not sure Paolo will revert the "Queued both" decision,
but I'm not taking my eyes or hands off the vPMU code.

> 
> Thanks!
> 
> 
>>> +}
>>> +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
>>> +
>>>    int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>>>    {
>>>        struct kvm_pmu_event_filter tmp, *filter;
>>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>>> index 59d6b76203d5..d1dd2294f8fb 100644
>>> --- a/arch/x86/kvm/pmu.h
>>> +++ b/arch/x86/kvm/pmu.h
>>> @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
>>>    void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
>>>    void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>>>    int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
>>>
>>>    bool is_vmware_backdoor_pmc(u32 pmc_idx);
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index d7def720227d..bd49e2a204d5 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>>>        if (unlikely(!r))
>>>                return 0;
>>>
>>> +     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>>> +
>>>        /*
>>>         * rflags is the old, "raw" value of the flags.  The new value has
>>>         * not been saved yet.
>>> @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>>                vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
>>>                if (!ctxt->have_exception ||
>>>                    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
>>> +                     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>>>                        kvm_rip_write(vcpu, ctxt->eip);
>>>                        if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
>>>                                r = kvm_vcpu_do_singlestep(vcpu);
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-17  3:21       ` Like Xu
@ 2021-11-17 20:01         ` Jim Mattson
  2021-11-18  3:37           ` Jim Mattson
  2021-11-18  9:13           ` Like Xu
  0 siblings, 2 replies; 14+ messages in thread
From: Jim Mattson @ 2021-11-17 20:01 UTC (permalink / raw)
  To: Like Xu
  Cc: Paolo Bonzini - Distinguished Engineer (kernel-recipes.org),
	Eric Hankland, kvm, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, LKML, Peter Zijlstra (Intel OTC, Netherlander)

On Tue, Nov 16, 2021 at 7:22 PM Like Xu <like.xu.linux@gmail.com> wrote:
>
> On 17/11/2021 6:15 am, Jim Mattson wrote:
> > On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
> >>
> >> Hi Jim,
> >>
> >> On 13/11/2021 7:52 am, Jim Mattson wrote:
> >>> When KVM retires a guest instruction through emulation, increment any
> >>> vPMCs that are configured to monitor "instructions retired," and
> >>> update the sample period of those counters so that they will overflow
> >>> at the right time.
> >>>
> >>> Signed-off-by: Eric Hankland <ehankland@google.com>
> >>> [jmattson:
> >>>     - Split the code to increment "branch instructions retired" into a
> >>>       separate commit.
> >>>     - Added 'static' to kvm_pmu_incr_counter() definition.
> >>>     - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
> >>>       PERF_EVENT_STATE_ACTIVE.
> >>> ]
> >>> Signed-off-by: Jim Mattson <jmattson@google.com>
> >>> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
> >>> ---
> >>>    arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
> >>>    arch/x86/kvm/pmu.h |  1 +
> >>>    arch/x86/kvm/x86.c |  3 +++
> >>>    3 files changed, 35 insertions(+)
> >>>
> >>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> >>> index 09873f6488f7..153c488032a5 100644
> >>> --- a/arch/x86/kvm/pmu.c
> >>> +++ b/arch/x86/kvm/pmu.c
> >>> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> >>>        kvm_pmu_reset(vcpu);
> >>>    }
> >>>
> >>> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
> >>> +{
> >>> +     u64 counter_value, sample_period;
> >>> +
> >>> +     if (pmc->perf_event &&
> >>
> >> We need to incr pmc->counter whether it has a perf_event or not.
> >>
> >>> +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
> >>
> >> We need to cover PERF_TYPE_RAW as well, for example,
> >> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
> >> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
> >>
> >> We just need to focus on checking the select and umask bits:
> >
> > [What follows applies only to Intel CPUs. I haven't looked at AMD's
> > PMU implementation yet.]
>
> x86 has the same bit definition and semantics on at least the select and umask bits.

Yes, but AMD supports 12 bits of event selector. AMD also has the
HG_ONLY bits, which affect whether or not to count the event based on
context.

> >
> > Looking at the SDM, volume 3, Figure 18-1: Layout of IA32_PERFEVTSELx
> > MSRs, there seems to be a lot of complexity here, actually. In
>
> The devil is in the details.
>
> > addition to checking for the desired event select and unit mask, it
> > looks like we need to check the following:
> >
> > 1. The EN bit is set.
>
> We need to cover the EN bit of fixed counter 0 for HW_INSTRUCTIONS.

I don't know what you mean by that.

> > 2. The CMASK field is 0 (for events that can only happen once per cycle).
> > 3. The E bit is clear (maybe?).
>
> The "Edge detect" bit is about hw detail and let's ignore it.

From my reading of the SDM, I don't think the edge detect bit can be
ignored, but I will do some empirical tests to convince myself when I
get back from vacation.

> > 4. The OS bit is set if the guest is running at CPL0.
> > 5. The USR bit is set if the guest is running at CPL>0.
>
> CPL is a necessity.
>

As is host/guest mode on AMD.

> >
> >
> >> static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
> >>          unsigned int perf_hw_id)
> >> {
> >>          u64 old_eventsel = pmc->eventsel;
> >>          unsigned int config;
> >>
> >>          pmc->eventsel &=
> >>                  (ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK);
> >>          config = kvm_x86_ops.pmu_ops->find_perf_hw_id(pmc);
> >>          pmc->eventsel = old_eventsel;
> >>          return config == perf_hw_id;
> >> }
>
> My proposal is to incr counter as long as the select and mask bits match the
> generi event.
>
> What do you think?
>
> >>
> >>> +         pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
> >>
> >> Again, we should not care the pmc->perf_event.
> >
> > This test was intended as a proxy for checking that the counter is
> > enabled in the guest's IA32_PERF_GLOBAL_CTRL MSR.
>
> The two are not equivalent.

Yes. I'm getting that now.

> A enabled counter means true from "pmc_is_enabled(pmc)  &&
> pmc_speculative_in_use(pmc)".
> A well-emulated counter means true from "perf_event->state ==
> PERF_EVENT_STATE_ACTIVE".
>
> A bad-emulated but enabled counter should be incremented for emulated instructions.

What is a "bad-emulated" counter?

> >
> >>> +         pmc->perf_event->attr.config == evt) {
> >>
> >> So how about the emulated instructions for
> >> ARCH_PERFMON_EVENTSEL_USR and ARCH_PERFMON_EVENTSEL_USR ?
> >
> > I assume you're referring to the OS and USR bits of the corresponding
> > IA32_PERFEVTSELx MSR. I agree that these bits have to be consulted,
> > along with guest privilege level, before deciding whether or not to
> > count the event.
>
> Thanks and we may need update the testcase as well.

Indeed.

> >
> >>> +             pmc->counter++;
> >>> +             counter_value = pmc_read_counter(pmc);
> >>> +             sample_period = get_sample_period(pmc, counter_value);
> >>> +             if (!counter_value)
> >>> +                     perf_event_overflow(pmc->perf_event, NULL, NULL);
> >>
> >> We need to call kvm_perf_overflow() or kvm_perf_overflow_intr().
> >> And the patch set doesn't export the perf_event_overflow() SYMBOL.
> >
> > Oops. I was compiling with kvm built into vmlinux, so I missed this.
>
> In fact, I don't think the perf code would accept such rude symbolic export
> And I do propose to apply kvm_pmu_incr_counter() in a less invasive way.
>
> >
> >>> +             if (local64_read(&pmc->perf_event->hw.period_left) >
> >>> +                 sample_period)
> >>> +                     perf_event_period(pmc->perf_event, sample_period);
> >>> +     }
> >>> +}
> >>
> >> Not cc PeterZ or perf reviewers for this part of code is not a good thing.
> >
> > Added.
> >
> >> How about this:
> >>
> >> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
> >> {
> >>          struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> >>
> >>          pmc->counter++;
> >>          reprogram_counter(pmu, pmc->idx);
> >>          if (!pmc_read_counter(pmc))
> >>                  // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
> >>                  kvm_pmu_counter_overflow(pmc, need_overflow_intr(pmc));
> >> }
> >>
> >>> +
> >>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
> >>
> >> s/kvm_pmu_record_event/kvm_pmu_trigger_event/
> >>
> >>> +{
> >>> +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> >>> +             kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
> >>
> >> Why do we need to accumulate a counter that is not enabled at all ?
> >
> > In the original code, the condition checked in kmu_pmu_incr_counter()
> > was intended to filter out disabled counters.
>
> The bar of code review haven't been lowered, eh?

I have no idea what you mean. If anything, I'd like the bar for both
review and acceptance to be higher than it is today. No one was more
surprised than I was when Paolo accepted these patches so quickly.

> >
> >>> +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> >>> +             kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
> >>
> >> How about this:
> >>
> >>          for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
> >>                  pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i);
> >>
> >>                  if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc))
> >>                          continue;
> >>
> >>                  // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
> >>                  if (eventsel_match_perf_hw_id(pmc, perf_hw_id))
> >>                          kvm_pmu_incr_counter(pmc);
> >>          }
> >>
> >
> > Let me expand the list of reviewers and come back with v2 after I
> > collect more input.
>
> I'm not sure Paolo will revert the "Queued both" decision,
> but I'm not taking my eyes or hands off the vPMU code.

I'm going on vacation for a couple of weeks. If Paolo doesn't want to
revert the buggy submissions from kvm-queue, then I will gladly defer
to you as the self-declared warden of the vPMU code to fix it as you
see fit.

Thanks!

--jim

> >
> > Thanks!
> >
> >
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
> >>> +
> >>>    int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
> >>>    {
> >>>        struct kvm_pmu_event_filter tmp, *filter;
> >>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> >>> index 59d6b76203d5..d1dd2294f8fb 100644
> >>> --- a/arch/x86/kvm/pmu.h
> >>> +++ b/arch/x86/kvm/pmu.h
> >>> @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
> >>>    void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
> >>>    void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
> >>>    int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
> >>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
> >>>
> >>>    bool is_vmware_backdoor_pmc(u32 pmc_idx);
> >>>
> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>> index d7def720227d..bd49e2a204d5 100644
> >>> --- a/arch/x86/kvm/x86.c
> >>> +++ b/arch/x86/kvm/x86.c
> >>> @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> >>>        if (unlikely(!r))
> >>>                return 0;
> >>>
> >>> +     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> >>> +
> >>>        /*
> >>>         * rflags is the old, "raw" value of the flags.  The new value has
> >>>         * not been saved yet.
> >>> @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >>>                vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> >>>                if (!ctxt->have_exception ||
> >>>                    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
> >>> +                     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> >>>                        kvm_rip_write(vcpu, ctxt->eip);
> >>>                        if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
> >>>                                r = kvm_vcpu_do_singlestep(vcpu);
> >>>
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-17 20:01         ` Jim Mattson
@ 2021-11-18  3:37           ` Jim Mattson
  2021-11-18 11:26             ` Like Xu
  2021-11-18  9:13           ` Like Xu
  1 sibling, 1 reply; 14+ messages in thread
From: Jim Mattson @ 2021-11-18  3:37 UTC (permalink / raw)
  To: Like Xu
  Cc: Paolo Bonzini - Distinguished Engineer (kernel-recipes.org),
	Eric Hankland, kvm, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, LKML, Peter Zijlstra (Intel OTC, Netherlander)

On Wed, Nov 17, 2021 at 12:01 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 7:22 PM Like Xu <like.xu.linux@gmail.com> wrote:
> >
> > On 17/11/2021 6:15 am, Jim Mattson wrote:
> > > On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
> > >>
> > >> Hi Jim,
> > >>
> > >> On 13/11/2021 7:52 am, Jim Mattson wrote:
> > >>> When KVM retires a guest instruction through emulation, increment any
> > >>> vPMCs that are configured to monitor "instructions retired," and
> > >>> update the sample period of those counters so that they will overflow
> > >>> at the right time.
> > >>>
> > >>> Signed-off-by: Eric Hankland <ehankland@google.com>
> > >>> [jmattson:
> > >>>     - Split the code to increment "branch instructions retired" into a
> > >>>       separate commit.
> > >>>     - Added 'static' to kvm_pmu_incr_counter() definition.
> > >>>     - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
> > >>>       PERF_EVENT_STATE_ACTIVE.
> > >>> ]
> > >>> Signed-off-by: Jim Mattson <jmattson@google.com>
> > >>> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
> > >>> ---
> > >>>    arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
> > >>>    arch/x86/kvm/pmu.h |  1 +
> > >>>    arch/x86/kvm/x86.c |  3 +++
> > >>>    3 files changed, 35 insertions(+)
> > >>>
> > >>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > >>> index 09873f6488f7..153c488032a5 100644
> > >>> --- a/arch/x86/kvm/pmu.c
> > >>> +++ b/arch/x86/kvm/pmu.c
> > >>> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> > >>>        kvm_pmu_reset(vcpu);
> > >>>    }
> > >>>
> > >>> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
> > >>> +{
> > >>> +     u64 counter_value, sample_period;
> > >>> +
> > >>> +     if (pmc->perf_event &&
> > >>
> > >> We need to incr pmc->counter whether it has a perf_event or not.
> > >>
> > >>> +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
> > >>
> > >> We need to cover PERF_TYPE_RAW as well, for example,
> > >> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
> > >> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
> > >>
> > >> We just need to focus on checking the select and umask bits:
> > >
> > > [What follows applies only to Intel CPUs. I haven't looked at AMD's
> > > PMU implementation yet.]
> >
> > x86 has the same bit definition and semantics on at least the select and umask bits.
>
> Yes, but AMD supports 12 bits of event selector. AMD also has the
> HG_ONLY bits, which affect whether or not to count the event based on
> context.

It looks like we already have an issue with event selector truncation
on the AMD side. It's not clear from the APM if AMD has always had a
12-bit event selector field, but it's 12 bits now. Milan, for example,
has at least 6 different events with selectors > 255. I don't see how
a guest could monitor those events with the existing KVM
implementation.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-17 20:01         ` Jim Mattson
  2021-11-18  3:37           ` Jim Mattson
@ 2021-11-18  9:13           ` Like Xu
  1 sibling, 0 replies; 14+ messages in thread
From: Like Xu @ 2021-11-18  9:13 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini - Distinguished Engineer (kernel-recipes.org),
	Eric Hankland, kvm, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, LKML, Peter Zijlstra (Intel OTC, Netherlander)

On 18/11/2021 4:01 am, Jim Mattson wrote:
> On Tue, Nov 16, 2021 at 7:22 PM Like Xu <like.xu.linux@gmail.com> wrote:
>>
>> On 17/11/2021 6:15 am, Jim Mattson wrote:
>>> On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
>>>>
>>>> Hi Jim,
>>>>
>>>> On 13/11/2021 7:52 am, Jim Mattson wrote:
>>>>> When KVM retires a guest instruction through emulation, increment any
>>>>> vPMCs that are configured to monitor "instructions retired," and
>>>>> update the sample period of those counters so that they will overflow
>>>>> at the right time.
>>>>>
>>>>> Signed-off-by: Eric Hankland <ehankland@google.com>
>>>>> [jmattson:
>>>>>      - Split the code to increment "branch instructions retired" into a
>>>>>        separate commit.
>>>>>      - Added 'static' to kvm_pmu_incr_counter() definition.
>>>>>      - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
>>>>>        PERF_EVENT_STATE_ACTIVE.
>>>>> ]
>>>>> Signed-off-by: Jim Mattson <jmattson@google.com>
>>>>> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
>>>>> ---
>>>>>     arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
>>>>>     arch/x86/kvm/pmu.h |  1 +
>>>>>     arch/x86/kvm/x86.c |  3 +++
>>>>>     3 files changed, 35 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>>>> index 09873f6488f7..153c488032a5 100644
>>>>> --- a/arch/x86/kvm/pmu.c
>>>>> +++ b/arch/x86/kvm/pmu.c
>>>>> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>>>>>         kvm_pmu_reset(vcpu);
>>>>>     }
>>>>>
>>>>> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
>>>>> +{
>>>>> +     u64 counter_value, sample_period;
>>>>> +
>>>>> +     if (pmc->perf_event &&
>>>>
>>>> We need to incr pmc->counter whether it has a perf_event or not.
>>>>
>>>>> +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
>>>>
>>>> We need to cover PERF_TYPE_RAW as well, for example,
>>>> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
>>>> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
>>>>
>>>> We just need to focus on checking the select and umask bits:
>>>
>>> [What follows applies only to Intel CPUs. I haven't looked at AMD's
>>> PMU implementation yet.]
>>
>> x86 has the same bit definition and semantics on at least the select and umask bits.
> 
> Yes, but AMD supports 12 bits of event selector. AMD also has the
> HG_ONLY bits, which affect whether or not to count the event based on
> context.

You're right, "EVENT_SELECT[11:8] (Event Select): read/write.
This field extends the EVENT_SELECT field from 8 bits to 12 bits."

But as a legacy way, we're using amd_event_mapping[]
(which checking the shared select and umask bits) to
get the two generalized performance event
"instructions retired" and "branch instructions retired".

So we may insist on checking x86 shared 16-bits and comment the risk from extra 
4 bits.

We may not support HG_ONLY bit in the KVM world (df51fe7ea1c1).

> 
>>>
>>> Looking at the SDM, volume 3, Figure 18-1: Layout of IA32_PERFEVTSELx
>>> MSRs, there seems to be a lot of complexity here, actually. In
>>
>> The devil is in the details.
>>
>>> addition to checking for the desired event select and unit mask, it
>>> looks like we need to check the following:
>>>
>>> 1. The EN bit is set.
>>
>> We need to cover the EN bit of fixed counter 0 for HW_INSTRUCTIONS.
> 
> I don't know what you mean by that.

The four fixed ctrl bits for fixed counter 0 should be consulted as well.

> 
>>> 2. The CMASK field is 0 (for events that can only happen once per cycle).
>>> 3. The E bit is clear (maybe?).
>>
>> The "Edge detect" bit is about hw detail and let's ignore it.
> 
>   From my reading of the SDM, I don't think the edge detect bit can be
> ignored, but I will do some empirical tests to convince myself when I
> get back from vacation.

Sorry to bother you on vacation.

I assume turning edge detection on or off causes certain event
to occur depending on the HW implementation.

Can we trap the edge status of an emulated instruction ?

> 
>>> 4. The OS bit is set if the guest is running at CPL0.
>>> 5. The USR bit is set if the guest is running at CPL>0.
>>
>> CPL is a necessity.
>>
> 
> As is host/guest mode on AMD.
> 
>>>
>>>
>>>> static inline bool eventsel_match_perf_hw_id(struct kvm_pmc *pmc,
>>>>           unsigned int perf_hw_id)
>>>> {
>>>>           u64 old_eventsel = pmc->eventsel;
>>>>           unsigned int config;
>>>>
>>>>           pmc->eventsel &=
>>>>                   (ARCH_PERFMON_EVENTSEL_EVENT | ARCH_PERFMON_EVENTSEL_UMASK);
>>>>           config = kvm_x86_ops.pmu_ops->find_perf_hw_id(pmc);
>>>>           pmc->eventsel = old_eventsel;
>>>>           return config == perf_hw_id;
>>>> }
>>
>> My proposal is to incr counter as long as the select and mask bits match the
>> generi event.
>>
>> What do you think?
>>
>>>>
>>>>> +         pmc->perf_event->state == PERF_EVENT_STATE_ACTIVE &&
>>>>
>>>> Again, we should not care the pmc->perf_event.
>>>
>>> This test was intended as a proxy for checking that the counter is
>>> enabled in the guest's IA32_PERF_GLOBAL_CTRL MSR.
>>
>> The two are not equivalent.
> 
> Yes. I'm getting that now.
> 
>> A enabled counter means true from "pmc_is_enabled(pmc)  &&
>> pmc_speculative_in_use(pmc)".
>> A well-emulated counter means true from "perf_event->state ==
>> PERF_EVENT_STATE_ACTIVE".
>>
>> A bad-emulated but enabled counter should be incremented for emulated instructions.
> 
> What is a "bad-emulated" counter?

When pmc->perf_event is not created or not scheduled by the host perf scheduler.

> 
>>>
>>>>> +         pmc->perf_event->attr.config == evt) {
>>>>
>>>> So how about the emulated instructions for
>>>> ARCH_PERFMON_EVENTSEL_USR and ARCH_PERFMON_EVENTSEL_USR ?
>>>
>>> I assume you're referring to the OS and USR bits of the corresponding
>>> IA32_PERFEVTSELx MSR. I agree that these bits have to be consulted,
>>> along with guest privilege level, before deciding whether or not to
>>> count the event.
>>
>> Thanks and we may need update the testcase as well.
> 
> Indeed.
> 
>>>
>>>>> +             pmc->counter++;
>>>>> +             counter_value = pmc_read_counter(pmc);
>>>>> +             sample_period = get_sample_period(pmc, counter_value);
>>>>> +             if (!counter_value)
>>>>> +                     perf_event_overflow(pmc->perf_event, NULL, NULL);
>>>>
>>>> We need to call kvm_perf_overflow() or kvm_perf_overflow_intr().
>>>> And the patch set doesn't export the perf_event_overflow() SYMBOL.
>>>
>>> Oops. I was compiling with kvm built into vmlinux, so I missed this.
>>
>> In fact, I don't think the perf code would accept such rude symbolic export
>> And I do propose to apply kvm_pmu_incr_counter() in a less invasive way.
>>
>>>
>>>>> +             if (local64_read(&pmc->perf_event->hw.period_left) >
>>>>> +                 sample_period)
>>>>> +                     perf_event_period(pmc->perf_event, sample_period);
>>>>> +     }
>>>>> +}
>>>>
>>>> Not cc PeterZ or perf reviewers for this part of code is not a good thing.
>>>
>>> Added.
>>>
>>>> How about this:
>>>>
>>>> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
>>>> {
>>>>           struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>>>>
>>>>           pmc->counter++;
>>>>           reprogram_counter(pmu, pmc->idx);
>>>>           if (!pmc_read_counter(pmc))
>>>>                   // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>>>>                   kvm_pmu_counter_overflow(pmc, need_overflow_intr(pmc));
>>>> }
>>>>
>>>>> +
>>>>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt)
>>>>
>>>> s/kvm_pmu_record_event/kvm_pmu_trigger_event/
>>>>
>>>>> +{
>>>>> +     struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>> +     int i;
>>>>> +
>>>>> +     for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>>>> +             kvm_pmu_incr_counter(&pmu->gp_counters[i], evt);
>>>>
>>>> Why do we need to accumulate a counter that is not enabled at all ?
>>>
>>> In the original code, the condition checked in kmu_pmu_incr_counter()
>>> was intended to filter out disabled counters.
>>
>> The bar of code review haven't been lowered, eh?
> 
> I have no idea what you mean. If anything, I'd like the bar for both
> review and acceptance to be higher than it is today. No one was more
> surprised than I was when Paolo accepted these patches so quickly.

Personally, I have great sympathy for the maintainers who are always overworked.

Surely, we need more eyes on all corners of the KVM code. At least I may offer a 
little help.

> 
>>>
>>>>> +     for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>>>> +             kvm_pmu_incr_counter(&pmu->fixed_counters[i], evt);
>>>>
>>>> How about this:
>>>>
>>>>           for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
>>>>                   pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, i);
>>>>
>>>>                   if (!pmc || !pmc_is_enabled(pmc) || !pmc_speculative_in_use(pmc))
>>>>                           continue;
>>>>
>>>>                   // https://lore.kernel.org/kvm/20211116122030.4698-1-likexu@tencent.com/T/#t
>>>>                   if (eventsel_match_perf_hw_id(pmc, perf_hw_id))
>>>>                           kvm_pmu_incr_counter(pmc);
>>>>           }
>>>>
>>>
>>> Let me expand the list of reviewers and come back with v2 after I
>>> collect more input.
>>
>> I'm not sure Paolo will revert the "Queued both" decision,
>> but I'm not taking my eyes or hands off the vPMU code.
> 
> I'm going on vacation for a couple of weeks. If Paolo doesn't want to
> revert the buggy submissions from kvm-queue, then I will gladly defer
> to you as the self-declared warden of the vPMU code to fix it as you
> see fit.
> 

Sorry to disturb you before a wonderful holiday.

It looks we committed the test case and reverted this patch set.
Please let me know if you need me to take over for a V2 w/ your SOBs.

> Thanks!
> 
> --jim
> 
>>>
>>> Thanks!
>>>
>>>
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(kvm_pmu_record_event);
>>>>> +
>>>>>     int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>>>>>     {
>>>>>         struct kvm_pmu_event_filter tmp, *filter;
>>>>> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
>>>>> index 59d6b76203d5..d1dd2294f8fb 100644
>>>>> --- a/arch/x86/kvm/pmu.h
>>>>> +++ b/arch/x86/kvm/pmu.h
>>>>> @@ -159,6 +159,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
>>>>>     void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
>>>>>     void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
>>>>>     int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
>>>>> +void kvm_pmu_record_event(struct kvm_vcpu *vcpu, u64 evt);
>>>>>
>>>>>     bool is_vmware_backdoor_pmc(u32 pmc_idx);
>>>>>
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index d7def720227d..bd49e2a204d5 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -7854,6 +7854,8 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>>>>>         if (unlikely(!r))
>>>>>                 return 0;
>>>>>
>>>>> +     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>>>>> +
>>>>>         /*
>>>>>          * rflags is the old, "raw" value of the flags.  The new value has
>>>>>          * not been saved yet.
>>>>> @@ -8101,6 +8103,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>>>>>                 vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
>>>>>                 if (!ctxt->have_exception ||
>>>>>                     exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
>>>>> +                     kvm_pmu_record_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
>>>>>                         kvm_rip_write(vcpu, ctxt->eip);
>>>>>                         if (r && (ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
>>>>>                                 r = kvm_vcpu_do_singlestep(vcpu);
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions
  2021-11-18  3:37           ` Jim Mattson
@ 2021-11-18 11:26             ` Like Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Like Xu @ 2021-11-18 11:26 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini - Distinguished Engineer (kernel-recipes.org),
	Eric Hankland, kvm, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	linux-perf-users, LKML, Peter Zijlstra (Intel OTC, Netherlander)

On 18/11/2021 11:37 am, Jim Mattson wrote:
> On Wed, Nov 17, 2021 at 12:01 PM Jim Mattson <jmattson@google.com> wrote:
>>
>> On Tue, Nov 16, 2021 at 7:22 PM Like Xu <like.xu.linux@gmail.com> wrote:
>>>
>>> On 17/11/2021 6:15 am, Jim Mattson wrote:
>>>> On Tue, Nov 16, 2021 at 4:44 AM Like Xu <like.xu.linux@gmail.com> wrote:
>>>>>
>>>>> Hi Jim,
>>>>>
>>>>> On 13/11/2021 7:52 am, Jim Mattson wrote:
>>>>>> When KVM retires a guest instruction through emulation, increment any
>>>>>> vPMCs that are configured to monitor "instructions retired," and
>>>>>> update the sample period of those counters so that they will overflow
>>>>>> at the right time.
>>>>>>
>>>>>> Signed-off-by: Eric Hankland <ehankland@google.com>
>>>>>> [jmattson:
>>>>>>      - Split the code to increment "branch instructions retired" into a
>>>>>>        separate commit.
>>>>>>      - Added 'static' to kvm_pmu_incr_counter() definition.
>>>>>>      - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
>>>>>>        PERF_EVENT_STATE_ACTIVE.
>>>>>> ]
>>>>>> Signed-off-by: Jim Mattson <jmattson@google.com>
>>>>>> Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
>>>>>> ---
>>>>>>     arch/x86/kvm/pmu.c | 31 +++++++++++++++++++++++++++++++
>>>>>>     arch/x86/kvm/pmu.h |  1 +
>>>>>>     arch/x86/kvm/x86.c |  3 +++
>>>>>>     3 files changed, 35 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>>>>> index 09873f6488f7..153c488032a5 100644
>>>>>> --- a/arch/x86/kvm/pmu.c
>>>>>> +++ b/arch/x86/kvm/pmu.c
>>>>>> @@ -490,6 +490,37 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>>>>>>         kvm_pmu_reset(vcpu);
>>>>>>     }
>>>>>>
>>>>>> +static void kvm_pmu_incr_counter(struct kvm_pmc *pmc, u64 evt)
>>>>>> +{
>>>>>> +     u64 counter_value, sample_period;
>>>>>> +
>>>>>> +     if (pmc->perf_event &&
>>>>>
>>>>> We need to incr pmc->counter whether it has a perf_event or not.
>>>>>
>>>>>> +         pmc->perf_event->attr.type == PERF_TYPE_HARDWARE &&
>>>>>
>>>>> We need to cover PERF_TYPE_RAW as well, for example,
>>>>> it has the basic bits for "{ 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },"
>>>>> plus HSW_IN_TX or ARCH_PERFMON_EVENTSEL_EDGE stuff.
>>>>>
>>>>> We just need to focus on checking the select and umask bits:
>>>>
>>>> [What follows applies only to Intel CPUs. I haven't looked at AMD's
>>>> PMU implementation yet.]
>>>
>>> x86 has the same bit definition and semantics on at least the select and umask bits.
>>
>> Yes, but AMD supports 12 bits of event selector. AMD also has the
>> HG_ONLY bits, which affect whether or not to count the event based on
>> context.
> 
> It looks like we already have an issue with event selector truncation
> on the AMD side. It's not clear from the APM if AMD has always had a
> 12-bit event selector field, but it's 12 bits now. Milan, for example,
> has at least 6 different events with selectors > 255. I don't see how
> a guest could monitor those events with the existing KVM
> implementation.

Yes and I have reproduced the issue on a Milan.
Thanks for your input, and let me try to fix it.

Thanks,
Like Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-18 11:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 23:52 [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Jim Mattson
2021-11-12 23:52 ` [PATCH 1/2] KVM: x86: Update vPMCs when retiring instructions Jim Mattson
2021-11-16 10:30   ` Paolo Bonzini
2021-11-16 12:43   ` Like Xu
2021-11-16 22:15     ` Jim Mattson
2021-11-17  3:21       ` Like Xu
2021-11-17 20:01         ` Jim Mattson
2021-11-18  3:37           ` Jim Mattson
2021-11-18 11:26             ` Like Xu
2021-11-18  9:13           ` Like Xu
2021-11-12 23:52 ` [PATCH 2/2] KVM: x86: Update vPMCs when retiring branch instructions Jim Mattson
2021-11-15  3:43 ` [PATCH 0/2] kvm: x86: Fix PMU virtualization for some basic events Like Xu
2021-11-15 17:51   ` Jim Mattson
2021-11-16  3:22     ` Like Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).