[PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
@ 2023-06-21  1:38 Dongli Zhang
  2023-06-21  1:38 ` [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
  2023-06-21  1:38 ` [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers Dongli Zhang
  0 siblings, 2 replies; 7+ messages in thread
From: Dongli Zhang @ 2023-06-21  1:38 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, joe.jin, likexu, like.xu.linux, zhenyuw, groug, lyan

This is to rebase the patchset on top of the most recet QEMU.

This patchset is to fix two svm pmu virtualization bugs, x86 only.

version 1:
https://lore.kernel.org/all/20221119122901.2469-1-dongli.zhang@oracle.com/

1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization.

To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu
virtualization. There is still below at the VM linux side ...

[    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
[    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

The 1st patch has introduced a new x86 only accel/kvm property
"pmu-cap-disabled=true" to disable the pmu virtualization via
KVM_PMU_CAP_DISABLE.

I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1.
Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I
finally used the latter because it is easier to use.

2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset)
at the KVM side may inject random unwanted/unknown NMIs to the VM.

The svm pmu registers are not reset during QEMU system_reset.

(1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

(2). Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

(3). The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

(4). After the reboot, the VM kernel reports below error:

[    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
[    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)

(5). In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The 2nd patch is to fix the issue by resetting AMD pmu registers as well as
Intel registers.

This patchset does not cover PerfMonV2, until the below patchset is merged
into the KVM side. It has been queued to kvm-x86 next by Sean.

[PATCH v7 00/12] KVM: x86: Add AMD Guest PerfMonV2 PMU support
https://lore.kernel.org/all/168609790857.1417369.13152633386083458084.b4-ty@google.com/

Dongli Zhang (2):
      target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
      target/i386/kvm: get and put AMD pmu registers

 accel/kvm/kvm-all.c      |   1 +
 include/sysemu/kvm_int.h |   1 +
 qemu-options.hx          |   7 +++
 target/i386/cpu.h        |   5 ++
 target/i386/kvm/kvm.c    | 129 +++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 141 insertions(+), 2 deletions(-)

Thank you very much!

Dongli Zhang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2023-06-21  1:38 [PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs Dongli Zhang
@ 2023-06-21  1:38 ` Dongli Zhang
  2023-07-02 13:41   ` Like Xu
  2023-06-21  1:38 ` [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers Dongli Zhang
  1 sibling, 1 reply; 7+ messages in thread
From: Dongli Zhang @ 2023-06-21  1:38 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, joe.jin, likexu, like.xu.linux, zhenyuw, groug, lyan

The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
could disable the pmu virtualization in an AMD environment.

We still see below at VM kernel side ...

[    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.

... although we expect something like below.

[    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
[    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled

This is because the AMD pmu (v1) does not rely on cpuid to decide if the
pmu virtualization is supported.

We introduce a new property 'pmu-cap-disabled' for KVM accel to set
KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
x86.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Like Xu <likexu@tencent.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Changed since v1:
- In version 1 we did not introduce the new property. We ioctl
  KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
  introduced a helpfer function to do this job before creating the 1st
  KVM vcpu in v1.

 accel/kvm/kvm-all.c      |  1 +
 include/sysemu/kvm_int.h |  1 +
 qemu-options.hx          |  7 ++++++
 target/i386/kvm/kvm.c    | 46 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 55 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7679f397ae..238098e991 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj)
     s->xen_version = 0;
     s->xen_gnttab_max_frames = 64;
     s->xen_evtchn_max_pirq = 256;
+    s->pmu_cap_disabled = false;
 }
 
 /**
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 511b42bde5..cbbe08ec54 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -123,6 +123,7 @@ struct KVMState
     uint32_t xen_caps;
     uint16_t xen_gnttab_max_frames;
     uint16_t xen_evtchn_max_pirq;
+    bool pmu_cap_disabled;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/qemu-options.hx b/qemu-options.hx
index b57489d7ca..1976c0ca3e 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
     "                tb-size=n (TCG translation block cache size)\n"
     "                dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
     "                notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n"
+    "                pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
     "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
 SRST
 ``-accel name[,prop=value[,...]]``
@@ -254,6 +255,12 @@ SRST
         open up for a specified of time (i.e. notify-window).
         Default: notify-vmexit=run,notify-window=0.
 
+    ``pmu-cap-disabled=true|false``
+        When the KVM accelerator is used, it controls whether to disable the
+        KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
+        PMU virtualization is disabled at the KVM module side. This is for
+        x86 host only.
+
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index de531842f6..bf4136fa1b 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -129,6 +129,7 @@ static bool has_msr_ucode_rev;
 static bool has_msr_vmx_procbased_ctls2;
 static bool has_msr_perf_capabs;
 static bool has_msr_pkrs;
+static bool has_pmu_cap;
 
 static uint32_t has_architectural_pmu_version;
 static uint32_t num_architectural_pmu_gp_counters;
@@ -2767,6 +2768,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
         }
     }
 
+    has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
+
+    if (s->pmu_cap_disabled) {
+        if (has_pmu_cap) {
+            ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
+                                    KVM_PMU_CAP_DISABLE);
+            if (ret < 0) {
+                s->pmu_cap_disabled = false;
+                error_report("kvm: Failed to disable pmu cap: %s",
+                             strerror(-ret));
+            }
+        } else {
+            s->pmu_cap_disabled = false;
+            error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
+        }
+    }
+
     return 0;
 }
 
@@ -5951,6 +5969,28 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v,
     s->xen_evtchn_max_pirq = value;
 }
 
+static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
+                                     const char *name, void *opaque,
+                                     Error **errp)
+{
+    KVMState *s = KVM_STATE(obj);
+    bool pmu_cap_disabled;
+    Error *error = NULL;
+
+    if (s->fd != -1) {
+        error_setg(errp, "Cannot set properties after the accelerator has been initialized");
+        return;
+    }
+
+    visit_type_bool(v, name, &pmu_cap_disabled, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    s->pmu_cap_disabled = pmu_cap_disabled;
+}
+
 void kvm_arch_accel_class_init(ObjectClass *oc)
 {
     object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption",
@@ -5990,6 +6030,12 @@ void kvm_arch_accel_class_init(ObjectClass *oc)
                               NULL, NULL);
     object_class_property_set_description(oc, "xen-evtchn-max-pirq",
                                           "Maximum number of Xen PIRQs");
+
+    object_class_property_add(oc, "pmu-cap-disabled", "bool",
+                              NULL, kvm_set_pmu_cap_disabled,
+                              NULL, NULL);
+    object_class_property_set_description(oc, "pmu-cap-disabled",
+                                          "Disable KVM_CAP_PMU_CAPABILITY");
 }
 
 void kvm_set_max_apic_id(uint32_t max_apic_id)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers
  2023-06-21  1:38 [PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs Dongli Zhang
  2023-06-21  1:38 ` [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
@ 2023-06-21  1:38 ` Dongli Zhang
  2023-07-02 14:15   ` Like Xu
  1 sibling, 1 reply; 7+ messages in thread
From: Dongli Zhang @ 2023-06-21  1:38 UTC (permalink / raw)
  To: qemu-devel, kvm
  Cc: pbonzini, mtosatti, joe.jin, likexu, like.xu.linux, zhenyuw, groug, lyan

The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
the KVM side.

However, only the Intel gp/fixed/global pmu registers are involved. There
is not any implementation for AMD pmu registers. The
'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
the CPU version.

This patch is to add the support for AMD version=1 pmu, to get and put AMD
pmu registers. Otherwise, there will be a bug:

1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
is running "perf top". The pmu registers are not disabled gracefully.

2. Although the x86_cpu_reset() resets many registers to zero, the
kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
some pmu events are still enabled at the KVM side.

3. The KVM pmc_speculative_in_use() always returns true so that the events
will not be reclaimed. The kvm_pmc->perf_event is still active.

4. After the reboot, the VM kernel reports below error:

[    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
[    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)

5. In a worse case, the active kvm_pmc->perf_event is still able to
inject unknown NMIs randomly to the VM kernel.

[...] Uhhuh. NMI received for unknown reason 30 on CPU 0.

The patch is to fix the issue by resetting AMD pmu registers during the
reset.

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Like Xu <likexu@tencent.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 target/i386/cpu.h     |  5 +++
 target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cd047e0410..b8ba72e87a 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -471,6 +471,11 @@ typedef enum X86Seg {
 #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
 
+#define MSR_K7_EVNTSEL0                 0xc0010000
+#define MSR_K7_PERFCTR0                 0xc0010004
+#define MSR_F15H_PERF_CTL0              0xc0010200
+#define MSR_F15H_PERF_CTR0              0xc0010201
+
 #define MSR_MC0_CTL                     0x400
 #define MSR_MC0_STATUS                  0x401
 #define MSR_MC0_ADDR                    0x402
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index bf4136fa1b..a0f7273dad 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
         }
     }
 
+    /*
+     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
+     * disable the AMD pmu virtualization.
+     *
+     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
+     * indicates the KVM side has already disabled the pmu virtualization.
+     */
+    if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) {
+        int64_t family;
+
+        family = (env->cpuid_version >> 8) & 0xf;
+        if (family == 0xf) {
+            family += (env->cpuid_version >> 20) & 0xff;
+        }
+
+        if (family >= 6) {
+            has_architectural_pmu_version = 1;
+
+            if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
+                num_architectural_pmu_gp_counters = 6;
+            } else {
+                num_architectural_pmu_gp_counters = 4;
+            }
+        }
+    }
+
     cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
 
     for (i = 0x80000000; i <= limit; i++) {
@@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
             kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
         }
 
-        if (has_architectural_pmu_version > 0) {
+        if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
             if (has_architectural_pmu_version > 1) {
                 /* Stop the counter.  */
                 kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
@@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
                                   env->msr_global_ctrl);
             }
         }
+
+        if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
+            uint32_t sel_base = MSR_K7_EVNTSEL0;
+            uint32_t ctr_base = MSR_K7_PERFCTR0;
+            uint32_t step = 1;
+
+            if (num_architectural_pmu_gp_counters == 6) {
+                sel_base = MSR_F15H_PERF_CTL0;
+                ctr_base = MSR_F15H_PERF_CTR0;
+                step = 2;
+            }
+
+            for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+                kvm_msr_entry_add(cpu, ctr_base + i * step,
+                                  env->msr_gp_counters[i]);
+                kvm_msr_entry_add(cpu, sel_base + i * step,
+                                  env->msr_gp_evtsel[i]);
+            }
+        }
+
         /*
          * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
          * only sync them to KVM on the first cpu
@@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu)
     if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
         kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
     }
-    if (has_architectural_pmu_version > 0) {
+    if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
         if (has_architectural_pmu_version > 1) {
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
             kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
@@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu)
         }
     }
 
+    if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
+        uint32_t sel_base = MSR_K7_EVNTSEL0;
+        uint32_t ctr_base = MSR_K7_PERFCTR0;
+        uint32_t step = 1;
+
+        if (num_architectural_pmu_gp_counters == 6) {
+            sel_base = MSR_F15H_PERF_CTL0;
+            ctr_base = MSR_F15H_PERF_CTR0;
+            step = 2;
+        }
+
+        for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
+            kvm_msr_entry_add(cpu, ctr_base + i * step,
+                              env->msr_gp_counters[i]);
+            kvm_msr_entry_add(cpu, sel_base + i * step,
+                              env->msr_gp_evtsel[i]);
+        }
+    }
+
     if (env->mcg_cap) {
         kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
         kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
@@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu)
         case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
             env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
             break;
+        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
+            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
+            break;
+        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
+            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
+            break;
+        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
+            index = index - MSR_F15H_PERF_CTL0;
+            if (index & 0x1) {
+                env->msr_gp_counters[index] = msrs[i].data;
+            } else {
+                env->msr_gp_evtsel[index] = msrs[i].data;
+            }
+            break;
         case HV_X64_MSR_HYPERCALL:
             env->msr_hv_hypercall = msrs[i].data;
             break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2023-06-21  1:38 ` [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
@ 2023-07-02 13:41   ` Like Xu
  2023-07-03 21:58     ` Dongli Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: Like Xu @ 2023-07-02 13:41 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, kvm, pbonzini, mtosatti, joe.jin, zhenyuw, groug, lyan

On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
> could disable the pmu virtualization in an AMD environment.
>
> We still see below at VM kernel side ...
>
> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>
> ... although we expect something like below.
>
> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>
> This is because the AMD pmu (v1) does not rely on cpuid to decide if the
> pmu virtualization is supported.
>
> We introduce a new property 'pmu-cap-disabled' for KVM accel to set
> KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
> is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
> x86.

We may check cpu->enable_pmu when creating the first CPU or a BSP one
(before it gets running) and then choose whether to disable guest pmu using
vm ioctl KVM_CAP_PMU_CAPABILITY. Introducing a new property is not too
acceptable if there are other options.

>
> Cc: Joe Jin <joe.jin@oracle.com>
> Cc: Like Xu <likexu@tencent.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> Changed since v1:
> - In version 1 we did not introduce the new property. We ioctl
>   KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
>   introduced a helpfer function to do this job before creating the 1st
>   KVM vcpu in v1.
>
>  accel/kvm/kvm-all.c      |  1 +
>  include/sysemu/kvm_int.h |  1 +
>  qemu-options.hx          |  7 ++++++
>  target/i386/kvm/kvm.c    | 46 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 55 insertions(+)
>
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 7679f397ae..238098e991 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj)
>      s->xen_version = 0;
>      s->xen_gnttab_max_frames = 64;
>      s->xen_evtchn_max_pirq = 256;
> +    s->pmu_cap_disabled = false;
>  }
>
>  /**
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index 511b42bde5..cbbe08ec54 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -123,6 +123,7 @@ struct KVMState
>      uint32_t xen_caps;
>      uint16_t xen_gnttab_max_frames;
>      uint16_t xen_evtchn_max_pirq;
> +    bool pmu_cap_disabled;
>  };
>
>  void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
> diff --git a/qemu-options.hx b/qemu-options.hx
> index b57489d7ca..1976c0ca3e 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
>      "                tb-size=n (TCG translation block cache size)\n"
>      "                dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
>      "                notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n"
> +    "                pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
>      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
>  SRST
>  ``-accel name[,prop=value[,...]]``
> @@ -254,6 +255,12 @@ SRST
>          open up for a specified of time (i.e. notify-window).
>          Default: notify-vmexit=run,notify-window=0.
>
> +    ``pmu-cap-disabled=true|false``
> +        When the KVM accelerator is used, it controls whether to disable the
> +        KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
> +        PMU virtualization is disabled at the KVM module side. This is for
> +        x86 host only.
> +
>  ERST
>
>  DEF("smp", HAS_ARG, QEMU_OPTION_smp,
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index de531842f6..bf4136fa1b 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -129,6 +129,7 @@ static bool has_msr_ucode_rev;
>  static bool has_msr_vmx_procbased_ctls2;
>  static bool has_msr_perf_capabs;
>  static bool has_msr_pkrs;
> +static bool has_pmu_cap;
>
>  static uint32_t has_architectural_pmu_version;
>  static uint32_t num_architectural_pmu_gp_counters;
> @@ -2767,6 +2768,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>          }
>      }
>
> +    has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
> +
> +    if (s->pmu_cap_disabled) {
> +        if (has_pmu_cap) {
> +            ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
> +                                    KVM_PMU_CAP_DISABLE);
> +            if (ret < 0) {
> +                s->pmu_cap_disabled = false;
> +                error_report("kvm: Failed to disable pmu cap: %s",
> +                             strerror(-ret));
> +            }
> +        } else {
> +            s->pmu_cap_disabled = false;
> +            error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
> +        }
> +    }
> +
>      return 0;
>  }
>
> @@ -5951,6 +5969,28 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v,
>      s->xen_evtchn_max_pirq = value;
>  }
>
> +static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
> +                                     const char *name, void *opaque,
> +                                     Error **errp)
> +{
> +    KVMState *s = KVM_STATE(obj);
> +    bool pmu_cap_disabled;
> +    Error *error = NULL;
> +
> +    if (s->fd != -1) {
> +        error_setg(errp, "Cannot set properties after the accelerator has been initialized");
> +        return;
> +    }
> +
> +    visit_type_bool(v, name, &pmu_cap_disabled, &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +        return;
> +    }
> +
> +    s->pmu_cap_disabled = pmu_cap_disabled;
> +}
> +
>  void kvm_arch_accel_class_init(ObjectClass *oc)
>  {
>      object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption",
> @@ -5990,6 +6030,12 @@ void kvm_arch_accel_class_init(ObjectClass *oc)
>                                NULL, NULL);
>      object_class_property_set_description(oc, "xen-evtchn-max-pirq",
>                                            "Maximum number of Xen PIRQs");
> +
> +    object_class_property_add(oc, "pmu-cap-disabled", "bool",
> +                              NULL, kvm_set_pmu_cap_disabled,
> +                              NULL, NULL);
> +    object_class_property_set_description(oc, "pmu-cap-disabled",
> +                                          "Disable KVM_CAP_PMU_CAPABILITY");
>  }
>
>  void kvm_set_max_apic_id(uint32_t max_apic_id)
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers
  2023-06-21  1:38 ` [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers Dongli Zhang
@ 2023-07-02 14:15   ` Like Xu
  2023-07-03 23:20     ` Dongli Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: Like Xu @ 2023-07-02 14:15 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: qemu-devel, pbonzini, mtosatti, joe.jin, zhenyuw, groug, lyan

On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
> the KVM side.
>
> However, only the Intel gp/fixed/global pmu registers are involved. There
> is not any implementation for AMD pmu registers. The
> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
> the CPU version.

Updating the relevant documentation to clarify this part of the deficiency
would be a good first step.

>
> This patch is to add the support for AMD version=1 pmu, to get and put AMD
> pmu registers. Otherwise, there will be a bug:

AMD version=1 ?

AMD does not have version 1, just directly has 2, perhaps because of x86
compatibility. AMD also does not have the so-called architectural pmu.
Maybe need to rename has_architectural_pmu_version for AMD.

It might be more helpful to add similar support for AMD PerfMonV2.

>
> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
> is running "perf top". The pmu registers are not disabled gracefully.
>
> 2. Although the x86_cpu_reset() resets many registers to zero, the
> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
> some pmu events are still enabled at the KVM side.

I agree that we should have done that, especially if guest pmu is enabled
on the AMD platforms.

>
> 3. The KVM pmc_speculative_in_use() always returns true so that the events
> will not be reclaimed. The kvm_pmc->perf_event is still active.
>
> 4. After the reboot, the VM kernel reports below error:
>
> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)
>
> 5. In a worse case, the active kvm_pmc->perf_event is still able to
> inject unknown NMIs randomly to the VM kernel.
>
> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>
> The patch is to fix the issue by resetting AMD pmu registers during the
> reset.

I'm not sure if the qemu_reset or VM kexec will necessarily trigger
kvm::amd_pmu_reset().

>
> Cc: Joe Jin <joe.jin@oracle.com>
> Cc: Like Xu <likexu@tencent.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  target/i386/cpu.h     |  5 +++
>  target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 86 insertions(+), 2 deletions(-)
>
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index cd047e0410..b8ba72e87a 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -471,6 +471,11 @@ typedef enum X86Seg {
>  #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>  #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>
> +#define MSR_K7_EVNTSEL0                 0xc0010000
> +#define MSR_K7_PERFCTR0                 0xc0010004
> +#define MSR_F15H_PERF_CTL0              0xc0010200
> +#define MSR_F15H_PERF_CTR0              0xc0010201
> +
>  #define MSR_MC0_CTL                     0x400
>  #define MSR_MC0_STATUS                  0x401
>  #define MSR_MC0_ADDR                    0x402
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index bf4136fa1b..a0f7273dad 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
>          }
>      }
>
> +    /*
> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
> +     * disable the AMD pmu virtualization.
> +     *
> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
> +     * indicates the KVM side has already disabled the pmu virtualization.
> +     */
> +    if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) {
> +        int64_t family;
> +
> +        family = (env->cpuid_version >> 8) & 0xf;
> +        if (family == 0xf) {
> +            family += (env->cpuid_version >> 20) & 0xff;
> +        }
> +
> +        if (family >= 6) {
> +            has_architectural_pmu_version = 1;
> +
> +            if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
> +                num_architectural_pmu_gp_counters = 6;

Please make the code a little more readable with some macro definitions.

#define AMD64_NUM_COUNTERS 4
#define AMD64_NUM_COUNTERS_CORE 6

> +            } else {
> +                num_architectural_pmu_gp_counters = 4;
> +            }
> +        }
> +    }
> +
>      cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
>
>      for (i = 0x80000000; i <= limit; i++) {
> @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
>          }
>
> -        if (has_architectural_pmu_version > 0) {
> +        if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {

It seems to be saying here that it is possible to perform kvm_msr_entry_add()
for Intel-related pmu on AMD platforms. So the reverse is expected to be
feasible as well. Why need distinction?

>              if (has_architectural_pmu_version > 1) {
>                  /* Stop the counter.  */
>                  kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>                                    env->msr_global_ctrl);
>              }
>          }
> +
> +        if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
> +            uint32_t step = 1;
> +
> +            if (num_architectural_pmu_gp_counters == 6) {
> +                sel_base = MSR_F15H_PERF_CTL0;
> +                ctr_base = MSR_F15H_PERF_CTR0;
> +                step = 2;
> +            }
> +
> +            for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
> +                                  env->msr_gp_counters[i]);
> +                kvm_msr_entry_add(cpu, sel_base + i * step,
> +                                  env->msr_gp_evtsel[i]);
> +            }
> +        }
> +
>          /*
>           * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>           * only sync them to KVM on the first cpu
> @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu)
>      if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>          kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>      }
> -    if (has_architectural_pmu_version > 0) {
> +    if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
>          if (has_architectural_pmu_version > 1) {
>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
> @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu)
>          }
>      }
>
> +    if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
> +        uint32_t step = 1;
> +
> +        if (num_architectural_pmu_gp_counters == 6) {
> +            sel_base = MSR_F15H_PERF_CTL0;
> +            ctr_base = MSR_F15H_PERF_CTR0;
> +            step = 2;

Perhaps it would be more helpful to add a little commentary on why
step 2 is needed.

> +        }
> +
> +        for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
> +            kvm_msr_entry_add(cpu, ctr_base + i * step,
> +                              env->msr_gp_counters[i]);
> +            kvm_msr_entry_add(cpu, sel_base + i * step,
> +                              env->msr_gp_evtsel[i]);
> +        }
> +    }
> +
>      if (env->mcg_cap) {
>          kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>          kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
> @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>          case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>              env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>              break;
> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
> +            break;
> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
> +            break;
> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
> +            index = index - MSR_F15H_PERF_CTL0;
> +            if (index & 0x1) {
> +                env->msr_gp_counters[index] = msrs[i].data;
> +            } else {
> +                env->msr_gp_evtsel[index] = msrs[i].data;
> +            }
> +            break;
>          case HV_X64_MSR_HYPERCALL:
>              env->msr_hv_hypercall = msrs[i].data;
>              break;
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
  2023-07-02 13:41   ` Like Xu
@ 2023-07-03 21:58     ` Dongli Zhang
  0 siblings, 0 replies; 7+ messages in thread
From: Dongli Zhang @ 2023-07-03 21:58 UTC (permalink / raw)
  To: Like Xu, groug
  Cc: qemu-devel, kvm, pbonzini, mtosatti, joe.jin, zhenyuw, lyan

Hi Like,

On 7/2/23 06:41, Like Xu wrote:
> On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in
>> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC"
>> could disable the pmu virtualization in an AMD environment.
>>
>> We still see below at VM kernel side ...
>>
>> [    0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
>>
>> ... although we expect something like below.
>>
>> [    0.596381] Performance Events: PMU not available due to virtualization, using software events only.
>> [    0.600972] NMI watchdog: Perf NMI watchdog permanently disabled
>>
>> This is because the AMD pmu (v1) does not rely on cpuid to decide if the
>> pmu virtualization is supported.
>>
>> We introduce a new property 'pmu-cap-disabled' for KVM accel to set
>> KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host
>> is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for
>> x86.
> 
> We may check cpu->enable_pmu when creating the first CPU or a BSP one
> (before it gets running) and then choose whether to disable guest pmu using
> vm ioctl KVM_CAP_PMU_CAPABILITY. Introducing a new property is not too
> acceptable if there are other options.

In the v1 of the implementation, we have implemented something similar: not
based on the cpu_index (or BSP), but to introduce a helper before creating the
KVM vcpu to let the further implementation decide. We did the
KVM_CAP_PMU_CAPABILITY in that helper once.

[PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu
https://lore.kernel.org/all/20221119122901.2469-2-dongli.zhang@oracle.com/

[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
https://lore.kernel.org/all/20221119122901.2469-3-dongli.zhang@oracle.com/


The below was the suggestion from Greg Kurz about to use per-VCPU property to
control per-VM cap:

"It doesn't seem conceptually correct to configure VM level stuff out of
a vCPU property, which could theoretically be different for each vCPU,
even if this isn't the case with the current code base.

Maybe consider controlling PMU with a machine property and this
could be done in kvm_arch_init() like other VM level stuff ?"

Would you mind comment on that?

Thank you very much!

Dongli Zhang

> 
>>
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Cc: Like Xu <likexu@tencent.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>> Changed since v1:
>> - In version 1 we did not introduce the new property. We ioctl
>>   KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had
>>   introduced a helpfer function to do this job before creating the 1st
>>   KVM vcpu in v1.
>>
>>  accel/kvm/kvm-all.c      |  1 +
>>  include/sysemu/kvm_int.h |  1 +
>>  qemu-options.hx          |  7 ++++++
>>  target/i386/kvm/kvm.c    | 46 ++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 55 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 7679f397ae..238098e991 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj)
>>      s->xen_version = 0;
>>      s->xen_gnttab_max_frames = 64;
>>      s->xen_evtchn_max_pirq = 256;
>> +    s->pmu_cap_disabled = false;
>>  }
>>
>>  /**
>> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
>> index 511b42bde5..cbbe08ec54 100644
>> --- a/include/sysemu/kvm_int.h
>> +++ b/include/sysemu/kvm_int.h
>> @@ -123,6 +123,7 @@ struct KVMState
>>      uint32_t xen_caps;
>>      uint16_t xen_gnttab_max_frames;
>>      uint16_t xen_evtchn_max_pirq;
>> +    bool pmu_cap_disabled;
>>  };
>>
>>  void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index b57489d7ca..1976c0ca3e 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
>>      "                tb-size=n (TCG translation block cache size)\n"
>>      "                dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
>>      "                notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n"
>> +    "                pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n"
>>      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
>>  SRST
>>  ``-accel name[,prop=value[,...]]``
>> @@ -254,6 +255,12 @@ SRST
>>          open up for a specified of time (i.e. notify-window).
>>          Default: notify-vmexit=run,notify-window=0.
>>
>> +    ``pmu-cap-disabled=true|false``
>> +        When the KVM accelerator is used, it controls whether to disable the
>> +        KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the
>> +        PMU virtualization is disabled at the KVM module side. This is for
>> +        x86 host only.
>> +
>>  ERST
>>
>>  DEF("smp", HAS_ARG, QEMU_OPTION_smp,
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index de531842f6..bf4136fa1b 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -129,6 +129,7 @@ static bool has_msr_ucode_rev;
>>  static bool has_msr_vmx_procbased_ctls2;
>>  static bool has_msr_perf_capabs;
>>  static bool has_msr_pkrs;
>> +static bool has_pmu_cap;
>>
>>  static uint32_t has_architectural_pmu_version;
>>  static uint32_t num_architectural_pmu_gp_counters;
>> @@ -2767,6 +2768,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>          }
>>      }
>>
>> +    has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY);
>> +
>> +    if (s->pmu_cap_disabled) {
>> +        if (has_pmu_cap) {
>> +            ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0,
>> +                                    KVM_PMU_CAP_DISABLE);
>> +            if (ret < 0) {
>> +                s->pmu_cap_disabled = false;
>> +                error_report("kvm: Failed to disable pmu cap: %s",
>> +                             strerror(-ret));
>> +            }
>> +        } else {
>> +            s->pmu_cap_disabled = false;
>> +            error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported");
>> +        }
>> +    }
>> +
>>      return 0;
>>  }
>>
>> @@ -5951,6 +5969,28 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v,
>>      s->xen_evtchn_max_pirq = value;
>>  }
>>
>> +static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v,
>> +                                     const char *name, void *opaque,
>> +                                     Error **errp)
>> +{
>> +    KVMState *s = KVM_STATE(obj);
>> +    bool pmu_cap_disabled;
>> +    Error *error = NULL;
>> +
>> +    if (s->fd != -1) {
>> +        error_setg(errp, "Cannot set properties after the accelerator has been initialized");
>> +        return;
>> +    }
>> +
>> +    visit_type_bool(v, name, &pmu_cap_disabled, &error);
>> +    if (error) {
>> +        error_propagate(errp, error);
>> +        return;
>> +    }
>> +
>> +    s->pmu_cap_disabled = pmu_cap_disabled;
>> +}
>> +
>>  void kvm_arch_accel_class_init(ObjectClass *oc)
>>  {
>>      object_class_property_add_enum(oc, "notify-vmexit", "NotifyVMexitOption",
>> @@ -5990,6 +6030,12 @@ void kvm_arch_accel_class_init(ObjectClass *oc)
>>                                NULL, NULL);
>>      object_class_property_set_description(oc, "xen-evtchn-max-pirq",
>>                                            "Maximum number of Xen PIRQs");
>> +
>> +    object_class_property_add(oc, "pmu-cap-disabled", "bool",
>> +                              NULL, kvm_set_pmu_cap_disabled,
>> +                              NULL, NULL);
>> +    object_class_property_set_description(oc, "pmu-cap-disabled",
>> +                                          "Disable KVM_CAP_PMU_CAPABILITY");
>>  }
>>
>>  void kvm_set_max_apic_id(uint32_t max_apic_id)
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers
  2023-07-02 14:15   ` Like Xu
@ 2023-07-03 23:20     ` Dongli Zhang
  0 siblings, 0 replies; 7+ messages in thread
From: Dongli Zhang @ 2023-07-03 23:20 UTC (permalink / raw)
  To: Like Xu; +Cc: qemu-devel, pbonzini, mtosatti, joe.jin, zhenyuw, groug, lyan

Hi Like,

On 7/2/23 07:15, Like Xu wrote:
> On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
>> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
>> the KVM side.
>>
>> However, only the Intel gp/fixed/global pmu registers are involved. There
>> is not any implementation for AMD pmu registers. The
>> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
>> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
>> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
>> the CPU version.
> 
> Updating the relevant documentation to clarify this part of the deficiency
> would be a good first step.

Would you mind suggesting the doc to add this TODO/deficiency?

The only place I find is to add a new TODO under docs/system/i386, but not sure
if it is worth it. This bugfix is not complex.

> 
>>
>> This patch is to add the support for AMD version=1 pmu, to get and put AMD
>> pmu registers. Otherwise, there will be a bug:
> 
> AMD version=1 ?
> 
> AMD does not have version 1, just directly has 2, perhaps because of x86
> compatibility. AMD also does not have the so-called architectural pmu.
> Maybe need to rename has_architectural_pmu_version for AMD.
> 

Thank you very much for the explanation. I will use version 2.

> It might be more helpful to add similar support for AMD PerfMonV2.

Yes. I will do that. During that time, the AMD PerfMonV2 KVM patchset (from you)
was still in progress.

I see the pull request from Paolo today. I will add that in v3.

> 
>>
>> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>> is running "perf top". The pmu registers are not disabled gracefully.
>>
>> 2. Although the x86_cpu_reset() resets many registers to zero, the
>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>> some pmu events are still enabled at the KVM side.
> 
> I agree that we should have done that, especially if guest pmu is enabled
> on the AMD platforms.
> 
>>
>> 3. The KVM pmc_speculative_in_use() always returns true so that the events
>> will not be reclaimed. The kvm_pmc->perf_event is still active.
>>
>> 4. After the reboot, the VM kernel reports below error:
>>
>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor.
>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076)
>>
>> 5. In a worse case, the active kvm_pmc->perf_event is still able to
>> inject unknown NMIs randomly to the VM kernel.
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> The patch is to fix the issue by resetting AMD pmu registers during the
>> reset.
> 
> I'm not sure if the qemu_reset or VM kexec will necessarily trigger
> kvm::amd_pmu_reset().

According to the mainline linux kernel:

kvm_vcpu_reset()
-> kvm_pmu_reset()
   -> amd_pmu_reset()

The PMU will not reset when init_event==true, that is, when processing the INIT:
line 12049.

11975 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
11976 {
... ...
12049         if (!init_event) {
12050                 kvm_pmu_reset(vcpu);
12051                 vcpu->arch.smbase = 0x30000;
12052
12053                 vcpu->arch.msr_misc_features_enables = 0;
12054                 vcpu->arch.ia32_misc_enable_msr =
MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL |
12055
MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
12056
12057                 __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
12058                 __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
12059         }

According to the below ...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d28bc9dd25ce023270d2e039e7c98d38ecbf7758

... "INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP."

That's why initially I did not send a KVM patch to remove the 'init_event'.

> 
>>
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Cc: Like Xu <likexu@tencent.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>>  target/i386/cpu.h     |  5 +++
>>  target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index cd047e0410..b8ba72e87a 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -471,6 +471,11 @@ typedef enum X86Seg {
>>  #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>>  #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>>
>> +#define MSR_K7_EVNTSEL0                 0xc0010000
>> +#define MSR_K7_PERFCTR0                 0xc0010004
>> +#define MSR_F15H_PERF_CTL0              0xc0010200
>> +#define MSR_F15H_PERF_CTR0              0xc0010201
>> +
>>  #define MSR_MC0_CTL                     0x400
>>  #define MSR_MC0_STATUS                  0x401
>>  #define MSR_MC0_ADDR                    0x402
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index bf4136fa1b..a0f7273dad 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
>>          }
>>      }
>>
>> +    /*
>> +     * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
>> +     * disable the AMD pmu virtualization.
>> +     *
>> +     * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled
>> +     * indicates the KVM side has already disabled the pmu virtualization.
>> +     */
>> +    if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) {
>> +        int64_t family;
>> +
>> +        family = (env->cpuid_version >> 8) & 0xf;
>> +        if (family == 0xf) {
>> +            family += (env->cpuid_version >> 20) & 0xff;
>> +        }
>> +
>> +        if (family >= 6) {
>> +            has_architectural_pmu_version = 1;
>> +
>> +            if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
>> +                num_architectural_pmu_gp_counters = 6;
> 
> Please make the code a little more readable with some macro definitions.
> 
> #define AMD64_NUM_COUNTERS 4
> #define AMD64_NUM_COUNTERS_CORE 6

I will do that. Thank you very much!

> 
>> +            } else {
>> +                num_architectural_pmu_gp_counters = 4;
>> +            }
>> +        }
>> +    }
>> +
>>      cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
>>
>>      for (i = 0x80000000; i <= limit; i++) {
>> @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>              kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr);
>>          }
>>
>> -        if (has_architectural_pmu_version > 0) {
>> +        if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
> 
> It seems to be saying here that it is possible to perform kvm_msr_entry_add()
> for Intel-related pmu on AMD platforms. So the reverse is expected to be
> feasible as well. Why need distinction?

My fault. I always used something like below when emulating AMD on Intel host.

-cpu EPYC,vendor="AuthenticAMD"

I should consider the AMD-on-Intel and Intel-on-AMD case more carefully.

I will address that in v3.

> 
>>              if (has_architectural_pmu_version > 1) {
>>                  /* Stop the counter.  */
>>                  kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>                                    env->msr_global_ctrl);
>>              }
>>          }
>> +
>> +        if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
>> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +            uint32_t step = 1;
>> +
>> +            if (num_architectural_pmu_gp_counters == 6) {
>> +                sel_base = MSR_F15H_PERF_CTL0;
>> +                ctr_base = MSR_F15H_PERF_CTR0;
>> +                step = 2;
>> +            }
>> +
>> +            for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
>> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
>> +                                  env->msr_gp_counters[i]);
>> +                kvm_msr_entry_add(cpu, sel_base + i * step,
>> +                                  env->msr_gp_evtsel[i]);
>> +            }
>> +        }
>> +
>>          /*
>>           * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>>           * only sync them to KVM on the first cpu
>> @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu)
>>      if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>>          kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>>      }
>> -    if (has_architectural_pmu_version > 0) {
>> +    if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
>>          if (has_architectural_pmu_version > 1) {
>>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>              kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
>> @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu)
>>          }
>>      }
>>
>> +    if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
>> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +        uint32_t step = 1;
>> +
>> +        if (num_architectural_pmu_gp_counters == 6) {
>> +            sel_base = MSR_F15H_PERF_CTL0;
>> +            ctr_base = MSR_F15H_PERF_CTR0;
>> +            step = 2;
> 
> Perhaps it would be more helpful to add a little commentary on why
> step 2 is needed.

Sure. I will do that. The step "2" is because MSR_F15H_PERF_CTL1 is
MSR_F15H_PERF_CTL0 + 2.

#define MSR_F15H_PERF_CTL0              0xc0010200
#define MSR_F15H_PERF_CTR0              0xc0010201


Thank you very much for your feedback. I will address those in v3.

Dongli Zhang

> 
>> +        }
>> +
>> +        for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
>> +            kvm_msr_entry_add(cpu, ctr_base + i * step,
>> +                              env->msr_gp_counters[i]);
>> +            kvm_msr_entry_add(cpu, sel_base + i * step,
>> +                              env->msr_gp_evtsel[i]);
>> +        }
>> +    }
>> +
>>      if (env->mcg_cap) {
>>          kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>>          kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
>> @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>>          case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>>              env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>>              break;
>> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
>> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
>> +            break;
>> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
>> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
>> +            break;
>> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
>> +            index = index - MSR_F15H_PERF_CTL0;
>> +            if (index & 0x1) {
>> +                env->msr_gp_counters[index] = msrs[i].data;
>> +            } else {
>> +                env->msr_gp_evtsel[index] = msrs[i].data;
>> +            }
>> +            break;
>>          case HV_X64_MSR_HYPERCALL:
>>              env->msr_hv_hypercall = msrs[i].data;
>>              break;
>> --
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-03 23:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-21  1:38 [PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs Dongli Zhang
2023-06-21  1:38 ` [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE Dongli Zhang
2023-07-02 13:41   ` Like Xu
2023-07-03 21:58     ` Dongli Zhang
2023-06-21  1:38 ` [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers Dongli Zhang
2023-07-02 14:15   ` Like Xu
2023-07-03 23:20     ` Dongli Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.