linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 0/9] KVM in-guest performance monitoring
@ 2011-11-03 12:33 Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 1/9] KVM: Expose kvm_lapic_local_deliver() Gleb Natapov
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

This patchset exposes an emulated version 2 architectural performance
monitoring unit to KVM guests.  The PMU is emulated using perf_events,
so the host kernel can multiplex host-wide, host-user, and the
guest on available resources.

The patches are against next branch on kvm.git.

If you want to try running perf in a guest you need to apply the patch
below to qemu-kvm and use -cpu host on qemu command line. But DO NOT
TRY those patches without applying [1][2] to the host kernel first.
Don't tell me I didn't warn you!

[1] https://lkml.org/lkml/2011/10/18/390
[2] https://lkml.org/lkml/2011/10/23/163

Changelog:
 v1->v2
  - put index into struct kvm_pmc instead of calculating it
  - use locked version of bitops
  - inject pmi from irq work if vcpu was not in a guest mode during NMI
  - providing stub for perf_get_x86_pmu_capability() for !PERF_EVENTS

Avi Kivity (6):
  KVM: Expose kvm_lapic_local_deliver()
  KVM: Add generic RDPMC support
  KVM: SVM: Intercept RDPMC
  KVM: VMX: Intercept RDPMC
  KVM: x86 emulator: fix RDPMC privilege check
  KVM: x86 emulator: implement RDPMC (0F 33)

Gleb Natapov (3):
  KVM: Expose a version 2 architectural PMU to a guests
  perf: expose perf capability to other modules.
  KVM: Expose the architectural performance monitoring CPUID leaf

 arch/x86/include/asm/kvm_emulate.h     |    1 +
 arch/x86/include/asm/kvm_host.h        |   49 +++
 arch/x86/include/asm/perf_event.h      |   15 +
 arch/x86/kernel/cpu/perf_event.c       |   11 +
 arch/x86/kernel/cpu/perf_event.h       |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c |    3 +
 arch/x86/kvm/Kconfig                   |    1 +
 arch/x86/kvm/Makefile                  |    2 +-
 arch/x86/kvm/emulate.c                 |   13 +-
 arch/x86/kvm/lapic.c                   |    2 +-
 arch/x86/kvm/lapic.h                   |    1 +
 arch/x86/kvm/pmu.c                     |  529 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm.c                     |   15 +
 arch/x86/kvm/vmx.c                     |   15 +-
 arch/x86/kvm/x86.c                     |   69 ++++-
 include/linux/kvm_host.h               |    2 +
 16 files changed, 716 insertions(+), 14 deletions(-)
 create mode 100644 arch/x86/kvm/pmu.c


diff --git a/roms/seabios b/roms/seabios
index 8e30147..e0f87ce 160000
--- a/roms/seabios
+++ b/roms/seabios
@@ -1 +1 @@
-Subproject commit 8e301472e324b6d6496d8b4ffc66863e99d7a505
+Subproject commit e0f87ce6610a0f341ff79c2c40ddc29f26932353-dirty
diff --git a/roms/vgabios b/roms/vgabios
index ca056d8..19ea12c 160000
--- a/roms/vgabios
+++ b/roms/vgabios
@@ -1 +1 @@
-Subproject commit ca056d8e77a534f4f90548bc8cee166a378c1454
+Subproject commit 19ea12c230ded95928ecaef0db47a82231c2e485-dirty
diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index f179999..ff2a0ca 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -1178,11 +1178,20 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
         *edx = 0;
         break;
     case 0xA:
-        /* Architectural Performance Monitoring Leaf */
-        *eax = 0;
-        *ebx = 0;
-        *ecx = 0;
-        *edx = 0;
+	if (kvm_enabled()) {
+            KVMState *s = env->kvm_state;
+
+            *eax = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EAX);
+            *ebx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EBX);
+            *ecx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_ECX);
+            *edx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EDX);
+	} else {
+		/* Architectural Performance Monitoring Leaf */
+		*eax = 0; //0x07280402;
+		*ebx = 0;
+		*ecx = 0;
+		*edx = 0; //0x00000503;
+	}
         break;
     case 0xD:
         /* Processor Extended State */
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 1/9] KVM: Expose kvm_lapic_local_deliver()
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

Needed to deliver performance monitoring interrupts.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/lapic.c |    2 +-
 arch/x86/kvm/lapic.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 54abb40..e87e43e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1120,7 +1120,7 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
 {
 	u32 reg = apic_get_reg(apic, lvt_type);
 	int vector, mode, trig_mode;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 138e8cc..6f4ce25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -34,6 +34,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
 int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
 
 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
 void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 1/9] KVM: Expose kvm_lapic_local_deliver() Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-07 14:22   ` Peter Zijlstra
                     ` (2 more replies)
  2011-11-03 12:33 ` [PATCHv2 3/9] KVM: Add generic RDPMC support Gleb Natapov
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |   48 ++++
 arch/x86/kvm/Kconfig            |    1 +
 arch/x86/kvm/Makefile           |    2 +-
 arch/x86/kvm/pmu.c              |  529 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   24 ++-
 include/linux/kvm_host.h        |    2 +
 6 files changed, 596 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/kvm/pmu.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d83264..5807a49 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -16,10 +16,12 @@
 #include <linux/mmu_notifier.h>
 #include <linux/tracepoint.h>
 #include <linux/cpumask.h>
+#include <linux/irq_work.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
 #include <linux/kvm_types.h>
+#include <linux/perf_event.h>
 
 #include <asm/pvclock-abi.h>
 #include <asm/desc.h>
@@ -289,6 +291,37 @@ struct kvm_mmu {
 	u64 pdptrs[4]; /* pae */
 };
 
+enum pmc_type {
+	KVM_PMC_GP = 0,
+	KVM_PMC_FIXED,
+};
+
+struct kvm_pmc {
+	enum pmc_type type;
+	u8 idx;
+	u64 counter;
+	u64 eventsel;
+	struct perf_event *perf_event;
+	struct kvm_vcpu *vcpu;
+};
+
+struct kvm_pmu {
+	unsigned nr_arch_gp_counters;
+	unsigned nr_arch_fixed_counters;
+	unsigned available_event_types;
+	u64 fixed_ctr_ctrl;
+	u64 global_ctrl;
+	u64 global_status;
+	u64 global_ovf_ctrl;
+	u64 counter_bitmask[2];
+	u64 global_ctrl_mask;
+	u8 version;
+	struct kvm_pmc gp_counters[X86_PMC_MAX_GENERIC];
+	struct kvm_pmc fixed_counters[X86_PMC_MAX_FIXED];
+	struct irq_work irq_work;
+	u64 reprogram_pmi;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -422,6 +455,8 @@ struct kvm_vcpu_arch {
 	unsigned access;
 	gfn_t mmio_gfn;
 
+	struct kvm_pmu pmu;
+
 	/* used for guest single stepping over the given code position */
 	unsigned long singlestep_rip;
 
@@ -881,4 +916,17 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 
 void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 
+int kvm_is_in_guest(void);
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_reset(struct kvm_vcpu *vcpu);
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu);
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
+void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
+void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..c27dd11 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -35,6 +35,7 @@ config KVM
 	select KVM_MMIO
 	select TASKSTATS
 	select TASK_DELAY_ACCT
+	select PERF_EVENTS
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..cfca03f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-			   i8254.o timer.o
+			   i8254.o timer.o pmu.o
 kvm-intel-y		+= vmx.o
 kvm-amd-y		+= svm.o
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
new file mode 100644
index 0000000..d4ada30
--- /dev/null
+++ b/arch/x86/kvm/pmu.c
@@ -0,0 +1,529 @@
+/*
+ * Kernel-based Virtual Machine -- Performane Monitoring Unit support
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ *   Avi Kivity   <avi@redhat.com>
+ *   Gleb Natapov <gleb@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/perf_event.h>
+#include "x86.h"
+#include "lapic.h"
+
+static struct kvm_arch_event_perf_mapping {
+	u8 eventsel;
+	u8 unit_mask;
+	unsigned event_type;
+	bool inexact;
+} arch_events[] = {
+	/* Index must match CPUID 0x0A.EBX bit vector */
+	[0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
+	[1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
+	[2] = { 0x3c, 0x01, PERF_COUNT_HW_BUS_CYCLES  },
+	[3] = { 0x2e, 0x4f, PERF_COUNT_HW_CACHE_REFERENCES },
+	[4] = { 0x2e, 0x41, PERF_COUNT_HW_CACHE_MISSES },
+	[5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
+	[6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
+};
+
+/* mapping between fixed pmc index and arch_events array */
+int fixed_pmc_events[] = {1, 0, 2};
+
+static bool pmc_is_gp(struct kvm_pmc *pmc)
+{
+	return pmc->type == KVM_PMC_GP;
+}
+
+static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+
+	return pmu->counter_bitmask[pmc->type];
+}
+
+static inline bool pmc_enabled(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
+}
+
+static inline struct kvm_pmc *get_gp_pmc(struct kvm_pmu *pmu, u32 msr,
+					 u32 base)
+{
+	if (msr >= base && msr < base + pmu->nr_arch_gp_counters)
+		return &pmu->gp_counters[msr - base];
+	return NULL;
+}
+
+static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
+{
+	int base = MSR_CORE_PERF_FIXED_CTR0;
+	if (msr >= base && msr < base + pmu->nr_arch_fixed_counters)
+		return &pmu->fixed_counters[msr - base];
+	return NULL;
+}
+
+static inline struct kvm_pmc *get_fixed_pmc_idx(struct kvm_pmu *pmu, int idx)
+{
+	return get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + idx);
+}
+
+static struct kvm_pmc *global_idx_to_pmc(struct kvm_pmu *pmu, int idx)
+{
+	if (idx < X86_PMC_IDX_FIXED)
+		return get_gp_pmc(pmu, MSR_P6_EVNTSEL0 + idx, MSR_P6_EVNTSEL0);
+	else
+		return get_fixed_pmc_idx(pmu, idx - X86_PMC_IDX_FIXED);
+}
+
+void kvm_deliver_pmi(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.apic)
+		kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
+}
+
+static void trigger_pmi(struct irq_work *irq_work)
+{
+	struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu,
+			irq_work);
+	struct kvm_vcpu *vcpu = container_of(pmu, struct kvm_vcpu,
+			arch.pmu);
+
+	kvm_deliver_pmi(vcpu);
+}
+
+static void kvm_perf_overflow(struct perf_event *perf_event,
+			      struct perf_sample_data *data,
+			      struct pt_regs *regs)
+{
+	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+}
+
+static void kvm_perf_overflow_intr(struct perf_event *perf_event,
+		struct perf_sample_data *data, struct pt_regs *regs)
+{
+	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
+	if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
+		kvm_perf_overflow(perf_event, data, regs);
+		kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+		/*
+		 * Inject PMI. If vcpu was in a guest mode during NMI PMI
+		 * can be ejected on a guest mode re-entry. Otherwise we can't
+		 * be sure that vcpu is not halted, so we should kick it, but
+		 * this is impossible from NMI context. Do it from irq work
+		 * instead.
+		 */
+		if (!kvm_is_in_guest())
+			irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
+		else
+			kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
+	}
+}
+
+static u64 read_pmc(struct kvm_pmc *pmc)
+{
+	u64 counter, enabled, running;
+
+	counter = pmc->counter;
+
+	if (pmc->perf_event)
+		counter += perf_event_read_value(pmc->perf_event,
+						 &enabled, &running);
+
+	/* FIXME: Scaling needed? */
+
+	return counter & pmc_bitmask(pmc);
+}
+
+static void stop_counter(struct kvm_pmc *pmc)
+{
+	if (pmc->perf_event) {
+		pmc->counter = read_pmc(pmc);
+		perf_event_release_kernel(pmc->perf_event);
+		pmc->perf_event = NULL;
+	}
+}
+
+static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
+		unsigned config, bool exclude_user, bool exclude_kernel,
+		bool intr)
+{
+	struct perf_event *event;
+	struct perf_event_attr attr = {
+		.type = type,
+		.size = sizeof(attr),
+		.exclude_idle = true,
+		.exclude_host = 1,
+		.exclude_user = exclude_user,
+		.exclude_kernel = exclude_kernel,
+		.sample_period = (-pmc->counter) & pmc_bitmask(pmc),
+		.config = config,
+	};
+
+	event = perf_event_create_kernel_counter(&attr, -1, current,
+						 intr ? kvm_perf_overflow_intr :
+						 kvm_perf_overflow, pmc);
+	if (IS_ERR(event)) {
+		printk_once("kvm: pmu event creation failed %ld\n",
+				PTR_ERR(event));
+		return;
+	}
+
+	pmc->perf_event = event;
+	clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
+}
+
+static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
+		u8 unit_mask)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(arch_events); i++)
+		if (arch_events[i].eventsel == event_select
+				&& arch_events[i].unit_mask == unit_mask
+				&& (pmu->available_event_types & (1 << i)))
+			break;
+
+	if (i == ARRAY_SIZE(arch_events))
+		return PERF_COUNT_HW_MAX;
+
+	return arch_events[i].event_type;
+}
+
+static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
+{
+	unsigned config, type = PERF_TYPE_RAW;
+	u8 event_select, unit_mask;
+
+	pmc->eventsel = eventsel;
+
+	stop_counter(pmc);
+
+	if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_enabled(pmc))
+		return;
+
+	event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
+	unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
+
+	if (!(event_select & (ARCH_PERFMON_EVENTSEL_EDGE |
+				ARCH_PERFMON_EVENTSEL_INV |
+				ARCH_PERFMON_EVENTSEL_CMASK))) {
+		config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
+				unit_mask);
+		if (config != PERF_COUNT_HW_MAX)
+			type = PERF_TYPE_HARDWARE;
+	}
+
+	if (type == PERF_TYPE_RAW)
+		config = eventsel & X86_RAW_EVENT_MASK;
+
+	reprogram_counter(pmc, type, config,
+			!(eventsel & ARCH_PERFMON_EVENTSEL_USR),
+			!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
+			eventsel & ARCH_PERFMON_EVENTSEL_INT);
+}
+
+static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
+{
+	unsigned en = en_pmi & 0x3;
+	bool pmi = en_pmi & 0x8;
+
+	stop_counter(pmc);
+
+	if (!en || !pmc_enabled(pmc))
+		return;
+
+	reprogram_counter(pmc, PERF_TYPE_HARDWARE,
+			arch_events[fixed_pmc_events[idx]].event_type,
+			!(en & 0x2), /* exclude user */
+			!(en & 0x1), /* exclude kernel */
+			pmi);
+}
+
+static inline u8 fixed_en_pmi(u64 ctrl, int idx)
+{
+	return (ctrl >> (idx * 4)) & 0xf;
+}
+
+static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
+{
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		u8 en_pmi = fixed_en_pmi(data, i);
+		struct kvm_pmc *pmc = get_fixed_pmc_idx(pmu, i);
+
+		if (fixed_en_pmi(pmu->fixed_ctr_ctrl, i) == en_pmi)
+			continue;
+
+		reprogram_fixed_counter(pmc, en_pmi, i);
+	}
+
+	pmu->fixed_ctr_ctrl = data;
+}
+
+static void reprogram_idx(struct kvm_pmu *pmu, int idx)
+{
+	struct kvm_pmc *pmc = global_idx_to_pmc(pmu, idx);
+
+	if (!pmc)
+		return;
+
+	if (pmc_is_gp(pmc))
+		reprogram_gp_counter(pmc, pmc->eventsel);
+	else {
+		int fidx = idx - X86_PMC_IDX_FIXED;
+		reprogram_fixed_counter(pmc,
+				fixed_en_pmi(pmu->fixed_ctr_ctrl, fidx), fidx);
+	}
+}
+
+static void global_ctrl_changed(struct kvm_pmu *pmu, u64 data)
+{
+	int bit;
+	u64 diff = pmu->global_ctrl ^ data;
+
+	pmu->global_ctrl = data;
+
+	for_each_set_bit(bit, (unsigned long *)&diff, X86_PMC_IDX_MAX)
+		reprogram_idx(pmu, bit);
+}
+
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	int ret;
+
+	switch (msr) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		ret = pmu->version > 1;
+		break;
+	default:
+		ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
+			|| get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0)
+			|| get_fixed_pmc(pmu, msr);
+		break;
+	}
+	return ret;
+}
+
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_pmc *pmc;
+
+	switch (index) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+		*data = pmu->fixed_ctr_ctrl;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+		*data = pmu->global_status;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		*data = pmu->global_ctrl;
+		return 0;
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		*data = pmu->global_ovf_ctrl;
+		return 0;
+	default:
+		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
+				(pmc = get_fixed_pmc(pmu, index))) {
+			*data = read_pmc(pmc);
+			return 0;
+		} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+			*data = pmc->eventsel;
+			return 0;
+		}
+	}
+	return 1;
+}
+
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_pmc *pmc;
+
+	switch (index) {
+	case MSR_CORE_PERF_FIXED_CTR_CTRL:
+		if (pmu->fixed_ctr_ctrl == data)
+			return 0;
+		if (!(data & 0xfffffffffffff444)) {
+			reprogram_fixed_counters(pmu, data);
+			return 0;
+		}
+		break;
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+		break; /* RO MSR */
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+		if (pmu->global_ctrl == data)
+			return 0;
+		if (!(data & pmu->global_ctrl_mask)) {
+			global_ctrl_changed(pmu, data);
+			return 0;
+		}
+		break;
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		if (!(data & (pmu->global_ctrl_mask & ~(3ull<<62)))) {
+			pmu->global_status &= ~data;
+			pmu->global_ovf_ctrl = data;
+			return 0;
+		}
+		break;
+	default:
+		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
+				(pmc = get_fixed_pmc(pmu, index))) {
+			data = (s64)(s32)data;
+			pmc->counter += data - read_pmc(pmc);
+			return 0;
+		} else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+			if (data == pmc->eventsel)
+				return 0;
+			if (!(data & 0xffffffff00200000ull)) {
+				reprogram_gp_counter(pmc, data);
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	bool fast_mode = pmc & (1u << 31);
+	bool fixed = pmc & (1u << 30);
+	struct kvm_pmc *counters;
+	u64 ctr;
+
+	pmc &= (3u << 30) - 1;
+	if (!fixed && pmc >= pmu->nr_arch_gp_counters)
+		return 1;
+	if (fixed && pmc >= pmu->nr_arch_fixed_counters)
+		return 1;
+	counters = fixed ? pmu->fixed_counters : pmu->gp_counters;
+	ctr = read_pmc(&counters[pmc]);
+	if (fast_mode)
+		ctr = (u32)ctr;
+	*data = ctr;
+
+	return 0;
+}
+
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct kvm_cpuid_entry2 *entry;
+	unsigned bitmap_len;
+
+	pmu->nr_arch_gp_counters = 0;
+	pmu->nr_arch_fixed_counters = 0;
+	pmu->counter_bitmask[KVM_PMC_GP] = 0;
+	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
+	pmu->version = 0;
+
+	entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
+	if (!entry)
+		return;
+
+	pmu->version = entry->eax & 0xff;
+	if (!pmu->version)
+		return;
+
+	pmu->nr_arch_gp_counters = min((int)(entry->eax >> 8) & 0xff,
+			X86_PMC_MAX_GENERIC);
+	pmu->counter_bitmask[KVM_PMC_GP] =
+		((u64)1 << ((entry->eax >> 16) & 0xff)) - 1;
+	bitmap_len = (entry->eax >> 24) & 0xff;
+	pmu->available_event_types = ~entry->ebx & ((1ull << bitmap_len) - 1);
+
+	if (pmu->version == 1) {
+		pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
+		return;
+	}
+
+	pmu->nr_arch_fixed_counters = min((int)(entry->edx & 0x1f),
+			X86_PMC_MAX_FIXED);
+	pmu->counter_bitmask[KVM_PMC_FIXED] =
+		((u64)1 << ((entry->edx >> 5) & 0xff)) - 1;
+	pmu->global_ctrl_mask = ~(((1 << pmu->nr_arch_gp_counters) - 1)
+			| (((1ull << pmu->nr_arch_fixed_counters) - 1)
+				<< X86_PMC_IDX_FIXED));
+}
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu)
+{
+	int i;
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+	memset(pmu, 0, sizeof(*pmu));
+	for (i = 0; i < X86_PMC_MAX_GENERIC; i++) {
+		pmu->gp_counters[i].type = KVM_PMC_GP;
+		pmu->gp_counters[i].vcpu = vcpu;
+		pmu->gp_counters[i].idx = i;
+	}
+	for (i = 0; i < X86_PMC_MAX_FIXED; i++) {
+		pmu->fixed_counters[i].type = KVM_PMC_FIXED;
+		pmu->fixed_counters[i].vcpu = vcpu;
+		pmu->fixed_counters[i].idx = i + X86_PMC_IDX_FIXED;
+	}
+	init_irq_work(&pmu->irq_work, trigger_pmi);
+	kvm_pmu_cpuid_update(vcpu);
+}
+
+void kvm_pmu_reset(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	int i;
+
+	irq_work_sync(&pmu->irq_work);
+	for (i = 0; i < X86_PMC_MAX_GENERIC; i++) {
+		struct kvm_pmc *pmc = &pmu->gp_counters[i];
+		stop_counter(pmc);
+		pmc->counter = pmc->eventsel = 0;
+	}
+
+	for (i = 0; i < X86_PMC_MAX_FIXED; i++)
+		stop_counter(&pmu->fixed_counters[i]);
+
+	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
+		pmu->global_ovf_ctrl = 0;
+}
+
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
+{
+	kvm_pmu_reset(vcpu);
+}
+
+void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	u64 bitmask;
+	int bit;
+
+	bitmask = pmu->reprogram_pmi;
+
+	for_each_set_bit(bit, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
+		struct kvm_pmc *pmc = global_idx_to_pmc(pmu, bit);
+
+		if (unlikely(!pmc || !pmc->perf_event)) {
+			clear_bit(bit, (unsigned long *)&pmu->reprogram_pmi);
+			continue;
+		}
+
+		reprogram_idx(pmu, bit);
+	}
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9eff4af..52a8666 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -624,6 +624,8 @@ static void update_cpuid(struct kvm_vcpu *vcpu)
 
 	if (apic)
 		apic->lapic_timer.timer_mode_mask = timer_mode_mask;
+
+	kvm_pmu_cpuid_update(vcpu);
 }
 
 int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
@@ -1656,8 +1658,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	 * which we perfectly emulate ;-). Any other value should be at least
 	 * reported, some guests depend on them.
 	 */
-	case MSR_P6_EVNTSEL0:
-	case MSR_P6_EVNTSEL1:
 	case MSR_K7_EVNTSEL0:
 	case MSR_K7_EVNTSEL1:
 	case MSR_K7_EVNTSEL2:
@@ -1669,8 +1669,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	/* at least RHEL 4 unconditionally writes to the perfctr registers,
 	 * so we ignore writes to make it happy.
 	 */
-	case MSR_P6_PERFCTR0:
-	case MSR_P6_PERFCTR1:
 	case MSR_K7_PERFCTR0:
 	case MSR_K7_PERFCTR1:
 	case MSR_K7_PERFCTR2:
@@ -1707,6 +1705,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	default:
 		if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
 			return xen_hvm_config(vcpu, data);
+		if (kvm_pmu_msr(vcpu, msr))
+			return kvm_pmu_set_msr(vcpu, msr, data);
 		if (!ignore_msrs) {
 			pr_unimpl(vcpu, "unhandled wrmsr: 0x%x data %llx\n",
 				msr, data);
@@ -1869,10 +1869,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	case MSR_K8_SYSCFG:
 	case MSR_K7_HWCR:
 	case MSR_VM_HSAVE_PA:
-	case MSR_P6_PERFCTR0:
-	case MSR_P6_PERFCTR1:
-	case MSR_P6_EVNTSEL0:
-	case MSR_P6_EVNTSEL1:
 	case MSR_K7_EVNTSEL0:
 	case MSR_K7_PERFCTR0:
 	case MSR_K8_INT_PENDING_MSG:
@@ -1983,6 +1979,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = 0xbe702111;
 		break;
 	default:
+		if (kvm_pmu_msr(vcpu, msr))
+			return kvm_pmu_get_msr(vcpu, msr, pdata);
 		if (!ignore_msrs) {
 			pr_unimpl(vcpu, "unhandled rdmsr: 0x%x\n", msr);
 			return 1;
@@ -5136,7 +5134,7 @@ static void kvm_timer_init(void)
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
-static int kvm_is_in_guest(void)
+int kvm_is_in_guest(void)
 {
 	return percpu_read(current_vcpu) != NULL;
 }
@@ -5719,6 +5717,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			process_nmi(vcpu);
 		req_immediate_exit =
 			kvm_check_request(KVM_REQ_IMMEDIATE_EXIT, vcpu);
+		if (kvm_check_request(KVM_REQ_PMU, vcpu))
+			kvm_handle_pmu_event(vcpu);
+		if (kvm_check_request(KVM_REQ_PMI, vcpu))
+			kvm_deliver_pmi(vcpu);
 	}
 
 	r = kvm_mmu_reload(vcpu);
@@ -6459,6 +6461,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
 	kvm_async_pf_hash_reset(vcpu);
 	vcpu->arch.apf.halted = false;
 
+	kvm_pmu_reset(vcpu);
+
 	return kvm_x86_ops->vcpu_reset(vcpu);
 }
 
@@ -6547,6 +6551,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 		goto fail_free_mce_banks;
 
 	kvm_async_pf_hash_reset(vcpu);
+	kvm_pmu_init(vcpu);
 
 	return 0;
 fail_free_mce_banks:
@@ -6565,6 +6570,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 	int idx;
 
+	kvm_pmu_destroy(vcpu);
 	kfree(vcpu->arch.mce_banks);
 	kvm_free_lapic(vcpu);
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c6a2ec9..40d0878 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -52,6 +52,8 @@
 #define KVM_REQ_STEAL_UPDATE      13
 #define KVM_REQ_NMI               14
 #define KVM_REQ_IMMEDIATE_EXIT    15
+#define KVM_REQ_PMU               16
+#define KVM_REQ_PMI               17
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID	0
 
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 3/9] KVM: Add generic RDPMC support
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 1/9] KVM: Expose kvm_lapic_local_deliver() Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 4/9] KVM: SVM: Intercept RDPMC Gleb Natapov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/x86.c              |   15 +++++++++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5807a49..422824c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -756,6 +756,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool kvm_rdpmc(struct kvm_vcpu *vcpu);
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52a8666..b88426c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -815,6 +815,21 @@ int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
 }
 EXPORT_SYMBOL_GPL(kvm_get_dr);
 
+bool kvm_rdpmc(struct kvm_vcpu *vcpu)
+{
+	u32 ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
+	u64 data;
+	int err;
+
+	err = kvm_pmu_read_pmc(vcpu, ecx, &data);
+	if (err)
+		return err;
+	kvm_register_write(vcpu, VCPU_REGS_RAX, (u32)data);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, data >> 32);
+	return err;
+}
+EXPORT_SYMBOL_GPL(kvm_rdpmc);
+
 /*
  * List of msr numbers which we expose to userspace through KVM_GET_MSRS
  * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 4/9] KVM: SVM: Intercept RDPMC
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (2 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 3/9] KVM: Add generic RDPMC support Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 5/9] KVM: VMX: " Gleb Natapov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/svm.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e32243e..5fa553b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1014,6 +1014,7 @@ static void init_vmcb(struct vcpu_svm *svm)
 	set_intercept(svm, INTERCEPT_NMI);
 	set_intercept(svm, INTERCEPT_SMI);
 	set_intercept(svm, INTERCEPT_SELECTIVE_CR0);
+	set_intercept(svm, INTERCEPT_RDPMC);
 	set_intercept(svm, INTERCEPT_CPUID);
 	set_intercept(svm, INTERCEPT_INVD);
 	set_intercept(svm, INTERCEPT_HLT);
@@ -2770,6 +2771,19 @@ static int emulate_on_interception(struct vcpu_svm *svm)
 	return emulate_instruction(&svm->vcpu, 0) == EMULATE_DONE;
 }
 
+static int rdpmc_interception(struct vcpu_svm *svm)
+{
+	int err;
+
+	if (!static_cpu_has(X86_FEATURE_NRIPS))
+		return emulate_on_interception(svm);
+
+	err = kvm_rdpmc(&svm->vcpu);
+	kvm_complete_insn_gp(&svm->vcpu, err);
+
+	return 1;
+}
+
 bool check_selective_cr0_intercepted(struct vcpu_svm *svm, unsigned long val)
 {
 	unsigned long cr0 = svm->vcpu.arch.cr0;
@@ -3190,6 +3204,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = {
 	[SVM_EXIT_SMI]				= nop_on_interception,
 	[SVM_EXIT_INIT]				= nop_on_interception,
 	[SVM_EXIT_VINTR]			= interrupt_window_interception,
+	[SVM_EXIT_RDPMC]			= rdpmc_interception,
 	[SVM_EXIT_CPUID]			= cpuid_interception,
 	[SVM_EXIT_IRET]                         = iret_interception,
 	[SVM_EXIT_INVD]                         = emulate_on_interception,
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 5/9] KVM: VMX: Intercept RDPMC
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (3 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 4/9] KVM: SVM: Intercept RDPMC Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 6/9] perf: expose perf capability to other modules Gleb Natapov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/vmx.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6e28d58..a6535ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+		CPU_BASED_RDPMC_EXITING |
 		CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 	/*
 	 * We can allow some features even when not supported by the
@@ -2414,7 +2415,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	      CPU_BASED_USE_TSC_OFFSETING |
 	      CPU_BASED_MWAIT_EXITING |
 	      CPU_BASED_MONITOR_EXITING |
-	      CPU_BASED_INVLPG_EXITING;
+	      CPU_BASED_INVLPG_EXITING |
+	      CPU_BASED_RDPMC_EXITING;
 
 	if (yield_on_hlt)
 		min |= CPU_BASED_HLT_EXITING;
@@ -4615,6 +4617,16 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_rdpmc(struct kvm_vcpu *vcpu)
+{
+	int err;
+
+	err = kvm_rdpmc(vcpu);
+	kvm_complete_insn_gp(vcpu, err);
+
+	return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
 	skip_emulated_instruction(vcpu);
@@ -5565,6 +5577,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_HLT]                     = handle_halt,
 	[EXIT_REASON_INVD]		      = handle_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
+	[EXIT_REASON_RDPMC]                   = handle_rdpmc,
 	[EXIT_REASON_VMCALL]                  = handle_vmcall,
 	[EXIT_REASON_VMCLEAR]	              = handle_vmclear,
 	[EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (4 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 5/9] KVM: VMX: " Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-07 14:07   ` Peter Zijlstra
  2011-11-03 12:33 ` [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf Gleb Natapov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

KVM needs to know perf capability to decide which PMU it can expose to a
guest.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/perf_event.h      |   15 +++++++++++++++
 arch/x86/kernel/cpu/perf_event.c       |   11 +++++++++++
 arch/x86/kernel/cpu/perf_event.h       |    2 ++
 arch/x86/kernel/cpu/perf_event_intel.c |    3 +++
 4 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f61c62f..44b9145 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -81,6 +81,15 @@ union cpuid10_edx {
 	unsigned int full;
 };
 
+struct x86_pmu_capability {
+	int version;
+	int num_counters_gp;
+	int num_counters_fixed;
+	int bit_width_gp;
+	int bit_width_fixed;
+	unsigned int events_mask;
+	int events_mask_len;
+};
 
 /*
  * Fixed-purpose performance events:
@@ -202,6 +211,7 @@ struct perf_guest_switch_msr {
 };
 
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 #else
 static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
@@ -209,6 +219,11 @@ static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 	return NULL;
 }
 
+static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
+{
+	memset(cap, 0, sizeof(*cap));
+}
+
 static inline void perf_events_lapic_init(void)	{ }
 #endif
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..94ac9ca 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1570,3 +1570,14 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 
 	return misc;
 }
+
+void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
+{
+	cap->version = x86_pmu.version;
+	cap->num_counters_gp = x86_pmu.num_counters;
+	cap->num_counters_fixed = x86_pmu.num_counters_fixed;
+	cap->bit_width_gp = cap->bit_width_fixed = x86_pmu.cntval_bits;
+	cap->events_mask = x86_pmu.events_mask;
+	cap->events_mask_len = x86_pmu.events_mask_len;
+}
+EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index b9698d4..e9ed238 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -259,6 +259,8 @@ struct x86_pmu {
 	int		num_counters_fixed;
 	int		cntval_bits;
 	u64		cntval_mask;
+	u32		events_mask;
+	int		events_mask_len;
 	int		apic;
 	u64		max_period;
 	struct event_constraint *
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e09ca20..64e5f35 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1580,6 +1580,8 @@ __init int intel_pmu_init(void)
 	x86_pmu.num_counters		= eax.split.num_counters;
 	x86_pmu.cntval_bits		= eax.split.bit_width;
 	x86_pmu.cntval_mask		= (1ULL << eax.split.bit_width) - 1;
+	x86_pmu.events_mask		= ebx;
+	x86_pmu.events_mask_len		= eax.split.mask_length;
 
 	/*
 	 * Quirk: v2 perfmon does not report fixed-purpose events, so
@@ -1651,6 +1653,7 @@ __init int intel_pmu_init(void)
 			 * architectural event which is often completely bogus:
 			 */
 			intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = 0x7f89;
+			x86_pmu.events_mask &= ~0x40;
 
 			pr_cont("erratum AAJ80 worked around, ");
 		}
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (5 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 6/9] perf: expose perf capability to other modules Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-07 14:09   ` Peter Zijlstra
  2011-11-03 12:33 ` [PATCHv2 8/9] KVM: x86 emulator: fix RDPMC privilege check Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 9/9] KVM: x86 emulator: implement RDPMC (0F 33) Gleb Natapov
  8 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

Provide a CPUID leaf that describes the emulated PMU.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/x86.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b88426c..3c9c8c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2544,6 +2544,28 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	}
 	case 9:
 		break;
+	case 0xa: { /* Architectural Performance Monitoring */
+		struct x86_pmu_capability cap;
+
+		perf_get_x86_pmu_capability(&cap);
+
+		/*
+		 * Only support guest architectural pmu on a host
+		 * with architectural pmu.
+		 */
+		if (!cap.version)
+			memset(&cap, 0, sizeof(cap));
+
+		entry->eax = min(cap.version, 2)
+			| (cap.num_counters_gp << 8)
+			| (cap.bit_width_gp << 16)
+			| (cap.events_mask_len << 24);
+		entry->ebx = cap.events_mask;
+		entry->ecx = 0;
+		entry->edx = cap.num_counters_fixed
+			| (cap.bit_width_fixed << 5);
+		break;
+	}
 	/* function 0xb has additional index. */
 	case 0xb: {
 		int i, level_type;
@@ -2638,7 +2660,6 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	case 3: /* Processor serial number */
 	case 5: /* MONITOR/MWAIT */
 	case 6: /* Thermal management */
-	case 0xA: /* Architectural Performance Monitoring */
 	case 0x80000007: /* Advanced power management */
 	case 0xC0000002:
 	case 0xC0000003:
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 8/9] KVM: x86 emulator: fix RDPMC privilege check
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (6 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  2011-11-03 12:33 ` [PATCHv2 9/9] KVM: x86 emulator: implement RDPMC (0F 33) Gleb Natapov
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

RDPMC is only privileged if CR4.PCE=0.  check_rdpmc() already implements this,
so all we need to do is drop the Priv flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/emulate.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8547958..c0ee85b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3254,7 +3254,7 @@ static struct opcode twobyte_table[256] = {
 	DI(ImplicitOps | Priv, wrmsr),
 	IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
 	DI(ImplicitOps | Priv, rdmsr),
-	DIP(ImplicitOps | Priv, rdpmc, check_rdpmc),
+	DIP(ImplicitOps, rdpmc, check_rdpmc),
 	I(ImplicitOps | VendorSpecific, em_sysenter),
 	I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
 	N, N,
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv2 9/9] KVM: x86 emulator: implement RDPMC (0F 33)
  2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
                   ` (7 preceding siblings ...)
  2011-11-03 12:33 ` [PATCHv2 8/9] KVM: x86 emulator: fix RDPMC privilege check Gleb Natapov
@ 2011-11-03 12:33 ` Gleb Natapov
  8 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-03 12:33 UTC (permalink / raw)
  To: kvm; +Cc: avi, mtosatti, linux-kernel, mingo, a.p.zijlstra, acme

From: Avi Kivity <avi@redhat.com>

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/include/asm/kvm_emulate.h |    1 +
 arch/x86/kvm/emulate.c             |   13 ++++++++++++-
 arch/x86/kvm/x86.c                 |    7 +++++++
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 9a4acf4..ab4092e 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -181,6 +181,7 @@ struct x86_emulate_ops {
 	int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value);
 	int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data);
 	int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 *pdata);
+	int (*read_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc, u64 *pdata);
 	void (*halt)(struct x86_emulate_ctxt *ctxt);
 	void (*wbinvd)(struct x86_emulate_ctxt *ctxt);
 	int (*fix_hypercall)(struct x86_emulate_ctxt *ctxt);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c0ee85b..d76a852 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2623,6 +2623,17 @@ static int em_rdtsc(struct x86_emulate_ctxt *ctxt)
 	return X86EMUL_CONTINUE;
 }
 
+static int em_rdpmc(struct x86_emulate_ctxt *ctxt)
+{
+	u64 pmc;
+
+	if (ctxt->ops->read_pmc(ctxt, ctxt->regs[VCPU_REGS_RCX], &pmc))
+		return emulate_gp(ctxt, 0);
+	ctxt->regs[VCPU_REGS_RAX] = (u32)pmc;
+	ctxt->regs[VCPU_REGS_RDX] = pmc >> 32;
+	return X86EMUL_CONTINUE;
+}
+
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
 	ctxt->dst.val = ctxt->src.val;
@@ -3254,7 +3265,7 @@ static struct opcode twobyte_table[256] = {
 	DI(ImplicitOps | Priv, wrmsr),
 	IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
 	DI(ImplicitOps | Priv, rdmsr),
-	DIP(ImplicitOps, rdpmc, check_rdpmc),
+	IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
 	I(ImplicitOps | VendorSpecific, em_sysenter),
 	I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
 	N, N,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3c9c8c9..9ca94d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4655,6 +4655,12 @@ static int emulator_set_msr(struct x86_emulate_ctxt *ctxt,
 	return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
 }
 
+static int emulator_read_pmc(struct x86_emulate_ctxt *ctxt,
+			     u32 pmc, u64 *pdata)
+{
+	return kvm_pmu_read_pmc(emul_to_vcpu(ctxt), pmc, pdata);
+}
+
 static void emulator_halt(struct x86_emulate_ctxt *ctxt)
 {
 	emul_to_vcpu(ctxt)->arch.halt_request = 1;
@@ -4707,6 +4713,7 @@ static struct x86_emulate_ops emulate_ops = {
 	.set_dr              = emulator_set_dr,
 	.set_msr             = emulator_set_msr,
 	.get_msr             = emulator_get_msr,
+	.read_pmc            = emulator_read_pmc,
 	.halt                = emulator_halt,
 	.wbinvd              = emulator_wbinvd,
 	.fix_hypercall       = emulator_fix_hypercall,
-- 
1.7.5.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-03 12:33 ` [PATCHv2 6/9] perf: expose perf capability to other modules Gleb Natapov
@ 2011-11-07 14:07   ` Peter Zijlstra
  2011-11-07 15:53     ` Gleb Natapov
  2011-11-08 12:49     ` Gleb Natapov
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:07 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> @@ -1580,6 +1580,8 @@ __init int intel_pmu_init(void)
>         x86_pmu.num_counters            = eax.split.num_counters;
>         x86_pmu.cntval_bits             = eax.split.bit_width;
>         x86_pmu.cntval_mask             = (1ULL << eax.split.bit_width) - 1;
> +       x86_pmu.events_mask             = ebx;
> +       x86_pmu.events_mask_len         = eax.split.mask_length;
>  
>         /*
>          * Quirk: v2 perfmon does not report fixed-purpose events, so
> @@ -1651,6 +1653,7 @@ __init int intel_pmu_init(void)
>                          * architectural event which is often completely bogus:
>                          */
>                         intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = 0x7f89;
> +                       x86_pmu.events_mask &= ~0x40;
>  
>                         pr_cont("erratum AAJ80 worked around, ");
>                 } 

It might make sense to introduce cpuid10_ebx or so, also I think the
removal of the branch-miss-retired event is either unwanted or
incomplete. As seen software already expects that bit to be set, even
though its known broken.

At the very least add a full ebx iteration to disable unsupported events
in the intel-v1 case.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf
  2011-11-03 12:33 ` [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf Gleb Natapov
@ 2011-11-07 14:09   ` Peter Zijlstra
  2011-11-07 15:41     ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:09 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> +       case 0xa: { /* Architectural Performance Monitoring */
> +               struct x86_pmu_capability cap;
> +
> +               perf_get_x86_pmu_capability(&cap);
> +
> +               /*
> +                * Only support guest architectural pmu on a host
> +                * with architectural pmu.
> +                */
> +               if (!cap.version)
> +                       memset(&cap, 0, sizeof(cap));
> +
> +               entry->eax = min(cap.version, 2)
> +                       | (cap.num_counters_gp << 8)
> +                       | (cap.bit_width_gp << 16)
> +                       | (cap.events_mask_len << 24);
> +               entry->ebx = cap.events_mask;
> +               entry->ecx = 0;
> +               entry->edx = cap.num_counters_fixed
> +                       | (cap.bit_width_fixed << 5);
> +               break;
> +       } 

would it make sense to use the cpuid10_e[ad]x unions to fill out that
data?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
@ 2011-11-07 14:22   ` Peter Zijlstra
  2011-11-07 15:34     ` Gleb Natapov
  2011-11-07 14:34   ` Peter Zijlstra
  2011-11-07 14:36   ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:22 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme, Frederic Weisbecker

On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> @@ -35,6 +35,7 @@ config KVM
>         select KVM_MMIO
>         select TASKSTATS
>         select TASK_DELAY_ACCT
> +       select PERF_EVENTS 

Do you really want to make that an unconditional part of KVM? I know we
can't currently build x86 without perf due to that hw breakpoint
trainwreck, but people were actually wanting to solve that.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
  2011-11-07 14:22   ` Peter Zijlstra
@ 2011-11-07 14:34   ` Peter Zijlstra
  2011-11-07 14:46     ` Avi Kivity
  2011-11-07 14:36   ` Peter Zijlstra
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:34 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> +static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> +               struct perf_sample_data *data, struct pt_regs *regs)
> +{
> +       struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> +       struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> +       if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
> +               kvm_perf_overflow(perf_event, data, regs);
> +               kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> +               /*
> +                * Inject PMI. If vcpu was in a guest mode during NMI PMI
> +                * can be ejected on a guest mode re-entry. Otherwise we can't
> +                * be sure that vcpu is not halted, so we should kick it, but
> +                * this is impossible from NMI context. Do it from irq work
> +                * instead.
> +                */
> +               if (!kvm_is_in_guest())
> +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> +               else
> +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> +       }
> +} 

I'm not sure I get this, since the counters are vcpu task bound and only
count while the guest is running as per (144d31e6f19) how can we get an
NMI outside of guest context with the vcpu not halted?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
  2011-11-07 14:22   ` Peter Zijlstra
  2011-11-07 14:34   ` Peter Zijlstra
@ 2011-11-07 14:36   ` Peter Zijlstra
  2011-11-07 15:25     ` Gleb Natapov
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> +static u64 read_pmc(struct kvm_pmc *pmc)
> +{
> +       u64 counter, enabled, running;
> +
> +       counter = pmc->counter;
> +
> +       if (pmc->perf_event)
> +               counter += perf_event_read_value(pmc->perf_event,
> +                                                &enabled, &running);
> +
> +       /* FIXME: Scaling needed? */

Since the below programming doesn't use perf_event_attr::pinned, yes.

> +       return counter & pmc_bitmask(pmc);
> +}

> +static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
> +               unsigned config, bool exclude_user, bool exclude_kernel,
> +               bool intr)
> +{
> +       struct perf_event *event;
> +       struct perf_event_attr attr = {
> +               .type = type,
> +               .size = sizeof(attr),
> +               .exclude_idle = true,
> +               .exclude_host = 1,
> +               .exclude_user = exclude_user,
> +               .exclude_kernel = exclude_kernel,
> +               .sample_period = (-pmc->counter) & pmc_bitmask(pmc),
> +               .config = config,
> +       };
> +
> +       event = perf_event_create_kernel_counter(&attr, -1, current,
> +                                                intr ? kvm_perf_overflow_intr :
> +                                                kvm_perf_overflow, pmc);
> +       if (IS_ERR(event)) {
> +               printk_once("kvm: pmu event creation failed %ld\n",
> +                               PTR_ERR(event));
> +               return;
> +       }
> +
> +       pmc->perf_event = event;
> +       clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
> +} 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:34   ` Peter Zijlstra
@ 2011-11-07 14:46     ` Avi Kivity
  2011-11-07 14:59       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-11-07 14:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gleb Natapov, kvm, mtosatti, linux-kernel, mingo, acme

On 11/07/2011 04:34 PM, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > +static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> > +               struct perf_sample_data *data, struct pt_regs *regs)
> > +{
> > +       struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > +       struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> > +       if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
> > +               kvm_perf_overflow(perf_event, data, regs);
> > +               kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> > +               /*
> > +                * Inject PMI. If vcpu was in a guest mode during NMI PMI
> > +                * can be ejected on a guest mode re-entry. Otherwise we can't
> > +                * be sure that vcpu is not halted, so we should kick it, but
> > +                * this is impossible from NMI context. Do it from irq work
> > +                * instead.
> > +                */
> > +               if (!kvm_is_in_guest())
> > +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> > +               else
> > +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> > +       }
> > +} 
>
> I'm not sure I get this, since the counters are vcpu task bound and only
> count while the guest is running as per (144d31e6f19) how can we get an
> NMI outside of guest context with the vcpu not halted?

PMI skew.  Do we know how bad it can get?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:46     ` Avi Kivity
@ 2011-11-07 14:59       ` Peter Zijlstra
  2011-11-07 15:11         ` Gleb Natapov
  2011-11-07 15:13         ` Avi Kivity
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm, mtosatti, linux-kernel, mingo, acme

On Mon, 2011-11-07 at 16:46 +0200, Avi Kivity wrote:
> On 11/07/2011 04:34 PM, Peter Zijlstra wrote:
> > On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > > +static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> > > +               struct perf_sample_data *data, struct pt_regs *regs)
> > > +{
> > > +       struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > > +       struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> > > +       if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
> > > +               kvm_perf_overflow(perf_event, data, regs);
> > > +               kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> > > +               /*
> > > +                * Inject PMI. If vcpu was in a guest mode during NMI PMI
> > > +                * can be ejected on a guest mode re-entry. Otherwise we can't
> > > +                * be sure that vcpu is not halted, so we should kick it, but
> > > +                * this is impossible from NMI context. Do it from irq work
> > > +                * instead.
> > > +                */
> > > +               if (!kvm_is_in_guest())
> > > +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> > > +               else
> > > +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> > > +       }
> > > +} 
> >
> > I'm not sure I get this, since the counters are vcpu task bound and only
> > count while the guest is running as per (144d31e6f19) how can we get an
> > NMI outside of guest context with the vcpu not halted?
> 
> PMI skew.  Do we know how bad it can get?

You're talking about the PMI getting asserted before leaving guest mode
but not getting ran until after we left guest mode?

But in that case we know the guest is stopped, right? So we can simply
set that REQ_PMI bit and have guest entry sort things, no?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:59       ` Peter Zijlstra
@ 2011-11-07 15:11         ` Gleb Natapov
  2011-11-07 15:13         ` Avi Kivity
  1 sibling, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Avi Kivity, kvm, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 03:59:08PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-07 at 16:46 +0200, Avi Kivity wrote:
> > On 11/07/2011 04:34 PM, Peter Zijlstra wrote:
> > > On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > > > +static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> > > > +               struct perf_sample_data *data, struct pt_regs *regs)
> > > > +{
> > > > +       struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> > > > +       struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> > > > +       if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) {
> > > > +               kvm_perf_overflow(perf_event, data, regs);
> > > > +               kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> > > > +               /*
> > > > +                * Inject PMI. If vcpu was in a guest mode during NMI PMI
> > > > +                * can be ejected on a guest mode re-entry. Otherwise we can't
> > > > +                * be sure that vcpu is not halted, so we should kick it, but
> > > > +                * this is impossible from NMI context. Do it from irq work
> > > > +                * instead.
> > > > +                */
> > > > +               if (!kvm_is_in_guest())
> > > > +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> > > > +               else
> > > > +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> > > > +       }
> > > > +} 
> > >
> > > I'm not sure I get this, since the counters are vcpu task bound and only
> > > count while the guest is running as per (144d31e6f19) how can we get an
> > > NMI outside of guest context with the vcpu not halted?
> > 
> > PMI skew.  Do we know how bad it can get?
> 
I checked and we do get them. Not often. Something like once per second of
"perf record" run in a guest.

> You're talking about the PMI getting asserted before leaving guest mode
> but not getting ran until after we left guest mode?
> 
> But in that case we know the guest is stopped, right? So we can simply
> set that REQ_PMI bit and have guest entry sort things, no?
If exit reason was HALT a guest will not attempt a reentry unless MPI or
other event is injected. We can't eject event in NMI context so we
schedule irq work in this case.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:59       ` Peter Zijlstra
  2011-11-07 15:11         ` Gleb Natapov
@ 2011-11-07 15:13         ` Avi Kivity
  2011-11-07 15:19           ` Gleb Natapov
  1 sibling, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-11-07 15:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Gleb Natapov, kvm, mtosatti, linux-kernel, mingo, acme

On 11/07/2011 04:59 PM, Peter Zijlstra wrote:
> > > > +                */
> > > > +               if (!kvm_is_in_guest())
> > > > +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> > > > +               else
> > > > +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> > > > +       }
> > > > +} 
> > >
> > > I'm not sure I get this, since the counters are vcpu task bound and only
> > > count while the guest is running as per (144d31e6f19) how can we get an
> > > NMI outside of guest context with the vcpu not halted?
> > 
> > PMI skew.  Do we know how bad it can get?
>
> You're talking about the PMI getting asserted before leaving guest mode
> but not getting ran until after we left guest mode?

Yes (here, "guest mode" is not the hardware guest mode, but what
kvm_is_in_guest() returns).

> But in that case we know the guest is stopped, right? So we can simply
> set that REQ_PMI bit and have guest entry sort things, no?

Unless there is no guest entry, due to the guest sleeping.

Consider the sequence:

guest:
   nop
hardware:
   counter overflow; PMI sent to APIC
guest:
   hlt
hardware:
   begin vmexit due to HLT
host:
   process vmexit
   mark as outside guest mode (i.e. kvm_is_in_guest() now returns false)
   schedule(), since the guest is sleeping (and assume no interrupts
pending)
hardware:
   APIC finally posts NMI
host:
   process NMI; set KVM_REQ_PMI
   iret
host:
   happily do other things, ignoring that we need to post a PMI to the guest

note, this needs a fairly huge PMI skew to happen.

(btw, like userspace, kvm PMIs may as well be ordinary interrupts. 
Should we consider some logic to only make them NMIs if some counter
requires it?  NMI costs are set to increase dramatically with the
check-all-possible-sources patchset).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 15:13         ` Avi Kivity
@ 2011-11-07 15:19           ` Gleb Natapov
  2011-11-07 15:25             ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peter Zijlstra, kvm, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 05:13:17PM +0200, Avi Kivity wrote:
> On 11/07/2011 04:59 PM, Peter Zijlstra wrote:
> > > > > +                */
> > > > > +               if (!kvm_is_in_guest())
> > > > > +                       irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
> > > > > +               else
> > > > > +                       kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
> > > > > +       }
> > > > > +} 
> > > >
> > > > I'm not sure I get this, since the counters are vcpu task bound and only
> > > > count while the guest is running as per (144d31e6f19) how can we get an
> > > > NMI outside of guest context with the vcpu not halted?
> > > 
> > > PMI skew.  Do we know how bad it can get?
> >
> > You're talking about the PMI getting asserted before leaving guest mode
> > but not getting ran until after we left guest mode?
> 
> Yes (here, "guest mode" is not the hardware guest mode, but what
> kvm_is_in_guest() returns).
> 
> > But in that case we know the guest is stopped, right? So we can simply
> > set that REQ_PMI bit and have guest entry sort things, no?
> 
> Unless there is no guest entry, due to the guest sleeping.
> 
> Consider the sequence:
> 
> guest:
>    nop
> hardware:
>    counter overflow; PMI sent to APIC
> guest:
>    hlt
> hardware:
>    begin vmexit due to HLT
> host:
>    process vmexit
>    mark as outside guest mode (i.e. kvm_is_in_guest() now returns false)
>    schedule(), since the guest is sleeping (and assume no interrupts
> pending)
> hardware:
>    APIC finally posts NMI
> host:
>    process NMI; set KVM_REQ_PMI
>    iret
> host:
>    happily do other things, ignoring that we need to post a PMI to the guest
> 
> note, this needs a fairly huge PMI skew to happen.
> 
No, it need not. It is enough to get exit reason as hlt instead of nmi
for a vcpu to go to blocking state instead of reentering guest mode.
Note that we do not check request flags in kvm_arch_vcpu_runnable().

> (btw, like userspace, kvm PMIs may as well be ordinary interrupts. 
> Should we consider some logic to only make them NMIs if some counter
> requires it?  NMI costs are set to increase dramatically with the
> check-all-possible-sources patchset).
> 
> -- 
> error compiling committee.c: too many arguments to function

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 15:19           ` Gleb Natapov
@ 2011-11-07 15:25             ` Avi Kivity
  2011-11-07 16:22               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Avi Kivity @ 2011-11-07 15:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Peter Zijlstra, kvm, mtosatti, linux-kernel, mingo, acme

On 11/07/2011 05:19 PM, Gleb Natapov wrote:
> > 
> > note, this needs a fairly huge PMI skew to happen.
> > 
> No, it need not. It is enough to get exit reason as hlt instead of nmi
> for a vcpu to go to blocking state instead of reentering guest mode.
> Note that we do not check request flags in kvm_arch_vcpu_runnable().

Right.

If we had a guarantee about the maximum skew, we could add a check for
KVM_REQ_PMI in kvm_vcpu_block().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:36   ` Peter Zijlstra
@ 2011-11-07 15:25     ` Gleb Natapov
  2011-11-07 16:45       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 03:36:49PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > +static u64 read_pmc(struct kvm_pmc *pmc)
> > +{
> > +       u64 counter, enabled, running;
> > +
> > +       counter = pmc->counter;
> > +
> > +       if (pmc->perf_event)
> > +               counter += perf_event_read_value(pmc->perf_event,
> > +                                                &enabled, &running);
> > +
> > +       /* FIXME: Scaling needed? */
> 
> Since the below programming doesn't use perf_event_attr::pinned, yes.
> 
Yes, that is on todo :). Actually I do want to place all guest perf
counters into the same event group and make it pinned. But currently perf
event groups are not very flexible. In our usage scenario we can't have
one event as a group leader since events are created and destroyed very
dynamically. What I would like is to have something like meta event that
will group all other real event.

> > +       return counter & pmc_bitmask(pmc);
> > +}
> 
> > +static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
> > +               unsigned config, bool exclude_user, bool exclude_kernel,
> > +               bool intr)
> > +{
> > +       struct perf_event *event;
> > +       struct perf_event_attr attr = {
> > +               .type = type,
> > +               .size = sizeof(attr),
> > +               .exclude_idle = true,
> > +               .exclude_host = 1,
> > +               .exclude_user = exclude_user,
> > +               .exclude_kernel = exclude_kernel,
> > +               .sample_period = (-pmc->counter) & pmc_bitmask(pmc),
> > +               .config = config,
> > +       };
> > +
> > +       event = perf_event_create_kernel_counter(&attr, -1, current,
> > +                                                intr ? kvm_perf_overflow_intr :
> > +                                                kvm_perf_overflow, pmc);
> > +       if (IS_ERR(event)) {
> > +               printk_once("kvm: pmu event creation failed %ld\n",
> > +                               PTR_ERR(event));
> > +               return;
> > +       }
> > +
> > +       pmc->perf_event = event;
> > +       clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
> > +} 

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 14:22   ` Peter Zijlstra
@ 2011-11-07 15:34     ` Gleb Natapov
  2011-11-07 15:40       ` Avi Kivity
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme, Frederic Weisbecker

On Mon, Nov 07, 2011 at 03:22:05PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > @@ -35,6 +35,7 @@ config KVM
> >         select KVM_MMIO
> >         select TASKSTATS
> >         select TASK_DELAY_ACCT
> > +       select PERF_EVENTS 
> 
> Do you really want to make that an unconditional part of KVM? I know we
> can't currently build x86 without perf due to that hw breakpoint
> trainwreck, but people were actually wanting to solve that.
> 
I am fine either way (the only thing is that I can't check that I can
build kvm without pmu right now). But I doubt that there will be many
KVM deployment without perf enabled and making it optional will increase
test matrix. KVM has pretty extensive list of selects already for that
reason. I let maintainers to decide on this.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 15:34     ` Gleb Natapov
@ 2011-11-07 15:40       ` Avi Kivity
  0 siblings, 0 replies; 42+ messages in thread
From: Avi Kivity @ 2011-11-07 15:40 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Peter Zijlstra, kvm, mtosatti, linux-kernel, mingo, acme,
	Frederic Weisbecker

On 11/07/2011 05:34 PM, Gleb Natapov wrote:
> On Mon, Nov 07, 2011 at 03:22:05PM +0100, Peter Zijlstra wrote:
> > On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > > @@ -35,6 +35,7 @@ config KVM
> > >         select KVM_MMIO
> > >         select TASKSTATS
> > >         select TASK_DELAY_ACCT
> > > +       select PERF_EVENTS 
> > 
> > Do you really want to make that an unconditional part of KVM? I know we
> > can't currently build x86 without perf due to that hw breakpoint
> > trainwreck, but people were actually wanting to solve that.
> > 
> I am fine either way (the only thing is that I can't check that I can
> build kvm without pmu right now). But I doubt that there will be many
> KVM deployment without perf enabled and making it optional will increase
> test matrix. KVM has pretty extensive list of selects already for that
> reason. I let maintainers to decide on this.

I prefer to avoid the select, when possible.  But that can be done after
PERF_EVENTS is disentangled from x86.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf
  2011-11-07 14:09   ` Peter Zijlstra
@ 2011-11-07 15:41     ` Gleb Natapov
  2011-11-07 15:45       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 03:09:46PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > +       case 0xa: { /* Architectural Performance Monitoring */
> > +               struct x86_pmu_capability cap;
> > +
> > +               perf_get_x86_pmu_capability(&cap);
> > +
> > +               /*
> > +                * Only support guest architectural pmu on a host
> > +                * with architectural pmu.
> > +                */
> > +               if (!cap.version)
> > +                       memset(&cap, 0, sizeof(cap));
> > +
> > +               entry->eax = min(cap.version, 2)
> > +                       | (cap.num_counters_gp << 8)
> > +                       | (cap.bit_width_gp << 16)
> > +                       | (cap.events_mask_len << 24);
> > +               entry->ebx = cap.events_mask;
> > +               entry->ecx = 0;
> > +               entry->edx = cap.num_counters_fixed
> > +                       | (cap.bit_width_fixed << 5);
> > +               break;
> > +       } 
> 
> would it make sense to use the cpuid10_e[ad]x unions to fill out that
> data?
Do you mean by doing cpuid here directly instead of checking perf
capability? We do not (entirely) pass through PMU to a guest. We emulate
it using perf subsystem and I can imaging cases where perf capabilities
will be different from what host reports. For instance host cpu may have
an errata that will make one of its counters unusable. In such case
code that checks for errata will have to be duplicated here too. Or we
may wan to emulate architectural PMU for a guest running on AMD host.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf
  2011-11-07 15:41     ` Gleb Natapov
@ 2011-11-07 15:45       ` Peter Zijlstra
  2011-11-07 15:54         ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 15:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, 2011-11-07 at 17:41 +0200, Gleb Natapov wrote:
> > > +               entry->eax = min(cap.version, 2)
> > > +                       | (cap.num_counters_gp << 8)
> > > +                       | (cap.bit_width_gp << 16)
> > > +                       | (cap.events_mask_len << 24); 


> Do you mean by doing cpuid here directly instead of checking perf
> capability?

No I meant something like:
        
        union cpuid10_eax eax;
        
        eax.version	 = min(cap.version, 2);
        eax.bit_width	 = cap.bit_width;
        eax.num_counters = cap.num_counters;
        eax.mask_length	 = cap.mask_length;
        
        entry->eax = eax.full;
        


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-07 14:07   ` Peter Zijlstra
@ 2011-11-07 15:53     ` Gleb Natapov
  2011-11-07 16:01       ` Peter Zijlstra
  2011-11-08 12:49     ` Gleb Natapov
  1 sibling, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 03:07:50PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > @@ -1580,6 +1580,8 @@ __init int intel_pmu_init(void)
> >         x86_pmu.num_counters            = eax.split.num_counters;
> >         x86_pmu.cntval_bits             = eax.split.bit_width;
> >         x86_pmu.cntval_mask             = (1ULL << eax.split.bit_width) - 1;
> > +       x86_pmu.events_mask             = ebx;
> > +       x86_pmu.events_mask_len         = eax.split.mask_length;
> >  
> >         /*
> >          * Quirk: v2 perfmon does not report fixed-purpose events, so
> > @@ -1651,6 +1653,7 @@ __init int intel_pmu_init(void)
> >                          * architectural event which is often completely bogus:
> >                          */
> >                         intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = 0x7f89;
> > +                       x86_pmu.events_mask &= ~0x40;
> >  
> >                         pr_cont("erratum AAJ80 worked around, ");
> >                 } 
> 
> It might make sense to introduce cpuid10_ebx or so, also I think the
> removal of the branch-miss-retired event is either unwanted or
> incomplete. As seen software already expects that bit to be set, even
> though its known broken.
> 
I removed branch-miss-retired here because for perf user it exists. Perf
approximates it by other event but perf user shouldn't know that. A
guest is not always runs with exactly same cpu model number as a host,
so if we will not drop the bit here if guest will see cpu model other
than the one that has this quirk it will not be able to use the event.

BTW why perf does not check event mask to see if architectural event is
available before programming it?

> At the very least add a full ebx iteration to disable unsupported events
> in the intel-v1 case.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf
  2011-11-07 15:45       ` Peter Zijlstra
@ 2011-11-07 15:54         ` Gleb Natapov
  0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 15:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 04:45:57PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-07 at 17:41 +0200, Gleb Natapov wrote:
> > > > +               entry->eax = min(cap.version, 2)
> > > > +                       | (cap.num_counters_gp << 8)
> > > > +                       | (cap.bit_width_gp << 16)
> > > > +                       | (cap.events_mask_len << 24); 
> 
> 
> > Do you mean by doing cpuid here directly instead of checking perf
> > capability?
> 
> No I meant something like:
>         
>         union cpuid10_eax eax;
>         
>         eax.version	 = min(cap.version, 2);
>         eax.bit_width	 = cap.bit_width;
>         eax.num_counters = cap.num_counters;
>         eax.mask_length	 = cap.mask_length;
>         
>         entry->eax = eax.full;
>         
Ah, yes. Will do that.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-07 15:53     ` Gleb Natapov
@ 2011-11-07 16:01       ` Peter Zijlstra
  2011-11-07 16:22         ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 16:01 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, 2011-11-07 at 17:53 +0200, Gleb Natapov wrote:
> I removed branch-miss-retired here because for perf user it exists. Perf
> approximates it by other event but perf user shouldn't know that. A
> guest is not always runs with exactly same cpu model number as a host,
> so if we will not drop the bit here if guest will see cpu model other
> than the one that has this quirk it will not be able to use the event.

Right, so what model number do you expose? Anyway I don't really mind
masking the thing as long as we grow an ebx iteration,

> BTW why perf does not check event mask to see if architectural event is
> available before programming it? 

No idea why not.. just one of those things nobody noticed/got around to
doing or so.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-07 16:01       ` Peter Zijlstra
@ 2011-11-07 16:22         ` Gleb Natapov
  2011-11-07 16:25           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 16:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 05:01:11PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-07 at 17:53 +0200, Gleb Natapov wrote:
> > I removed branch-miss-retired here because for perf user it exists. Perf
> > approximates it by other event but perf user shouldn't know that. A
> > guest is not always runs with exactly same cpu model number as a host,
> > so if we will not drop the bit here if guest will see cpu model other
> > than the one that has this quirk it will not be able to use the event.
> 
> Right, so what model number do you expose? Anyway I don't really mind
Depends on what management wants. You can specify -cpu Nehalem or -cpu
Conroe or even override model manually by doing -cpu host,model=15.

> masking the thing as long as we grow an ebx iteration,
> 
> > BTW why perf does not check event mask to see if architectural event is
> > available before programming it? 
> 
> No idea why not.. just one of those things nobody noticed/got around to
> doing or so.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 15:25             ` Avi Kivity
@ 2011-11-07 16:22               ` Peter Zijlstra
  2011-11-07 16:26                 ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 16:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Gleb Natapov, kvm, mtosatti, linux-kernel, mingo, acme, hpa

On Mon, 2011-11-07 at 17:25 +0200, Avi Kivity wrote:
> On 11/07/2011 05:19 PM, Gleb Natapov wrote:
> > > 
> > > note, this needs a fairly huge PMI skew to happen.
> > > 
> > No, it need not. It is enough to get exit reason as hlt instead of nmi
> > for a vcpu to go to blocking state instead of reentering guest mode.
> > Note that we do not check request flags in kvm_arch_vcpu_runnable().
> 
> Right.
> 
> If we had a guarantee about the maximum skew, we could add a check for
> KVM_REQ_PMI in kvm_vcpu_block().

Right, it shouldn't be more than a few instructions since its NMIs we're
talking about, but I'm not sure there's any really hard guarantees on,
hardware folks would be able to say more.

Typically you can assert NMIs at instruction boundaries, but things like
instruction fusing and a few 'special' insn can delay NMI delivery. then
again, I'm clueless as to the actual implementation details of any of
this stuff.

Also I'm not sure if there's any non-deterministic delays in the PMU
event -> PMU overflow -> PMI raise path that could push you out into
silly land.

So yeah, I think the proposed code is fine, although I think the comment
can be improved by mentioning the vcpu hlt case.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-07 16:22         ` Gleb Natapov
@ 2011-11-07 16:25           ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 16:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, 2011-11-07 at 18:22 +0200, Gleb Natapov wrote:
> > Right, so what model number do you expose? 

> Depends on what management wants. You can specify -cpu Nehalem or -cpu
> Conroe or even override model manually by doing -cpu host,model=15.

Oh cute ;-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 16:22               ` Peter Zijlstra
@ 2011-11-07 16:26                 ` Gleb Natapov
  0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 16:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Avi Kivity, kvm, mtosatti, linux-kernel, mingo, acme, hpa

On Mon, Nov 07, 2011 at 05:22:21PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-07 at 17:25 +0200, Avi Kivity wrote:
> > On 11/07/2011 05:19 PM, Gleb Natapov wrote:
> > > > 
> > > > note, this needs a fairly huge PMI skew to happen.
> > > > 
> > > No, it need not. It is enough to get exit reason as hlt instead of nmi
> > > for a vcpu to go to blocking state instead of reentering guest mode.
> > > Note that we do not check request flags in kvm_arch_vcpu_runnable().
> > 
> > Right.
> > 
> > If we had a guarantee about the maximum skew, we could add a check for
> > KVM_REQ_PMI in kvm_vcpu_block().
> 
> Right, it shouldn't be more than a few instructions since its NMIs we're
> talking about, but I'm not sure there's any really hard guarantees on,
> hardware folks would be able to say more.
> 
> Typically you can assert NMIs at instruction boundaries, but things like
> instruction fusing and a few 'special' insn can delay NMI delivery. then
> again, I'm clueless as to the actual implementation details of any of
> this stuff.
> 
> Also I'm not sure if there's any non-deterministic delays in the PMU
> event -> PMU overflow -> PMI raise path that could push you out into
> silly land.
> 
> So yeah, I think the proposed code is fine, although I think the comment
> can be improved by mentioning the vcpu hlt case.
> 
Will do.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 15:25     ` Gleb Natapov
@ 2011-11-07 16:45       ` Peter Zijlstra
  2011-11-07 17:17         ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-07 16:45 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, 2011-11-07 at 17:25 +0200, Gleb Natapov wrote:
> > Since the below programming doesn't use perf_event_attr::pinned, yes.
> > 
> Yes, that is on todo :). Actually I do want to place all guest perf
> counters into the same event group and make it pinned. But currently perf
> event groups are not very flexible. In our usage scenario we can't have
> one event as a group leader since events are created and destroyed very
> dynamically. What I would like is to have something like meta event that
> will group all other real event. 

Is there a reason to have them grouped if you pin them all anyway?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests
  2011-11-07 16:45       ` Peter Zijlstra
@ 2011-11-07 17:17         ` Gleb Natapov
  0 siblings, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-07 17:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 05:45:14PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-07 at 17:25 +0200, Gleb Natapov wrote:
> > > Since the below programming doesn't use perf_event_attr::pinned, yes.
> > > 
> > Yes, that is on todo :). Actually I do want to place all guest perf
> > counters into the same event group and make it pinned. But currently perf
> > event groups are not very flexible. In our usage scenario we can't have
> > one event as a group leader since events are created and destroyed very
> > dynamically. What I would like is to have something like meta event that
> > will group all other real event. 
> 
> Is there a reason to have them grouped if you pin them all anyway?
Hmm good question. May be we shouldn't pin then since this will prevent
profiling vcpu task on a host while perf is running in a guest, but then
we need to group guest created event to get meaningful result. I think I'll
pin them for now and will look into grouping them later.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-07 14:07   ` Peter Zijlstra
  2011-11-07 15:53     ` Gleb Natapov
@ 2011-11-08 12:49     ` Gleb Natapov
  2011-11-08 13:26       ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-08 12:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Mon, Nov 07, 2011 at 03:07:50PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-03 at 14:33 +0200, Gleb Natapov wrote:
> > @@ -1580,6 +1580,8 @@ __init int intel_pmu_init(void)
> >         x86_pmu.num_counters            = eax.split.num_counters;
> >         x86_pmu.cntval_bits             = eax.split.bit_width;
> >         x86_pmu.cntval_mask             = (1ULL << eax.split.bit_width) - 1;
> > +       x86_pmu.events_mask             = ebx;
> > +       x86_pmu.events_mask_len         = eax.split.mask_length;
> >  
> >         /*
> >          * Quirk: v2 perfmon does not report fixed-purpose events, so
> > @@ -1651,6 +1653,7 @@ __init int intel_pmu_init(void)
> >                          * architectural event which is often completely bogus:
> >                          */
> >                         intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = 0x7f89;
> > +                       x86_pmu.events_mask &= ~0x40;
> >  
> >                         pr_cont("erratum AAJ80 worked around, ");
> >                 } 
> 
> It might make sense to introduce cpuid10_ebx or so, also I think the
cpuid10_ebx will have only one field though (event_mask).

> At the very least add a full ebx iteration to disable unsupported events
> in the intel-v1 case.
I do not understand what do you mean here, cpuid10_ebx was introduced by
intel v1 architectural PMU so it should already contain correct information.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 12:49     ` Gleb Natapov
@ 2011-11-08 13:26       ` Peter Zijlstra
  2011-11-08 13:54         ` Gleb Natapov
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-08 13:26 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, 2011-11-08 at 14:49 +0200, Gleb Natapov wrote:
> > It might make sense to introduce cpuid10_ebx or so, also I think the
> cpuid10_ebx will have only one field though (event_mask).
> 
> > At the very least add a full ebx iteration to disable unsupported events
> > in the intel-v1 case.
> I do not understand what do you mean here, cpuid10_ebx was introduced by
> intel v1 architectural PMU so it should already contain correct information.

I meant something like the below

---
 arch/x86/include/asm/perf_event.h      |   13 +++++++++++++
 arch/x86/kernel/cpu/perf_event.c       |    3 +++
 arch/x86/kernel/cpu/perf_event_intel.c |   21 ++++++++++++++++++---
 3 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index f61c62f..98e397a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -72,6 +72,19 @@ union cpuid10_eax {
 	unsigned int full;
 };
 
+union cpuid10_ebx {
+	struct {
+		unsigned int unhalted_core_cycles:1;
+		unsigned int instructions_retired:1;
+		unsigned int unhalted_reference_cycles:1;
+		unsigned int llc_reference:1;
+		unsigned int llc_misses:1;
+		unsigned int branch_instruction_retired:1;
+		unsigned int branch_misses_retired:1;
+	} split;
+	unsigned int full;
+};
+
 union cpuid10_edx {
 	struct {
 		unsigned int num_counters_fixed:5;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..e4fdb9d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -336,6 +336,9 @@ int x86_setup_perfctr(struct perf_event *event)
 	if (config == -1LL)
 		return -EINVAL;
 
+	if (config == -2LL)
+		return -EOPNOTSUPP;
+
 	/*
 	 * Branch tracing:
 	 */
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e09ca20..aaaed9a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1547,9 +1547,9 @@ static void intel_clovertown_quirks(void)
 __init int intel_pmu_init(void)
 {
 	union cpuid10_edx edx;
+	union cpuid10_ebx ebx;
 	union cpuid10_eax eax;
 	unsigned int unused;
-	unsigned int ebx;
 	int version;
 
 	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
@@ -1566,7 +1566,7 @@ __init int intel_pmu_init(void)
 	 * Check whether the Architectural PerfMon supports
 	 * Branch Misses Retired hw_event or not.
 	 */
-	cpuid(10, &eax.full, &ebx, &unused, &edx.full);
+	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
 	if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
 		return -ENODEV;
 
@@ -1598,6 +1598,21 @@ __init int intel_pmu_init(void)
 		x86_pmu.intel_cap.capabilities = capabilities;
 	}
 
+	if (!ebx.split.unhalted_core_cycles)
+		intel_perfmon_event_map[PERF_COUNT_HW_CPU_CYCLES] = -2;
+	if (!ebx.split.instructions_retired)
+		intel_perfmon_event_map[PERF_COUNT_HW_INSTRUCTIONS] = -2;
+	if (!ebx.split.unhalted_reference_cycles)
+		intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES] = -2;
+	if (!ebx.split.llc_reference)
+		intel_perfmon_event_map[PERF_COUNT_HW_CACHE_REFERENCES] = -2;
+	if (!ebx.split.llc_misses)
+		intel_perfmon_event_map[PERF_COUNT_HW_CACHE_MISSES] = -2;
+	if (!ebx.split.branch_instruction_retired)
+		intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = -2;
+	if (!ebx.split.branch_misses_retired)
+		intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = -2;
+
 	intel_ds_init();
 
 	/*
@@ -1643,7 +1658,7 @@ __init int intel_pmu_init(void)
 		/* UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;
 
-		if (ebx & 0x40) {
+		if (ebx.split.branch_misses_retired) {
 			/*
 			 * Erratum AAJ80 detected, we work it around by using
 			 * the BR_MISP_EXEC.ANY event. This will over-count


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 13:26       ` Peter Zijlstra
@ 2011-11-08 13:54         ` Gleb Natapov
  2011-11-08 14:12           ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-08 13:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, Nov 08, 2011 at 02:26:51PM +0100, Peter Zijlstra wrote:
> On Tue, 2011-11-08 at 14:49 +0200, Gleb Natapov wrote:
> > > It might make sense to introduce cpuid10_ebx or so, also I think the
> > cpuid10_ebx will have only one field though (event_mask).
> > 
> > > At the very least add a full ebx iteration to disable unsupported events
> > > in the intel-v1 case.
> > I do not understand what do you mean here, cpuid10_ebx was introduced by
> > intel v1 architectural PMU so it should already contain correct information.
> 
> I meant something like the below
> 
Isn't it better to introduce mapping between ebx bits and architectural
events and do for_each_set_bit loop? But I wouldn't want to introduce
patch as below as part of this series. I do not want to introduce
incidental regressions. For instance the patch below will introduce
regression on my Nehalem cpu. It reports value 0x44 in cpuid10.ebx which
means that unhalted_reference_cycles is not available (bit set means
event is not available), but event still works! Actually it is listed as
supported by the cpu in Table A-4 SDM 3B. Go figure.

> ---
>  arch/x86/include/asm/perf_event.h      |   13 +++++++++++++
>  arch/x86/kernel/cpu/perf_event.c       |    3 +++
>  arch/x86/kernel/cpu/perf_event_intel.c |   21 ++++++++++++++++++---
>  3 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index f61c62f..98e397a 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -72,6 +72,19 @@ union cpuid10_eax {
>  	unsigned int full;
>  };
>  
> +union cpuid10_ebx {
> +	struct {
> +		unsigned int unhalted_core_cycles:1;
> +		unsigned int instructions_retired:1;
> +		unsigned int unhalted_reference_cycles:1;
> +		unsigned int llc_reference:1;
> +		unsigned int llc_misses:1;
> +		unsigned int branch_instruction_retired:1;
> +		unsigned int branch_misses_retired:1;
> +	} split;
> +	unsigned int full;
> +};
> +
>  union cpuid10_edx {
>  	struct {
>  		unsigned int num_counters_fixed:5;
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 6408910..e4fdb9d 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -336,6 +336,9 @@ int x86_setup_perfctr(struct perf_event *event)
>  	if (config == -1LL)
>  		return -EINVAL;
>  
> +	if (config == -2LL)
> +		return -EOPNOTSUPP;
> +
>  	/*
>  	 * Branch tracing:
>  	 */
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index e09ca20..aaaed9a 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1547,9 +1547,9 @@ static void intel_clovertown_quirks(void)
>  __init int intel_pmu_init(void)
>  {
>  	union cpuid10_edx edx;
> +	union cpuid10_ebx ebx;
>  	union cpuid10_eax eax;
>  	unsigned int unused;
> -	unsigned int ebx;
>  	int version;
>  
>  	if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
> @@ -1566,7 +1566,7 @@ __init int intel_pmu_init(void)
>  	 * Check whether the Architectural PerfMon supports
>  	 * Branch Misses Retired hw_event or not.
>  	 */
> -	cpuid(10, &eax.full, &ebx, &unused, &edx.full);
> +	cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
>  	if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
>  		return -ENODEV;
>  
> @@ -1598,6 +1598,21 @@ __init int intel_pmu_init(void)
>  		x86_pmu.intel_cap.capabilities = capabilities;
>  	}
>  
> +	if (!ebx.split.unhalted_core_cycles)
0 means event is available 1 it is no.

> +		intel_perfmon_event_map[PERF_COUNT_HW_CPU_CYCLES] = -2;
> +	if (!ebx.split.instructions_retired)
> +		intel_perfmon_event_map[PERF_COUNT_HW_INSTRUCTIONS] = -2;
> +	if (!ebx.split.unhalted_reference_cycles)
> +		intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES] = -2;
> +	if (!ebx.split.llc_reference)
> +		intel_perfmon_event_map[PERF_COUNT_HW_CACHE_REFERENCES] = -2;
> +	if (!ebx.split.llc_misses)
> +		intel_perfmon_event_map[PERF_COUNT_HW_CACHE_MISSES] = -2;
> +	if (!ebx.split.branch_instruction_retired)
> +		intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = -2;
> +	if (!ebx.split.branch_misses_retired)
> +		intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = -2;
> +
>  	intel_ds_init();
>  
>  	/*
> @@ -1643,7 +1658,7 @@ __init int intel_pmu_init(void)
>  		/* UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
>  		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x1803fb1;
>  
> -		if (ebx & 0x40) {
> +		if (ebx.split.branch_misses_retired) {
>  			/*
>  			 * Erratum AAJ80 detected, we work it around by using
>  			 * the BR_MISP_EXEC.ANY event. This will over-count

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 13:54         ` Gleb Natapov
@ 2011-11-08 14:12           ` Peter Zijlstra
  2011-11-08 14:18             ` Gleb Natapov
  2011-11-10 11:56             ` Gleb Natapov
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-08 14:12 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, 2011-11-08 at 15:54 +0200, Gleb Natapov wrote:
> Isn't it better to introduce mapping between ebx bits and architectural
> events and do for_each_set_bit loop?

Probably, but I only thought of that halfway through ;-)

>  But I wouldn't want to introduce
> patch as below as part of this series. 

Well, since you're actually going to frob cpuid10.ebx bits we had better
deal with it properly.

> I do not want to introduce
> incidental regressions. For instance the patch below will introduce
> regression on my Nehalem cpu. It reports value 0x44 in cpuid10.ebx which
> means that unhalted_reference_cycles is not available (bit set means
> event is not available), but event still works! Actually it is listed as
> supported by the cpu in Table A-4 SDM 3B. Go figure. 

We'd better figure out why your machine says that. It could be we need
another quirk for the nehalem machines, it could be your BIOS is smoking
crack and there's nothing we can do about it.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 14:12           ` Peter Zijlstra
@ 2011-11-08 14:18             ` Gleb Natapov
  2011-11-08 14:31               ` Peter Zijlstra
  2011-11-10 11:56             ` Gleb Natapov
  1 sibling, 1 reply; 42+ messages in thread
From: Gleb Natapov @ 2011-11-08 14:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, Nov 08, 2011 at 03:12:27PM +0100, Peter Zijlstra wrote:
> On Tue, 2011-11-08 at 15:54 +0200, Gleb Natapov wrote:
> > Isn't it better to introduce mapping between ebx bits and architectural
> > events and do for_each_set_bit loop?
> 
> Probably, but I only thought of that halfway through ;-)
> 
> >  But I wouldn't want to introduce
> > patch as below as part of this series. 
> 
> Well, since you're actually going to frob cpuid10.ebx bits we had better
> deal with it properly.
OK. We need to figure what is the proper way though. How about me
introducing cpuid10_ebx with event_mask bitmask, mapping array, but
not disabling events for now.

> 
> > I do not want to introduce
> > incidental regressions. For instance the patch below will introduce
> > regression on my Nehalem cpu. It reports value 0x44 in cpuid10.ebx which
> > means that unhalted_reference_cycles is not available (bit set means
> > event is not available), but event still works! Actually it is listed as
> > supported by the cpu in Table A-4 SDM 3B. Go figure. 
> 
> We'd better figure out why your machine says that. It could be we need
> another quirk for the nehalem machines, it could be your BIOS is smoking
> crack and there's nothing we can do about it.
> 
I asked Intel a week (or so) ago. Waiting for an answer. Will try to
find other machines with similar cpu to do more checks meanwhile.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 14:18             ` Gleb Natapov
@ 2011-11-08 14:31               ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2011-11-08 14:31 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, 2011-11-08 at 16:18 +0200, Gleb Natapov wrote:
> On Tue, Nov 08, 2011 at 03:12:27PM +0100, Peter Zijlstra wrote:
> > On Tue, 2011-11-08 at 15:54 +0200, Gleb Natapov wrote:
> > > Isn't it better to introduce mapping between ebx bits and architectural
> > > events and do for_each_set_bit loop?
> > 
> > Probably, but I only thought of that halfway through ;-)
> > 
> > >  But I wouldn't want to introduce
> > > patch as below as part of this series. 
> > 
> > Well, since you're actually going to frob cpuid10.ebx bits we had better
> > deal with it properly.
> OK. We need to figure what is the proper way though. How about me
> introducing cpuid10_ebx with event_mask bitmask, mapping array, but
> not disabling events for now.

Ah, only now that you pointed our those bits are in fact inverted do I
realize you propagate the workaround. So by clearing it you say the
event is supported.

OK. maybe we should change the names of the split to reflect their
inverted nature, although I'm clean out of ideas atm.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv2 6/9] perf: expose perf capability to other modules.
  2011-11-08 14:12           ` Peter Zijlstra
  2011-11-08 14:18             ` Gleb Natapov
@ 2011-11-10 11:56             ` Gleb Natapov
  1 sibling, 0 replies; 42+ messages in thread
From: Gleb Natapov @ 2011-11-10 11:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: kvm, avi, mtosatti, linux-kernel, mingo, acme

On Tue, Nov 08, 2011 at 03:12:27PM +0100, Peter Zijlstra wrote:
> > I do not want to introduce
> > incidental regressions. For instance the patch below will introduce
> > regression on my Nehalem cpu. It reports value 0x44 in cpuid10.ebx which
> > means that unhalted_reference_cycles is not available (bit set means
> > event is not available), but event still works! Actually it is listed as
> > supported by the cpu in Table A-4 SDM 3B. Go figure. 
> 
> We'd better figure out why your machine says that. It could be we need
> another quirk for the nehalem machines, it could be your BIOS is smoking
> crack and there's nothing we can do about it.
> 
Looks like on nehalem 013c event does not count unhalted reference
cycles like spec says it should, but instead increments with 133MHZ
frequency. So disabling it should be the right thing to do.

--
			Gleb.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2011-11-10 11:56 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-03 12:33 [PATCHv2 0/9] KVM in-guest performance monitoring Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 1/9] KVM: Expose kvm_lapic_local_deliver() Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 2/9] KVM: Expose a version 2 architectural PMU to a guests Gleb Natapov
2011-11-07 14:22   ` Peter Zijlstra
2011-11-07 15:34     ` Gleb Natapov
2011-11-07 15:40       ` Avi Kivity
2011-11-07 14:34   ` Peter Zijlstra
2011-11-07 14:46     ` Avi Kivity
2011-11-07 14:59       ` Peter Zijlstra
2011-11-07 15:11         ` Gleb Natapov
2011-11-07 15:13         ` Avi Kivity
2011-11-07 15:19           ` Gleb Natapov
2011-11-07 15:25             ` Avi Kivity
2011-11-07 16:22               ` Peter Zijlstra
2011-11-07 16:26                 ` Gleb Natapov
2011-11-07 14:36   ` Peter Zijlstra
2011-11-07 15:25     ` Gleb Natapov
2011-11-07 16:45       ` Peter Zijlstra
2011-11-07 17:17         ` Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 3/9] KVM: Add generic RDPMC support Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 4/9] KVM: SVM: Intercept RDPMC Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 5/9] KVM: VMX: " Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 6/9] perf: expose perf capability to other modules Gleb Natapov
2011-11-07 14:07   ` Peter Zijlstra
2011-11-07 15:53     ` Gleb Natapov
2011-11-07 16:01       ` Peter Zijlstra
2011-11-07 16:22         ` Gleb Natapov
2011-11-07 16:25           ` Peter Zijlstra
2011-11-08 12:49     ` Gleb Natapov
2011-11-08 13:26       ` Peter Zijlstra
2011-11-08 13:54         ` Gleb Natapov
2011-11-08 14:12           ` Peter Zijlstra
2011-11-08 14:18             ` Gleb Natapov
2011-11-08 14:31               ` Peter Zijlstra
2011-11-10 11:56             ` Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 7/9] KVM: Expose the architectural performance monitoring CPUID leaf Gleb Natapov
2011-11-07 14:09   ` Peter Zijlstra
2011-11-07 15:41     ` Gleb Natapov
2011-11-07 15:45       ` Peter Zijlstra
2011-11-07 15:54         ` Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 8/9] KVM: x86 emulator: fix RDPMC privilege check Gleb Natapov
2011-11-03 12:33 ` [PATCHv2 9/9] KVM: x86 emulator: implement RDPMC (0F 33) Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).