All of lore.kernel.org
 help / color / mirror / Atom feed
* Implement PEBS virtualization for Silvermont
@ 2014-05-30  1:12 Andi Kleen
  2014-05-30  1:12 ` [PATCH 1/4] perf: Add PEBS virtualization enable " Andi Kleen
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Andi Kleen @ 2014-05-30  1:12 UTC (permalink / raw)
  To: peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel

PEBS is very useful (e.g. enabling the more cycles:pp event or
memory profiling) Unfortunately it didn't work in virtualization,
which is becoming more and more common.

This patch kit implements simple PEBS virtualization for KVM on Silvermont
CPUs. Silvermont does not have the leak problems that prevented successfull
PEBS virtualization earlier.

It needs some (simple) modifcations on the host perf code, in addition to 
a PEBS device model in KVM. The guest does not need any modifications.

It also requires running with -cpu host. This may in term cause
some other problems with the guest perf (due to writes to various missing
MSRs), but these can be addressed separately.

For more details please see the description of the individual patches.

Available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git perf/kvm-pebs-slm

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/4] perf: Add PEBS virtualization enable for Silvermont
  2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
@ 2014-05-30  1:12 ` Andi Kleen
  2014-05-30  1:12 ` [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters Andi Kleen
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2014-05-30  1:12 UTC (permalink / raw)
  To: peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

To avoid various problems (like leaking counters) the PEBS
virtualization needs white listing per CPU model. Add state to the
x86_pmu for this and enable it for Silvermont.

Silvermont is currently the only CPU where it is safe
to virtualize PEBS, as it doesn't leak PEBS event
through exits (as long as the exit MSR list disables
the counter with PEBS_ENABLE)

Also Silvermont is relatively simple to handle,
as it only has one PEBS counter.

Also export the information to (modular) KVM.

Used in followon patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/perf_event.h         |  6 ++++++
 arch/x86/kernel/cpu/perf_event.h          |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c    |  1 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 20 ++++++++++++++++++++
 4 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8249df4..c49c7d3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -250,6 +250,9 @@ struct perf_guest_switch_msr {
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
+extern unsigned long long perf_get_ds_area(void);
+extern unsigned long long perf_get_pebs_enable(void);
+extern bool perf_pebs_virtualization(void);
 #else
 static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
@@ -264,6 +267,9 @@ static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 
 static inline void perf_events_lapic_init(void)	{ }
 static inline void perf_check_microcode(void) { }
+static inline unsigned long long perf_get_ds_area(void) { return 0; }
+static inline unsigned long long perf_get_pebs_enable(void) { return 0; }
+static inline bool perf_pebs_virtualization(void) { return false; }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_AMD)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..6ab8fdd 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -449,6 +449,7 @@ struct x86_pmu {
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	int 		max_pebs_events;
+	bool		pebs_virtualization;
 
 	/*
 	 * Intel LBR
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index aa333d9..86ccb81 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2399,6 +2399,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.pebs_constraints = intel_slm_pebs_event_constraints;
 		x86_pmu.extra_regs = intel_slm_extra_regs;
 		x86_pmu.er_flags |= ERF_HAS_RSP_1;
+		x86_pmu.pebs_virtualization = true;
 		pr_cont("Silvermont events, ");
 		break;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ae96cfa..29622a7 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -429,6 +429,26 @@ void reserve_ds_buffers(void)
 	put_online_cpus();
 }
 
+unsigned long long perf_get_ds_area(void)
+{
+	return (u64)__get_cpu_var(cpu_hw_events).ds;
+}
+EXPORT_SYMBOL_GPL(perf_get_ds_area);
+
+unsigned long long perf_get_pebs_enable(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	return cpuc->pebs_enabled;
+}
+EXPORT_SYMBOL_GPL(perf_get_pebs_enable);
+
+bool perf_pebs_virtualization(void)
+{
+	return x86_pmu.pebs_virtualization;
+}
+EXPORT_SYMBOL_GPL(perf_pebs_virtualization);
+
 /*
  * BTS
  */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
  2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
  2014-05-30  1:12 ` [PATCH 1/4] perf: Add PEBS virtualization enable " Andi Kleen
@ 2014-05-30  1:12 ` Andi Kleen
  2014-05-30  7:31   ` Peter Zijlstra
  2014-05-30  1:12 ` [PATCH 3/4] perf: Handle guest PEBS events with a fake event Andi Kleen
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-05-30  1:12 UTC (permalink / raw)
  To: peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Currently perf unconditionally disables PEBS for guest.

Now that we have the infrastructure in place to handle
it we can allow it for KVM owned guest events. For
the perf needs to know that a event is owned by
a guest. Add a new state bit in the perf_event for that.

The bit is only set by KVM and cannot be selected
by anyone else.

Then change the MSR entry/exit list to allow
PEBS for these counters.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.h       |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c | 14 +++++++++++---
 arch/x86/kvm/pmu.c                     |  1 +
 include/linux/perf_event.h             | 15 ++++++++++++++-
 kernel/events/core.c                   |  7 ++++---
 5 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 6ab8fdd..422bca5 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -163,6 +163,7 @@ struct cpu_hw_events {
 	 */
 	u64				intel_ctrl_guest_mask;
 	u64				intel_ctrl_host_mask;
+	u64				intel_ctrl_guest_owned;
 	struct perf_guest_switch_msr	guest_switch_msrs[X86_PMC_IDX_MAX];
 
 	/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 86ccb81..3bcfda0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1202,6 +1202,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 
 	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
+	cpuc->intel_ctrl_guest_owned &= ~(1ull << hwc->idx);
 	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
 
 	/*
@@ -1274,6 +1275,8 @@ static void intel_pmu_enable_event(struct perf_event *event)
 
 	if (event->attr.exclude_host)
 		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
+	if (event->guest_owned)
+		cpuc->intel_ctrl_guest_owned |= (1ull << hwc->idx);
 	if (event->attr.exclude_guest)
 		cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
 
@@ -1775,18 +1778,23 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
+	u64 mask;
 
 	arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
 	arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
 	arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
+
+	arr[1].msr = MSR_IA32_PEBS_ENABLE;
+	arr[1].host = cpuc->pebs_enabled;
 	/*
+	 * For PEBS virtualization only allow guest owned counters.
+	 *
 	 * If PMU counter has PEBS enabled it is not enough to disable counter
 	 * on a guest entry since PEBS memory write can overshoot guest entry
 	 * and corrupt guest memory. Disabling PEBS solves the problem.
 	 */
-	arr[1].msr = MSR_IA32_PEBS_ENABLE;
-	arr[1].host = cpuc->pebs_enabled;
-	arr[1].guest = 0;
+	mask = cpuc->intel_ctrl_guest_owned;
+	arr[1].guest = cpuc->pebs_enabled & (mask | (mask << 32));
 
 	*nr = 2;
 	return arr;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5c4f631..4c6f417 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -188,6 +188,7 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
 				PTR_ERR(event));
 		return;
 	}
+	event->guest_owned = true;
 
 	pmc->perf_event = event;
 	clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3356abc..ad2b3f6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -437,6 +437,8 @@ struct perf_event {
 	int				cgrp_defer_enabled;
 #endif
 
+	bool				guest_owned;	/* Owned by a guest */
+
 #endif /* CONFIG_PERF_EVENTS */
 };
 
@@ -550,11 +552,22 @@ extern int perf_event_refresh(struct perf_event *event, int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
+__perf_event_create_kernel_counter(struct perf_event_attr *attr,
+				int cpu,
+				struct task_struct *task,
+				perf_overflow_handler_t callback,
+				void *context, bool guest_owned);
+static inline struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				int cpu,
 				struct task_struct *task,
 				perf_overflow_handler_t callback,
-				void *context);
+				void *context)
+{
+	return __perf_event_create_kernel_counter(attr, cpu, task, callback,
+						  context, false);
+}
+
 extern void perf_pmu_migrate_context(struct pmu *pmu,
 				int src_cpu, int dst_cpu);
 extern u64 perf_event_read_value(struct perf_event *event,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f83a71a..3450ba7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7249,10 +7249,10 @@ err_fd:
  * @task: task to profile (NULL for percpu)
  */
 struct perf_event *
-perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
+__perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 				 struct task_struct *task,
 				 perf_overflow_handler_t overflow_handler,
-				 void *context)
+				 void *context, bool guest_owned)
 {
 	struct perf_event_context *ctx;
 	struct perf_event *event;
@@ -7268,6 +7268,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 		err = PTR_ERR(event);
 		goto err;
 	}
+	event->guest_owned = guest_owned;
 
 	account_event(event);
 
@@ -7290,7 +7291,7 @@ err_free:
 err:
 	return ERR_PTR(err);
 }
-EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
+EXPORT_SYMBOL_GPL(__perf_event_create_kernel_counter);
 
 void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
 {
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/4] perf: Handle guest PEBS events with a fake event
  2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
  2014-05-30  1:12 ` [PATCH 1/4] perf: Add PEBS virtualization enable " Andi Kleen
  2014-05-30  1:12 ` [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters Andi Kleen
@ 2014-05-30  1:12 ` Andi Kleen
  2014-05-30  7:34   ` Peter Zijlstra
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
  2014-05-30  7:39 ` Implement PEBS virtualization for Silvermont Peter Zijlstra
  4 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-05-30  1:12 UTC (permalink / raw)
  To: peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

With PEBS virtualization the PEBS record gets delivered to the guest,
but the host sees the PMI. This would normally result in a spurious
PEBS PMI that is ignored. But we need to inject the PMI into the guest,
so that the guest PMI handler can handle the PEBS record.

Check for this case in the perf PEBS handler.  When any guest PEBS
counters are active always check the counters explicitely for
overflow. If a guest PEBs counter overflowed trigger a fake event. The
fake event results in calling the KVM PMI callback, which injects
the PMI into the guest. The guest handler then retrieves the correct
information from its own PEBS record and the guest state.

Note: in very rare cases with exotic events this may lead to spurious PMIs
in the guest.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 49 +++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 29622a7..0267174 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -998,6 +998,53 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	__intel_pmu_pebs_event(event, iregs, at);
 }
 
+/*
+ * We may be running with virtualized PEBS, so the PEBS record
+ * was logged into the guest's DS and is invisible to us.
+ *
+ * For guest-owned counters we always have to check the counter
+ * and see if they are overflowed, because PEBS thresholds
+ * are not reported in the GLOBAL_STATUS.
+ *
+ * In this case just trigger a fake event for KVM to forward
+ * to the guest as PMI.  The guest will then see the real PEBS
+ * record and read the counter values.
+ *
+ * The contents of the event do not matter.
+ */
+static void intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
+					struct pt_regs *iregs)
+{
+	int bit;
+	struct perf_event *event;
+
+	if (!cpuc->intel_ctrl_guest_owned)
+		return;
+
+	for_each_set_bit(bit, (unsigned long *)&cpuc->intel_ctrl_guest_owned,
+			 x86_pmu.max_pebs_events) {
+		struct perf_sample_data data;
+		s64 count;
+		int shift;
+
+		event = cpuc->events[bit];
+		if (!event->attr.precise_ip)
+			continue;
+		rdpmcl(event->hw.event_base_rdpmc, count);
+
+		/* sign extend */
+		shift = 64 - x86_pmu.cntval_bits;
+		count = ((s64)((u64)count << shift)) >> shift;
+
+		if (count < 0)
+			continue;
+
+		perf_sample_data_init(&data, 0, event->hw.last_period);
+		if (perf_event_overflow(event, &data, iregs))
+			x86_pmu_stop(event, 0);
+	}
+}
+
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1010,6 +1057,8 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 	if (!x86_pmu.pebs_active)
 		return;
 
+	intel_pmu_handle_guest_pebs(cpuc, iregs);
+
 	at  = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
 	top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
                   ` (2 preceding siblings ...)
  2014-05-30  1:12 ` [PATCH 3/4] perf: Handle guest PEBS events with a fake event Andi Kleen
@ 2014-05-30  1:12 ` Andi Kleen
  2014-05-30  8:21   ` Gleb Natapov
                     ` (3 more replies)
  2014-05-30  7:39 ` Implement PEBS virtualization for Silvermont Peter Zijlstra
  4 siblings, 4 replies; 29+ messages in thread
From: Andi Kleen @ 2014-05-30  1:12 UTC (permalink / raw)
  To: peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

PEBS (Precise Event Bases Sampling) profiling is very powerful,
allowing improved sampling precision and much additional information,
like address or TSX abort profiling. cycles:p and :pp uses PEBS.

This patch enables PEBS profiling in KVM guests.

PEBS writes profiling records to a virtual address in memory. Since
the guest controls the virtual address space the PEBS record
is directly delivered to the guest buffer. We set up the PEBS state
that is works correctly.The CPU cannot handle any kinds of faults during
these guest writes.

To avoid any problems with guest pages being swapped by the host we
pin the pages when the PEBS buffer is setup, by intercepting
that MSR.

Typically profilers only set up a single page, so pinning that is not
a big problem. The pinning is limited to 17 pages currently (64K+1)

In theory the guest can change its own page tables after the PEBS
setup. The host has no way to track that with EPT. But if a guest
would do that it could only crash itself. It's not expected
that normal profilers do that.

The patch also adds the basic glue to enable the PEBS CPUIDs
and other PEBS MSRs, and ask perf to enable PEBS as needed.

Due to various limitations it currently only works on Silvermont
based systems.

This patch doesn't implement the extended MSRs some CPUs support.
For example latency profiling on SLM will not work at this point.

Timing:

The emulation is somewhat more expensive than a real PMU. This
may trigger the expensive PMI detection in the guest.
Usually this can be disabled with
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent

Migration:

In theory it should should be possible (as long as we migrate to
a host with the same PEBS event and the same PEBS format), but I'm not
sure the basic KVM PMU code supports it correctly: no code to
save/restore state, unless I'm missing something. Once the PMU
code grows proper migration support it should be straight forward
to handle the PEBS state too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h       |   6 ++
 arch/x86/include/uapi/asm/msr-index.h |   4 +
 arch/x86/kvm/cpuid.c                  |  10 +-
 arch/x86/kvm/pmu.c                    | 184 ++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx.c                    |   6 ++
 5 files changed, 196 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7de069af..d87cb66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -319,6 +319,8 @@ struct kvm_pmc {
 	struct kvm_vcpu *vcpu;
 };
 
+#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
+
 struct kvm_pmu {
 	unsigned nr_arch_gp_counters;
 	unsigned nr_arch_fixed_counters;
@@ -335,6 +337,10 @@ struct kvm_pmu {
 	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
 	struct irq_work irq_work;
 	u64 reprogram_pmi;
+	u64 pebs_enable;
+	u64 ds_area;
+	struct page *pinned_pages[MAX_PINNED_PAGES];
+	unsigned num_pinned_pages;
 };
 
 enum {
diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index fcf2b3a..409a582 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -72,6 +72,10 @@
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
+#define PERF_CAP_PEBS_TRAP		(1U << 6)
+#define PERF_CAP_ARCH_REG		(1U << 7)
+#define PERF_CAP_PEBS_FORMAT		(0xf << 8)
+
 #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
 #define MSR_MTRRfix64K_00000		0x00000250
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f47a104..c8cc76b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
 	unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0;
 	unsigned f_mpx = kvm_x86_ops->mpx_supported() ? F(MPX) : 0;
+	bool pebs = perf_pebs_virtualization();
+	unsigned f_ds = pebs ? F(DS) : 0;
+	unsigned f_pdcm = pebs ? F(PDCM) : 0;
+	unsigned f_dtes64 = pebs ? F(DTES64) : 0;
 
 	/* cpuid 1.edx */
 	const u32 kvm_supported_word0_x86_features =
@@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
 		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
 		F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
-		0 /* Reserved, DS, ACPI */ | F(MMX) |
+		f_ds /* Reserved, ACPI */ | F(MMX) |
 		F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
 		0 /* HTT, TM, Reserved, PBE */;
 	/* cpuid 0x80000001.edx */
@@ -283,10 +287,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
 	/* cpuid 1.ecx */
 	const u32 kvm_supported_word4_x86_features =
-		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
+		F(XMM3) | F(PCLMULQDQ) | f_dtes64 /* MONITOR */ |
 		0 /* DS-CPL, VMX, SMX, EST */ |
 		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+		F(FMA) | F(CX16) | f_pdcm /* xTPR Update */ |
 		F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
 		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
 		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 4c6f417..6362db7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -15,9 +15,11 @@
 #include <linux/types.h>
 #include <linux/kvm_host.h>
 #include <linux/perf_event.h>
+#include <linux/highmem.h>
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
+#include "mmu.h"
 
 static struct kvm_arch_event_perf_mapping {
 	u8 eventsel;
@@ -36,9 +38,23 @@ static struct kvm_arch_event_perf_mapping {
 	[7] = { 0x00, 0x30, PERF_COUNT_HW_REF_CPU_CYCLES },
 };
 
+struct debug_store {
+	u64	bts_buffer_base;
+	u64	bts_index;
+	u64	bts_absolute_maximum;
+	u64	bts_interrupt_threshold;
+	u64	pebs_buffer_base;
+	u64	pebs_index;
+	u64	pebs_absolute_maximum;
+	u64	pebs_interrupt_threshold;
+	u64	pebs_event_reset[4];
+};
+
 /* mapping between fixed pmc index and arch_events array */
 int fixed_pmc_events[] = {1, 0, 7};
 
+static u64 host_perf_cap __read_mostly;
+
 static bool pmc_is_gp(struct kvm_pmc *pmc)
 {
 	return pmc->type == KVM_PMC_GP;
@@ -108,7 +124,10 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 {
 	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
 	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
-	__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+	if (perf_event->attr.precise_ip)
+		__set_bit(62, (unsigned long *)&pmu->global_status);
+	else
+		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
 }
 
 static void kvm_perf_overflow_intr(struct perf_event *perf_event,
@@ -160,7 +179,7 @@ static void stop_counter(struct kvm_pmc *pmc)
 
 static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
 		unsigned config, bool exclude_user, bool exclude_kernel,
-		bool intr, bool in_tx, bool in_tx_cp)
+		bool intr, bool in_tx, bool in_tx_cp, bool pebs)
 {
 	struct perf_event *event;
 	struct perf_event_attr attr = {
@@ -177,18 +196,20 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
 		attr.config |= HSW_IN_TX;
 	if (in_tx_cp)
 		attr.config |= HSW_IN_TX_CHECKPOINTED;
+	if (pebs)
+		attr.precise_ip = 1;
 
 	attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
 
-	event = perf_event_create_kernel_counter(&attr, -1, current,
-						 intr ? kvm_perf_overflow_intr :
-						 kvm_perf_overflow, pmc);
+	event = __perf_event_create_kernel_counter(&attr, -1, current,
+						 (intr || pebs) ?
+						 kvm_perf_overflow_intr :
+						 kvm_perf_overflow, pmc, true);
 	if (IS_ERR(event)) {
 		printk_once("kvm: pmu event creation failed %ld\n",
 				PTR_ERR(event));
 		return;
 	}
-	event->guest_owned = true;
 
 	pmc->perf_event = event;
 	clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
@@ -211,7 +232,8 @@ static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
 	return arch_events[i].event_type;
 }
 
-static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
+static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc,
+				 u64 eventsel)
 {
 	unsigned config, type = PERF_TYPE_RAW;
 	u8 event_select, unit_mask;
@@ -248,7 +270,8 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
 			!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
 			eventsel & ARCH_PERFMON_EVENTSEL_INT,
 			(eventsel & HSW_IN_TX),
-			(eventsel & HSW_IN_TX_CHECKPOINTED));
+			(eventsel & HSW_IN_TX_CHECKPOINTED),
+			test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable));
 }
 
 static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
@@ -265,7 +288,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
 			arch_events[fixed_pmc_events[idx]].event_type,
 			!(en & 0x2), /* exclude user */
 			!(en & 0x1), /* exclude kernel */
-			pmi, false, false);
+			pmi, false, false, false);
 }
 
 static inline u8 fixed_en_pmi(u64 ctrl, int idx)
@@ -298,7 +321,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, int idx)
 		return;
 
 	if (pmc_is_gp(pmc))
-		reprogram_gp_counter(pmc, pmc->eventsel);
+		reprogram_gp_counter(pmu, pmc, pmc->eventsel);
 	else {
 		int fidx = idx - INTEL_PMC_IDX_FIXED;
 		reprogram_fixed_counter(pmc,
@@ -323,6 +346,12 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
 	int ret;
 
 	switch (msr) {
+	case MSR_IA32_DS_AREA:
+	case MSR_IA32_PEBS_ENABLE:
+	case MSR_IA32_PERF_CAPABILITIES:
+		ret = perf_pebs_virtualization() ? 1 : 0;
+		break;
+
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
 	case MSR_CORE_PERF_GLOBAL_STATUS:
 	case MSR_CORE_PERF_GLOBAL_CTRL:
@@ -356,6 +385,18 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
 		*data = pmu->global_ovf_ctrl;
 		return 0;
+	case MSR_IA32_DS_AREA:
+		*data = pmu->ds_area;
+		return 0;
+	case MSR_IA32_PEBS_ENABLE:
+		*data = pmu->pebs_enable;
+		return 0;
+	case MSR_IA32_PERF_CAPABILITIES:
+		/* Report host PEBS format to guest */
+		*data = host_perf_cap &
+			(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG |
+			 PERF_CAP_PEBS_FORMAT);
+		return 0;
 	default:
 		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
 				(pmc = get_fixed_pmc(pmu, index))) {
@@ -369,6 +410,109 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
 	return 1;
 }
 
+static void kvm_pmu_release_pin(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	int i;
+
+	for (i = 0; i < pmu->num_pinned_pages; i++)
+		put_page(pmu->pinned_pages[i]);
+	pmu->num_pinned_pages = 0;
+}
+
+static struct page *get_guest_page(struct kvm_vcpu *vcpu,
+				   unsigned long addr)
+{
+	unsigned long pfn;
+	struct x86_exception exception;
+	gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
+						    PFERR_WRITE_MASK,
+						    &exception);
+
+	if (gpa == UNMAPPED_GVA) {
+		printk_once("Cannot translate guest page %lx\n", addr);
+		return NULL;
+	}
+	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	if (is_error_noslot_pfn(pfn)) {
+		printk_once("gfn_to_pfn failed for %llx\n", gpa);
+		return NULL;
+	}
+	return pfn_to_page(pfn);
+}
+
+static int pin_and_copy(struct kvm_vcpu *vcpu,
+			unsigned long addr, void *dst, int len,
+			struct page **p)
+{
+	unsigned long offset = addr & ~PAGE_MASK;
+	void *map;
+
+	*p = get_guest_page(vcpu, addr);
+	if (!*p)
+		return -EIO;
+	map = kmap(*p);
+	memcpy(dst, map + offset, len);
+	kunmap(map);
+	return 0;
+}
+
+/*
+ * Pin the DS area and the PEBS buffer while PEBS is active,
+ * because the CPU cannot tolerate EPT faults for PEBS updates.
+ *
+ * We assume that any guest who changes the DS buffer disables
+ * PEBS first and does not change the page tables during operation.
+ *
+ * When the guest violates these assumptions it may crash itself.
+ * This is expected to not happen with standard profilers.
+ *
+ * No need to clean up anything, as the caller will always eventually
+ * unpin pages.
+ */
+
+static void kvm_pmu_pebs_pin(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = &vcpu->arch.pmu;
+	struct debug_store ds;
+	int pg;
+	unsigned len;
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = pmu->ds_area & ~PAGE_MASK;
+	len = sizeof(struct debug_store);
+	len = min_t(unsigned, PAGE_SIZE - offset, len);
+	if (pin_and_copy(vcpu, pmu->ds_area, &ds, len,
+			 &pmu->pinned_pages[0]) < 0) {
+		printk_once("Cannot pin ds area %llx\n", pmu->ds_area);
+		return;
+	}
+	pmu->num_pinned_pages++;
+	if (len < sizeof(struct debug_store)) {
+		if (pin_and_copy(vcpu, pmu->ds_area + len, (void *)&ds + len,
+				  sizeof(struct debug_store) - len,
+				  &pmu->pinned_pages[1]) < 0)
+			return;
+		pmu->num_pinned_pages++;
+	}
+
+	pg = pmu->num_pinned_pages;
+	for (addr = ds.pebs_buffer_base;
+	     addr < ds.pebs_absolute_maximum && pg < MAX_PINNED_PAGES;
+	     addr += PAGE_SIZE, pg++) {
+		pmu->pinned_pages[pg] = get_guest_page(vcpu, addr);
+		if (!pmu->pinned_pages[pg]) {
+			printk_once("Cannot pin PEBS buffer %lx (%llx-%llx)\n",
+				 addr,
+				 ds.pebs_buffer_base,
+				 ds.pebs_absolute_maximum);
+			break;
+		}
+	}
+	pmu->num_pinned_pages = pg;
+}
+
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = &vcpu->arch.pmu;
@@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 0;
 		}
 		break;
+	case MSR_IA32_DS_AREA:
+		pmu->ds_area = data;
+		return 0;
+	case MSR_IA32_PEBS_ENABLE:
+		if (data & ~0xf0000000fULL)
+			break;
+		if (data && data != pmu->pebs_enable) {
+			kvm_pmu_release_pin(vcpu);
+			kvm_pmu_pebs_pin(vcpu);
+		} else if (data == 0 && pmu->pebs_enable) {
+			kvm_pmu_release_pin(vcpu);
+		}
+		pmu->pebs_enable = data;
+		return 0;
 	default:
 		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
 				(pmc = get_fixed_pmc(pmu, index))) {
@@ -418,7 +576,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (data == pmc->eventsel)
 				return 0;
 			if (!(data & pmu->reserved_bits)) {
-				reprogram_gp_counter(pmc, data);
+				reprogram_gp_counter(pmu, pmc, data);
 				return 0;
 			}
 		}
@@ -514,6 +672,9 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
 	}
 	init_irq_work(&pmu->irq_work, trigger_pmi);
 	kvm_pmu_cpuid_update(vcpu);
+
+	if (boot_cpu_has(X86_FEATURE_PDCM))
+		rdmsrl_safe(MSR_IA32_PERF_CAPABILITIES, &host_perf_cap);
 }
 
 void kvm_pmu_reset(struct kvm_vcpu *vcpu)
@@ -538,6 +699,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_pmu_reset(vcpu);
+	kvm_pmu_release_pin(vcpu);
 }
 
 void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 33e8c02..4f39917 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	atomic_switch_perf_msrs(vmx);
 	debugctlmsr = get_debugctlmsr();
 
+	/* Move this somewhere else? */
+	if (vcpu->arch.pmu.ds_area)
+		add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
+				      vcpu->arch.pmu.ds_area,
+				      perf_get_ds_area());
+
 	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
  2014-05-30  1:12 ` [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters Andi Kleen
@ 2014-05-30  7:31   ` Peter Zijlstra
  2014-05-30 16:03     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2014-05-30  7:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Currently perf unconditionally disables PEBS for guest.
> 
> Now that we have the infrastructure in place to handle
> it we can allow it for KVM owned guest events. For
> the perf needs to know that a event is owned by
> a guest. Add a new state bit in the perf_event for that.
> 

This doesn't make sense; why does it need to be owned?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] perf: Handle guest PEBS events with a fake event
  2014-05-30  1:12 ` [PATCH 3/4] perf: Handle guest PEBS events with a fake event Andi Kleen
@ 2014-05-30  7:34   ` Peter Zijlstra
  2014-05-30 16:29     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2014-05-30  7:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

On Thu, May 29, 2014 at 06:12:06PM -0700, Andi Kleen wrote:

> Note: in very rare cases with exotic events this may lead to spurious PMIs
> in the guest.

Qualify that statement so that if someone runs into it we at least know
it is known/expected.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Implement PEBS virtualization for Silvermont
  2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
                   ` (3 preceding siblings ...)
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
@ 2014-05-30  7:39 ` Peter Zijlstra
  4 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2014-05-30  7:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel

On Thu, May 29, 2014 at 06:12:03PM -0700, Andi Kleen wrote:
> PEBS is very useful (e.g. enabling the more cycles:pp event or
> memory profiling) Unfortunately it didn't work in virtualization,
> which is becoming more and more common.
> 
> This patch kit implements simple PEBS virtualization for KVM on Silvermont
> CPUs. Silvermont does not have the leak problems that prevented successfull
> PEBS virtualization earlier.

Silvermont is such an underpowered thing, who in his right mind would
run anything virt on it to further reduce performance?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
@ 2014-05-30  8:21   ` Gleb Natapov
  2014-05-30 16:24     ` Andi Kleen
  2014-06-02 19:05   ` Eric Northup
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2014-05-30  8:21 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> PEBS (Precise Event Bases Sampling) profiling is very powerful,
> allowing improved sampling precision and much additional information,
> like address or TSX abort profiling. cycles:p and :pp uses PEBS.
> 
> This patch enables PEBS profiling in KVM guests.
That sounds really cool!

> 
> PEBS writes profiling records to a virtual address in memory. Since
> the guest controls the virtual address space the PEBS record
> is directly delivered to the guest buffer. We set up the PEBS state
> that is works correctly.The CPU cannot handle any kinds of faults during
> these guest writes.
> 
> To avoid any problems with guest pages being swapped by the host we
> pin the pages when the PEBS buffer is setup, by intercepting
> that MSR.
It will avoid guest page to be swapped, but shadow paging code may still drop
shadow PT pages that build a mapping from DS virtual address to the guest page.
With EPT it is less likely to happen (but still possible IIRC depending on memory
pressure and how much memory shadow paging code is allowed to use), without EPT
it will happen for sure.

> 
> Typically profilers only set up a single page, so pinning that is not
> a big problem. The pinning is limited to 17 pages currently (64K+1)
> 
> In theory the guest can change its own page tables after the PEBS
> setup. The host has no way to track that with EPT. But if a guest
> would do that it could only crash itself. It's not expected
> that normal profilers do that.
Spec says:

 The following restrictions should be applied to the DS save area.
   • The three DS save area sections should be allocated from a
   non-paged pool, and marked accessed and dirty. It is the responsibility
   of the operating system to keep the pages that contain the buffer
   present and to mark them accessed and dirty. The implication is that
   the operating system cannot do “lazy” page-table entry propagation
   for these pages.

There is nothing, as far as I can see, that says what will happen if the
condition is not met. I always interpreted it as undefined behaviour so
anything can happen including CPU dies completely.  You are saying above
on one hand that CPU cannot handle any kinds of faults during write to
DS area, but on the other hand a guest could only crash itself. Is this
architecturally guarantied?


> 
> The patch also adds the basic glue to enable the PEBS CPUIDs
> and other PEBS MSRs, and ask perf to enable PEBS as needed.
> 
> Due to various limitations it currently only works on Silvermont
> based systems.
> 
> This patch doesn't implement the extended MSRs some CPUs support.
> For example latency profiling on SLM will not work at this point.
> 
> Timing:
> 
> The emulation is somewhat more expensive than a real PMU. This
> may trigger the expensive PMI detection in the guest.
> Usually this can be disabled with
> echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
> 
> Migration:
> 
> In theory it should should be possible (as long as we migrate to
> a host with the same PEBS event and the same PEBS format), but I'm not
> sure the basic KVM PMU code supports it correctly: no code to
> save/restore state, unless I'm missing something. Once the PMU
> code grows proper migration support it should be straight forward
> to handle the PEBS state too.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h       |   6 ++
>  arch/x86/include/uapi/asm/msr-index.h |   4 +
>  arch/x86/kvm/cpuid.c                  |  10 +-
>  arch/x86/kvm/pmu.c                    | 184 ++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/vmx.c                    |   6 ++
>  5 files changed, 196 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7de069af..d87cb66 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -319,6 +319,8 @@ struct kvm_pmc {
>  	struct kvm_vcpu *vcpu;
>  };
>  
> +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
> +
>  struct kvm_pmu {
>  	unsigned nr_arch_gp_counters;
>  	unsigned nr_arch_fixed_counters;
> @@ -335,6 +337,10 @@ struct kvm_pmu {
>  	struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
>  	struct irq_work irq_work;
>  	u64 reprogram_pmi;
> +	u64 pebs_enable;
> +	u64 ds_area;
> +	struct page *pinned_pages[MAX_PINNED_PAGES];
> +	unsigned num_pinned_pages;
>  };
>  
>  enum {
> diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
> index fcf2b3a..409a582 100644
> --- a/arch/x86/include/uapi/asm/msr-index.h
> +++ b/arch/x86/include/uapi/asm/msr-index.h
> @@ -72,6 +72,10 @@
>  #define MSR_IA32_PEBS_ENABLE		0x000003f1
>  #define MSR_IA32_DS_AREA		0x00000600
>  #define MSR_IA32_PERF_CAPABILITIES	0x00000345
> +#define PERF_CAP_PEBS_TRAP		(1U << 6)
> +#define PERF_CAP_ARCH_REG		(1U << 7)
> +#define PERF_CAP_PEBS_FORMAT		(0xf << 8)
> +
>  #define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
>  
>  #define MSR_MTRRfix64K_00000		0x00000250
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index f47a104..c8cc76b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  	unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
>  	unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0;
>  	unsigned f_mpx = kvm_x86_ops->mpx_supported() ? F(MPX) : 0;
> +	bool pebs = perf_pebs_virtualization();
> +	unsigned f_ds = pebs ? F(DS) : 0;
> +	unsigned f_pdcm = pebs ? F(PDCM) : 0;
> +	unsigned f_dtes64 = pebs ? F(DTES64) : 0;
>  
>  	/* cpuid 1.edx */
>  	const u32 kvm_supported_word0_x86_features =
> @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
>  		F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
>  		F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
> -		0 /* Reserved, DS, ACPI */ | F(MMX) |
> +		f_ds /* Reserved, ACPI */ | F(MMX) |
>  		F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
>  		0 /* HTT, TM, Reserved, PBE */;
>  	/* cpuid 0x80000001.edx */
> @@ -283,10 +287,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>  	/* cpuid 1.ecx */
>  	const u32 kvm_supported_word4_x86_features =
> -		F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> +		F(XMM3) | F(PCLMULQDQ) | f_dtes64 /* MONITOR */ |
>  		0 /* DS-CPL, VMX, SMX, EST */ |
>  		0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> -		F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> +		F(FMA) | F(CX16) | f_pdcm /* xTPR Update */ |
>  		F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
>  		F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
>  		0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 4c6f417..6362db7 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -15,9 +15,11 @@
>  #include <linux/types.h>
>  #include <linux/kvm_host.h>
>  #include <linux/perf_event.h>
> +#include <linux/highmem.h>
>  #include "x86.h"
>  #include "cpuid.h"
>  #include "lapic.h"
> +#include "mmu.h"
>  
>  static struct kvm_arch_event_perf_mapping {
>  	u8 eventsel;
> @@ -36,9 +38,23 @@ static struct kvm_arch_event_perf_mapping {
>  	[7] = { 0x00, 0x30, PERF_COUNT_HW_REF_CPU_CYCLES },
>  };
>  
> +struct debug_store {
> +	u64	bts_buffer_base;
> +	u64	bts_index;
> +	u64	bts_absolute_maximum;
> +	u64	bts_interrupt_threshold;
> +	u64	pebs_buffer_base;
> +	u64	pebs_index;
> +	u64	pebs_absolute_maximum;
> +	u64	pebs_interrupt_threshold;
> +	u64	pebs_event_reset[4];
> +};
> +
>  /* mapping between fixed pmc index and arch_events array */
>  int fixed_pmc_events[] = {1, 0, 7};
>  
> +static u64 host_perf_cap __read_mostly;
> +
>  static bool pmc_is_gp(struct kvm_pmc *pmc)
>  {
>  	return pmc->type == KVM_PMC_GP;
> @@ -108,7 +124,10 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
>  {
>  	struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>  	struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> -	__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> +	if (perf_event->attr.precise_ip)
> +		__set_bit(62, (unsigned long *)&pmu->global_status);
> +	else
> +		__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
>  }
>  
>  static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> @@ -160,7 +179,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>  
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>  		unsigned config, bool exclude_user, bool exclude_kernel,
> -		bool intr, bool in_tx, bool in_tx_cp)
> +		bool intr, bool in_tx, bool in_tx_cp, bool pebs)
>  {
>  	struct perf_event *event;
>  	struct perf_event_attr attr = {
> @@ -177,18 +196,20 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>  		attr.config |= HSW_IN_TX;
>  	if (in_tx_cp)
>  		attr.config |= HSW_IN_TX_CHECKPOINTED;
> +	if (pebs)
> +		attr.precise_ip = 1;
>  
>  	attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>  
> -	event = perf_event_create_kernel_counter(&attr, -1, current,
> -						 intr ? kvm_perf_overflow_intr :
> -						 kvm_perf_overflow, pmc);
> +	event = __perf_event_create_kernel_counter(&attr, -1, current,
> +						 (intr || pebs) ?
> +						 kvm_perf_overflow_intr :
> +						 kvm_perf_overflow, pmc, true);
>  	if (IS_ERR(event)) {
>  		printk_once("kvm: pmu event creation failed %ld\n",
>  				PTR_ERR(event));
>  		return;
>  	}
> -	event->guest_owned = true;
>  
>  	pmc->perf_event = event;
>  	clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
> @@ -211,7 +232,8 @@ static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
>  	return arch_events[i].event_type;
>  }
>  
> -static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> +static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc,
> +				 u64 eventsel)
>  {
>  	unsigned config, type = PERF_TYPE_RAW;
>  	u8 event_select, unit_mask;
> @@ -248,7 +270,8 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
>  			!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
>  			eventsel & ARCH_PERFMON_EVENTSEL_INT,
>  			(eventsel & HSW_IN_TX),
> -			(eventsel & HSW_IN_TX_CHECKPOINTED));
> +			(eventsel & HSW_IN_TX_CHECKPOINTED),
> +			test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable));
>  }
>  
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -265,7 +288,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
>  			arch_events[fixed_pmc_events[idx]].event_type,
>  			!(en & 0x2), /* exclude user */
>  			!(en & 0x1), /* exclude kernel */
> -			pmi, false, false);
> +			pmi, false, false, false);
>  }
>  
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -298,7 +321,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, int idx)
>  		return;
>  
>  	if (pmc_is_gp(pmc))
> -		reprogram_gp_counter(pmc, pmc->eventsel);
> +		reprogram_gp_counter(pmu, pmc, pmc->eventsel);
>  	else {
>  		int fidx = idx - INTEL_PMC_IDX_FIXED;
>  		reprogram_fixed_counter(pmc,
> @@ -323,6 +346,12 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
>  	int ret;
>  
>  	switch (msr) {
> +	case MSR_IA32_DS_AREA:
> +	case MSR_IA32_PEBS_ENABLE:
> +	case MSR_IA32_PERF_CAPABILITIES:
> +		ret = perf_pebs_virtualization() ? 1 : 0;
> +		break;
> +
>  	case MSR_CORE_PERF_FIXED_CTR_CTRL:
>  	case MSR_CORE_PERF_GLOBAL_STATUS:
>  	case MSR_CORE_PERF_GLOBAL_CTRL:
> @@ -356,6 +385,18 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
>  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
>  		*data = pmu->global_ovf_ctrl;
>  		return 0;
> +	case MSR_IA32_DS_AREA:
> +		*data = pmu->ds_area;
> +		return 0;
> +	case MSR_IA32_PEBS_ENABLE:
> +		*data = pmu->pebs_enable;
> +		return 0;
> +	case MSR_IA32_PERF_CAPABILITIES:
> +		/* Report host PEBS format to guest */
> +		*data = host_perf_cap &
> +			(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG |
> +			 PERF_CAP_PEBS_FORMAT);
> +		return 0;
>  	default:
>  		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
>  				(pmc = get_fixed_pmc(pmu, index))) {
> @@ -369,6 +410,109 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
>  	return 1;
>  }
>  
> +static void kvm_pmu_release_pin(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> +	int i;
> +
> +	for (i = 0; i < pmu->num_pinned_pages; i++)
> +		put_page(pmu->pinned_pages[i]);
> +	pmu->num_pinned_pages = 0;
> +}
> +
> +static struct page *get_guest_page(struct kvm_vcpu *vcpu,
> +				   unsigned long addr)
> +{
> +	unsigned long pfn;
> +	struct x86_exception exception;
> +	gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
> +						    PFERR_WRITE_MASK,
> +						    &exception);
> +
> +	if (gpa == UNMAPPED_GVA) {
> +		printk_once("Cannot translate guest page %lx\n", addr);
> +		return NULL;
> +	}
> +	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> +	if (is_error_noslot_pfn(pfn)) {
> +		printk_once("gfn_to_pfn failed for %llx\n", gpa);
> +		return NULL;
> +	}
> +	return pfn_to_page(pfn);
> +}
> +
> +static int pin_and_copy(struct kvm_vcpu *vcpu,
> +			unsigned long addr, void *dst, int len,
> +			struct page **p)
> +{
> +	unsigned long offset = addr & ~PAGE_MASK;
> +	void *map;
> +
> +	*p = get_guest_page(vcpu, addr);
> +	if (!*p)
> +		return -EIO;
> +	map = kmap(*p);
> +	memcpy(dst, map + offset, len);
> +	kunmap(map);
> +	return 0;
> +}
> +
> +/*
> + * Pin the DS area and the PEBS buffer while PEBS is active,
> + * because the CPU cannot tolerate EPT faults for PEBS updates.
> + *
> + * We assume that any guest who changes the DS buffer disables
> + * PEBS first and does not change the page tables during operation.
> + *
> + * When the guest violates these assumptions it may crash itself.
> + * This is expected to not happen with standard profilers.
> + *
> + * No need to clean up anything, as the caller will always eventually
> + * unpin pages.
> + */
> +
> +static void kvm_pmu_pebs_pin(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> +	struct debug_store ds;
> +	int pg;
> +	unsigned len;
> +	unsigned long offset;
> +	unsigned long addr;
> +
> +	offset = pmu->ds_area & ~PAGE_MASK;
> +	len = sizeof(struct debug_store);
> +	len = min_t(unsigned, PAGE_SIZE - offset, len);
> +	if (pin_and_copy(vcpu, pmu->ds_area, &ds, len,
> +			 &pmu->pinned_pages[0]) < 0) {
> +		printk_once("Cannot pin ds area %llx\n", pmu->ds_area);
> +		return;
> +	}
> +	pmu->num_pinned_pages++;
> +	if (len < sizeof(struct debug_store)) {
> +		if (pin_and_copy(vcpu, pmu->ds_area + len, (void *)&ds + len,
> +				  sizeof(struct debug_store) - len,
> +				  &pmu->pinned_pages[1]) < 0)
> +			return;
> +		pmu->num_pinned_pages++;
> +	}
> +
> +	pg = pmu->num_pinned_pages;
> +	for (addr = ds.pebs_buffer_base;
> +	     addr < ds.pebs_absolute_maximum && pg < MAX_PINNED_PAGES;
> +	     addr += PAGE_SIZE, pg++) {
> +		pmu->pinned_pages[pg] = get_guest_page(vcpu, addr);
> +		if (!pmu->pinned_pages[pg]) {
> +			printk_once("Cannot pin PEBS buffer %lx (%llx-%llx)\n",
> +				 addr,
> +				 ds.pebs_buffer_base,
> +				 ds.pebs_absolute_maximum);
> +			break;
> +		}
> +	}
> +	pmu->num_pinned_pages = pg;
> +}
> +
>  int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  {
>  	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			return 0;
>  		}
>  		break;
> +	case MSR_IA32_DS_AREA:
> +		pmu->ds_area = data;
> +		return 0;
> +	case MSR_IA32_PEBS_ENABLE:
> +		if (data & ~0xf0000000fULL)
> +			break;
> +		if (data && data != pmu->pebs_enable) {
> +			kvm_pmu_release_pin(vcpu);
> +			kvm_pmu_pebs_pin(vcpu);
> +		} else if (data == 0 && pmu->pebs_enable) {
> +			kvm_pmu_release_pin(vcpu);
> +		}
> +		pmu->pebs_enable = data;
> +		return 0;
>  	default:
>  		if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
>  				(pmc = get_fixed_pmc(pmu, index))) {
> @@ -418,7 +576,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			if (data == pmc->eventsel)
>  				return 0;
>  			if (!(data & pmu->reserved_bits)) {
> -				reprogram_gp_counter(pmc, data);
> +				reprogram_gp_counter(pmu, pmc, data);
>  				return 0;
>  			}
>  		}
> @@ -514,6 +672,9 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
>  	}
>  	init_irq_work(&pmu->irq_work, trigger_pmi);
>  	kvm_pmu_cpuid_update(vcpu);
> +
> +	if (boot_cpu_has(X86_FEATURE_PDCM))
> +		rdmsrl_safe(MSR_IA32_PERF_CAPABILITIES, &host_perf_cap);
>  }
>  
>  void kvm_pmu_reset(struct kvm_vcpu *vcpu)
> @@ -538,6 +699,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>  void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>  {
>  	kvm_pmu_reset(vcpu);
> +	kvm_pmu_release_pin(vcpu);
>  }
>  
>  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 33e8c02..4f39917 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	atomic_switch_perf_msrs(vmx);
>  	debugctlmsr = get_debugctlmsr();
>  
> +	/* Move this somewhere else? */
> +	if (vcpu->arch.pmu.ds_area)
> +		add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
> +				      vcpu->arch.pmu.ds_area,
> +				      perf_get_ds_area());
> +
>  	vmx->__launched = vmx->loaded_vmcs->launched;
>  	asm(
>  		/* Store host registers */
> -- 
> 1.9.0
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
  2014-05-30  7:31   ` Peter Zijlstra
@ 2014-05-30 16:03     ` Andi Kleen
  2014-05-30 16:17       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-05-30 16:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, gleb, pbonzini, eranian, kvm, linux-kernel

On Fri, May 30, 2014 at 09:31:53AM +0200, Peter Zijlstra wrote:
> On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > 
> > Currently perf unconditionally disables PEBS for guest.
> > 
> > Now that we have the infrastructure in place to handle
> > it we can allow it for KVM owned guest events. For
> > the perf needs to know that a event is owned by
> > a guest. Add a new state bit in the perf_event for that.
> > 
> 
> This doesn't make sense; why does it need to be owned?

Please read the complete patch kit

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters
  2014-05-30 16:03     ` Andi Kleen
@ 2014-05-30 16:17       ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2014-05-30 16:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andi Kleen, gleb, pbonzini, eranian, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

On Fri, May 30, 2014 at 09:03:57AM -0700, Andi Kleen wrote:
> On Fri, May 30, 2014 at 09:31:53AM +0200, Peter Zijlstra wrote:
> > On Thu, May 29, 2014 at 06:12:05PM -0700, Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > 
> > > Currently perf unconditionally disables PEBS for guest.
> > > 
> > > Now that we have the infrastructure in place to handle
> > > it we can allow it for KVM owned guest events. For
> > > the perf needs to know that a event is owned by
> > > a guest. Add a new state bit in the perf_event for that.
> > > 
> > 
> > This doesn't make sense; why does it need to be owned?
> 
> Please read the complete patch kit

Please write coherent and self sustaining changelogs.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  8:21   ` Gleb Natapov
@ 2014-05-30 16:24     ` Andi Kleen
  2014-06-02 16:45       ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-05-30 16:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Andi Kleen, peterz, pbonzini, eranian, kvm, linux-kernel

> > To avoid any problems with guest pages being swapped by the host we
> > pin the pages when the PEBS buffer is setup, by intercepting
> > that MSR.
> It will avoid guest page to be swapped, but shadow paging code may still drop
> shadow PT pages that build a mapping from DS virtual address to the guest page.

You're saying the EPT code could tear down the EPT mappings?

OK that would need to be prevented too. Any suggestions how?

> With EPT it is less likely to happen (but still possible IIRC depending on memory
> pressure and how much memory shadow paging code is allowed to use), without EPT
> it will happen for sure.

Don't care about the non EPT case, this is white listed only for EPT supporting 
CPUs.

> There is nothing, as far as I can see, that says what will happen if the
> condition is not met. I always interpreted it as undefined behaviour so
> anything can happen including CPU dies completely.  You are saying above
> on one hand that CPU cannot handle any kinds of faults during write to
> DS area, but on the other hand a guest could only crash itself. Is this
> architecturally guarantied?

You essentially would get random page faults, and the PEBS event will
be cancelled. No hangs.

It's not architecturally guaranteed, but we white list anyways so 
we only care about the white listed CPUs at this point. For them
I have confirmation that it works.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/4] perf: Handle guest PEBS events with a fake event
  2014-05-30  7:34   ` Peter Zijlstra
@ 2014-05-30 16:29     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2014-05-30 16:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, gleb, pbonzini, eranian, kvm, linux-kernel

On Fri, May 30, 2014 at 09:34:39AM +0200, Peter Zijlstra wrote:
> On Thu, May 29, 2014 at 06:12:06PM -0700, Andi Kleen wrote:
> 
> > Note: in very rare cases with exotic events this may lead to spurious PMIs
> > in the guest.
> 
> Qualify that statement so that if someone runs into it we at least know
> it is known/expected.

You cannot actually observe it, so it's not a real problem.
I'll drop the Note.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30 16:24     ` Andi Kleen
@ 2014-06-02 16:45       ` Gleb Natapov
  2014-06-02 16:52         ` Andi Kleen
  2014-06-02 19:09         ` Marcelo Tosatti
  0 siblings, 2 replies; 29+ messages in thread
From: Gleb Natapov @ 2014-06-02 16:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, peterz, pbonzini, eranian, kvm, linux-kernel, mtosatti

On Fri, May 30, 2014 at 09:24:24AM -0700, Andi Kleen wrote:
> > > To avoid any problems with guest pages being swapped by the host we
> > > pin the pages when the PEBS buffer is setup, by intercepting
> > > that MSR.
> > It will avoid guest page to be swapped, but shadow paging code may still drop
> > shadow PT pages that build a mapping from DS virtual address to the guest page.
> 
> You're saying the EPT code could tear down the EPT mappings?

Under memory pressure yes. mmu_shrink_scan() calls
prepare_zap_oldest_mmu_page() which destroys oldest mmu pages like its
name says. As far as I can tell running nested guest can also result in
EPT mapping to be dropped since it will create a lot of shadow pages and
this will cause make_mmu_pages_available() to destroy some shadow pages
and it may choose EPT pages to destroy.

CCing Marcelo to confirm/correct.

> 
> OK that would need to be prevented too. Any suggestions how?
Only high level. Mark shadow pages involved in translation we want to keep and skip them in
prepare_zap_oldest_mmu_page().

> 
> > With EPT it is less likely to happen (but still possible IIRC depending on memory
> > pressure and how much memory shadow paging code is allowed to use), without EPT
> > it will happen for sure.
> 
> Don't care about the non EPT case, this is white listed only for EPT supporting 
> CPUs.
User may still disable EPT during module load, so pebs should be dropped
from a guest's cpuid in this case.

> 
> > There is nothing, as far as I can see, that says what will happen if the
> > condition is not met. I always interpreted it as undefined behaviour so
> > anything can happen including CPU dies completely.  You are saying above
> > on one hand that CPU cannot handle any kinds of faults during write to
> > DS area, but on the other hand a guest could only crash itself. Is this
> > architecturally guarantied?
> 
> You essentially would get random page faults, and the PEBS event will
> be cancelled. No hangs.
Is this a guest who will get those random page faults or a host?

> 
> It's not architecturally guaranteed, but we white list anyways so 
> we only care about the white listed CPUs at this point. For them
> I have confirmation that it works.
> 
> -Andi

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-02 16:45       ` Gleb Natapov
@ 2014-06-02 16:52         ` Andi Kleen
  2014-06-02 19:09         ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2014-06-02 16:52 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Andi Kleen, Andi Kleen, peterz, pbonzini, eranian, kvm,
	linux-kernel, mtosatti


BTW I found some more problems in the v1 version.

> > > With EPT it is less likely to happen (but still possible IIRC depending on memory
> > > pressure and how much memory shadow paging code is allowed to use), without EPT
> > > it will happen for sure.
> > 
> > Don't care about the non EPT case, this is white listed only for EPT supporting 
> > CPUs.
> User may still disable EPT during module load, so pebs should be dropped
> from a guest's cpuid in this case.

Ok.

> 
> > 
> > > There is nothing, as far as I can see, that says what will happen if the
> > > condition is not met. I always interpreted it as undefined behaviour so
> > > anything can happen including CPU dies completely.  You are saying above
> > > on one hand that CPU cannot handle any kinds of faults during write to
> > > DS area, but on the other hand a guest could only crash itself. Is this
> > > architecturally guarantied?
> > 
> > You essentially would get random page faults, and the PEBS event will
> > be cancelled. No hangs.
> Is this a guest who will get those random page faults or a host?

The guest (on the white listed CPU models)

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
  2014-05-30  8:21   ` Gleb Natapov
@ 2014-06-02 19:05   ` Eric Northup
  2014-06-02 19:57     ` Andi Kleen
  2014-06-10 18:04   ` Marcelo Tosatti
  2014-06-22 13:57   ` Avi Kivity
  3 siblings, 1 reply; 29+ messages in thread
From: Eric Northup @ 2014-06-02 19:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, gleb, Paolo Bonzini, Stephane Eranian, KVM,
	Linux Kernel Mailing List, Andi Kleen

On Thu, May 29, 2014 at 6:12 PM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> PEBS (Precise Event Bases Sampling) profiling is very powerful,
> allowing improved sampling precision and much additional information,
> like address or TSX abort profiling. cycles:p and :pp uses PEBS.
>
> This patch enables PEBS profiling in KVM guests.
>
> PEBS writes profiling records to a virtual address in memory. Since
> the guest controls the virtual address space the PEBS record
> is directly delivered to the guest buffer. We set up the PEBS state
> that is works correctly.The CPU cannot handle any kinds of faults during
> these guest writes.
>
> To avoid any problems with guest pages being swapped by the host we
> pin the pages when the PEBS buffer is setup, by intercepting
> that MSR.
>
> Typically profilers only set up a single page, so pinning that is not
> a big problem. The pinning is limited to 17 pages currently (64K+1)
>
> In theory the guest can change its own page tables after the PEBS
> setup. The host has no way to track that with EPT. But if a guest
> would do that it could only crash itself. It's not expected
> that normal profilers do that.
>
> The patch also adds the basic glue to enable the PEBS CPUIDs
> and other PEBS MSRs, and ask perf to enable PEBS as needed.
>
> Due to various limitations it currently only works on Silvermont
> based systems.
>
> This patch doesn't implement the extended MSRs some CPUs support.
> For example latency profiling on SLM will not work at this point.
>
> Timing:
>
> The emulation is somewhat more expensive than a real PMU. This
> may trigger the expensive PMI detection in the guest.
> Usually this can be disabled with
> echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
>
> Migration:
>
> In theory it should should be possible (as long as we migrate to
> a host with the same PEBS event and the same PEBS format), but I'm not
> sure the basic KVM PMU code supports it correctly: no code to
> save/restore state, unless I'm missing something. Once the PMU
> code grows proper migration support it should be straight forward
> to handle the PEBS state too.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h       |   6 ++
>  arch/x86/include/uapi/asm/msr-index.h |   4 +
>  arch/x86/kvm/cpuid.c                  |  10 +-
>  arch/x86/kvm/pmu.c                    | 184 ++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/vmx.c                    |   6 ++
>  5 files changed, 196 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7de069af..d87cb66 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -319,6 +319,8 @@ struct kvm_pmc {
>         struct kvm_vcpu *vcpu;
>  };
>
> +#define MAX_PINNED_PAGES 17 /* 64k buffer + ds */
> +
>  struct kvm_pmu {
>         unsigned nr_arch_gp_counters;
>         unsigned nr_arch_fixed_counters;
> @@ -335,6 +337,10 @@ struct kvm_pmu {
>         struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
>         struct irq_work irq_work;
>         u64 reprogram_pmi;
> +       u64 pebs_enable;
> +       u64 ds_area;
> +       struct page *pinned_pages[MAX_PINNED_PAGES];
> +       unsigned num_pinned_pages;
>  };
>
>  enum {
> diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
> index fcf2b3a..409a582 100644
> --- a/arch/x86/include/uapi/asm/msr-index.h
> +++ b/arch/x86/include/uapi/asm/msr-index.h
> @@ -72,6 +72,10 @@
>  #define MSR_IA32_PEBS_ENABLE           0x000003f1
>  #define MSR_IA32_DS_AREA               0x00000600
>  #define MSR_IA32_PERF_CAPABILITIES     0x00000345
> +#define PERF_CAP_PEBS_TRAP             (1U << 6)
> +#define PERF_CAP_ARCH_REG              (1U << 7)
> +#define PERF_CAP_PEBS_FORMAT           (0xf << 8)
> +
>  #define MSR_PEBS_LD_LAT_THRESHOLD      0x000003f6
>
>  #define MSR_MTRRfix64K_00000           0x00000250
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index f47a104..c8cc76b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -260,6 +260,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>         unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
>         unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0;
>         unsigned f_mpx = kvm_x86_ops->mpx_supported() ? F(MPX) : 0;
> +       bool pebs = perf_pebs_virtualization();
> +       unsigned f_ds = pebs ? F(DS) : 0;
> +       unsigned f_pdcm = pebs ? F(PDCM) : 0;
> +       unsigned f_dtes64 = pebs ? F(DTES64) : 0;
>
>         /* cpuid 1.edx */
>         const u32 kvm_supported_word0_x86_features =
> @@ -268,7 +272,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>                 F(CX8) | F(APIC) | 0 /* Reserved */ | F(SEP) |
>                 F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
>                 F(PAT) | F(PSE36) | 0 /* PSN */ | F(CLFLUSH) |
> -               0 /* Reserved, DS, ACPI */ | F(MMX) |
> +               f_ds /* Reserved, ACPI */ | F(MMX) |
>                 F(FXSR) | F(XMM) | F(XMM2) | F(SELFSNOOP) |
>                 0 /* HTT, TM, Reserved, PBE */;
>         /* cpuid 0x80000001.edx */
> @@ -283,10 +287,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>                 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
>         /* cpuid 1.ecx */
>         const u32 kvm_supported_word4_x86_features =
> -               F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> +               F(XMM3) | F(PCLMULQDQ) | f_dtes64 /* MONITOR */ |
>                 0 /* DS-CPL, VMX, SMX, EST */ |
>                 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> -               F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> +               F(FMA) | F(CX16) | f_pdcm /* xTPR Update */ |
>                 F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
>                 F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
>                 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 4c6f417..6362db7 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -15,9 +15,11 @@
>  #include <linux/types.h>
>  #include <linux/kvm_host.h>
>  #include <linux/perf_event.h>
> +#include <linux/highmem.h>
>  #include "x86.h"
>  #include "cpuid.h"
>  #include "lapic.h"
> +#include "mmu.h"
>
>  static struct kvm_arch_event_perf_mapping {
>         u8 eventsel;
> @@ -36,9 +38,23 @@ static struct kvm_arch_event_perf_mapping {
>         [7] = { 0x00, 0x30, PERF_COUNT_HW_REF_CPU_CYCLES },
>  };
>
> +struct debug_store {
> +       u64     bts_buffer_base;
> +       u64     bts_index;
> +       u64     bts_absolute_maximum;
> +       u64     bts_interrupt_threshold;
> +       u64     pebs_buffer_base;
> +       u64     pebs_index;
> +       u64     pebs_absolute_maximum;
> +       u64     pebs_interrupt_threshold;
> +       u64     pebs_event_reset[4];
> +};
> +
>  /* mapping between fixed pmc index and arch_events array */
>  int fixed_pmc_events[] = {1, 0, 7};
>
> +static u64 host_perf_cap __read_mostly;
> +
>  static bool pmc_is_gp(struct kvm_pmc *pmc)
>  {
>         return pmc->type == KVM_PMC_GP;
> @@ -108,7 +124,10 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
>  {
>         struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>         struct kvm_pmu *pmu = &pmc->vcpu->arch.pmu;
> -       __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> +       if (perf_event->attr.precise_ip)
> +               __set_bit(62, (unsigned long *)&pmu->global_status);
> +       else
> +               __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
>  }
>
>  static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> @@ -160,7 +179,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>                 unsigned config, bool exclude_user, bool exclude_kernel,
> -               bool intr, bool in_tx, bool in_tx_cp)
> +               bool intr, bool in_tx, bool in_tx_cp, bool pebs)
>  {
>         struct perf_event *event;
>         struct perf_event_attr attr = {
> @@ -177,18 +196,20 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>                 attr.config |= HSW_IN_TX;
>         if (in_tx_cp)
>                 attr.config |= HSW_IN_TX_CHECKPOINTED;
> +       if (pebs)
> +               attr.precise_ip = 1;
>
>         attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>
> -       event = perf_event_create_kernel_counter(&attr, -1, current,
> -                                                intr ? kvm_perf_overflow_intr :
> -                                                kvm_perf_overflow, pmc);
> +       event = __perf_event_create_kernel_counter(&attr, -1, current,
> +                                                (intr || pebs) ?
> +                                                kvm_perf_overflow_intr :
> +                                                kvm_perf_overflow, pmc, true);
>         if (IS_ERR(event)) {
>                 printk_once("kvm: pmu event creation failed %ld\n",
>                                 PTR_ERR(event));
>                 return;
>         }
> -       event->guest_owned = true;
>
>         pmc->perf_event = event;
>         clear_bit(pmc->idx, (unsigned long*)&pmc->vcpu->arch.pmu.reprogram_pmi);
> @@ -211,7 +232,8 @@ static unsigned find_arch_event(struct kvm_pmu *pmu, u8 event_select,
>         return arch_events[i].event_type;
>  }
>
> -static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> +static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc,
> +                                u64 eventsel)
>  {
>         unsigned config, type = PERF_TYPE_RAW;
>         u8 event_select, unit_mask;
> @@ -248,7 +270,8 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
>                         !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
>                         eventsel & ARCH_PERFMON_EVENTSEL_INT,
>                         (eventsel & HSW_IN_TX),
> -                       (eventsel & HSW_IN_TX_CHECKPOINTED));
> +                       (eventsel & HSW_IN_TX_CHECKPOINTED),
> +                       test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable));
>  }
>
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -265,7 +288,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
>                         arch_events[fixed_pmc_events[idx]].event_type,
>                         !(en & 0x2), /* exclude user */
>                         !(en & 0x1), /* exclude kernel */
> -                       pmi, false, false);
> +                       pmi, false, false, false);
>  }
>
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -298,7 +321,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, int idx)
>                 return;
>
>         if (pmc_is_gp(pmc))
> -               reprogram_gp_counter(pmc, pmc->eventsel);
> +               reprogram_gp_counter(pmu, pmc, pmc->eventsel);
>         else {
>                 int fidx = idx - INTEL_PMC_IDX_FIXED;
>                 reprogram_fixed_counter(pmc,
> @@ -323,6 +346,12 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
>         int ret;
>
>         switch (msr) {
> +       case MSR_IA32_DS_AREA:
> +       case MSR_IA32_PEBS_ENABLE:
> +       case MSR_IA32_PERF_CAPABILITIES:
> +               ret = perf_pebs_virtualization() ? 1 : 0;

Should this also check the CPUID exposed to the guest?  The KVM PMU
module is careful to not expose PMU MSRs to the guest unless the right
CPUID bits have been exposed.

It seems to me that with this patch, there is no way to expose a
PMU-without-PEBS to the guest if the host has PEBS.

It would be a bigger concern if we expected virtual PMU migration to
work, but I think it would be nice to update kvm_pmu_cpuid_update() to
notice the presence/absence of the new CPUID bits, and then store that
into per-VM kvm_pmu->pebs_allowed rather than relying only on the
per-host perf_pebs_virtualization().

> +               break;
> +
>         case MSR_CORE_PERF_FIXED_CTR_CTRL:
>         case MSR_CORE_PERF_GLOBAL_STATUS:
>         case MSR_CORE_PERF_GLOBAL_CTRL:
> @@ -356,6 +385,18 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
>         case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
>                 *data = pmu->global_ovf_ctrl;
>                 return 0;
> +       case MSR_IA32_DS_AREA:
> +               *data = pmu->ds_area;
> +               return 0;
> +       case MSR_IA32_PEBS_ENABLE:
> +               *data = pmu->pebs_enable;
> +               return 0;
> +       case MSR_IA32_PERF_CAPABILITIES:
> +               /* Report host PEBS format to guest */
> +               *data = host_perf_cap &
> +                       (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG |
> +                        PERF_CAP_PEBS_FORMAT);
> +               return 0;
>         default:
>                 if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
>                                 (pmc = get_fixed_pmc(pmu, index))) {
> @@ -369,6 +410,109 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
>         return 1;
>  }
>
> +static void kvm_pmu_release_pin(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_pmu *pmu = &vcpu->arch.pmu;
> +       int i;
> +
> +       for (i = 0; i < pmu->num_pinned_pages; i++)
> +               put_page(pmu->pinned_pages[i]);
> +       pmu->num_pinned_pages = 0;
> +}
> +
> +static struct page *get_guest_page(struct kvm_vcpu *vcpu,
> +                                  unsigned long addr)
> +{
> +       unsigned long pfn;
> +       struct x86_exception exception;
> +       gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
> +                                                   PFERR_WRITE_MASK,
> +                                                   &exception);
> +
> +       if (gpa == UNMAPPED_GVA) {
> +               printk_once("Cannot translate guest page %lx\n", addr);
> +               return NULL;
> +       }
> +       pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> +       if (is_error_noslot_pfn(pfn)) {
> +               printk_once("gfn_to_pfn failed for %llx\n", gpa);
> +               return NULL;
> +       }
> +       return pfn_to_page(pfn);
> +}
> +
> +static int pin_and_copy(struct kvm_vcpu *vcpu,
> +                       unsigned long addr, void *dst, int len,
> +                       struct page **p)
> +{
> +       unsigned long offset = addr & ~PAGE_MASK;
> +       void *map;
> +
> +       *p = get_guest_page(vcpu, addr);
> +       if (!*p)
> +               return -EIO;
> +       map = kmap(*p);
> +       memcpy(dst, map + offset, len);
> +       kunmap(map);
> +       return 0;
> +}
> +
> +/*
> + * Pin the DS area and the PEBS buffer while PEBS is active,
> + * because the CPU cannot tolerate EPT faults for PEBS updates.
> + *
> + * We assume that any guest who changes the DS buffer disables
> + * PEBS first and does not change the page tables during operation.
> + *
> + * When the guest violates these assumptions it may crash itself.
> + * This is expected to not happen with standard profilers.
> + *
> + * No need to clean up anything, as the caller will always eventually
> + * unpin pages.
> + */
> +
> +static void kvm_pmu_pebs_pin(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_pmu *pmu = &vcpu->arch.pmu;
> +       struct debug_store ds;
> +       int pg;
> +       unsigned len;
> +       unsigned long offset;
> +       unsigned long addr;
> +
> +       offset = pmu->ds_area & ~PAGE_MASK;
> +       len = sizeof(struct debug_store);
> +       len = min_t(unsigned, PAGE_SIZE - offset, len);
> +       if (pin_and_copy(vcpu, pmu->ds_area, &ds, len,
> +                        &pmu->pinned_pages[0]) < 0) {
> +               printk_once("Cannot pin ds area %llx\n", pmu->ds_area);
> +               return;
> +       }
> +       pmu->num_pinned_pages++;
> +       if (len < sizeof(struct debug_store)) {
> +               if (pin_and_copy(vcpu, pmu->ds_area + len, (void *)&ds + len,
> +                                 sizeof(struct debug_store) - len,
> +                                 &pmu->pinned_pages[1]) < 0)
> +                       return;
> +               pmu->num_pinned_pages++;
> +       }
> +
> +       pg = pmu->num_pinned_pages;
> +       for (addr = ds.pebs_buffer_base;
> +            addr < ds.pebs_absolute_maximum && pg < MAX_PINNED_PAGES;
> +            addr += PAGE_SIZE, pg++) {
> +               pmu->pinned_pages[pg] = get_guest_page(vcpu, addr);
> +               if (!pmu->pinned_pages[pg]) {
> +                       printk_once("Cannot pin PEBS buffer %lx (%llx-%llx)\n",
> +                                addr,
> +                                ds.pebs_buffer_base,
> +                                ds.pebs_absolute_maximum);
> +                       break;
> +               }
> +       }
> +       pmu->num_pinned_pages = pg;
> +}
> +
>  int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  {
>         struct kvm_pmu *pmu = &vcpu->arch.pmu;
> @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                         return 0;
>                 }
>                 break;
> +       case MSR_IA32_DS_AREA:
> +               pmu->ds_area = data;
> +               return 0;
> +       case MSR_IA32_PEBS_ENABLE:
> +               if (data & ~0xf0000000fULL)
> +                       break;
> +               if (data && data != pmu->pebs_enable) {
> +                       kvm_pmu_release_pin(vcpu);
> +                       kvm_pmu_pebs_pin(vcpu);
> +               } else if (data == 0 && pmu->pebs_enable) {
> +                       kvm_pmu_release_pin(vcpu);
> +               }
> +               pmu->pebs_enable = data;
> +               return 0;
>         default:
>                 if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
>                                 (pmc = get_fixed_pmc(pmu, index))) {
> @@ -418,7 +576,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>                         if (data == pmc->eventsel)
>                                 return 0;
>                         if (!(data & pmu->reserved_bits)) {
> -                               reprogram_gp_counter(pmc, data);
> +                               reprogram_gp_counter(pmu, pmc, data);
>                                 return 0;
>                         }
>                 }
> @@ -514,6 +672,9 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
>         }
>         init_irq_work(&pmu->irq_work, trigger_pmi);
>         kvm_pmu_cpuid_update(vcpu);
> +
> +       if (boot_cpu_has(X86_FEATURE_PDCM))
> +               rdmsrl_safe(MSR_IA32_PERF_CAPABILITIES, &host_perf_cap);
>  }
>
>  void kvm_pmu_reset(struct kvm_vcpu *vcpu)
> @@ -538,6 +699,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>  void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>  {
>         kvm_pmu_reset(vcpu);
> +       kvm_pmu_release_pin(vcpu);
>  }
>
>  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 33e8c02..4f39917 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>         atomic_switch_perf_msrs(vmx);
>         debugctlmsr = get_debugctlmsr();
>
> +       /* Move this somewhere else? */
> +       if (vcpu->arch.pmu.ds_area)
> +               add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
> +                                     vcpu->arch.pmu.ds_area,
> +                                     perf_get_ds_area());
> +
>         vmx->__launched = vmx->loaded_vmcs->launched;
>         asm(
>                 /* Store host registers */
> --
> 1.9.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-02 16:45       ` Gleb Natapov
  2014-06-02 16:52         ` Andi Kleen
@ 2014-06-02 19:09         ` Marcelo Tosatti
  1 sibling, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2014-06-02 19:09 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Andi Kleen, Andi Kleen, peterz, pbonzini, eranian, kvm, linux-kernel

On Mon, Jun 02, 2014 at 07:45:35PM +0300, Gleb Natapov wrote:
> On Fri, May 30, 2014 at 09:24:24AM -0700, Andi Kleen wrote:
> > > > To avoid any problems with guest pages being swapped by the host we
> > > > pin the pages when the PEBS buffer is setup, by intercepting
> > > > that MSR.
> > > It will avoid guest page to be swapped, but shadow paging code may still drop
> > > shadow PT pages that build a mapping from DS virtual address to the guest page.
> > 
> > You're saying the EPT code could tear down the EPT mappings?
> 
> Under memory pressure yes. mmu_shrink_scan() calls
> prepare_zap_oldest_mmu_page() which destroys oldest mmu pages like its
> name says. As far as I can tell running nested guest can also result in
> EPT mapping to be dropped since it will create a lot of shadow pages and
> this will cause make_mmu_pages_available() to destroy some shadow pages
> and it may choose EPT pages to destroy.
> 
> CCing Marcelo to confirm/correct.

Yes. Given SLAB pressure any shadow pages can be deleted except pinned 
via root_count=1 ones.

> > OK that would need to be prevented too. Any suggestions how?
> Only high level. Mark shadow pages involved in translation we want to keep and skip them in
> prepare_zap_oldest_mmu_page().

Should special case such translations so that they are not zapped
(either via page deletion or single entry EPT deletion). Them
and any other their parents, bummer.

Maybe its cleaner to check that DS area is EPT mapped before VM-entry.

No way the processor can generate VM-exits ?

Is it not an option to fake a DS-save area in the host (and trap
any accesses to the DS_AREA MSR from the guest) ? 
Then before notifying the PEBS event, copy from that host area to 
guests address. Slow probably.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-02 19:05   ` Eric Northup
@ 2014-06-02 19:57     ` Andi Kleen
  2014-06-19 14:39       ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-06-02 19:57 UTC (permalink / raw)
  To: Eric Northup
  Cc: Andi Kleen, Peter Zijlstra, gleb, Paolo Bonzini,
	Stephane Eranian, KVM, Linux Kernel Mailing List, Andi Kleen

> It seems to me that with this patch, there is no way to expose a
> PMU-without-PEBS to the guest if the host has PEBS.

If you clear the CPUIDs then noone would ilikely access it.

But fair enough, I'll add extra checks for CPUID.

> It would be a bigger concern if we expected virtual PMU migration to
> work, but I think it would be nice to update kvm_pmu_cpuid_update() to
> notice the presence/absence of the new CPUID bits, and then store that
> into per-VM kvm_pmu->pebs_allowed rather than relying only on the
> per-host perf_pebs_virtualization().

I hope at some point it can work. There shouldn't be any problems
with migrating to the same CPU model, in many cases (same event 
and same PEBS format) it'll likely even work between models or
gracefully degrade.

BTW in practice it'll likely work anyways because many profilers
regularly re-set the PMU. 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
  2014-05-30  8:21   ` Gleb Natapov
  2014-06-02 19:05   ` Eric Northup
@ 2014-06-10 18:04   ` Marcelo Tosatti
  2014-06-10 19:22     ` Andi Kleen
  2014-06-22 13:57   ` Avi Kivity
  3 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2014-06-10 18:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: peterz, gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen

On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
>  {
>  	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			return 0;
>  		}
>  		break;
> +	case MSR_IA32_DS_AREA:
> +		pmu->ds_area = data;
> +		return 0;
> +	case MSR_IA32_PEBS_ENABLE:
> +		if (data & ~0xf0000000fULL)
> +			break;

Bit 63 == PS_ENABLE ?

>  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 33e8c02..4f39917 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	atomic_switch_perf_msrs(vmx);
>  	debugctlmsr = get_debugctlmsr();
>  
> +	/* Move this somewhere else? */

Unless you hook into vcpu->arch.pmu.ds_area and perf_get_ds_area()
writers, it has to be at every vcpu entry.

Could compare values in MSR save area to avoid switch.

> +	if (vcpu->arch.pmu.ds_area)
> +		add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
> +				      vcpu->arch.pmu.ds_area,
> +				      perf_get_ds_area());

Should clear_atomic_switch_msr before 
add_atomic_switch_msr.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-10 18:04   ` Marcelo Tosatti
@ 2014-06-10 19:22     ` Andi Kleen
  2014-06-10 21:06       ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-06-10 19:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andi Kleen, peterz, gleb, pbonzini, eranian, kvm, linux-kernel

On Tue, Jun 10, 2014 at 03:04:48PM -0300, Marcelo Tosatti wrote:
> On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
> >  {
> >  	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> > @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  			return 0;
> >  		}
> >  		break;
> > +	case MSR_IA32_DS_AREA:
> > +		pmu->ds_area = data;
> > +		return 0;
> > +	case MSR_IA32_PEBS_ENABLE:
> > +		if (data & ~0xf0000000fULL)
> > +			break;
> 
> Bit 63 == PS_ENABLE ?

PEBS_EN is [3:0] for each counter, but only one bit on Silvermont.
LL_EN is [36:32], but currently unused.

> 
> >  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 33e8c02..4f39917 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> >  	atomic_switch_perf_msrs(vmx);
> >  	debugctlmsr = get_debugctlmsr();
> >  
> > +	/* Move this somewhere else? */
> 
> Unless you hook into vcpu->arch.pmu.ds_area and perf_get_ds_area()
> writers, it has to be at every vcpu entry.
> 
> Could compare values in MSR save area to avoid switch.

Ok.

> 
> > +	if (vcpu->arch.pmu.ds_area)
> > +		add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
> > +				      vcpu->arch.pmu.ds_area,
> > +				      perf_get_ds_area());
> 
> Should clear_atomic_switch_msr before 
> add_atomic_switch_msr.

Ok.

BTW how about general PMU migration? As far as I can tell there 
is no code to save/restore the state for that currently, right?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-10 19:22     ` Andi Kleen
@ 2014-06-10 21:06       ` Marcelo Tosatti
  2014-06-19 14:42         ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2014-06-10 21:06 UTC (permalink / raw)
  To: Andi Kleen, Paolo Bonzini
  Cc: Andi Kleen, peterz, gleb, pbonzini, eranian, kvm, linux-kernel

On Tue, Jun 10, 2014 at 12:22:07PM -0700, Andi Kleen wrote:
> On Tue, Jun 10, 2014 at 03:04:48PM -0300, Marcelo Tosatti wrote:
> > On Thu, May 29, 2014 at 06:12:07PM -0700, Andi Kleen wrote:
> > >  {
> > >  	struct kvm_pmu *pmu = &vcpu->arch.pmu;
> > > @@ -407,6 +551,20 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >  			return 0;
> > >  		}
> > >  		break;
> > > +	case MSR_IA32_DS_AREA:
> > > +		pmu->ds_area = data;
> > > +		return 0;
> > > +	case MSR_IA32_PEBS_ENABLE:
> > > +		if (data & ~0xf0000000fULL)
> > > +			break;
> > 
> > Bit 63 == PS_ENABLE ?
> 
> PEBS_EN is [3:0] for each counter, but only one bit on Silvermont.
> LL_EN is [36:32], but currently unused.
> 
> > 
> > >  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu)
> > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > > index 33e8c02..4f39917 100644
> > > --- a/arch/x86/kvm/vmx.c
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -7288,6 +7288,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > >  	atomic_switch_perf_msrs(vmx);
> > >  	debugctlmsr = get_debugctlmsr();
> > >  
> > > +	/* Move this somewhere else? */
> > 
> > Unless you hook into vcpu->arch.pmu.ds_area and perf_get_ds_area()
> > writers, it has to be at every vcpu entry.
> > 
> > Could compare values in MSR save area to avoid switch.
> 
> Ok.
> 
> > 
> > > +	if (vcpu->arch.pmu.ds_area)
> > > +		add_atomic_switch_msr(vmx, MSR_IA32_DS_AREA,
> > > +				      vcpu->arch.pmu.ds_area,
> > > +				      perf_get_ds_area());
> > 
> > Should clear_atomic_switch_msr before 
> > add_atomic_switch_msr.
> 
> Ok.
> 
> BTW how about general PMU migration? As far as I can tell there 
> is no code to save/restore the state for that currently, right?
> 
> -Andi

Paolo wrote support for it, recently. Paolo?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-02 19:57     ` Andi Kleen
@ 2014-06-19 14:39       ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-06-19 14:39 UTC (permalink / raw)
  To: Andi Kleen, Eric Northup
  Cc: Peter Zijlstra, gleb, Stephane Eranian, KVM,
	Linux Kernel Mailing List, Andi Kleen

Il 02/06/2014 21:57, Andi Kleen ha scritto:
> > It would be a bigger concern if we expected virtual PMU migration to
> > work, but I think it would be nice to update kvm_pmu_cpuid_update() to
> > notice the presence/absence of the new CPUID bits, and then store that
> > into per-VM kvm_pmu->pebs_allowed rather than relying only on the
> > per-host perf_pebs_virtualization().
>
> I hope at some point it can work. There shouldn't be any problems
> with migrating to the same CPU model, in many cases (same event
> and same PEBS format) it'll likely even work between models or
> gracefully degrade.

The code is there in both kernel and QEMU, it's just very little tested.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-10 21:06       ` Marcelo Tosatti
@ 2014-06-19 14:42         ` Paolo Bonzini
  2014-06-19 17:33           ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2014-06-19 14:42 UTC (permalink / raw)
  To: Marcelo Tosatti, Andi Kleen
  Cc: Andi Kleen, peterz, gleb, eranian, kvm, linux-kernel

Il 10/06/2014 23:06, Marcelo Tosatti ha scritto:
> > BTW how about general PMU migration? As far as I can tell there
> > is no code to save/restore the state for that currently, right?
>
> Paolo wrote support for it, recently. Paolo?

Yes, on the KVM side all that is needed is to special case MSR reads and 
writes that have side effects, for example:

         case MSR_CORE_PERF_GLOBAL_STATUS:
                 if (msr_info->host_initiated) {
                         pmu->global_status = data;
                         return 0;
                 }
                 break; /* RO MSR */
         case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
                 if (!(data & (pmu->global_ctrl_mask & ~(3ull<<62)))) {
                         if (!msr_info->host_initiated)
                                 pmu->global_status &= ~data;
                         pmu->global_ovf_ctrl = data;
                         return 0;
                 }
                 break;

Right now this is only needed for writes.

Userspace then can read/write these MSRs, and add them to the migration 
stream.  QEMU has code for that.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-19 14:42         ` Paolo Bonzini
@ 2014-06-19 17:33           ` Andi Kleen
  2014-06-19 20:33             ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-06-19 17:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Marcelo Tosatti, Andi Kleen, peterz, gleb, eranian, kvm, linux-kernel

> Userspace then can read/write these MSRs, and add them to the migration
> stream.  QEMU has code for that.

Thanks. The PEBS setup always redoes its state, can be arbitarily often redone.

So the only change needed would be to add the MSRs to some list in qemu?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-19 17:33           ` Andi Kleen
@ 2014-06-19 20:33             ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2014-06-19 20:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Marcelo Tosatti, Andi Kleen, peterz, gleb, eranian, kvm, linux-kernel

> > Userspace then can read/write these MSRs, and add them to the migration
> > stream.  QEMU has code for that.
> 
> Thanks. The PEBS setup always redoes its state, can be arbitarily often
> redone.
> 
> So the only change needed would be to add the MSRs to some list in qemu?

Yes, and also adding them to the migration stream if the MSRs do not
have the default (all-zero? need to look at the SDM) values.

Paolo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
                     ` (2 preceding siblings ...)
  2014-06-10 18:04   ` Marcelo Tosatti
@ 2014-06-22 13:57   ` Avi Kivity
  2014-06-22 19:02     ` Andi Kleen
  3 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2014-06-22 13:57 UTC (permalink / raw)
  To: Andi Kleen, peterz; +Cc: gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen


On 05/30/2014 04:12 AM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> PEBS (Precise Event Bases Sampling) profiling is very powerful,
> allowing improved sampling precision and much additional information,
> like address or TSX abort profiling. cycles:p and :pp uses PEBS.
>
> This patch enables PEBS profiling in KVM guests.
>
> PEBS writes profiling records to a virtual address in memory. Since
> the guest controls the virtual address space the PEBS record
> is directly delivered to the guest buffer. We set up the PEBS state
> that is works correctly.The CPU cannot handle any kinds of faults during
> these guest writes.
>
> To avoid any problems with guest pages being swapped by the host we
> pin the pages when the PEBS buffer is setup, by intercepting
> that MSR.
>
> Typically profilers only set up a single page, so pinning that is not
> a big problem. The pinning is limited to 17 pages currently (64K+1)
>
> In theory the guest can change its own page tables after the PEBS
> setup. The host has no way to track that with EPT. But if a guest
> would do that it could only crash itself. It's not expected
> that normal profilers do that.
>
>

Talking a bit with Gleb about this, I think this is impossible.

First, it's not sufficient to pin the debug store area, you also have to 
pin the guest page tables that are used to map the debug store.  But 
even if you do that, as soon as the guest fork()s, it will create a new 
pgd which the host will be free to swap out.  The processor can then 
attempt a PEBS store to an unmapped address which will fail, even though 
the guest is configured correctly.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-22 13:57   ` Avi Kivity
@ 2014-06-22 19:02     ` Andi Kleen
  2014-06-24 16:45       ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2014-06-22 19:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, peterz, gleb, pbonzini, eranian, kvm, linux-kernel,
	Andi Kleen

> First, it's not sufficient to pin the debug store area, you also
> have to pin the guest page tables that are used to map the debug
> store.  But even if you do that, as soon as the guest fork()s, it
> will create a new pgd which the host will be free to swap out.  The
> processor can then attempt a PEBS store to an unmapped address which
> will fail, even though the guest is configured correctly.

That's a good point. You're right of course.

The only way I can think around it would be to intercept CR3 writes
while PEBS is active and always pin all the table pages leading 
to the PEBS buffer. That's slow, but should be only needed
while PEBS is running.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-22 19:02     ` Andi Kleen
@ 2014-06-24 16:45       ` Marcelo Tosatti
  2014-06-25  7:04         ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2014-06-24 16:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, peterz, gleb, pbonzini, eranian, kvm, linux-kernel,
	Andi Kleen

On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:
> > First, it's not sufficient to pin the debug store area, you also
> > have to pin the guest page tables that are used to map the debug
> > store.  But even if you do that, as soon as the guest fork()s, it
> > will create a new pgd which the host will be free to swap out.  The
> > processor can then attempt a PEBS store to an unmapped address which
> > will fail, even though the guest is configured correctly.
> 
> That's a good point. You're right of course.
> 
> The only way I can think around it would be to intercept CR3 writes
> while PEBS is active and always pin all the table pages leading 
> to the PEBS buffer. That's slow, but should be only needed
> while PEBS is running.
> 
> -Andi

Suppose that can be done separately from the pinned spte patchset.
And it requires accounting into mlock limits as well, as noted.

One set of pagetables per pinned virtual address leading down to the
last translations is sufficient per-vcpu.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/4] kvm: Implement PEBS virtualization
  2014-06-24 16:45       ` Marcelo Tosatti
@ 2014-06-25  7:04         ` Avi Kivity
  0 siblings, 0 replies; 29+ messages in thread
From: Avi Kivity @ 2014-06-25  7:04 UTC (permalink / raw)
  To: Marcelo Tosatti, Andi Kleen
  Cc: peterz, gleb, pbonzini, eranian, kvm, linux-kernel, Andi Kleen


On 06/24/2014 07:45 PM, Marcelo Tosatti wrote:
> On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:
>>> First, it's not sufficient to pin the debug store area, you also
>>> have to pin the guest page tables that are used to map the debug
>>> store.  But even if you do that, as soon as the guest fork()s, it
>>> will create a new pgd which the host will be free to swap out.  The
>>> processor can then attempt a PEBS store to an unmapped address which
>>> will fail, even though the guest is configured correctly.
>> That's a good point. You're right of course.
>>
>> The only way I can think around it would be to intercept CR3 writes
>> while PEBS is active and always pin all the table pages leading
>> to the PEBS buffer. That's slow, but should be only needed
>> while PEBS is running.
>>
>> -Andi
> Suppose that can be done separately from the pinned spte patchset.
> And it requires accounting into mlock limits as well, as noted.
>
> One set of pagetables per pinned virtual address leading down to the
> last translations is sufficient per-vcpu.

Or 4, and use the CR3 exit filter to prevent vmexits among the last 4 
LRU CR3 values.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-06-25  7:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-30  1:12 Implement PEBS virtualization for Silvermont Andi Kleen
2014-05-30  1:12 ` [PATCH 1/4] perf: Add PEBS virtualization enable " Andi Kleen
2014-05-30  1:12 ` [PATCH 2/4] perf: Allow guest PEBS for KVM owned counters Andi Kleen
2014-05-30  7:31   ` Peter Zijlstra
2014-05-30 16:03     ` Andi Kleen
2014-05-30 16:17       ` Peter Zijlstra
2014-05-30  1:12 ` [PATCH 3/4] perf: Handle guest PEBS events with a fake event Andi Kleen
2014-05-30  7:34   ` Peter Zijlstra
2014-05-30 16:29     ` Andi Kleen
2014-05-30  1:12 ` [PATCH 4/4] kvm: Implement PEBS virtualization Andi Kleen
2014-05-30  8:21   ` Gleb Natapov
2014-05-30 16:24     ` Andi Kleen
2014-06-02 16:45       ` Gleb Natapov
2014-06-02 16:52         ` Andi Kleen
2014-06-02 19:09         ` Marcelo Tosatti
2014-06-02 19:05   ` Eric Northup
2014-06-02 19:57     ` Andi Kleen
2014-06-19 14:39       ` Paolo Bonzini
2014-06-10 18:04   ` Marcelo Tosatti
2014-06-10 19:22     ` Andi Kleen
2014-06-10 21:06       ` Marcelo Tosatti
2014-06-19 14:42         ` Paolo Bonzini
2014-06-19 17:33           ` Andi Kleen
2014-06-19 20:33             ` Paolo Bonzini
2014-06-22 13:57   ` Avi Kivity
2014-06-22 19:02     ` Andi Kleen
2014-06-24 16:45       ` Marcelo Tosatti
2014-06-25  7:04         ` Avi Kivity
2014-05-30  7:39 ` Implement PEBS virtualization for Silvermont Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.