kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
@ 2019-03-23 14:18 Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 1/5] perf/x86: avoid host changing counter state for kvm_intel events holder Like Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

As a toolbox treasure to developers, the Performance Monitoring Unit is
designed to monitor micro architectural events which helps in analyzing
how an application or operating systems are performing on the processors.

Today in KVM, version 2 Architectural PMU on Intel and AMD hosts is
implemented and works. With the joint efforts of the community, it would be
an inspiring journey to enable all available PMU features for guest users
as complete/smooth/accurate as possible.

=== Brief description ===

This proposal for Intel vPMU is still committed to optimize the basic
functionality by reducing the PMU virtualization overhead and not a blind
pass-through of the PMU. The proposal applies to existing models, in short,
is "host perf would hand over control to kvm after counter allocation".

The pmc_reprogram_counter is a heavyweight and high frequency operation
which goes through the host perf software stack to create a perf event for
counter assignment, this could take millions of nanoseconds. The current
vPMU always does reprogram_counter when the guest changes the eventsel,
fixctrl, and global_ctrl msrs. This brings too much overhead to the usage
of perf inside the guest, especially the guest PMI handling and context
switching of guest threads with perf in use.

We optimize the current vPMU to work in this manner:

(1) rely on the existing host perf (perf_event_create_kernel_counter)
    to allocate counters for in-use vPMC and always try to reuse events;
(2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
    to the hardware msr that the corresponding host event is scheduled on
    and avoid pollution from host is also needed in its partial runtime;
(3) save and restore the counter state during vCPU scheduling in hooks;
(4) apply a lazy approach to release the vPMC's perf event. That is, if
    the vPMC isn't used in a fixed sched slice, its event will be released.

In the use of vPMC, the vPMU always focus on the assigned resources and
guest perf would significantly benefit from direct access to hardware and
may not care about runtime state of perf_event created by host and always
try not to pay for their maintenance. However to avoid events entering into
any unexpected state, calling pmc_read_counter in appropriate is necessary.

=== vPMU Overhead Comparison ===

For the guest perf usage like "perf stat -e branches,cpu-cycles,\
L1-icache-load-misses,branch-load-misses,branch-loads,\
dTLB-load-misses ./ftest", here are some performance numbers which show the
improvement with this optimization (in nanoseconds) [1]:

(1) Basic operatios latency on legacy Intel vPMU

kvm_pmu_rdpmc                               200
pmc_stop_counter: gp                        30,000
pmc_stop_counter: fixed                		2,000,000
perf_event_create_kernel_counter: gp        30,000,000 <== (mark as 3.1)
perf_event_create_kernel_counter: fixed     25,000

(2) Comparison of max guest behavior latency
                        legacy          v2
enable global_ctrl 		57,000,000		17,000,000 <== (3.2)
disable global_ctrl		2,000,000 		21,000
r/w fixed_ctrl   		21,000          1,100
r/w eventsel     		36,000          17,000
rdpmcl             		35,000   		18,000
x86_pmu.handle_irq 		3,500,000 		8,800 <== (3.3)

(3) For 3.2, the v2 value is just a maximum value for reprogram and
would be quickly weakened to neglect by reusing perf_events. In general,
we can say this optimization is ~400 times (3.3) faster than the original
for Intel vPMU due to a large number reduction of calls to
perf_event_create_kernel_counter (3.1).

(4) Comparison of guest behavior call time
                        legacy                  v2
enable global_ctrl      74,000                  3,000 <== (6.1)
rd/wr fixed_ctrl        11,000                  1,400
rd/wr eventsel          7,000,000               7,600
rdpmcl            		130,000                 10,000
x86_pmu.handle_irq		11                      14

(5) Comparison of perf-attached thread guest context_switch latency
                            legacy          v2
context_switch, sched_in 	350,000,000     4,000,000
context_switch, sched_out	55,000,000      200,000

(6) From 6.1 and table 5, We can see a substantial reduction in the
runtime of a perf attached guest thread and the vPMU is no longer stuck.

=== vPMU Precision Comparison ===

We don't want to lose any precision after optimization and for perf usage
like "perf record -e cpu-cycles --all-user ./ftest"here is the comparison
of the profiling results with and without this optimization [1]:

(1) Test in Guest without optimization:

[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.437 MB perf.data (5198 samples) ]

  36.95%  ftest    ftest          [.] qux
  15.68%  ftest    ftest          [.] foo
  15.45%  ftest    ftest          [.] bar
  12.32%  ftest    ftest          [.] main
   9.56%  ftest    libc-2.27.so   [.] __random
   8.87%  ftest    libc-2.27.so   [.] __random_r
   1.17%  ftest    ftest          [.] random@plt
   0.00%  ftest    ld-2.27.so     [.] _start

(2) Test in Guest with this optimization:

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.861 MB perf.data (22550 samples) ]

  36.64%  ftest    ftest             [.] qux
  14.35%  ftest    ftest             [.] foo
  14.07%  ftest    ftest             [.] bar
  12.60%  ftest    ftest             [.] main
  11.73%  ftest    libc-2.27.so      [.] __random
   9.18%  ftest    libc-2.27.so      [.] __random_r
   1.42%  ftest    ftest             [.] random@plt
   0.00%  ftest    ld-2.27.so        [.] do_lookup_x
   0.00%  ftest    ld-2.27.so        [.] _dl_new_object
   0.00%  ftest    ld-2.27.so        [.] _dl_sysdep_start
   0.00%  ftest    ld-2.27.so        [.] _start

(3) Test in Host:

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.789 MB perf.data (20652 samples) ]

  37.87%  ftest    ftest          [.] qux
  15.78%  ftest    ftest          [.] foo
  13.18%  ftest    ftest          [.] main
  12.14%  ftest    ftest          [.] bar
   9.85%  ftest    libc-2.17.so   [.] __random_r
   9.59%  ftest    libc-2.17.so   [.] __random
   1.59%  ftest    ftest          [.] random@plt
   0.00%  ftest    ld-2.17.so     [.] _dl_cache_libcmp
   0.00%  ftest    ld-2.17.so     [.] _dl_start
   0.00%  ftest    ld-2.17.so     [.] _start

=== NEXT ===

This proposal is trying to respected necessary functionality from the host
perf driver and bypasses the host perf subsystem software stack in most
execution paths with no loss of precision compared to the legacy one.
If this proposal is acceptable, here are something we could do for next:

(1) If host perf wants to perceive all the events for scheduling, some
    event hooks could be implemented to update host perf_event with the
    proper counts/runtimes/state.
(2) Loose the scheduling restrictions on pinned,
    but still keeps eyes on special specific requests
(3) This series currently covers the basic perf counter virtualization.
    Other features, such as pebs, bts, lbr will come after this series.

May be there is something wrong in the whole series and please help me
reach the other side of the performance improvement with your comments.

[1] Tested on Linux 5.0.0 on Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz,
   and added "nowatchdog" to host booting parameter. The values comes from 
	sched_clock() using tsc as guest clocksource.

=== Changelog ===

v1: Wei Wang (8): https://lkml.org/lkml/2018/11/1/937
  perf/x86: add support to mask counters from host
  perf/x86/intel: add pmi callback support
  KVM/x86/vPMU: optimize intel vPMU
  KVM/x86/vPMU: support msr switch on vmx transitions
  KVM/x86/vPMU: intel_pmu_read_pmc
  KVM/x86/vPMU: remove some unused functions
  KVM/x86/vPMU: save/restore guest perf counters on vCPU switching
  KVM/x86/vPMU: return the counters to host if guest is torn down

v2: Like Xu (5):
  perf/x86: avoid host changing counter state for kvm_intel events holder
  KVM/x86/vPMU: add pmc operations for vmx and count to track release
  KVM/x86/vPMU: add Intel vPMC enable/disable and save/restore support
  KVM/x86/vPMU: add vCPU scheduling support for hw-assigned vPMC
  KVM/x86/vPMU: not do reprogram_counter for Intel hw-assigned vPMC

 arch/x86/events/core.c          |  37 ++++-
 arch/x86/events/intel/core.c    |   5 +-
 arch/x86/events/perf_event.h    |  13 +-
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/pmu.c              |  34 +++++
 arch/x86/kvm/pmu.h              |  22 +++
 arch/x86/kvm/vmx/pmu_intel.c    | 329 +++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c              |   6 +
 8 files changed, 421 insertions(+), 27 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC] [PATCH v2 1/5] perf/x86: avoid host changing counter state for kvm_intel events holder
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
@ 2019-03-23 14:18 ` Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 2/5] KVM/x86/vPMU: add pmc operations for vmx and count to track release Like Xu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

When an perf_event is used by intel vPMU, the vPMU would be responsible
for updating its event_base and config_base. Just checking the writes not
including reading helps perf_events run as usual.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/events/core.c       | 37 +++++++++++++++++++++++++++++++++----
 arch/x86/events/intel/core.c |  5 +++--
 arch/x86/events/perf_event.h | 13 +++++++++----
 3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447..d4b5fc0 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1120,6 +1120,35 @@ static void x86_pmu_enable(struct pmu *pmu)
 static DEFINE_PER_CPU(u64 [X86_PMC_IDX_MAX], pmc_prev_left);
 
 /*
+ * If this is an event used by intel vPMU,
+ * intel_kvm_pmu would be responsible for updating the HW.
+ */
+void x86_perf_event_set_event_base(struct perf_event *event,
+	unsigned long val)
+{
+	if (event->attr.exclude_host &&
+			boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		return;
+
+	wrmsrl(event->hw.event_base, val);
+}
+
+void x86_perf_event_set_config_base(struct perf_event *event,
+	unsigned long val, bool set_extra_config)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (event->attr.exclude_host &&
+			boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+		return;
+
+	if (set_extra_config)
+		wrmsrl(hwc->extra_reg.reg, hwc->extra_reg.config);
+
+	wrmsrl(event->hw.config_base, val);
+}
+
+/*
  * Set the next IRQ period, based on the hwc->period_left value.
  * To be called with the event disabled in hw:
  */
@@ -1169,17 +1198,17 @@ int x86_perf_event_set_period(struct perf_event *event)
 	 */
 	local64_set(&hwc->prev_count, (u64)-left);
 
-	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+	x86_perf_event_set_event_base(event,
+		(u64)(-left) & x86_pmu.cntval_mask);
 
 	/*
 	 * Due to erratum on certan cpu we need
 	 * a second write to be sure the register
 	 * is updated properly
 	 */
-	if (x86_pmu.perfctr_second_write) {
-		wrmsrl(hwc->event_base,
+	if (x86_pmu.perfctr_second_write)
+		x86_perf_event_set_event_base(event,
 			(u64)(-left) & x86_pmu.cntval_mask);
-	}
 
 	perf_event_update_userpage(event);
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8baa441..817257c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2061,6 +2061,7 @@ static inline void intel_pmu_ack_status(u64 ack)
 
 static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
 {
+	struct perf_event *event = container_of(hwc, struct perf_event, hw);
 	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
 	u64 ctrl_val, mask;
 
@@ -2068,7 +2069,7 @@ static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
 
 	rdmsrl(hwc->config_base, ctrl_val);
 	ctrl_val &= ~mask;
-	wrmsrl(hwc->config_base, ctrl_val);
+	x86_perf_event_set_config_base(event, ctrl_val, false);
 }
 
 static inline bool event_is_checkpointed(struct perf_event *event)
@@ -2148,7 +2149,7 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
 	rdmsrl(hwc->config_base, ctrl_val);
 	ctrl_val &= ~mask;
 	ctrl_val |= bits;
-	wrmsrl(hwc->config_base, ctrl_val);
+	x86_perf_event_set_config_base(event, ctrl_val, false);
 }
 
 static void intel_pmu_enable_event(struct perf_event *event)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a759557..3029960 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -726,6 +726,11 @@ static inline bool x86_pmu_has_lbr_callstack(void)
 
 int x86_perf_event_set_period(struct perf_event *event);
 
+void x86_perf_event_set_config_base(struct perf_event *event,
+	unsigned long val, bool set_extra_config);
+void x86_perf_event_set_event_base(struct perf_event *event,
+	unsigned long val);
+
 /*
  * Generalized hw caching related hw_event table, filled
  * in on a per model basis. A value of 0 means
@@ -785,11 +790,11 @@ static inline int x86_pmu_rdpmc_index(int index)
 static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
 					  u64 enable_mask)
 {
+	struct perf_event *event = container_of(hwc, struct perf_event, hw);
 	u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask);
 
-	if (hwc->extra_reg.reg)
-		wrmsrl(hwc->extra_reg.reg, hwc->extra_reg.config);
-	wrmsrl(hwc->config_base, (hwc->config | enable_mask) & ~disable_mask);
+	x86_perf_event_set_config_base(event,
+		(hwc->config | enable_mask) & ~disable_mask, true);
 }
 
 void x86_pmu_enable_all(int added);
@@ -804,7 +809,7 @@ static inline void x86_pmu_disable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
-	wrmsrl(hwc->config_base, hwc->config);
+	x86_perf_event_set_config_base(event, hwc->config, false);
 }
 
 void x86_pmu_enable_event(struct perf_event *event);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC] [PATCH v2 2/5] KVM/x86/vPMU: add pmc operations for vmx and count to track release
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 1/5] perf/x86: avoid host changing counter state for kvm_intel events holder Like Xu
@ 2019-03-23 14:18 ` Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 3/5] KVM/x86/vPMU: add Intel vPMC enable/disable and save/restore support Like Xu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

The introduced hw_life_count is initialized with HW_LIFE_COUNT_MAX
when the vPMC holds a hw-assigned perf_event and the kvm_pmu_sched ctx
would start counting down (0 means to be released) if not be charged.

If vPMC is assigned, the intel_pmc_read_counter() would use rdpmcl
directly not perf_event_read_value() and charge hw_life_count to max.

To clear out responsibility for potential operating space in kvm,
this patch is not going to invoke similar functions from host perf.

Signed-off-by: Wang Wei <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/vmx/pmu_intel.c    | 98 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a5db447..2a2c78f2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -449,6 +449,7 @@ enum pmc_type {
 	KVM_PMC_FIXED,
 };
 
+#define	HW_LIFE_COUNT_MAX	2
 struct kvm_pmc {
 	enum pmc_type type;
 	u8 idx;
@@ -456,6 +457,7 @@ struct kvm_pmc {
 	u64 eventsel;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
+	int hw_life_count;
 };
 
 struct kvm_pmu {
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5ab4a36..bb16031 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -35,6 +35,104 @@
 /* mapping between fixed pmc index and intel_arch_events array */
 static int fixed_pmc_events[] = {1, 0, 7};
 
+static bool intel_pmc_is_assigned(struct kvm_pmc *pmc)
+{
+	return pmc->perf_event != NULL &&
+		   pmc->perf_event->hw.idx != -1 &&
+		   pmc->perf_event->oncpu != -1;
+}
+
+static int intel_pmc_read_counter(struct kvm_vcpu *vcpu,
+	unsigned int idx, u64 *data)
+{
+	struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, idx);
+
+	if (intel_pmc_is_assigned(pmc)) {
+		rdpmcl(pmc->perf_event->hw.event_base_rdpmc, *data);
+		pmc->counter = *data;
+		pmc->hw_life_count = HW_LIFE_COUNT_MAX;
+	} else {
+		*data = pmc->counter;
+	}
+	return 0;
+}
+
+static void intel_pmu_enable_host_gp_counter(struct kvm_pmc *pmc)
+{
+	u64 config;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	config = (pmc->type == KVM_PMC_GP) ? pmc->eventsel :
+		pmc->perf_event->hw.config | ARCH_PERFMON_EVENTSEL_ENABLE;
+	wrmsrl(pmc->perf_event->hw.config_base, config);
+}
+
+static void intel_pmu_disable_host_gp_counter(struct kvm_pmc *pmc)
+{
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	wrmsrl(pmc->perf_event->hw.config_base, 0);
+}
+
+static void intel_pmu_enable_host_fixed_counter(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
+	int host_idx = pmc->perf_event->hw.idx - INTEL_PMC_IDX_FIXED;
+	u64 ctrl_val, mask, bits = 0;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	if (!pmc->perf_event->attr.precise_ip)
+		bits |= 0x8;
+	if (pmc->perf_event->hw.config & ARCH_PERFMON_EVENTSEL_USR)
+		bits |= 0x2;
+	if (pmc->perf_event->hw.config & ARCH_PERFMON_EVENTSEL_OS)
+		bits |= 0x1;
+
+	if (pmu->version > 2
+		&& (pmc->perf_event->hw.config & ARCH_PERFMON_EVENTSEL_ANY))
+		bits |= 0x4;
+
+	bits <<= (host_idx * 4);
+	mask = 0xfULL << (host_idx * 4);
+
+	rdmsrl(pmc->perf_event->hw.config_base, ctrl_val);
+	ctrl_val &= ~mask;
+	ctrl_val |= bits;
+	wrmsrl(pmc->perf_event->hw.config_base, ctrl_val);
+}
+
+static void intel_pmu_disable_host_fixed_counter(struct kvm_pmc *pmc)
+{
+	u64 ctrl_val, mask = 0;
+	u8 host_idx;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	host_idx = pmc->perf_event->hw.idx - INTEL_PMC_IDX_FIXED;
+	mask = 0xfULL << (host_idx * 4);
+	rdmsrl(pmc->perf_event->hw.config_base, ctrl_val);
+	ctrl_val &= ~mask;
+	wrmsrl(pmc->perf_event->hw.config_base, ctrl_val);
+}
+
+static void intel_pmu_update_host_fixed_ctrl(u64 new_ctrl, u8 host_idx)
+{
+	u64 host_ctrl, mask;
+
+	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, host_ctrl);
+	mask = 0xfULL << (host_idx * 4);
+	host_ctrl &= ~mask;
+	new_ctrl <<= (host_idx * 4);
+	host_ctrl |= new_ctrl;
+	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, host_ctrl);
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	int i;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC] [PATCH v2 3/5] KVM/x86/vPMU: add Intel vPMC enable/disable and save/restore support
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 1/5] perf/x86: avoid host changing counter state for kvm_intel events holder Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 2/5] KVM/x86/vPMU: add pmc operations for vmx and count to track release Like Xu
@ 2019-03-23 14:18 ` Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 4/5] KVM/x86/vPMU: add vCPU scheduling support for hw-assigned vPMC Like Xu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

We may not assume the guest fixed vPMC would be assigned by
an host fixed counter and vice versa. This issue (the host hw->idx
has a different type of guest hw->idx) is named as the cross-mapping
and it needs to keep semantics for mask select and enable ctrl.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 92 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index bb16031..0b69acc 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -133,6 +133,34 @@ static void intel_pmu_update_host_fixed_ctrl(u64 new_ctrl, u8 host_idx)
 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, host_ctrl);
 }
 
+static void intel_pmu_enable_host_counter(struct kvm_pmc *pmc)
+{
+	u8 host_idx;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	host_idx = pmc->perf_event->hw.idx;
+	if (host_idx >= INTEL_PMC_IDX_FIXED)
+		intel_pmu_enable_host_fixed_counter(pmc);
+	else
+		intel_pmu_enable_host_gp_counter(pmc);
+}
+
+static void intel_pmu_disable_host_counter(struct kvm_pmc *pmc)
+{
+	u8 host_idx;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	host_idx = pmc->perf_event->hw.idx;
+	if (host_idx >= INTEL_PMC_IDX_FIXED)
+		intel_pmu_disable_host_fixed_counter(pmc);
+	else
+		intel_pmu_disable_host_gp_counter(pmc);
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	int i;
@@ -262,6 +290,57 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 	return ret;
 }
 
+static void intel_pmu_save_guest_pmc(struct kvm_pmu *pmu, u32 idx)
+{
+	struct kvm_pmc *pmc = intel_pmc_idx_to_pmc(pmu, idx);
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	rdmsrl(pmc->perf_event->hw.event_base, pmc->counter);
+	wrmsrl(pmc->perf_event->hw.event_base, 0);
+}
+
+static void intel_pmu_restore_guest_pmc(struct kvm_pmu *pmu, u32 idx)
+{
+	struct kvm_pmc *pmc = intel_pmc_idx_to_pmc(pmu, idx);
+	u8 ctrl;
+
+	if (!intel_pmc_is_assigned(pmc))
+		return;
+
+	if (pmc->idx >= INTEL_PMC_IDX_FIXED) {
+		ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+			pmc->idx - INTEL_PMC_IDX_FIXED);
+		if (ctrl)
+			intel_pmu_enable_host_counter(pmc);
+		else
+			intel_pmu_disable_host_counter(pmc);
+	} else {
+		if (!(pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE))
+			intel_pmu_disable_host_counter(pmc);
+		else
+			intel_pmu_enable_host_counter(pmc);
+	}
+
+	wrmsrl(pmc->perf_event->hw.event_base, pmc->counter);
+}
+
+static void intel_pmc_stop_counter(struct kvm_pmc *pmc)
+{
+	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+	if (!pmc->perf_event)
+		return;
+
+	intel_pmu_disable_host_counter(pmc);
+	intel_pmu_save_guest_pmc(pmu, pmc->idx);
+	pmc_read_counter(pmc);
+	perf_event_release_kernel(pmc->perf_event);
+	pmc->perf_event = NULL;
+	pmc->hw_life_count = 0;
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -424,17 +503,20 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
 static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
 	int i;
 
 	for (i = 0; i < INTEL_PMC_MAX_GENERIC; i++) {
-		struct kvm_pmc *pmc = &pmu->gp_counters[i];
-
-		pmc_stop_counter(pmc);
+		pmc = &pmu->gp_counters[i];
+		intel_pmc_stop_counter(pmc);
 		pmc->counter = pmc->eventsel = 0;
 	}
 
-	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++)
-		pmc_stop_counter(&pmu->fixed_counters[i]);
+	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
+		pmc = &pmu->fixed_counters[i];
+		intel_pmc_stop_counter(pmc);
+		pmc->counter = 0;
+	}
 
 	pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
 		pmu->global_ovf_ctrl = 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC] [PATCH v2 4/5] KVM/x86/vPMU: add vCPU scheduling support for hw-assigned vPMC
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
                   ` (2 preceding siblings ...)
  2019-03-23 14:18 ` [RFC] [PATCH v2 3/5] KVM/x86/vPMU: add Intel vPMC enable/disable and save/restore support Like Xu
@ 2019-03-23 14:18 ` Like Xu
  2019-03-23 14:18 ` [RFC] [PATCH v2 5/5] KVM/x86/vPMU: not do reprogram_counter for Intel " Like Xu
  2019-03-23 17:28 ` [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Peter Zijlstra
  5 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

This patch would dispatch the generic vPMU request to intel ones.
In the vCPU scheduling context, this patch would save and restore
those in-use assigned counter states without interference.

The intel_pmu_sched_in() would release the event if its hw_life_count
has been counted down to zero OR if vPMC is disabled, it is considered
to be no longer used and its hw_life_count is decreased by one.

Signed-off-by: Wang Wei <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/kvm/pmu.c           | 15 ++++++++
 arch/x86/kvm/pmu.h           | 22 ++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c           |  6 ++++
 4 files changed, 124 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 58ead7d..672e268 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -284,6 +284,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
 	struct kvm_pmc *pmc;
 	u64 ctr_val;
 
+	if (kvm_x86_ops->pmu_ops->pmc_read_counter)
+		return kvm_x86_ops->pmu_ops->pmc_read_counter(vcpu, idx, data);
+
 	if (is_vmware_backdoor_pmc(idx))
 		return kvm_pmu_rdpmc_vmware(vcpu, idx, data);
 
@@ -337,6 +340,18 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->pmu_ops->reset(vcpu);
 }
 
+void kvm_pmu_sched_out(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_out)
+		kvm_x86_ops->pmu_ops->sched_out(vcpu);
+}
+
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu)
+{
+	if (kvm_x86_ops->pmu_ops->sched_in)
+		kvm_x86_ops->pmu_ops->sched_in(vcpu);
+}
+
 void kvm_pmu_init(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ba8898e..de68ff0 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -33,6 +33,12 @@ struct kvm_pmu_ops {
 	void (*refresh)(struct kvm_vcpu *vcpu);
 	void (*init)(struct kvm_vcpu *vcpu);
 	void (*reset)(struct kvm_vcpu *vcpu);
+	bool (*pmc_is_assigned)(struct kvm_pmc *pmc);
+	void (*pmc_stop_counter)(struct kvm_pmc *pmc);
+	int (*pmc_read_counter)(struct kvm_vcpu *vcpu,
+		unsigned int idx, u64 *data);
+	void (*sched_out)(struct kvm_vcpu *vcpu);
+	void (*sched_in)(struct kvm_vcpu *vcpu);
 };
 
 static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -54,8 +60,22 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 	return counter & pmc_bitmask(pmc);
 }
 
+static inline bool pmc_is_assigned(struct kvm_pmc *pmc)
+{
+	if (kvm_x86_ops->pmu_ops->pmc_is_assigned)
+		return kvm_x86_ops->pmu_ops->pmc_is_assigned(pmc);
+
+	return false;
+}
+
 static inline void pmc_stop_counter(struct kvm_pmc *pmc)
 {
+	if (kvm_x86_ops->pmu_ops->pmc_stop_counter) {
+		if (pmc_is_assigned(pmc))
+			rdmsrl(pmc->perf_event->hw.event_base, pmc->counter);
+		return;
+	}
+
 	if (pmc->perf_event) {
 		pmc->counter = pmc_read_counter(pmc);
 		perf_event_release_kernel(pmc->perf_event);
@@ -117,6 +137,8 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_sched_out(struct kvm_vcpu *vcpu);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0b69acc..63e00ea 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -522,6 +522,82 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 		pmu->global_ovf_ctrl = 0;
 }
 
+static void intel_pmu_sched_out(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	int i;
+
+	for (i = 0; i < INTEL_PMC_MAX_GENERIC; i++) {
+		pmc = &pmu->gp_counters[i];
+		intel_pmu_disable_host_counter(pmc);
+		intel_pmu_save_guest_pmc(pmu, pmc->idx);
+	}
+
+	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
+		pmc = &pmu->fixed_counters[i];
+		intel_pmu_disable_host_counter(pmc);
+		intel_pmu_save_guest_pmc(pmu, pmc->idx);
+	}
+}
+
+static void intel_pmu_sched_in(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	struct hw_perf_event *hwc;
+	u64 host_ctrl, test, disabled_ctrl_val = 0;
+	int i;
+
+	for (i = 0; i < INTEL_PMC_MAX_GENERIC; i++) {
+		pmc = &pmu->gp_counters[i];
+
+		if (pmc->perf_event && pmc->hw_life_count == 0)
+			intel_pmc_stop_counter(pmc);
+
+		if (!intel_pmc_is_assigned(pmc))
+			continue;
+
+		intel_pmu_restore_guest_pmc(pmu, pmc->idx);
+
+		hwc = &pmc->perf_event->hw;
+		if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
+			u64 mask = 0xfULL <<
+				((hwc->idx - INTEL_PMC_IDX_FIXED) * 4);
+			disabled_ctrl_val &= ~mask;
+			rdmsrl(hwc->config_base, host_ctrl);
+			if (disabled_ctrl_val == host_ctrl)
+				pmc->hw_life_count--;
+		} else if (!(pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE)) {
+			pmc->hw_life_count--;
+		}
+	}
+
+	for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
+		pmc = &pmu->fixed_counters[i];
+
+		if (pmc->perf_event && pmc->hw_life_count == 0)
+			intel_pmc_stop_counter(pmc);
+
+		if (!intel_pmc_is_assigned(pmc))
+			continue;
+
+		intel_pmu_restore_guest_pmc(pmu, pmc->idx);
+
+		hwc = &pmc->perf_event->hw;
+		if (hwc->idx < INTEL_PMC_IDX_FIXED) {
+			rdmsrl(hwc->config_base, test);
+			if (!(test & ARCH_PERFMON_EVENTSEL_ENABLE))
+				pmc->hw_life_count--;
+		} else {
+			u8 ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+				pmc->idx - INTEL_PMC_IDX_FIXED);
+			if (ctrl == 0)
+				pmc->hw_life_count--;
+		}
+	}
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
 	.find_arch_event = intel_find_arch_event,
 	.find_fixed_event = intel_find_fixed_event,
@@ -535,4 +611,9 @@ struct kvm_pmu_ops intel_pmu_ops = {
 	.refresh = intel_pmu_refresh,
 	.init = intel_pmu_init,
 	.reset = intel_pmu_reset,
+	.pmc_is_assigned = intel_pmc_is_assigned,
+	.pmc_stop_counter = intel_pmc_stop_counter,
+	.pmc_read_counter = intel_pmc_read_counter,
+	.sched_out = intel_pmu_sched_out,
+	.sched_in = intel_pmu_sched_in,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65e4559..f9c715b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9100,9 +9100,15 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 		static_key_slow_dec(&kvm_no_apic_vcpu);
 }
 
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu)
+{
+	kvm_pmu_sched_out(vcpu);
+}
+
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	vcpu->arch.l1tf_flush_l1d = true;
+	kvm_pmu_sched_in(vcpu);
 	kvm_x86_ops->sched_in(vcpu, cpu);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC] [PATCH v2 5/5] KVM/x86/vPMU: not do reprogram_counter for Intel hw-assigned vPMC
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
                   ` (3 preceding siblings ...)
  2019-03-23 14:18 ` [RFC] [PATCH v2 4/5] KVM/x86/vPMU: add vCPU scheduling support for hw-assigned vPMC Like Xu
@ 2019-03-23 14:18 ` Like Xu
  2019-03-23 17:28 ` [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Peter Zijlstra
  5 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-23 14:18 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: like.xu, wei.w.wang, Andi Kleen, Peter Zijlstra, Kan Liang,
	Ingo Molnar, Paolo Bonzini

Considering the cross-mapping issue, this patch directly passes
the intel_pmu_set_msr request value to the hw-assigned vPMC.

This patch would reprogram a counter from host perf scheduler
just one time when it's first requested and keep to reuse it
during a certain period of time until it's lazy-released, which
is associated with HW_LIFE_COUNT_MAX and scheduling time slice.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
---
 arch/x86/kvm/pmu.c           | 19 +++++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c | 58 +++++++++++++++++++++++++++++++++++---------
 2 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 672e268..d7e7fb6 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -137,6 +137,11 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
 	}
 
 	pmc->perf_event = event;
+	if (pmc_is_assigned(pmc)) {
+		pmc->hw_life_count = HW_LIFE_COUNT_MAX;
+		wrmsrl(pmc->perf_event->hw.event_base, pmc->counter);
+	}
+
 	clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
 }
 
@@ -155,6 +160,13 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
 	if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc))
 		return;
 
+	if (pmc_is_assigned(pmc)) {
+		pmc->hw_life_count = HW_LIFE_COUNT_MAX;
+		clear_bit(pmc->idx,
+			(unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+		return;
+	}
+
 	event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
 	unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
 
@@ -192,6 +204,13 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int idx)
 	if (!en_field || !pmc_is_enabled(pmc))
 		return;
 
+	if (pmc_is_assigned(pmc)) {
+		pmc->hw_life_count = HW_LIFE_COUNT_MAX;
+		clear_bit(pmc->idx,
+			(unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+		return;
+	}
+
 	pmc_reprogram_counter(pmc, PERF_TYPE_HARDWARE,
 			      kvm_x86_ops->pmu_ops->find_fixed_event(idx),
 			      !(en_field & 0x2), /* exclude user */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 63e00ea..2dfdf54 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -163,12 +163,13 @@ static void intel_pmu_disable_host_counter(struct kvm_pmc *pmc)
 
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
+	struct hw_perf_event *hwc;
+	struct kvm_pmc *pmc;
 	int i;
 
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		u8 new_ctrl = fixed_ctrl_field(data, i);
 		u8 old_ctrl = fixed_ctrl_field(pmu->fixed_ctr_ctrl, i);
-		struct kvm_pmc *pmc;
 
 		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
 
@@ -176,6 +177,19 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 			continue;
 
 		reprogram_fixed_counter(pmc, new_ctrl, i);
+
+		if (!intel_pmc_is_assigned(pmc))
+			continue;
+
+		hwc = &pmc->perf_event->hw;
+		if (hwc->idx < INTEL_PMC_IDX_FIXED) {
+			u64 config = (new_ctrl == 0) ? 0 :
+				(hwc->config | ARCH_PERFMON_EVENTSEL_ENABLE);
+			wrmsrl(hwc->config_base, config);
+		} else {
+			intel_pmu_update_host_fixed_ctrl(new_ctrl,
+				hwc->idx - INTEL_PMC_IDX_FIXED);
+		}
 	}
 
 	pmu->fixed_ctr_ctrl = data;
@@ -345,6 +359,7 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	struct hw_perf_event *hwc;
 
 	switch (msr) {
 	case MSR_CORE_PERF_FIXED_CTR_CTRL:
@@ -362,7 +377,13 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_fixed_pmc(pmu, msr))) {
-			*data = pmc_read_counter(pmc);
+			if (intel_pmc_is_assigned(pmc)) {
+				hwc = &pmc->perf_event->hw;
+				rdmsrl_safe(hwc->event_base, data);
+				pmc->counter = *data;
+			} else {
+				*data = pmc->counter;
+			}
 			return 0;
 		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
 			*data = pmc->eventsel;
@@ -377,6 +398,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	struct hw_perf_event *hwc;
 	u32 msr = msr_info->index;
 	u64 data = msr_info->data;
 
@@ -414,18 +436,30 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	default:
 		if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
 		    (pmc = get_fixed_pmc(pmu, msr))) {
-			if (!msr_info->host_initiated)
-				data = (s64)(s32)data;
-			pmc->counter += data - pmc_read_counter(pmc);
-			return 0;
-		} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
-			if (data == pmc->eventsel)
-				return 0;
-			if (!(data & pmu->reserved_bits)) {
-				reprogram_gp_counter(pmc, data);
-				return 0;
+			pmc->counter = data;
+			if (intel_pmc_is_assigned(pmc)) {
+				hwc = &pmc->perf_event->hw;
+				wrmsrl(hwc->event_base, pmc->counter);
 			}
+			return 0;
 		}
+
+		pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0);
+		if (!pmc)
+			return 1;
+
+		if (data == pmc->eventsel
+				|| (data & pmu->reserved_bits))
+			return 0;
+
+		reprogram_gp_counter(pmc, data);
+
+		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE)
+			intel_pmu_enable_host_counter(pmc);
+		else
+			intel_pmu_disable_host_counter(pmc);
+
+		return 0;
 	}
 
 	return 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
                   ` (4 preceding siblings ...)
  2019-03-23 14:18 ` [RFC] [PATCH v2 5/5] KVM/x86/vPMU: not do reprogram_counter for Intel " Like Xu
@ 2019-03-23 17:28 ` Peter Zijlstra
  2019-03-23 23:15   ` Andi Kleen
  2019-03-25  6:47   ` Like Xu
  5 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-03-23 17:28 UTC (permalink / raw)
  To: Like Xu
  Cc: linux-kernel, kvm, like.xu, wei.w.wang, Andi Kleen, Kan Liang,
	Ingo Molnar, Paolo Bonzini, Thomas Gleixner

On Sat, Mar 23, 2019 at 10:18:03PM +0800, Like Xu wrote:
> === Brief description ===
> 
> This proposal for Intel vPMU is still committed to optimize the basic
> functionality by reducing the PMU virtualization overhead and not a blind
> pass-through of the PMU. The proposal applies to existing models, in short,
> is "host perf would hand over control to kvm after counter allocation".
> 
> The pmc_reprogram_counter is a heavyweight and high frequency operation
> which goes through the host perf software stack to create a perf event for
> counter assignment, this could take millions of nanoseconds. The current
> vPMU always does reprogram_counter when the guest changes the eventsel,
> fixctrl, and global_ctrl msrs. This brings too much overhead to the usage
> of perf inside the guest, especially the guest PMI handling and context
> switching of guest threads with perf in use.

I think I asked for starting with making pmc_reprogram_counter() less
retarded. I'm not seeing that here.

> We optimize the current vPMU to work in this manner:
> 
> (1) rely on the existing host perf (perf_event_create_kernel_counter)
>     to allocate counters for in-use vPMC and always try to reuse events;
> (2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
>     to the hardware msr that the corresponding host event is scheduled on
>     and avoid pollution from host is also needed in its partial runtime;

If you do pass-through; how do you deal with event constraints?

> (3) save and restore the counter state during vCPU scheduling in hooks;
> (4) apply a lazy approach to release the vPMC's perf event. That is, if
>     the vPMC isn't used in a fixed sched slice, its event will be released.
> 
> In the use of vPMC, the vPMU always focus on the assigned resources and
> guest perf would significantly benefit from direct access to hardware and
> may not care about runtime state of perf_event created by host and always
> try not to pay for their maintenance. However to avoid events entering into
> any unexpected state, calling pmc_read_counter in appropriate is necessary.

what?!

I can't follow that, and the quick look I had at the patches doesn't
seem to help. I did note it is intel only and that is really sad.

It also makes a mess of who programs what msr when.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-23 17:28 ` [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Peter Zijlstra
@ 2019-03-23 23:15   ` Andi Kleen
  2019-03-25  6:07     ` Like Xu
  2019-03-25  6:47   ` Like Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2019-03-23 23:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Like Xu, linux-kernel, kvm, like.xu, wei.w.wang, Kan Liang,
	Ingo Molnar, Paolo Bonzini, Thomas Gleixner

> > We optimize the current vPMU to work in this manner:
> > 
> > (1) rely on the existing host perf (perf_event_create_kernel_counter)
> >     to allocate counters for in-use vPMC and always try to reuse events;
> > (2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
> >     to the hardware msr that the corresponding host event is scheduled on
> >     and avoid pollution from host is also needed in its partial runtime;
> 
> If you do pass-through; how do you deal with event constraints?

The guest has to deal with them. It already needs to know
the model number to program the right events, can as well know
the constraints too.

For architectural events that don't need the model number it's
not a problem because they don't have constraints.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-23 23:15   ` Andi Kleen
@ 2019-03-25  6:07     ` Like Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Like Xu @ 2019-03-25  6:07 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra
  Cc: linux-kernel, kvm, like.xu, wei.w.wang, Kan Liang, Ingo Molnar,
	Paolo Bonzini, Thomas Gleixner

On 2019/3/24 7:15, Andi Kleen wrote:
>>> We optimize the current vPMU to work in this manner:
>>>
>>> (1) rely on the existing host perf (perf_event_create_kernel_counter)
>>>      to allocate counters for in-use vPMC and always try to reuse events;
>>> (2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
>>>      to the hardware msr that the corresponding host event is scheduled on
>>>      and avoid pollution from host is also needed in its partial runtime;
>>
>> If you do pass-through; how do you deal with event constraints?
> 
> The guest has to deal with them. It already needs to know
> the model number to program the right events, can as well know
> the constraints too.
> 
> For architectural events that don't need the model number it's
> not a problem because they don't have constraints.
> 
> -Andi
> 

I agree this version doesn't seem to keep an eye on host perf event 
constraints deliberately:

1. Based on my limited knowledge, assuming the model number means hwc->idx.
2. The guest event constraints would be constructed into 
hwc->config_base value which is pmc->eventsel and pmu->fixed_ctr_ctrl 
from KVM point of view.
3. The guest PMU has same semantic model on virt hardware limitation as 
the host does with real PMU (related CPUID/PERF_MSR expose this part of 
information to guest).
3. Guest perf scheduler would make sure the guest event constraints 
could dance with right guest model number.
4. vPMU would make sure the guest vPMC get the right guest model number 
by hard-code EVENT_PINNED or just fail with creation.
5. This patch directly apply the guest hwc->config_base value to host 
assigned hardware without consent from host perf(a bit deceptive but 
practical for reducing the number of reprogram calls).

=== OR ====

If we insist on passing guest event constraints to host perf,
this proposal may need the following changes:

Because the guest configuration of hwc->config_base mostly only toggles 
the enable bit of eventsel or fixctrl,it is not necessary to do 
reprogram_counter because it's serving the same guest perf event.

The event creation is only needed when guest writes a complete new value 
to eventsel or fixctrl.Codes for guest MSR_P6_EVNTSEL0 trap for example 
may be modified to be like this:

	u64 diff = pmc->eventsel ^ data;
	if (intel_pmc_is_assigned(pmc)
		&& diff	!= ARCH_PERFMON_EVENTSEL_ENABLE) {
		intel_pmu_save_guest_pmc(pmu, pmc->idx);
		intel_pmc_stop_counter(pmc);
	}
	reprogram_gp_counter(pmc, data);

Does this seem to satisfy our needs?


It makes everything easier to correct me if I'm wrong.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-23 17:28 ` [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Peter Zijlstra
  2019-03-23 23:15   ` Andi Kleen
@ 2019-03-25  6:47   ` Like Xu
  2019-03-25  7:19     ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Like Xu @ 2019-03-25  6:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, like.xu, wei.w.wang, Andi Kleen, Kan Liang,
	Ingo Molnar, Paolo Bonzini, Thomas Gleixner

On 2019/3/24 1:28, Peter Zijlstra wrote:
> On Sat, Mar 23, 2019 at 10:18:03PM +0800, Like Xu wrote:
>> === Brief description ===
>>
>> This proposal for Intel vPMU is still committed to optimize the basic
>> functionality by reducing the PMU virtualization overhead and not a blind
>> pass-through of the PMU. The proposal applies to existing models, in short,
>> is "host perf would hand over control to kvm after counter allocation".
>>
>> The pmc_reprogram_counter is a heavyweight and high frequency operation
>> which goes through the host perf software stack to create a perf event for
>> counter assignment, this could take millions of nanoseconds. The current
>> vPMU always does reprogram_counter when the guest changes the eventsel,
>> fixctrl, and global_ctrl msrs. This brings too much overhead to the usage
>> of perf inside the guest, especially the guest PMI handling and context
>> switching of guest threads with perf in use.
> 
> I think I asked for starting with making pmc_reprogram_counter() less
> retarded. I'm not seeing that here.

Do you mean pass perf_event_attr to refactor pmc_reprogram_counter
via paravirt ? Please share more details.

> 
>> We optimize the current vPMU to work in this manner:
>>
>> (1) rely on the existing host perf (perf_event_create_kernel_counter)
>>      to allocate counters for in-use vPMC and always try to reuse events;
>> (2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
>>      to the hardware msr that the corresponding host event is scheduled on
>>      and avoid pollution from host is also needed in its partial runtime;
> 
> If you do pass-through; how do you deal with event constraints >
>> (3) save and restore the counter state during vCPU scheduling in hooks;
>> (4) apply a lazy approach to release the vPMC's perf event. That is, if
>>      the vPMC isn't used in a fixed sched slice, its event will be released.
>>
>> In the use of vPMC, the vPMU always focus on the assigned resources and
>> guest perf would significantly benefit from direct access to hardware and
>> may not care about runtime state of perf_event created by host and always
>> try not to pay for their maintenance. However to avoid events entering into
>> any unexpected state, calling pmc_read_counter in appropriate is necessary.
> 
> what?!

The patch will reuse the created events as much as possible for same 
guest vPMC which may has different config_base in its partial runtime.

The pmc_read_counter is designed to be called in kvm_pmu_rdpmc and 
pmc_stop_counter as legacy does and it's not for vPMU functionality but 
for host perf maintenance (seems to be gone in code,Oops).

> 
> I can't follow that, and the quick look I had at the patches doesn't
> seem to help. I did note it is intel only and that is really sad.

The basic idea of optimization is x86 generic, and the implementation is 
not intentional cause I could not access non-Intel machines and verified it.

> 
> It also makes a mess of who programs what msr when.
> 

who programs: vPMU does as usual in pmc_reprogram_counter

what msr: host perf scheduler make decisions and I'm not sure the hosy 
perf would do cross-mapping scheduling which means to assign a host 
fixed counter to guest gp counter and vice versa.

when programs: every time to call reprogram_gp/fixed_counter && 
pmc_is_assigned(pmc) is false; check the fifth pacth for details.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-25  6:47   ` Like Xu
@ 2019-03-25  7:19     ` Peter Zijlstra
  2019-03-25 15:58       ` Andi Kleen
  2019-04-01  9:08       ` Wei Wang
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-03-25  7:19 UTC (permalink / raw)
  To: Like Xu
  Cc: linux-kernel, kvm, like.xu, wei.w.wang, Andi Kleen, Kan Liang,
	Ingo Molnar, Paolo Bonzini, Thomas Gleixner

On Mon, Mar 25, 2019 at 02:47:32PM +0800, Like Xu wrote:
> On 2019/3/24 1:28, Peter Zijlstra wrote:
> > On Sat, Mar 23, 2019 at 10:18:03PM +0800, Like Xu wrote:
> > > === Brief description ===
> > > 
> > > This proposal for Intel vPMU is still committed to optimize the basic
> > > functionality by reducing the PMU virtualization overhead and not a blind
> > > pass-through of the PMU. The proposal applies to existing models, in short,
> > > is "host perf would hand over control to kvm after counter allocation".
> > > 
> > > The pmc_reprogram_counter is a heavyweight and high frequency operation
> > > which goes through the host perf software stack to create a perf event for
> > > counter assignment, this could take millions of nanoseconds. The current
> > > vPMU always does reprogram_counter when the guest changes the eventsel,
> > > fixctrl, and global_ctrl msrs. This brings too much overhead to the usage
> > > of perf inside the guest, especially the guest PMI handling and context
> > > switching of guest threads with perf in use.
> > 
> > I think I asked for starting with making pmc_reprogram_counter() less
> > retarded. I'm not seeing that here.
> 
> Do you mean pass perf_event_attr to refactor pmc_reprogram_counter
> via paravirt ? Please share more details.

I mean nothing; I'm trying to understand wth you're doing.

> > > We optimize the current vPMU to work in this manner:
> > > 
> > > (1) rely on the existing host perf (perf_event_create_kernel_counter)
> > >      to allocate counters for in-use vPMC and always try to reuse events;
> > > (2) vPMU captures guest accesses to the eventsel and fixctrl msr directly
> > >      to the hardware msr that the corresponding host event is scheduled on
> > >      and avoid pollution from host is also needed in its partial runtime;
> > 
> > If you do pass-through; how do you deal with event constraints >
> > > (3) save and restore the counter state during vCPU scheduling in hooks;
> > > (4) apply a lazy approach to release the vPMC's perf event. That is, if
> > >      the vPMC isn't used in a fixed sched slice, its event will be released.
> > > 
> > > In the use of vPMC, the vPMU always focus on the assigned resources and
> > > guest perf would significantly benefit from direct access to hardware and
> > > may not care about runtime state of perf_event created by host and always
> > > try not to pay for their maintenance. However to avoid events entering into
> > > any unexpected state, calling pmc_read_counter in appropriate is necessary.
> > 
> > what?!
> 
> The patch will reuse the created events as much as possible for same guest
> vPMC which may has different config_base in its partial runtime.

again. what?!

> The pmc_read_counter is designed to be called in kvm_pmu_rdpmc and
> pmc_stop_counter as legacy does and it's not for vPMU functionality but for
> host perf maintenance (seems to be gone in code,Oops).
> 
> > 
> > I can't follow that, and the quick look I had at the patches doesn't
> > seem to help. I did note it is intel only and that is really sad.
> 
> The basic idea of optimization is x86 generic, and the implementation is not
> intentional cause I could not access non-Intel machines and verified it.
> 
> > 
> > It also makes a mess of who programs what msr when.
> > 
> 
> who programs: vPMU does as usual in pmc_reprogram_counter
> 
> what msr: host perf scheduler make decisions and I'm not sure the hosy perf
> would do cross-mapping scheduling which means to assign a host fixed counter
> to guest gp counter and vice versa.
> 
> when programs: every time to call reprogram_gp/fixed_counter &&
> pmc_is_assigned(pmc) is false; check the fifth pacth for details.

I'm not going to reverse engineer this; if you can't write coherent
descriptions, this isn't going anywhere.

It isn't going anywhere anyway, its insane. You let perf do all its
normal things and then discard the results by avoiding the wrmsr.

Then you fudge a second wrmsr path somewhere.

Please, just make the existing event dtrt.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-25  7:19     ` Peter Zijlstra
@ 2019-03-25 15:58       ` Andi Kleen
  2019-04-01  9:08       ` Wei Wang
  1 sibling, 0 replies; 13+ messages in thread
From: Andi Kleen @ 2019-03-25 15:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Like Xu, linux-kernel, kvm, like.xu, wei.w.wang, Kan Liang,
	Ingo Molnar, Paolo Bonzini, Thomas Gleixner

> It isn't going anywhere anyway, its insane. You let perf do all its
> normal things and then discard the results by avoiding the wrmsr.
> 
> Then you fudge a second wrmsr path somewhere.
> 
> Please, just make the existing event dtrt.

I still think the right way is to force an event to a counter
from an internal field. And then set that field from KVM.
This is quite straight forward to do in the event scheduler.

I did it for some experimential PEBS virtualization patches
which require the same because they have to expose the
counter indexes inside the PEBS record to the guest.

-Andi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization
  2019-03-25  7:19     ` Peter Zijlstra
  2019-03-25 15:58       ` Andi Kleen
@ 2019-04-01  9:08       ` Wei Wang
  1 sibling, 0 replies; 13+ messages in thread
From: Wei Wang @ 2019-04-01  9:08 UTC (permalink / raw)
  To: Peter Zijlstra, Like Xu
  Cc: linux-kernel, kvm, like.xu, Andi Kleen, Kan Liang, Ingo Molnar,
	Paolo Bonzini, Thomas Gleixner

On 03/25/2019 03:19 PM, Peter Zijlstra wrote:
> On Mon, Mar 25, 2019 at 02:47:32PM +0800, Like Xu wrote:
>> On 2019/3/24 1:28, Peter Zijlstra wrote:
>>> On Sat, Mar 23, 2019 at 10:18:03PM +0800, Like Xu wrote:
>>>> === Brief description ===
>>>>
>>>> This proposal for Intel vPMU is still committed to optimize the basic
>>>> functionality by reducing the PMU virtualization overhead and not a blind
>>>> pass-through of the PMU. The proposal applies to existing models, in short,
>>>> is "host perf would hand over control to kvm after counter allocation".
>>>>
>>>> The pmc_reprogram_counter is a heavyweight and high frequency operation
>>>> which goes through the host perf software stack to create a perf event for
>>>> counter assignment, this could take millions of nanoseconds. The current
>>>> vPMU always does reprogram_counter when the guest changes the eventsel,
>>>> fixctrl, and global_ctrl msrs. This brings too much overhead to the usage
>>>> of perf inside the guest, especially the guest PMI handling and context
>>>> switching of guest threads with perf in use.
>>> I think I asked for starting with making pmc_reprogram_counter() less
>>> retarded. I'm not seeing that here.
>> Do you mean pass perf_event_attr to refactor pmc_reprogram_counter
>> via paravirt ? Please share more details.
> I mean nothing; I'm trying to understand wth you're doing.

I also feel the description looks confusing (sorry for being late to
join in due to leaves). Also the code needs to be improved a lot.


Please see the basic idea here:

reprogram_counter is a heavyweight operation which goes through the
perf software stack to create a perf event, this could take millions of
nanoseconds. The current KVM vPMU always does reprogram_counter
when the guest changes the eventsel, fixctrl, and global_ctrl msrs. This
brings too much overhead to the usage of perf inside the guest, especially
the guest PMI handling and context switching of guest threads with perf in
use.

In fact, during the guest perf event life cycle, it mostly only toggles the
enable bit of eventsel or fixctrl. From the KVM point of view, if the guest
only toggles the enable bits, it is not necessary to do reprogram_counter,
because it is serving the same guest perf event. So the "enable bit" can
be directly applied to the hardware msr that the corresponding host event
is occupying.

We optimize the current vPMU to work in this manner:
1) rely on the existing host perf (perf_event_create_kernel_counter) to
create a perf event for each vPMC. This creation is only needed when
guest writes a complete new value to eventsel or fixctrl.

2) vPMU captures guest accesses to the eventsel and fixctrl msrs.
If the guest only toggles the enable bit, then we don't need to
reprogram_pmc_counter, as the vPMC is serving the same guest
event. So KVM only updates the enable bit directly to the hardware
msr that the corresponding host event is scheduled on.

3) When the host perf reschedules perf counters and happens to
have the vPMC's perf event scheduled out, KVM will do
reprogram_counter.

4) We use a lazy approach to release the vPMC's perf event. That is,
if the vPMC wasn't used for a vCPU time slice, the corresponding perf
event will be released via kvm calling perf_event_release_kernel.

Regarding who updates the underlying hardware counter:
The change here is when a perf event is used by the guest
(i.e. exclude_host=true or using a new flag if necessary), perf doesn't
update the hardware counter (e.g. a counter's event_base and config_base),
instead, the hypervisor helps to update them.

Hope the above has made it clear to understand. Thanks!

Best,
Wei

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-04-01  9:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-23 14:18 [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Like Xu
2019-03-23 14:18 ` [RFC] [PATCH v2 1/5] perf/x86: avoid host changing counter state for kvm_intel events holder Like Xu
2019-03-23 14:18 ` [RFC] [PATCH v2 2/5] KVM/x86/vPMU: add pmc operations for vmx and count to track release Like Xu
2019-03-23 14:18 ` [RFC] [PATCH v2 3/5] KVM/x86/vPMU: add Intel vPMC enable/disable and save/restore support Like Xu
2019-03-23 14:18 ` [RFC] [PATCH v2 4/5] KVM/x86/vPMU: add vCPU scheduling support for hw-assigned vPMC Like Xu
2019-03-23 14:18 ` [RFC] [PATCH v2 5/5] KVM/x86/vPMU: not do reprogram_counter for Intel " Like Xu
2019-03-23 17:28 ` [RFC] [PATCH v2 0/5] Intel Virtual PMU Optimization Peter Zijlstra
2019-03-23 23:15   ` Andi Kleen
2019-03-25  6:07     ` Like Xu
2019-03-25  6:47   ` Like Xu
2019-03-25  7:19     ` Peter Zijlstra
2019-03-25 15:58       ` Andi Kleen
2019-04-01  9:08       ` Wei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).