[PATCH v25 00/15] x86/PMU: Xen PMU PV(H) support

* [PATCH v25 00/15] x86/PMU: Xen PMU PV(H) support
@ 2015-06-19 18:44 Boris Ostrovsky
  2015-06-19 18:44 ` [PATCH v25 01/15] common/symbols: Export hypervisor symbols to privileged guest Boris Ostrovsky
                   ` (16 more replies)
  0 siblings, 17 replies; 37+ messages in thread
From: Boris Ostrovsky @ 2015-06-19 18:44 UTC (permalink / raw)
  To: JBeulich, kevin.tian, suravee.suthikulpanit,
	Aravind.Gopalakrishnan, dietmar.hahn, dgdegra, andrew.cooper3
  Cc: boris.ostrovsky, xen-devel

Changes in v25:
* Add extra check for consistency in patch 4
* Replace few returns with 'ret = -E...' (patches 4 and 12)
* Clarified commit message in patch 8
* A couple of cosmetic changes in patch 12
(I left AMD multi-counter problem unfixed since what I thought would
 be sufficient to fix it did not quite work and I don't want to do
 a partial fix)

Changes in v24:
* Make sure we follow read-once rule in hypervisor by copying data from
guest first (patch 12)
* Move MSR area pointers to the top on Intel structure (patch 2)
* Provde zero-sized arrays in public headers (patch 2)
* Return error code on PMU MSR access faults (patch 8)
* Initialize AMD counters to reserved values as opposed to zeroes
  (patch 11) (I also noticed that Intel counters similarly start with
  zeroes, which technically is not correct. I might fix it but not
  in this series)
* Make sure MSR pointers inside shared structure are not modified by
  guest (patch 12).

Changes in v23:
* In patch 2, replaced LVTPC and xen_pmu_<vendor>_ctxt unions in struct
  xen_pmu_arch with pad placeholders. They become actual memobers (as
  before) in patch 12, when they are first used.
* Added dummy struct member to ARM's struct xen_pmu_data definition
  (patch 2)
* Clarified changes in patch 10 commit message
* Added patch 11 to verify MSR value before writing it to HW in
  amd_vpmu_do_wrmsr()
* Verify MSRs before copying them into context on AMD (patch 12)

Changes in v22:
* Clarified access permissions on shared fileds (patch 2)
* Added ARCH_CNTR_ENABLED macro (patch 10)
* Use hypervisor-private VPMU_CACHED instead of shared PMU_CACHED
  (patch 11)
* A few cleanups (patch 11)

Changes in v21:

* Keep running VPMU's context private to hypervisor and expose
  it only during interrupt handling. Verify content of guest-updated
  context before loading into HW. Affected patches are 6 and (mostly) 11.
* Pre-compute some of the masks used for verification (new patch 10)
* Document shared data in public header files
* A couple of ARM-related fixes (patches 1 and 2)

Changes in v20:

* Made xen_pmu_data's domain_id a domid_t type, resulting in
  layout/padding changes (patches 2 and 10)
* Added compat checks for xen_pmu_data (patch 2)
* Clarified how vpmu_count is used in a comment (patch 6)
* Moved page management from under vpmu_lock (patch 6)
* Other (mostly style-related) changes in per-patch notes

Changes in v19:

* Do not allow changing mode to/from OFF/ALL while guests are
  running. This significantly simplifies code due
  to large number of corner cases that I had to deal with. Most of the
  changes are in patch#5. This also makes patch 4 from last version
  unnecessary
* Defer NMI support (drop patch#14 from last version)
* Make patch#15 from last series be patch#1 (vpmu init cleanup)
* Other changes are listed per patch

Changes in v18:

* Return 1 (i.e. "handled") in vpmu_do_interrupt() if PMU_CACHED is
  set. This is needed since we can get an interrupt while this flag is
  set on AMD processors when multiple counters are in use (**) (AMD
  processor don't mask LVTPC when PMC interrupt happens and so there
  is a window in vpmu_do_interrupt() until it sets the mask
  bit). Patch #14
* Unload both current and last_vcpu (if different) vpmu and clear
  this_cpu(last_vcpu) in vpmu_unload_all. Patch #5
* Make major version check for certain xenpmu_ops. Patch #5
* Make xenpmu_op()'s first argument unsigned. Patch #5
* Don't use format specifier for __stringify(). Patch #6
* Don't print generic error in vpmu_init(). Patch #6
* Don't test for VPMU existance in vpmu_initialise(). New patch #15
* Added vpmu_disabled flag to make sure VPMU doesn't get reenabled from
  dom0 (for example when watchdog is active). Patch #5
* Updated tags on some patches to better reflect latest reviewed status)

(**) While testing this I discovered that AMD VPMU is quite broken for
HVM: when multiple counters are in use linux dom0 often gets
unexpected NMIs. This may have something to do with what I mentioned
in the first bullet. However, this doesn't appear to be related to
this patch series (or earlier VPMU patches) --- I can reproduce this
all the way back to 4.1

Changes in v17:
* Disable VPMU when unknown CPU vendor is detected (patch #2)
* Remove unnecessary vendor tests in vendor-specific init routines (patch #14)
* Remember first CPU that starts mode change and use it to stop the cycle (patch #13)
* If vpmu ops is not present, return 0 as value for VPMU MSR read (as opposed to 
  returning an error as was the case in previous patch.) (patch #18)
  * Change slightly vpmu_do_msr() logic as result of this chage (patch #20)
* stringify VPMU version (patch #14)
* Use 'CS > 1' to mark sample as PMU_SAMPLE_USER (patch #19)

Changes in v16:

* Many changes in VPMU mode patch (#13):
  * Replaced arguments to some vpmu routines (vcpu -> vpmu). New patch (#12)
  * Added vpmu_unload vpmu op to completely unload vpmu data (e.g clear
    MSR bitmaps). This routine may be called in context switch (vpmu_switch_to()).
  * Added vmx_write_guest_msr_vcpu() interface to write MSRs of non-current VCPU
  * Use cpumask_cycle instead of cpumask_next
  * Dropped timeout error
  * Adjusted types of mode variables
  * Don't allow oprofile to allocate its context on MSR access if VPMU context
    has already been allocated (which may happen when VMPU mode was set to off
    while the guest was running)
* vpmu_initialise() no longer turns off VPMU globally on failure. New patch (#2)
* vpmu_do_msr() will return 1 (failure) if vpmu_ops are not set. This is done to
  prevent PV guests that are not VPMU-enabled from wrongly assuming that they have
  access to counters (Linux check_hw_exists() will make this assumption) (patch #18)
* Add cpl field to shared structure that will be passed for HVM guests' samples
  (instead of PMU_SAMPLE_USER flag). Add PMU_SAMPLE_PV flag to mark whose sample
  is passed up. (Patches ## 10, 19, 22)

Changes in v15:

* Rewrote vpmu_force_context_switch() to use continue_hypercall_on_cpu()
* Added vpmu_init initcall that will call vendor-specific init routines
* Added a lock to vpmu_struct to serialize pmu_init()/pmu_finish()
* Use SS instead of CS for DPL (instead of RPL)
* Don't take lock for XENPMU_mode_get
* Make vmpu_mode/features an unsigned int (from uint64_t)
* Adjusted pvh_hypercall64_table[] order
* Replaced address range check [XEN_VIRT_START..XEN_VIRT_END] with guest_mode()
* A few style cleanups

Changes in v14:

* Moved struct xen_pmu_regs to pmu.h
* Moved CHECK_pmu_* to an earlier patch (when structures are first introduced)
* Added PMU_SAMPLE_REAL flag to indicate whether the sample was taken in real mode
* Simplified slightly setting rules for xenpmu_data flags
* Rewrote vpmu_force_context_switch() to again use continuations. (Returning EAGAIN
  to user would mean that VPMU mode may get into inconsistent state (across processors)
  and dealing with that is more compicated than I'd like).
* Fixed msraddr_to_bitpos() and converted it into an inline
* Replaced address range check in vmpu_do_interrupt() with guest_mode()
* No error returns from __initcall
* Rebased on top of recent VPMU changes
* Various cleanups

Changes in v13:

* Rearranged data in xenpf_symdata to eliminate a hole (no change in
  structure size)
* Removed unnecessary zeroing of last character in name string during
  symbol read hypercall
* Updated comment in access_vectors for pmu_use operation
* Compute AMD MSR bank size at runtime
* Moved a couple of BUILD_BUG_ON later, to when the structures they are
  checking are first used
* Added ss and eflags to xen_pmu_registers structure
* vpmu_force_context_switch() uses per-cpu tasklet pointers.
* Moved most of arch-specific VPMU initialization code into an __initcall()
  to avoid re-calculating/re-checking things more than once (new patch, #12)
* Replaced is_*_domain() with is_*_vcpu()
* choose_hwdom_vcpu() will now assume that hardware_domain->vcpu[idx]
  is always there (callers will still verify whether or not that's true)
* Fixed a couple of sampled vs. sampling tests in vpmu_do_interrupt()
* Pass CS to the guest unchanged, add pmu_flags's flag to indicate whether the
  sample was for a user or kernel space. Move pmu_flags from xen_pmu_data into
  xen_pmu_arch
* Removed local msr_content variable from vpmu_do_wrmsr()
* Re-arranged code in parse_vpmu_param()
* Made vpmu_interrupt_type checks test for value, not individual bits
* Moved PMU_SOFTIRQ definition into arch-specific header
* Moved vpmu*.c files into xen/arch/x86/cpu/ instead of xen/arch/x86/
* For hypervisor samples, report DOMID_XEN to the guest
* Changed logic to select which registers to report to the guest (include RIP
  check to see whether we are in the hypervisor)

Changes in v12:

* Added XSM support
* Made a valifity check before writing MSR_CORE_PERF_GLOBAL_OVF_CTRL
* Updated documentation for 'vpmu=nmi' option
* Added more text to a bunch of commit messages (per Konrad's request)

Changes in v11:

* Replaced cpu_user_regs with new xen_pmu_regs (IP, SP, CS) in xen_pmu_arch.
  - as part of this re-work noticed that CS registers were set in later patch then
    needed. Moved those changes to appropriate place
* Added new VPMU mode (XENPMU_MODE_HV). Now XENPMU_MODE_SELF will only provide dom0
  with its own samples only (i.e. no hypervisor data) and XENPMU_MODE_HV will be what
  XENPMU_MODE_SELF used to be.
* Kept  vmx_add_guest_msr()/vmx_add_host_load_msr() as wrappers around vmx_add_msr()
* Cleaned up VPMU context switch macros (moved  'if(prev!=next)' back to context_switch())
* Dropped hypercall continuation from vpmu_force_context_switch() and replaced it with
  -EAGAIN error if hypercall_preempt_check() is true after 2ms.
* Kept vpmu_do_rdmsr()/vpmu_do_wrmsr as wrapperd for vpmu_do_msr()
* Move context switching patch (#13) earlier in the series (for proper bisection support)
* Various comment updates and cleanups
* Dropped a bunch of Reviewed-by and all Tested-by tags

Changes in v10:

* Swapped address and name fields of xenpf_symdata (to make it smaller on 32-bit)
* Dropped vmx_rm_guest_msr() as it requires refcountig which makes code more complicated.
* Cleaned up vlapic_reg_write()
* Call vpmu_destroy() for both HVM and PVH VCPUs
* Verify that (xen_pmu_data+PMU register bank) fit into a page
* Return error codes from arch-specific VPMU init code
* Moved VPMU-related context switch logic into inlines
* vpmu_force_context_switch() changes:
  o Avoid greater than page-sized allocations
  o Prevent another VCPU from starting VPMU sync while the first sync is in progress
* Avoid stack leak in do_xenpmu_op()
* Checked validity of Intel VPMU MSR values before they are committed
* Fixed MSR handling in traps.c (avoid potential accesses to Intel MSRs on AMD)
* Fixed VCPU selection in interrupt handler for 32-bit dom0 (sampled => sampling)
* Clarified commit messages (patches 2, 13, 18) 
* Various cleanups

Changes in v9:

* Restore VPMU context after context_saved() is called in
  context_switch(). This is needed because vpmu_load() may end up
  calling vmx_vmcs_try_enter()->vcpu_pause() and that needs is_running
  to be correctly set/cleared. (patch 18, dropped review acks)
* Added patch 2 to properly manage VPMU_CONTEXT_LOADED
* Addressed most of Jan's comments.
  o Keep track of time in vpmu_force_context_switch() to properly break
    out of a loop when using hypercall continuations
  o Fixed logic in calling vpmu_do_msr() in emulate_privileged_op()
  o Cleaned up vpmu_interrupt() wrt vcpu variable names to (hopefully)
    make it more clear which vcpu we are using
  o Cleaned up vpmu_do_wrmsr()
  o Did *not* replace sizeof(uint64_t) with sizeof(variable) in
    amd_vpmu_initialise(): throughout the code registers are declared as
    uint64_t and if we are to add a new type (e.g. reg_t) this should be
    done in a separate patch, unrelated to this series.
  o Various more minor cleanups and code style fixes

Changes in v8:

* Cleaned up a bit definitions of struct xenpf_symdata and xen_pmu_params
* Added compat checks for vpmu structures
* Converted vpmu flag manipulation macros to inline routines
* Reimplemented vpmu_unload_all() to avoid long loops
* Reworked PMU fault generation and handling (new patch #12)
* Added checks for domain->vcpu[] non-NULLness
* Added more comments, renamed some routines and macros, code style cleanup

Changes in v7:

* When reading hypervisor symbols make the caller pass buffer length
  (as opposed to having this length be part of the API). Make the
  hypervisor buffer static, make xensyms_read() return zero-length
  string on end-of-symbols. Make 'type' field of xenpf_symdata a char,
  drop compat_pf_symdata definition.
* Spread PVH support across patches as opposed to lumping it into a
  separate patch
* Rename vpmu_is_set_all() to vpmu_are_all_set()
* Split VPMU cleanup patch in two
* Use memmove when copying VMX guest and host MSRs
* Make padding of xen_arch_pmu's context union a constand that does not
  depend on arch context size.
* Set interface version to 0.1
* Check pointer validity in pvpmu_init/destroy()
* Fixed crash in core2_vpmu_dump()
* Fixed crash in vmx_add_msr()
* Break handling of Intel and AMD MSRs in traps.c into separate cases
* Pass full CS selector to guests
* Add lock in pvpmu init code to prevent potential race

Changes in v6:

* Two new patches:
  o Merge VMX MSR add/remove routines in vmcs.c (patch 5)
  o Merge VPMU read/write MSR routines in vpmu.c (patch 14)
* Check for pending NMI softirq after saving VPMU context to prevent a newly-scheduled
  guest from overwriting sampled_vcpu written by de-scheduled VPCU.
* Keep track of enabled counters on Intel. This was removed in earlier patches and
  was a mistake. As result of this change struct vpmu will have a pointer to private
  context data (i.e. data that is not exposed to a PV(H) guest). Use this private pointer
  on SVM as well for storing MSR bitmap status (it was unnecessarily exposed to PV guests
  earlier).
  Dropped Reviewed-by: and Tested-by: tags from patch 4 since it needs to be reviewed
  agan (core2_vpmu_do_wrmsr() routine, mostly)
* Replaced references to dom0 with hardware_domain (and is_control_domain with
  is_hardware_domain for consistency)
* Prevent non-privileged domains from reading PMU MSRs in VPMU_PRIV_MODE
* Reverted unnecessary changes in vpmu_initialise()'s switch statement
* Fixed comment in vpmu_do_interrupt

Changes in v5:

* Dropped patch number 2 ("Stop AMD counters when called from vpmu_save_force()")
  as no longer needed
* Added patch number 2 that marks context as loaded before PMU registers are
  loaded. This prevents situation where a PMU interrupt may occur while context
  is still viewed as not loaded. (This is really a bug fix for exsiting VPMU
  code)
* Renamed xenpmu.h files to pmu.h
* More careful use of is_pv_domain(), is_hvm_domain(, is_pvh_domain and
  has_hvm_container_domain(). Also explicitly disabled support for PVH until
  patch 16 to make distinction between usage of the above macros more clear.
* Added support for disabling VPMU support during runtime.
* Disable VPMUs for non-privileged domains when switching to privileged
  profiling mode
* Added ARM stub for xen_arch_pmu_t
* Separated vpmu_mode from vpmu_features
* Moved CS register query to make sure we use appropriate query mechanism for
  various guest types.
* LVTPC is now set from value in shared area, not copied from dom0
* Various code and comments cleanup as suggested by Jan.

Changes in v4:

* Added support for PVH guests:
  o changes in pvpmu_init() to accommodate both PV and PVH guests, still in patch 10
  o more careful use of is_hvm_domain
  o Additional patch (16)
* Moved HVM interrupt handling out of vpmu_do_interrupt() for NMI-safe handling
* Fixed dom0's VCPU selection in privileged mode
* Added a cast in register copy for 32-bit PV guests cpu_user_regs_t in vpmu_do_interrupt.
  (don't want to expose compat_cpu_user_regs in a public header)
* Renamed public structures by prefixing them with "xen_"
* Added an entry for xenpf_symdata in xlat.lst
* Fixed pv_cpuid check for vpmu-specific cpuid adjustments
* Varios code style fixes
* Eliminated anonymous unions
* Added more verbiage to NMI patch description

Changes in v3:

* Moved PMU MSR banks out from architectural context data structures to allow
for future expansion without protocol changes
* PMU interrupts can be either NMIs or regular vector interrupts (the latter
is the default)
* Context is now marked as PMU_CACHED by the hypervisor code to avoid certain
race conditions with the guest
* Fixed races with PV guest in MSR access handlers
* More Intel VPMU cleanup
* Moved NMI-unsafe code from NMI handler
* Dropped changes to vcpu->is_running
* Added LVTPC apic handling (cached for PV guests)
* Separated privileged profiling mode into a standalone patch
* Separated NMI handling into a standalone patch

Changes in v2:

* Xen symbols are exported as data structure (as opoosed to a set of formatted
strings in v1). Even though one symbol per hypercall is returned performance
appears to be acceptable: reading whole file from dom0 userland takes on average
about twice as long as reading /proc/kallsyms
* More cleanup of Intel VPMU code to simplify publicly exported structures
* There is an architecture-independent and x86-specific public include files (ARM
has a stub)
* General cleanup of public include files to make them more presentable (and
to make auto doc generation better)
* Setting of vcpu->is_running is now done on ARM in schedule_tail as well (making
changes to common/schedule.c architecture-independent). Note that this is not
tested since I don't have access to ARM hardware.
* PCPU ID of interrupted processor is now passed to PV guest

The following patch series adds PMU support in Xen for PV(H)
guests. There is a companion patchset for Linux kernel. In addition,
another set of changes will be provided (later) for userland perf
code.

This version has following limitations:
* For accurate profiling of dom0/Xen dom0 VCPUs should be pinned.
* Hypervisor code is only profiled on processors that have running dom0 VCPUs
on them.
* No backtrace support.

A few notes that may help reviewing:

* A shared data structure (xenpmu_data_t) between each PV VPCU and hypervisor
CPU is used for passing registers' values as well as PMU state at the time of
PMU interrupt.
* PMU interrupts are taken by hypervisor either as NMIs or regular vector
interrupts for both HVM and PV(H). The interrupts are sent as NMIs to HVM guests
and as virtual interrupts to PV(H) guests
* PV guest's interrupt handler does not read/write PMU MSRs directly. Instead, it
accesses xenpmu_data_t and flushes it to HW it before returning.
* PMU mode is controlled at runtime via /sys/hypervisor/pmu/pmu/{pmu_mode,pmu_flags}
in addition to 'vpmu' boot option (which is preserved for back compatibility).
The following modes are provided:
  * disable: VPMU is off
  * enable: VPMU is on. Guests can profile themselves, dom0 profiles itself and Xen
  * priv_enable: dom0 only profiling. dom0 collects samples for everyone. Sampling
    in guests is suspended.
* /proc/xen/xensyms file exports hypervisor's symbols to dom0 (similar to
/proc/kallsyms)
* VPMU infrastructure is now used for HVM, PV and PVH and therefore has been moved
up from hvm subtree

Boris Ostrovsky (15):
  common/symbols: Export hypervisor symbols to privileged guest
  x86/VPMU: Add public xenpmu.h
  x86/VPMU: Make vpmu not HVM-specific
  x86/VPMU: Interface for setting PMU mode and flags
  x86/VPMU: Initialize VPMUs with __initcall
  x86/VPMU: Initialize PMU for PV(H) guests
  x86/VPMU: Save VPMU state for PV guests during context switch
  x86/VPMU: When handling MSR accesses, leave fault injection to callers
  x86/VPMU: Add support for PMU register handling on PV guests
  x86/VPMU: Use pre-computed masks when checking validity of MSRs
  VPMU/AMD: Check MSR values before writing to hardware
  x86/VPMU: Handle PMU interrupts for PV(H) guests
  x86/VPMU: Merge vpmu_rdmsr and vpmu_wrmsr
  x86/VPMU: Add privileged PMU mode
  x86/VPMU: Move VPMU files up from hvm/ directory

 tools/flask/policy/policy/modules/xen/xen.te       |   7 +
 xen/arch/x86/cpu/Makefile                          |   1 +
 xen/arch/x86/cpu/vpmu.c                            | 818 +++++++++++++++++++++
 xen/arch/x86/{hvm/svm/vpmu.c => cpu/vpmu_amd.c}    | 332 ++++++---
 .../x86/{hvm/vmx/vpmu_core2.c => cpu/vpmu_intel.c} | 553 +++++++++-----
 xen/arch/x86/domain.c                              |  27 +-
 xen/arch/x86/hvm/Makefile                          |   1 -
 xen/arch/x86/hvm/hvm.c                             |   1 +
 xen/arch/x86/hvm/svm/Makefile                      |   1 -
 xen/arch/x86/hvm/svm/svm.c                         |  10 +-
 xen/arch/x86/hvm/vlapic.c                          |   2 +-
 xen/arch/x86/hvm/vmx/Makefile                      |   1 -
 xen/arch/x86/hvm/vmx/vmx.c                         |  28 +-
 xen/arch/x86/hvm/vpmu.c                            | 313 --------
 xen/arch/x86/oprofile/op_model_ppro.c              |   8 +-
 xen/arch/x86/platform_hypercall.c                  |  28 +
 xen/arch/x86/traps.c                               |  64 +-
 xen/arch/x86/x86_64/compat/entry.S                 |   4 +
 xen/arch/x86/x86_64/entry.S                        |   4 +
 xen/common/event_channel.c                         |   1 +
 xen/common/symbols.c                               |  54 ++
 xen/include/Makefile                               |   3 +-
 xen/include/asm-x86/domain.h                       |   2 +
 xen/include/asm-x86/hvm/vcpu.h                     |   3 -
 xen/include/asm-x86/hvm/vmx/vmcs.h                 |   2 +-
 xen/include/asm-x86/hvm/vmx/vpmu_core2.h           |  32 -
 xen/include/asm-x86/{hvm => }/vpmu.h               |  72 +-
 xen/include/public/arch-arm.h                      |   5 +
 xen/include/public/arch-x86/pmu.h                  | 145 ++++
 xen/include/public/platform.h                      |  19 +
 xen/include/public/pmu.h                           | 111 +++
 xen/include/public/xen.h                           |   2 +
 xen/include/xen/hypercall.h                        |   4 +
 xen/include/xen/symbols.h                          |   3 +
 xen/include/xlat.lst                               |   8 +
 xen/include/xsm/dummy.h                            |  20 +
 xen/include/xsm/xsm.h                              |   6 +
 xen/xsm/dummy.c                                    |   1 +
 xen/xsm/flask/hooks.c                              |  28 +
 xen/xsm/flask/policy/access_vectors                |   6 +
 40 files changed, 2006 insertions(+), 724 deletions(-)
 create mode 100644 xen/arch/x86/cpu/vpmu.c
 rename xen/arch/x86/{hvm/svm/vpmu.c => cpu/vpmu_amd.c} (58%)
 rename xen/arch/x86/{hvm/vmx/vpmu_core2.c => cpu/vpmu_intel.c} (67%)
 delete mode 100644 xen/arch/x86/hvm/vpmu.c
 delete mode 100644 xen/include/asm-x86/hvm/vmx/vpmu_core2.h
 rename xen/include/asm-x86/{hvm => }/vpmu.h (65%)
 create mode 100644 xen/include/public/arch-x86/pmu.h
 create mode 100644 xen/include/public/pmu.h

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 37+ messages in thread