All of lore.kernel.org
 help / color / mirror / Atom feed
* [DESIGN] Feature Levelling improvements
@ 2015-06-16 10:50 Andrew Cooper
  2015-06-16 15:33 ` Jan Beulich
  2015-06-22 19:18 ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Cooper @ 2015-06-16 10:50 UTC (permalink / raw)
  To: Xen-devel List
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson,
	Marcos E. Matsunaga, Jan Beulich

All,

With migration v2 getting close to being done, I have had time to pick
back up with feature levelling improvements.  Presented here for review
is draft E.

A PDF version of the design is available here:

http://xenbits.xen.org/people/andrewcoop/feature-levelling/feature-levelling-E.pdf

Pandoc version as follows:

% VM CPU Feature Levelling Improvements
% Andrew Cooper <<andrew.cooper3@citrix.com>>
% Draft E

Introduction
============

Revision History
----------------

------------------------------------------------------------------------------
Version  Date         Changes
-------  ----------- 
--------------------------------------------------------
Draft A  07 Feb 2014  Initial draft

Draft B  13 Feb 2014  More detail for proposed new implementation

Draft C  17 Feb 2014  Even more details for proposed new implementation

Draft D  11 Jun 2014  More background, having had time to hack around and
                      experiment

Draft E  15 Jun 2015  More details for the proposed implementation.
------------------------------------------------------------------------------

Background
----------

_CPU feature masking_ is a term used to mean altering the visible
feature-set
of a processor.  For single systems, this could be to hide certain features
from operating system software, for which support is buggy.

In the world of virtualisation, it is common to have non-identical
hardware in
a cluster but still want to migrate a virtual machine safely.  On regular
hardware, the kernel can safely assume that the feature-set as detected on
boot will remain the same.  Live migration invalidates this assumption when
moving between two non-identical pieces of hardware.

To migrate virtual machines in this fashion, orchestration software must
ensure that the available feature set remains consistent anywhere the
virtual
machine might end up.

The feature-set of a particular CPU can be obtained using the `CPUID`
instruction.  It was introduced as a forward compatible way of
advertising new
features which were detectable at runtime.  Information available includes
processor branding, available features, topology information and cache
details.

The `CPUID` instruction is an unprivileged instruction, usable from
user-mode
without interception from the kernel.  This makes it impossible to
paravirtualise using the standard trap-and-emulate method.

Purpose
-------

This project originally started to improve the way in which XenServer
performed heterogeneous pool levelling.  In the process of investigation, it
was discovered that the current implementation in Xen and libxc are in
need of
improvement, particularly in relation to PV guests.

This document describes:

* What properties are needed from a VM point of view
* What hardware features are available to aid with levelling
* What abilities are exposed by Xen and libxc for levelling
* How XenServer currently does pool levelling (and why it is in need of
improvements)

This document also proposes a new mechanism for VM feature levelling, taking
into account the information needed by orchestration software.


What a Virtual Machine cares about
==================================

On native hardware, a kernel, as well as certain userspace libraries
will use
the set of available features to tune themselves to run more efficiently.
Over a migrate, it is critical that features a VM is using do not disappear.
(In some cases it might be possible to trap-and-emulate missing
features, but
this would be an exceedingly high overhead and is not considered.)

When a VM is liable to migrate between hardware of differing
feature-sets, it
is important to ensure that the VM is strictly only using the common
subset of
features available on any potential destination.

This can be done either by hiding features outside of the common subset,
or in
some cases specifically instructing the kernel not to use a feature which it
can see.

Hardware features to aid levelling
==================================

HVM
---

HVM guests (using `Intel VT-x` or `AMD SVM`) will unconditionally exit
to Xen
on all `CPUID` instructions, allowing Xen full and complete control over all
leaves.

PV
--

The `CPUID` instruction is unprivileged, so executing it in a PV guest will
not trap, leaving Xen no direct ability to control the information returned.

Xen Forced Emulation Prefix
---------------------------

Xen-aware PV guest kernels and userspace can make use of the 'Forced
Emulation
Prefix'

> `ud2a; .byte 'x'; .byte 'e'; .byte 'n'; cpuid`

which Xen recognises as a deliberate attempt to get the fully-controlled
`CPUID` information rather than the hardware-reported information.  This
only
works with cooperative guests and guest userspace, so cannot be directly
relied upon.

Masking and Override MSRs
-------------------------

AMD CPUs from the `K8` onwards support _Feature Override_ MSRs, which
specify
the raw value returned for all `CPUID` instructions querying a specific
feature bitmap.  These MSRs allow any result to be returned, including the
ability to advertise features which are not actually supported.

Intel CPUs between `Nehalem` and `SandyBridge` have differing numbers of
_Feature Mask_ MSRs, which are a simple AND-mask applied to all `CPUID`
instructions requesting specific feature bitmap sets.  The exact MSRs, and
which feature bitmap sets they affect are hardware specific.  These MSRs
allow
features to be hidden by clearing the appropriate bit in the mask, but does
not allow unsupported features to be advertised.

CPUID Faulting
----------------

On newer Intel hardware, a feature known as _CPUID Faulting_ can allow
Xen to
cause `CPUID` instruction executed in PV guests to trap, which allows
Xen full
and complete control over all leaves (exactly like an HVM guest).  _CPUID
Faulting_ support is present in `IvyBridge` and newer CPUs, although not
architecturally guaranteed.


How Xen currently uses and exposes levelling support
====================================================

Libxc has a `CPUID` Policy API which can be set by the toolstack for a
domain.
Libxc performs some information gathering, and uses the `DOMCTL_set_cpuid`
hypercall to specify what information should be returned by Xen when the
domain requests specific `CPUID` leaves.

The user of the libxc `CPUID` Policy API may specify, for any leaf
whatsoever,
whether particular bits should be forced high, forced low, default (as
chosen
by libxc), specifically the same as hardware, or specifically the same
hardware and maintained consistently across migration.

The default `CPUID` Policy involves libxc trying to work out which features
should be set or cleared in the policy.  It does this with a mixture of
native
`CPUID` instructions, some switch statements choosing to enable/disable
certain features and hypercalls querying certain Xen state.

When Xen is servicing a `CPUID` instruction on behalf of a guest and ends up
using the policy provided by libxc, it subsequently edits certain fields,
particularly in the feature sets.

Support for the feature masking MSRs is available via the `cpuid_mask_*`
command line parameters which get applied at boot and reduce the visible
feature set to every subsequent `CPUID` instruction.

Support for enabling _CPUID Faulting_ exists, but it does nothing more than
defer back to the default policy.


How XenServer currently does levelling
======================================

The _Heterogeneous Pool Levelling_ support in XenServer appears to
predate the
libxc CPUID policy API, so does not currently use it.  The toolstack has a
table of CPU model numbers identifying whether levelling is supported.  It
then uses native `CPUID` instructions to look at the first four feature
masks,
and identifies the subset of features across the pool.
`cpuid_mask_{,extd_}{ecx,edx}` is then set on Xen's command line for
each host
in the pool, and all hosts rebooted.

This has several limitations:

* Xen and dom0 have a reduced feature set despite not needing to migrate
* There is only a single level for all VMs in the pool
* The toolstack only understands the first 4 of the possible masking
MSRs, and
  there are now feature maps in further `CPUID` leaves which have no masking
  MSRs


Notes and observations
======================

Experimentally, the masking MSRs can be context switched.  There is no
need to
force all PV guests to the same level, and no need to prevent dom0 or
Xen from
using certain features.  Context switching the masking MSRs will however
incur
an overhead, and should be avoided where possible.

The toolstack needs to know how much control Xen has over VM features. 
In the
case that there are insufficient masking MSRs, and no faulting support is
present, a PV VM can still potentially be made safe to migrate by explicitly
disabling features on the kernel command line.  As a result, there
should be a
new mechanism which reports the levelling controls Xen has available.

The features available to each type of guest is really only known to Xen.
Having libxc try to divine them is bogus (especially as libxc is subject to
the toolstack domains cpuid policy itself).  Therefore on boot, Xen should
work out the maximal feature set available to each type of guest and
make this
information available to the toolstack.


Design
======

`struct sysctl_physinfo.levelling_caps`
---------------------------------------

Xen shall gain a new physinfo field which reports the degree to which it can
influence `CPUID` executed by a PV guest.  This is a bitmap containing:

* `faulting`
    * CPUID Faulting is available, and full control can be exercised.
* `mask_ecx`
    * Leaf 0x00000001.ECX
* `mask_edx`
    * Leaf 0x00000001.EDX
* `mask_extd_ecx`
    * Leaf 0x80000001.ECX
* `mask_extd_edx`
    * Leaf 0x80000001.EDX
* `mask_xsave_eax`
    * Leaf 0x0000000D[ECX=1].EAX
* `mask_therm_ecx`
    * Leaf 0x00000006.ECX
* `mask_l7s0_eax`
    * Leaf 0x00000007[ECX=0].EAX
* `mask_l7s0_ebx`
    * Leaf 0x00000007[ECX=0].EBX

At the time of writing, these are all the masking MSRs known by Xen.  The
bitmap shall be extended as new MSRs become available.

New 'featureset' API for use by the toolstack
---------------------------------------------

A featureset is a defined as a collection of words covering the cpuid leaves
which report features to the caller.  It is variable length, and expected to
grow over time as processors gain more features, or Xen starts supporting
exposing more features to guests.

At the time of writing, the leaves containing feature bits are:

* 0x00000001.ECX
* 0x00000001.EDX
* 0x80000001.ECX
* 0x80000001.EDX
* 0x0000000D[ECX=1].EAX
* 0x00000007[ECX=0].EBX
* 0x00000006.EAX
* 0x00000006.ECX
* 0x0000000A.EAX
* 0x0000000A.EBX
* 0x0000000F[ECX=0].EDX
* 0x0000000F[ECX=1].EDX

XEN_SYSCTL_get_featureset
-------------------------

Xen shall on boot create a featureset for itself, and the maximum available
features for each type of guest, based on hardware features, command line
options etc.  A toolstack shall be able to query all of these.

Cpuid feature-verification library
----------------------------------

There shall be a new library (shared between Xen and libxc in the same
way as
libelf etc.) which can verify the a featureset.  In particular, it will
confirm that no features are enabled without their dependent features.

XEN_DOMCTL_set_cpuid
--------------------

This is an existing hypercall.  Currently it just stashes the policy from
userspace.  It shall be extended to provide verification of the policy, and
reject attempts to advertise features which Xen is incapable of providing
(via hardware or emulation support).

VCPU context switch
-------------------

Xen shall be updated to lazily context switch all available masking
MSRs.  It
is noted that this shall incur a performance overhead if restricted
featuresets are assigned to PV guests, and _CPUID Faulting_ is not
available.

It shall be the responsibility of the host administrator to avoid creating
such a scenario, if the performance overhead is a concern.


Future work
===========

The above is a minimum quantity of work to support feature levelling, but
further problems exist.  They are acknowledged as being issues, but are
not in
scope for fixing as part of feature levelling.

* Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
  particular, this causes issues for VMs attempting to detect topology,
which
  find inconsistent/incorrect cache information.

* In the case that `domain_cpuid()` can't locate a leaf in the topology, it
  will fall back to issuing a plain `CPUID` instruction.  This breaks VM
  encapsulation, as a VM which has migrated can observe differences which
  should be hidden.

* There is currently a positioning issue with the domains cpuid policy.
  Verifying the register state requires the policy, but the policy is behind
  the register state in the migration stream.  The domains cpuid policy
should
  become an item in Xen's migration state for a VM.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-16 10:50 [DESIGN] Feature Levelling improvements Andrew Cooper
@ 2015-06-16 15:33 ` Jan Beulich
  2015-06-16 16:45   ` Andrew Cooper
  2015-06-22 19:18 ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2015-06-16 15:33 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E.Matsunaga

>>> On 16.06.15 at 12:50, <andrew.cooper3@citrix.com> wrote:
> How XenServer currently does levelling
> ======================================
> 
> The _Heterogeneous Pool Levelling_ support in XenServer appears to
> predate the
> libxc CPUID policy API, so does not currently use it.  The toolstack has a
> table of CPU model numbers identifying whether levelling is supported.  It
> then uses native `CPUID` instructions to look at the first four feature
> masks,
> and identifies the subset of features across the pool.
> `cpuid_mask_{,extd_}{ecx,edx}` is then set on Xen's command line for
> each host
> in the pool, and all hosts rebooted.
> 
> This has several limitations:
> 
> * Xen and dom0 have a reduced feature set despite not needing to migrate

I don't think Xen is affected by this, as it reads the CPUID bits
before setting the masks (there are a few cpuid() invocations
in "random" code, but I don't think these access maskable ones).

> Notes and observations
> ======================
> 
> Experimentally, the masking MSRs can be context switched.  There is no
> need to
> force all PV guests to the same level, and no need to prevent dom0 or
> Xen from
> using certain features.  Context switching the masking MSRs will however
> incur
> an overhead, and should be avoided where possible.
> 
> The toolstack needs to know how much control Xen has over VM features. 
> In the
> case that there are insufficient masking MSRs, and no faulting support is
> present, a PV VM can still potentially be made safe to migrate by explicitly
> disabling features on the kernel command line.

That wouldn't help with user mode code, would it?

> VCPU context switch
> -------------------
> 
> Xen shall be updated to lazily context switch all available masking
> MSRs.  It
> is noted that this shall incur a performance overhead if restricted
> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
> available.
> 
> It shall be the responsibility of the host administrator to avoid creating
> such a scenario, if the performance overhead is a concern.

Not sure how feasible this is: Even if you run all PV guests at equal
feature levels, context switching between PV and non-PV guests
would still incur overhead (unless you mean to run HVM/PVH ones
with whatever masking is currently in place). Plus this still wouldn't
deal with masks in place when Xen itself wants to look at any of the
maskable ones, unless you intend to audit code to make sure no
such uses exist (which - as per above - I suppose/hope to be the
case).

> Future work
> ===========
> 
> The above is a minimum quantity of work to support feature levelling, but
> further problems exist.  They are acknowledged as being issues, but are
> not in
> scope for fixing as part of feature levelling.
> 
> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
>   particular, this causes issues for VMs attempting to detect topology,
> which
>   find inconsistent/incorrect cache information.
> 
> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
>   encapsulation, as a VM which has migrated can observe differences which
>   should be hidden.

I think this is actually something that (a) needs addressing not too
far in the future and (b) reminds me that I didn't see any talk here
regarding black vs white listing of features not explicitly known to
Xen or the tool stack.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-16 15:33 ` Jan Beulich
@ 2015-06-16 16:45   ` Andrew Cooper
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2015-06-16 16:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E.Matsunaga

On 16/06/15 16:33, Jan Beulich wrote:
>>>> On 16.06.15 at 12:50, <andrew.cooper3@citrix.com> wrote:
>> How XenServer currently does levelling
>> ======================================
>>
>> The _Heterogeneous Pool Levelling_ support in XenServer appears to
>> predate the
>> libxc CPUID policy API, so does not currently use it.  The toolstack has a
>> table of CPU model numbers identifying whether levelling is supported.  It
>> then uses native `CPUID` instructions to look at the first four feature
>> masks,
>> and identifies the subset of features across the pool.
>> `cpuid_mask_{,extd_}{ecx,edx}` is then set on Xen's command line for
>> each host
>> in the pool, and all hosts rebooted.
>>
>> This has several limitations:
>>
>> * Xen and dom0 have a reduced feature set despite not needing to migrate
> I don't think Xen is affected by this, as it reads the CPUID bits
> before setting the masks (there are a few cpuid() invocations
> in "random" code, but I don't think these access maskable ones).

As part of existing levelling in XenServer, xsave (and in particular,
xsaveopt) are disabled, which does infringe on Xen's ability to do a
context switch in an efficient manner.

For the gory details (from
https://github.com/xenserver/xen-4.5.pg/blob/master/master/series ), the
following hacks are in place which have accumulated over time to "fix"
regressions in migration when it comes to exposed features:

fix-xsave-dependent-CPUID-bits-being-advertised-to-guests.patch
xen-dont-hide-vtx-or-svm.patch
xen-capture-boot-cpuid-info.patch
xen-apply-cpuid-mask-to-cpuid-faulting.patch
xen-disable-xsave.patch
xen-hide-fma4-on-amd-fam15h.patch
mixed-cpuid-before-mask.patch

All of which I will abolish with pleasure once these levelling
improvements are complete.

>
>> Notes and observations
>> ======================
>>
>> Experimentally, the masking MSRs can be context switched.  There is no
>> need to
>> force all PV guests to the same level, and no need to prevent dom0 or
>> Xen from
>> using certain features.  Context switching the masking MSRs will however
>> incur
>> an overhead, and should be avoided where possible.
>>
>> The toolstack needs to know how much control Xen has over VM features. 
>> In the
>> case that there are insufficient masking MSRs, and no faulting support is
>> present, a PV VM can still potentially be made safe to migrate by explicitly
>> disabling features on the kernel command line.
> That wouldn't help with user mode code, would it?

Generally not, but it does depend on whether user code queries cpuid
directly, or asks the OS for features.  This is already a last-ditch
effort at this point.

>
>> VCPU context switch
>> -------------------
>>
>> Xen shall be updated to lazily context switch all available masking
>> MSRs.  It
>> is noted that this shall incur a performance overhead if restricted
>> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
>> available.
>>
>> It shall be the responsibility of the host administrator to avoid creating
>> such a scenario, if the performance overhead is a concern.
> Not sure how feasible this is: Even if you run all PV guests at equal
> feature levels, context switching between PV and non-PV guests
> would still incur overhead (unless you mean to run HVM/PVH ones
> with whatever masking is currently in place). Plus this still wouldn't
> deal with masks in place when Xen itself wants to look at any of the
> maskable ones, unless you intend to audit code to make sure no
> such uses exist (which - as per above - I suppose/hope to be the
> case).

This is partially linked with Future work b), to try and remove Xen's
reliance on cpuid after boot.

As domains shall inherit the default maximal policy, then moderated
downwards by DOMCTL_set_cpuid_policy, no maskable feature leaves will
fall though the existing domain_cpuid() implementation to a plain
`cpuid` instruction, so masking is save to leave in place when context
switching to an HVM VCPU.

I will audit the code to check that Xen is never checking feature leaves
at runtime.  It is never needed (and is specifically inefficient in a
nested case).

>
>> Future work
>> ===========
>>
>> The above is a minimum quantity of work to support feature levelling, but
>> further problems exist.  They are acknowledged as being issues, but are
>> not in
>> scope for fixing as part of feature levelling.
>>
>> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
>>   particular, this causes issues for VMs attempting to detect topology,
>> which
>>   find inconsistent/incorrect cache information.
>>
>> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
>>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
>>   encapsulation, as a VM which has migrated can observe differences which
>>   should be hidden.
> I think this is actually something that (a) needs addressing not too
> far in the future and (b) reminds me that I didn't see any talk here
> regarding black vs white listing of features not explicitly known to
> Xen or the tool stack.

I have put quite a lot of thought towards a) and while it absolutely
does need addressing, it is substantially more work than just fixing the
levelling issues (which are my top priority, from XenServers point of
view).  I do hope to manage it as follow-on work, and will have it in
mind when fixing levelling.

The whitelist is implicit by virtue of Xen calculating the per-vm-type
maximum feature set which can be offered, then forcibly prevent the
toolstack from expanding on that.  There is possibly room for a command
line parameter to change the default behaviour.

Part of fixing both b) and a) involves Xen gaining a more structured
understanding of the cpuid leaves, and enforcing things like max_leaf,
which are logically limited by Xen's understanding of the leaves.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-16 10:50 [DESIGN] Feature Levelling improvements Andrew Cooper
  2015-06-16 15:33 ` Jan Beulich
@ 2015-06-22 19:18 ` Konrad Rzeszutek Wilk
  2015-06-23  9:11   ` Andrew Cooper
  1 sibling, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-22 19:18 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E. Matsunaga, Jan Beulich

Thank you for posting this!

Some comments below.

> Design
> ======
> 
> `struct sysctl_physinfo.levelling_caps`
> ---------------------------------------
> 
> Xen shall gain a new physinfo field which reports the degree to which it can
> influence `CPUID` executed by a PV guest.  This is a bitmap containing:
> 
> * `faulting`
>     * CPUID Faulting is available, and full control can be exercised.
> * `mask_ecx`
>     * Leaf 0x00000001.ECX
> * `mask_edx`
>     * Leaf 0x00000001.EDX
> * `mask_extd_ecx`
>     * Leaf 0x80000001.ECX
> * `mask_extd_edx`
>     * Leaf 0x80000001.EDX
> * `mask_xsave_eax`
>     * Leaf 0x0000000D[ECX=1].EAX
> * `mask_therm_ecx`
>     * Leaf 0x00000006.ECX
> * `mask_l7s0_eax`
>     * Leaf 0x00000007[ECX=0].EAX
> * `mask_l7s0_ebx`

Those 'l' look like '1' in the PDF.

Can it be called something else?

>     * Leaf 0x00000007[ECX=0].EBX
> 
> At the time of writing, these are all the masking MSRs known by Xen.  The
> bitmap shall be extended as new MSRs become available.
> 
> New 'featureset' API for use by the toolstack
> ---------------------------------------------
> 
> A featureset is a defined as a collection of words covering the cpuid leaves
> which report features to the caller.  It is variable length, and expected to
> grow over time as processors gain more features, or Xen starts supporting
> exposing more features to guests.
> 
> At the time of writing, the leaves containing feature bits are:
> 
> * 0x00000001.ECX
> * 0x00000001.EDX
> * 0x80000001.ECX
> * 0x80000001.EDX
> * 0x0000000D[ECX=1].EAX
> * 0x00000007[ECX=0].EBX
> * 0x00000006.EAX
> * 0x00000006.ECX
> * 0x0000000A.EAX
> * 0x0000000A.EBX
> * 0x0000000F[ECX=0].EDX
> * 0x0000000F[ECX=1].EDX
> 
> XEN_SYSCTL_get_featureset
> -------------------------
> 
> Xen shall on boot create a featureset for itself, and the maximum available
> features for each type of guest, based on hardware features, command line
> options etc.  A toolstack shall be able to query all of these.

maximum available features? As in two sets of features - one for
PV and another for HVM. The PV being a subset of HVM (since it is more
constrained)?

Command line options being the same old ones (the cpuid_mask..?) and then
more? Or just rewrite this to be:

cpuid=mask_therm_ecx=[blahbla],mask_xsave_eax=[blahbal] ?


> 
> Cpuid feature-verification library
> ----------------------------------
> 
> There shall be a new library (shared between Xen and libxc in the same
> way as
> libelf etc.) which can verify the a featureset.  In particular, it will

s/ a //
> confirm that no features are enabled without their dependent features.

And presumarily can compare these features and do a and-subset (or an
or-subset) ?

> 
> XEN_DOMCTL_set_cpuid
> --------------------
> 
> This is an existing hypercall.  Currently it just stashes the policy from
> userspace.  It shall be extended to provide verification of the policy, and
> reject attempts to advertise features which Xen is incapable of providing
> (via hardware or emulation support).

Where would be the code to trim the 'maximum available features' in the
subsets (like PV) with some cpuid=X flags from user-space?


> 
> VCPU context switch
> -------------------
> 
> Xen shall be updated to lazily context switch all available masking
> MSRs.  It
> is noted that this shall incur a performance overhead if restricted
> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
> available.
> 
> It shall be the responsibility of the host administrator to avoid creating
> such a scenario, if the performance overhead is a concern.

.. and perhaps add warnings in the toolstack to tell the admin?

> 
> 
> Future work
> ===========
> 
> The above is a minimum quantity of work to support feature levelling, but
> further problems exist.  They are acknowledged as being issues, but are
> not in
> scope for fixing as part of feature levelling.
> 
> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
>   particular, this causes issues for VMs attempting to detect topology,
> which
>   find inconsistent/incorrect cache information.
> 
> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
>   encapsulation, as a VM which has migrated can observe differences which
>   should be hidden.
> 
> * There is currently a positioning issue with the domains cpuid policy.
>   Verifying the register state requires the policy, but the policy is behind
>   the register state in the migration stream.  The domains cpuid policy
> should
>   become an item in Xen's migration state for a VM.


And potentially code in libxl to allow subset manipulation to allow
leveling across different platforms. As in the common features would
be exposed while all the other ones are masked? And I suppose some
format to stash this so it can be ingested by the libxl tools?


> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-22 19:18 ` Konrad Rzeszutek Wilk
@ 2015-06-23  9:11   ` Andrew Cooper
  2015-06-23 15:09     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Cooper @ 2015-06-23  9:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E. Matsunaga, Jan Beulich

On 22/06/15 20:18, Konrad Rzeszutek Wilk wrote:
> Thank you for posting this!
>
> Some comments below.
>
>> Design
>> ======
>>
>> `struct sysctl_physinfo.levelling_caps`
>> ---------------------------------------
>>
>> Xen shall gain a new physinfo field which reports the degree to which it can
>> influence `CPUID` executed by a PV guest.  This is a bitmap containing:
>>
>> * `faulting`
>>     * CPUID Faulting is available, and full control can be exercised.
>> * `mask_ecx`
>>     * Leaf 0x00000001.ECX
>> * `mask_edx`
>>     * Leaf 0x00000001.EDX
>> * `mask_extd_ecx`
>>     * Leaf 0x80000001.ECX
>> * `mask_extd_edx`
>>     * Leaf 0x80000001.EDX
>> * `mask_xsave_eax`
>>     * Leaf 0x0000000D[ECX=1].EAX
>> * `mask_therm_ecx`
>>     * Leaf 0x00000006.ECX
>> * `mask_l7s0_eax`
>>     * Leaf 0x00000007[ECX=0].EAX
>> * `mask_l7s0_ebx`
> Those 'l' look like '1' in the PDF.
>
> Can it be called something else?

If you can suggest a better name, yes.  As for now, these are the
variable names used in-tree (top of xen/arch/x86/cpu/amd.c)

>
>>     * Leaf 0x00000007[ECX=0].EBX
>>
>> At the time of writing, these are all the masking MSRs known by Xen.  The
>> bitmap shall be extended as new MSRs become available.
>>
>> New 'featureset' API for use by the toolstack
>> ---------------------------------------------
>>
>> A featureset is a defined as a collection of words covering the cpuid leaves
>> which report features to the caller.  It is variable length, and expected to
>> grow over time as processors gain more features, or Xen starts supporting
>> exposing more features to guests.
>>
>> At the time of writing, the leaves containing feature bits are:
>>
>> * 0x00000001.ECX
>> * 0x00000001.EDX
>> * 0x80000001.ECX
>> * 0x80000001.EDX
>> * 0x0000000D[ECX=1].EAX
>> * 0x00000007[ECX=0].EBX
>> * 0x00000006.EAX
>> * 0x00000006.ECX
>> * 0x0000000A.EAX
>> * 0x0000000A.EBX
>> * 0x0000000F[ECX=0].EDX
>> * 0x0000000F[ECX=1].EDX
>>
>> XEN_SYSCTL_get_featureset
>> -------------------------
>>
>> Xen shall on boot create a featureset for itself, and the maximum available
>> features for each type of guest, based on hardware features, command line
>> options etc.  A toolstack shall be able to query all of these.
> maximum available features?

Maximum set of features Xen is able to provide to particular guests on
this specific host.

>  As in two sets of features - one for
> PV and another for HVM. The PV being a subset of HVM (since it is more
> constrained)?

Three really (including the host featureset), but yes.

>
> Command line options being the same old ones (the cpuid_mask..?) and then
> more? Or just rewrite this to be:
>
> cpuid=mask_therm_ecx=[blahbla],mask_xsave_eax=[blahbal] ?

No.  What I meant by that is that something like "no-xsave" will turn
off whole swathes of features in all sets.

The maximum set of features available to Xen, PV and HVM guests alike
depends on the hardware, firmware settings and command line options
provided to Xen enabling or disabling functionality.

It is specifically not guaranteed to remain the same across reboot,
which is why Xen shall recalculate it on each boot.

>
>
>> Cpuid feature-verification library
>> ----------------------------------
>>
>> There shall be a new library (shared between Xen and libxc in the same
>> way as
>> libelf etc.) which can verify the a featureset.  In particular, it will
> s/ a //
>> confirm that no features are enabled without their dependent features.
> And presumarily can compare these features and do a and-subset (or an
> or-subset) ?

At the end of the day, these are just bitmaps with a (unknown but fixed)
integer length.

>
>> XEN_DOMCTL_set_cpuid
>> --------------------
>>
>> This is an existing hypercall.  Currently it just stashes the policy from
>> userspace.  It shall be extended to provide verification of the policy, and
>> reject attempts to advertise features which Xen is incapable of providing
>> (via hardware or emulation support).
> Where would be the code to trim the 'maximum available features' in the
> subsets (like PV) with some cpuid=X flags from user-space?

There is already code to do this in both libxl and libxc.  There will of
course be some changes as part of this work, but nothing major (I hope).

The important point is that the hypercall shall now check Xen's ability
to provide what the toolstack has requested, and say no if it can't. 
This will avoid the current situation which exists where the domain
cpuid code in Xen is always needing to second-guess what is present in
the domain policy, due to it usually being junk.

>
>
>> VCPU context switch
>> -------------------
>>
>> Xen shall be updated to lazily context switch all available masking
>> MSRs.  It
>> is noted that this shall incur a performance overhead if restricted
>> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
>> available.
>>
>> It shall be the responsibility of the host administrator to avoid creating
>> such a scenario, if the performance overhead is a concern.
> .. and perhaps add warnings in the toolstack to tell the admin?

How and where would this surface?  xl/libxl is not designed to run the
system as a whole.

>
>>
>> Future work
>> ===========
>>
>> The above is a minimum quantity of work to support feature levelling, but
>> further problems exist.  They are acknowledged as being issues, but are
>> not in
>> scope for fixing as part of feature levelling.
>>
>> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
>>   particular, this causes issues for VMs attempting to detect topology,
>> which
>>   find inconsistent/incorrect cache information.
>>
>> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
>>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
>>   encapsulation, as a VM which has migrated can observe differences which
>>   should be hidden.
>>
>> * There is currently a positioning issue with the domains cpuid policy.
>>   Verifying the register state requires the policy, but the policy is behind
>>   the register state in the migration stream.  The domains cpuid policy
>> should
>>   become an item in Xen's migration state for a VM.
>
> And potentially code in libxl to allow subset manipulation to allow
> leveling across different platforms. As in the common features would
> be exposed while all the other ones are masked? And I suppose some
> format to stash this so it can be ingested by the libxl tools?

libxl's knowledge of multiple platforms is precisely nothing.  xl knows
just enough to ssh and set up some pipes to push a VM through.

The domain configuration does have cpuid information in it.  That will
be sufficient, given these proposed changes, to prevent running the VM
on an incompatible destination.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-23  9:11   ` Andrew Cooper
@ 2015-06-23 15:09     ` Konrad Rzeszutek Wilk
  2015-06-24 17:28       ` Andrew Cooper
  0 siblings, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-23 15:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E. Matsunaga, Jan Beulich

> >>     * Leaf 0x00000007[ECX=0].EAX
> >> * `mask_l7s0_ebx`
> > Those 'l' look like '1' in the PDF.
> >
> > Can it be called something else?
> 
> If you can suggest a better name, yes.  As for now, these are the
> variable names used in-tree (top of xen/arch/x86/cpu/amd.c)

low?
> 
> >
> >>     * Leaf 0x00000007[ECX=0].EBX
> >>
> >> At the time of writing, these are all the masking MSRs known by Xen.  The
> >> bitmap shall be extended as new MSRs become available.
> >>
> >> New 'featureset' API for use by the toolstack
> >> ---------------------------------------------
> >>
> >> A featureset is a defined as a collection of words covering the cpuid leaves
> >> which report features to the caller.  It is variable length, and expected to
> >> grow over time as processors gain more features, or Xen starts supporting
> >> exposing more features to guests.
> >>
> >> At the time of writing, the leaves containing feature bits are:
> >>
> >> * 0x00000001.ECX
> >> * 0x00000001.EDX
> >> * 0x80000001.ECX
> >> * 0x80000001.EDX
> >> * 0x0000000D[ECX=1].EAX
> >> * 0x00000007[ECX=0].EBX
> >> * 0x00000006.EAX
> >> * 0x00000006.ECX
> >> * 0x0000000A.EAX
> >> * 0x0000000A.EBX
> >> * 0x0000000F[ECX=0].EDX
> >> * 0x0000000F[ECX=1].EDX
> >>
> >> XEN_SYSCTL_get_featureset
> >> -------------------------
> >>
> >> Xen shall on boot create a featureset for itself, and the maximum available
> >> features for each type of guest, based on hardware features, command line
> >> options etc.  A toolstack shall be able to query all of these.
> > maximum available features?
> 
> Maximum set of features Xen is able to provide to particular guests on
> this specific host.
> 
> >  As in two sets of features - one for
> > PV and another for HVM. The PV being a subset of HVM (since it is more
> > constrained)?
> 
> Three really (including the host featureset), but yes.
> 
> >
> > Command line options being the same old ones (the cpuid_mask..?) and then
> > more? Or just rewrite this to be:
> >
> > cpuid=mask_therm_ecx=[blahbla],mask_xsave_eax=[blahbal] ?
> 
> No.  What I meant by that is that something like "no-xsave" will turn
> off whole swathes of features in all sets.
> 
> The maximum set of features available to Xen, PV and HVM guests alike
> depends on the hardware, firmware settings and command line options
> provided to Xen enabling or disabling functionality.
> 
> It is specifically not guaranteed to remain the same across reboot,
> which is why Xen shall recalculate it on each boot.
> 
> >
> >
> >> Cpuid feature-verification library
> >> ----------------------------------
> >>
> >> There shall be a new library (shared between Xen and libxc in the same
> >> way as
> >> libelf etc.) which can verify the a featureset.  In particular, it will
> > s/ a //
> >> confirm that no features are enabled without their dependent features.
> > And presumarily can compare these features and do a and-subset (or an
> > or-subset) ?
> 
> At the end of the day, these are just bitmaps with a (unknown but fixed)
> integer length.
> 
> >
> >> XEN_DOMCTL_set_cpuid
> >> --------------------
> >>
> >> This is an existing hypercall.  Currently it just stashes the policy from
> >> userspace.  It shall be extended to provide verification of the policy, and
> >> reject attempts to advertise features which Xen is incapable of providing
> >> (via hardware or emulation support).
> > Where would be the code to trim the 'maximum available features' in the
> > subsets (like PV) with some cpuid=X flags from user-space?
> 
> There is already code to do this in both libxl and libxc.  There will of
> course be some changes as part of this work, but nothing major (I hope).
> 
> The important point is that the hypercall shall now check Xen's ability
> to provide what the toolstack has requested, and say no if it can't. 
> This will avoid the current situation which exists where the domain
> cpuid code in Xen is always needing to second-guess what is present in
> the domain policy, due to it usually being junk.
> 
> >
> >
> >> VCPU context switch
> >> -------------------
> >>
> >> Xen shall be updated to lazily context switch all available masking
> >> MSRs.  It
> >> is noted that this shall incur a performance overhead if restricted
> >> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
> >> available.
> >>
> >> It shall be the responsibility of the host administrator to avoid creating
> >> such a scenario, if the performance overhead is a concern.
> > .. and perhaps add warnings in the toolstack to tell the admin?
> 
> How and where would this surface?  xl/libxl is not designed to run the
> system as a whole.

Not sure. We have some code for silly NUMA configuration that tells the user
when they are picking the wrong option.
> 
> >
> >>
> >> Future work
> >> ===========
> >>
> >> The above is a minimum quantity of work to support feature levelling, but
> >> further problems exist.  They are acknowledged as being issues, but are
> >> not in
> >> scope for fixing as part of feature levelling.
> >>
> >> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
> >>   particular, this causes issues for VMs attempting to detect topology,
> >> which
> >>   find inconsistent/incorrect cache information.
> >>
> >> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
> >>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
> >>   encapsulation, as a VM which has migrated can observe differences which
> >>   should be hidden.
> >>
> >> * There is currently a positioning issue with the domains cpuid policy.
> >>   Verifying the register state requires the policy, but the policy is behind
> >>   the register state in the migration stream.  The domains cpuid policy
> >> should
> >>   become an item in Xen's migration state for a VM.
> >
> > And potentially code in libxl to allow subset manipulation to allow
> > leveling across different platforms. As in the common features would
> > be exposed while all the other ones are masked? And I suppose some
> > format to stash this so it can be ingested by the libxl tools?
> 
> libxl's knowledge of multiple platforms is precisely nothing.  xl knows
> just enough to ssh and set up some pipes to push a VM through.
> 
> The domain configuration does have cpuid information in it.  That will

Does if 'cpuid' configuration is present?

> be sufficient, given these proposed changes, to prevent running the VM
> on an incompatible destination.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [DESIGN] Feature Levelling improvements
  2015-06-23 15:09     ` Konrad Rzeszutek Wilk
@ 2015-06-24 17:28       ` Andrew Cooper
  0 siblings, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2015-06-24 17:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Ian Campbell, Tim Deegan, Ian Jackson, Xen-devel List,
	Marcos E. Matsunaga, Jan Beulich

On 23/06/15 16:09, Konrad Rzeszutek Wilk wrote:
>>>>     * Leaf 0x00000007[ECX=0].EAX
>>>> * `mask_l7s0_ebx`
>>> Those 'l' look like '1' in the PDF.
>>>
>>> Can it be called something else?
>> If you can suggest a better name, yes.  As for now, these are the
>> variable names used in-tree (top of xen/arch/x86/cpu/amd.c)
> low?

Low what?  l7s0 means "leaf 7 subleaf 0" which is the most accurate
description of it, given no specific name in either the Intel or AMD
references.

>>
>>>
>>>> VCPU context switch
>>>> -------------------
>>>>
>>>> Xen shall be updated to lazily context switch all available masking
>>>> MSRs.  It
>>>> is noted that this shall incur a performance overhead if restricted
>>>> featuresets are assigned to PV guests, and _CPUID Faulting_ is not
>>>> available.
>>>>
>>>> It shall be the responsibility of the host administrator to avoid creating
>>>> such a scenario, if the performance overhead is a concern.
>>> .. and perhaps add warnings in the toolstack to tell the admin?
>> How and where would this surface?  xl/libxl is not designed to run the
>> system as a whole.
> Not sure. We have some code for silly NUMA configuration that tells the user
> when they are picking the wrong option.

Well yes.  If xl isn't going to block the creation attempt (and it
absolutely shouldn't), this is something better left to some sort of
"host health check".

>>>> Future work
>>>> ===========
>>>>
>>>> The above is a minimum quantity of work to support feature levelling, but
>>>> further problems exist.  They are acknowledged as being issues, but are
>>>> not in
>>>> scope for fixing as part of feature levelling.
>>>>
>>>> * Xen has no notion of per-cpu and per-package data in the cpuid policy.  In
>>>>   particular, this causes issues for VMs attempting to detect topology,
>>>> which
>>>>   find inconsistent/incorrect cache information.
>>>>
>>>> * In the case that `domain_cpuid()` can't locate a leaf in the topology, it
>>>>   will fall back to issuing a plain `CPUID` instruction.  This breaks VM
>>>>   encapsulation, as a VM which has migrated can observe differences which
>>>>   should be hidden.
>>>>
>>>> * There is currently a positioning issue with the domains cpuid policy.
>>>>   Verifying the register state requires the policy, but the policy is behind
>>>>   the register state in the migration stream.  The domains cpuid policy
>>>> should
>>>>   become an item in Xen's migration state for a VM.
>>> And potentially code in libxl to allow subset manipulation to allow
>>> leveling across different platforms. As in the common features would
>>> be exposed while all the other ones are masked? And I suppose some
>>> format to stash this so it can be ingested by the libxl tools?
>> libxl's knowledge of multiple platforms is precisely nothing.  xl knows
>> just enough to ssh and set up some pipes to push a VM through.
>>
>> The domain configuration does have cpuid information in it.  That will
> Does if 'cpuid' configuration is present?

Correct.

~Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-24 17:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-16 10:50 [DESIGN] Feature Levelling improvements Andrew Cooper
2015-06-16 15:33 ` Jan Beulich
2015-06-16 16:45   ` Andrew Cooper
2015-06-22 19:18 ` Konrad Rzeszutek Wilk
2015-06-23  9:11   ` Andrew Cooper
2015-06-23 15:09     ` Konrad Rzeszutek Wilk
2015-06-24 17:28       ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.