All of lore.kernel.org
 help / color / mirror / Atom feed
* DESIGN: CPUID part 3
@ 2017-06-08 13:12 Andrew Cooper
  2017-06-08 13:47 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Andrew Cooper @ 2017-06-08 13:12 UTC (permalink / raw)
  To: Xen-devel
  Cc: Juergen Gross, Lan Tianyu, Kevin Tian, Stefano Stabellini,
	Wei Liu, George Dunlap, Andrew Cooper, Anshul Makkar,
	Ian Jackson, Tim Deegan, Euan Harris, Jan Beulich,
	Boris Ostrovsky, Sergey Dyasli, Joao Martins, Lai, Paul C

Presented herewith is the a plan for the final part of CPUID work, which
primarily covers better Xen/Toolstack interaction for configuring the guests
CPUID policy.

A PDF version of this document is available from:

http://xenbits.xen.org/people/andrewcoop/cpuid-part-3.pdf

There are a number of still-open questions, which I would appreaciate views
on.

~Andrew

-----8<-----
% CPUID Handling (part 3)
% Revision 1

# Current state

At early boot, Xen enumerates the features it can see, takes into account
errata checks and command line arguments, and stores this information in the
`boot_cpu_data.x86_capability[]` bitmap.  This gets adjusted as APs boot up,
and is sanitised to disable all dependent leaf features.

At mid/late boot (before dom0 is constructed), Xen performs the necessary
calculations for guest cpuid handling.  Data are contained within the `struct
cpuid_policy` object, which is a representation of the architectural CPUID
information as specified by the Intel and AMD manuals.

There are a few global `cpuid_policy` objects.  First is the **raw_policy**
which is filled in from native `CPUID` instructions.  This represents what the
hardware is capable of, in its current firmware/microcode configuration.

The next global object is **host_policy**, which is derived from the
**raw_policy** and `boot_cpu_data.x86_capability[]`. It represents the
features which Xen knows about and is using.  Next, the **pv_max_policy** and
**hvm_max_policy** are derived from the **host_policy**, and represent the
upper bounds available to guests.

The toolstack may query for the **{raw,host,pv,hvm}\_featureset** information
using _XEN\_SYSCTL\_get\_cpu\_featureset_.  This is bitmap form of the feature
leaves only.

When a new domain is created, the appropriate **{pv,hvm}_max_policy** is
duplicated as a starting point, and can be subsequently mutated indirectly by
some hypercalls
(_XEN\_DOMCTL\_{set\_address\_size,disable\_migrate,settscinfo}_) or directly
by _XEN\_DOMCTL\_set\_cpuid_.


# Issues with the existing hypercalls

_XEN\_DOMCTL\_set\_cpuid_ doesn't have a return value which the domain builder
pays attention to.  This is because, before CPUID part 2, there were no
failure conditions, as Xen would accept all toolstack-provided data, and
attempt to audit it at the time it was requested by the guest.  To simplify
the part 2 work, this behaviour was maintained, although Xen was altered to
audit the data at hypercall time, typically zeroing out areas which failed the
audit.

There is no mechanism for the toolstack to query the CPUID configuration for a
specific domain.  Originally, the domain builder constructed a guests CPUID
policy from first principles, using native `CPUID` instructions in the control
domain.  This functioned to an extent, but was subject to masking problems,
and is fundamentally incompatible with HVM control domains or the use of
_CPUID Faulting_ in newer Intel processors.

CPUID phase 1 introduced the featureset information, which provided an
architecturally sound mechanism for the toolstack to identify which features
are usable for guests.  However, the rest of the CPUID policy is still
generated from native `CPUID` instructions.

The `cpuid_policy` is per-domain information.  Most CPUID data is identical
across all CPUs.  Some data are dynamic, based on other control settings
(APIC, OSXSAVE, OSPKE, OSLWP), and Xen substitutes these appropriately when
the information is requested..  Other areas however are topology information,
including thread/core/socket layout, cache and TLB hierarchy.  These data are
inherited from whichever physical CPU the domain builder happened to be
running on when it was making calculations.  As a result, it is inappropriate
for the guest under contraction, and usually entirely bogus when considered
alongside other data.


# Other problems

There is no easy provision for features at different code maturity levels,
both in the hypervisor, and in the toolstack.

Some CPUID features have top-level command line options on the Xen command
line, but most do not.  On some hardware, some features can be hidden
indirectly by altering the `cpuid_mask_*` parameters.  This is a problem for
developing new features (which want to be off-by-default but able to be opted
in to), debugging, where it can sometimes be very useful to hide features and
see if a problem reoccurs, and occasionally in security circumstances, where
disabling a feature outright is an easy stop-gap solution.

From the toolstack side, given no other constraints, a guest gets the
hypervisor-max set of features.  This set of features is a trade off between
what is supported in the hypervisor, and which features can reasonably be
offered without impeding the migrateability of the guest.  There is little
provision for features which can be opted in to at the toolstack level, and
those that are are done so via ad-hoc means.


# Proposal

First and foremost, split the current **max\_policy** notion into separate
**max** and **default** policies.  This allows for the provision of features
which are unused by default, but may be opted in to, both at the hypervisor
level and the toolstack level.

At the hypervisor level, **max** constitutes all the features Xen can use on
the current hardware, while **default** is the subset thereof which are
supported features, the features which the user has explicitly opted in to,
and excluding any features the user has explicitly opted out of.

A new `cpuid=` command line option shall be introduced, whose internals are
generated automatically from the featureset ABI.  This means that all features
added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
line control.  (RFC: The same top level option can probably be used for
non-feature CPUID data control, although I can't currently think of any cases
where this would be used Also find a sensible way to express 'available but
not to be used by Xen', as per the current `smep` and `smap` options.)


At the guest level, **max** constitutes all the features which can be offered
to each type of guest on this hardware.  Derived from Xen's **default**
policy, it includes the supported features and explicitly opted in to
features, which are appropriate for the guest.

The guests **default** policy is then derived from its **max**, and includes
the supported features which are considered migration safe.  (RFC: This
distinction is rather fuzzy, but for example it wouldn't include things like
ITSC by default, as that is likely to go wrong unless special care is taken.)

All global policies (Xen and guest, max and default) shall be made available
to the toolstack, in a manner similar to the existing
_XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
taken which include all CPUID data, not just the feature bitmaps.

New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
which allows the toolstack to query and set the cpuid policy for a specific
domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, shall fail if Xen is
unhappy with any aspect of the policy during auditing.

When a domain is initially created, the appropriate guests **default** policy
is duplicated for use.  When auditing, Xen shall audit the toolstacks
requested policy against the guests **max** policy.  This allows experimental
features or non-migration-safe features to be opted in to, without those
features being imposed upon all guests automatically.

A guests CPUID policy shall be immutable after construction.  This better
matches real hardware, and simplifies the logic in Xen to translate policy
alterations into configuration changes.

(RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
The toolstack shall also have a mechanism to explicitly select topology
configuration for the guest, which primarily affects the virtual APIC ID
layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
Xen's auditing shall ensure that guests observe values consistent with the
guarantees made by the vendor manuals.

The `disable_migrate` field shall be dropped.  The concept of migrateability
is not boolean; it is a large spectrum, all of which needs to be managed by
the toolstack.  The simple case is picking the common subset of features
between the source and destination.  This becomes more complicated e.g. if the
guest uses LBR/LER, at which point the toolstack needs to consider hardware
with the same LBR/LER format in addition to just the plain features.

`disable_migrate` is currently only used to expose ITSC to guests, but there
are cases where is perfectly safe to migrate such a guest, if the destination
host has the same TSC frequency or hardware TSC scaling support.

Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit
state gather operations, as this interferes with debugging and monitoring
tasks.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-08 13:12 DESIGN: CPUID part 3 Andrew Cooper
@ 2017-06-08 13:47 ` Jan Beulich
  2017-06-12 13:07   ` Andrew Cooper
  2017-06-09 12:24 ` Anshul Makkar
  2017-07-04 14:55 ` DESIGN v2: " Andrew Cooper
  2 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2017-06-08 13:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, Stefano Stabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	Ian Jackson, Xen-devel, Euan Harris, Joao Martins,
	Boris Ostrovsky, Paul C Lai

>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
> # Proposal
> 
> First and foremost, split the current **max\_policy** notion into separate
> **max** and **default** policies.  This allows for the provision of features
> which are unused by default, but may be opted in to, both at the hypervisor
> level and the toolstack level.
> 
> At the hypervisor level, **max** constitutes all the features Xen can use on
> the current hardware, while **default** is the subset thereof which are
> supported features, the features which the user has explicitly opted in to,
> and excluding any features the user has explicitly opted out of.
> 
> A new `cpuid=` command line option shall be introduced, whose internals are
> generated automatically from the featureset ABI.  This means that all features
> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
> line control.  (RFC: The same top level option can probably be used for
> non-feature CPUID data control, although I can't currently think of any cases
> where this would be used Also find a sensible way to express 'available but
> not to be used by Xen', as per the current `smep` and `smap` options.)

Especially for disabling individual features I'm not sure "cpuid=" is
an appropriate name. After all CPUID is only a manifestation of
behavior elsewhere, and hence we don't really want CPUID
behavior be controlled, but behavior which CPUID output reflects.
I can't, however, think of an alternative name I would consider
more suitable.

> At the guest level, **max** constitutes all the features which can be offered
> to each type of guest on this hardware.  Derived from Xen's **default**
> policy, it includes the supported features and explicitly opted in to
> features, which are appropriate for the guest.

There's no provision here at all for features which hardware doesn't
offer, but which we can emulate in a reasonable way (UMIP being
the example I'd be thinking of right away). While perhaps this could
be viewed to be covered by "explicitly opted in to features", I think
it would be nice to make this explicit.

> The guests **default** policy is then derived from its **max**, and includes
> the supported features which are considered migration safe.  (RFC: This
> distinction is rather fuzzy, but for example it wouldn't include things like
> ITSC by default, as that is likely to go wrong unless special care is 
> taken.)

As per above I think the delta between max and default is larger
than just migration-unsafe pieces. Iirc for UMIP we would mean to
have it off by default at least in the case where emulation incurs
side effects.

> The `disable_migrate` field shall be dropped.  The concept of migrateability
> is not boolean; it is a large spectrum, all of which needs to be managed by
> the toolstack.  The simple case is picking the common subset of features
> between the source and destination.  This becomes more complicated e.g. if the
> guest uses LBR/LER, at which point the toolstack needs to consider hardware
> with the same LBR/LER format in addition to just the plain features.

Not sure about this - by intercepting the MSR accesses to the involved
MSRs, it would be possible to mimic the LBR/LER format expected by
the guest even if different from that of the host.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-08 13:12 DESIGN: CPUID part 3 Andrew Cooper
  2017-06-08 13:47 ` Jan Beulich
@ 2017-06-09 12:24 ` Anshul Makkar
  2017-06-12 13:21   ` Andrew Cooper
  2017-07-04 14:55 ` DESIGN v2: " Andrew Cooper
  2 siblings, 1 reply; 19+ messages in thread
From: Anshul Makkar @ 2017-06-09 12:24 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Juergen Gross, Lan Tianyu, Kevin Tian, Stefano Stabellini,
	Wei Liu, George Dunlap, Tim Deegan, Lai, Paul C, Ian Jackson,
	Euan Harris, Jan Beulich, Boris Ostrovsky, Sergey Dyasli,
	Joao Martins

On 08/06/2017 14:12, Andrew Cooper wrote:
> Presented herewith is the a plan for the final part of CPUID work, which
> primarily covers better Xen/Toolstack interaction for configuring the guests
> CPUID policy.
>
> A PDF version of this document is available from:
>
> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3.pdf
>
> There are a number of still-open questions, which I would appreaciate views
> on.
>
> ~Andrew
>
>
> # Proposal
>
> First and foremost, split the current **max\_policy** notion into separate
> **max** and **default** policies.  This allows for the provision of features
> which are unused by default, but may be opted in to, both at the hypervisor
> level and the toolstack level.
>
> At the hypervisor level, **max** constitutes all the features Xen can use on
> the current hardware, while **default** is the subset thereof which are
> supported features, the features which the user has explicitly opted in to,
> and excluding any features the user has explicitly opted out of.
>
> A new `cpuid=` command line option shall be introduced, whose internals are
> generated automatically from the featureset ABI.  This means that all features
> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
> line control.  (RFC: The same top level option can probably be used for
> non-feature CPUID data control, although I can't currently think of any cases
> where this would be used Also find a sensible way to express 'available but
> not to be used by Xen', as per the current `smep` and `smap` options.)
>
>
> At the guest level, **max** constitutes all the features which can be offered
> to each type of guest on this hardware.  Derived from Xen's **default**
> policy, it includes the supported features and explicitly opted in to
> features, which are appropriate for the guest.
>
> The guests **default** policy is then derived from its **max**, and includes
> the supported features which are considered migration safe.  (RFC: This
> distinction is rather fuzzy, but for example it wouldn't include things like
> ITSC by default, as that is likely to go wrong unless special care is taken.)
>
Just from other perspective, what happens to the features which have 
been explicilty selected and are not migration safe ? Do, we consider 
them in guest's default policy.

> All global policies (Xen and guest, max and default) shall be made available
> to the toolstack, in a manner similar to the existing
Instead of all, do you see any harm if we expose only the default 
policies of Xen and Guest to toolstack.
> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
> taken which include all CPUID data, not just the feature bitmaps.
>
> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
> which allows the toolstack to query and set the cpuid policy for a specific
> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, shall fail if Xen is
> unhappy with any aspect of the policy during auditing.
>
> When a domain is initially created, the appropriate guests **default** policy
> is duplicated for use.  When auditing, Xen shall audit the toolstacks
> requested policy against the guests **max** policy.  This allows experimental
> features or non-migration-safe features to be opted in to, without those
> features being imposed upon all guests automatically.

>
> The `disable_migrate` field shall be dropped.  The concept of migrateability
> is not boolean; it is a large spectrum, all of which needs to be managed by
> the toolstack.
Can't this large spectrum result in a bool which can then be used for 
disable_migrate. Sorry, I can't see any value add in removing 
disable_migrate.
  The simple case is picking the common subset of features
> between the source and destination.  This becomes more complicated e.g. if the
> guest uses LBR/LER, at which point the toolstack needs to consider hardware
> with the same LBR/LER format in addition to just the plain features.
>
> `disable_migrate` is currently only used to expose ITSC to guests, but there
> are cases where is perfectly safe to migrate such a guest, if the destination
> host has the same TSC frequency or hardware TSC scaling support.
>
> Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit
> state gather operations, as this interferes with debugging and monitoring
> tasks.
>
Thanks
Anshul


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-08 13:47 ` Jan Beulich
@ 2017-06-12 13:07   ` Andrew Cooper
  2017-06-12 13:29     ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-06-12 13:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, Stefano Stabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	Ian Jackson, Xen-devel, Euan Harris, Joao Martins,
	Boris Ostrovsky, Paul C Lai

On 08/06/17 14:47, Jan Beulich wrote:
>>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
>> # Proposal
>>
>> First and foremost, split the current **max\_policy** notion into separate
>> **max** and **default** policies.  This allows for the provision of features
>> which are unused by default, but may be opted in to, both at the hypervisor
>> level and the toolstack level.
>>
>> At the hypervisor level, **max** constitutes all the features Xen can use on
>> the current hardware, while **default** is the subset thereof which are
>> supported features, the features which the user has explicitly opted in to,
>> and excluding any features the user has explicitly opted out of.
>>
>> A new `cpuid=` command line option shall be introduced, whose internals are
>> generated automatically from the featureset ABI.  This means that all features
>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
>> line control.  (RFC: The same top level option can probably be used for
>> non-feature CPUID data control, although I can't currently think of any cases
>> where this would be used Also find a sensible way to express 'available but
>> not to be used by Xen', as per the current `smep` and `smap` options.)
> Especially for disabling individual features I'm not sure "cpuid=" is
> an appropriate name. After all CPUID is only a manifestation of
> behavior elsewhere, and hence we don't really want CPUID
> behavior be controlled, but behavior which CPUID output reflects.
> I can't, however, think of an alternative name I would consider
> more suitable.

I suppose I view it a little like "information contained within cpuid"=

I'm happy to use an alternative name if we can think of a better one,
but I definitely want a way to control every feature (rather than the
controls being ad-hoc), and don't want to introduce top level booleans
for each feature.

>
>> At the guest level, **max** constitutes all the features which can be offered
>> to each type of guest on this hardware.  Derived from Xen's **default**
>> policy, it includes the supported features and explicitly opted in to
>> features, which are appropriate for the guest.
> There's no provision here at all for features which hardware doesn't
> offer, but which we can emulate in a reasonable way (UMIP being
> the example I'd be thinking of right away). While perhaps this could
> be viewed to be covered by "explicitly opted in to features", I think
> it would be nice to make this explicit.

In this case, I'd include that within "the features which can be offered".

So far, there is only a single feature we emulate to guests without
hardware support, which is x2apic mode for HVM guests.

I should call this distinction out more clearly.

>
>> The guests **default** policy is then derived from its **max**, and includes
>> the supported features which are considered migration safe.  (RFC: This
>> distinction is rather fuzzy, but for example it wouldn't include things like
>> ITSC by default, as that is likely to go wrong unless special care is 
>> taken.)
> As per above I think the delta between max and default is larger
> than just migration-unsafe pieces. Iirc for UMIP we would mean to
> have it off by default at least in the case where emulation incurs
> side effects.

There is a lot of emulation overhead for UMIP on non-UMIP-capable
hardware.  I'd advocate for it needing to be opt-in at both the
hypervisor and toolstack level.  In general, I'd expect people to be
more wary of the added emulation than the information leak.

>
>> The `disable_migrate` field shall be dropped.  The concept of migrateability
>> is not boolean; it is a large spectrum, all of which needs to be managed by
>> the toolstack.  The simple case is picking the common subset of features
>> between the source and destination.  This becomes more complicated e.g. if the
>> guest uses LBR/LER, at which point the toolstack needs to consider hardware
>> with the same LBR/LER format in addition to just the plain features.
> Not sure about this - by intercepting the MSR accesses to the involved
> MSRs, it would be possible to mimic the LBR/LER format expected by
> the guest even if different from that of the host.

LER yes, but how would you emulate LBR?

You could set DBG_CTL.BTF/EFLAGS.TF and intercept #DB, but this would be
visible to the guest via pushf/popf.  It would also interfere with a
guest trying to single-step itself.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-09 12:24 ` Anshul Makkar
@ 2017-06-12 13:21   ` Andrew Cooper
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Cooper @ 2017-06-12 13:21 UTC (permalink / raw)
  To: Anshul Makkar, Xen-devel
  Cc: Juergen Gross, Lan Tianyu, Kevin Tian, Stefano Stabellini,
	Wei Liu, George Dunlap, Tim Deegan, Lai, Paul C, Ian Jackson,
	Euan Harris, Jan Beulich, Boris Ostrovsky, Sergey Dyasli,
	Joao Martins

On 09/06/17 13:24, Anshul Makkar wrote:
> On 08/06/2017 14:12, Andrew Cooper wrote:
>> Presented herewith is the a plan for the final part of CPUID work, which
>> primarily covers better Xen/Toolstack interaction for configuring the
>> guests
>> CPUID policy.
>>
>> A PDF version of this document is available from:
>>
>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3.pdf
>>
>> There are a number of still-open questions, which I would appreaciate
>> views
>> on.
>>
>> ~Andrew
>>
>>
>> # Proposal
>>
>> First and foremost, split the current **max\_policy** notion into
>> separate
>> **max** and **default** policies.  This allows for the provision of
>> features
>> which are unused by default, but may be opted in to, both at the
>> hypervisor
>> level and the toolstack level.
>>
>> At the hypervisor level, **max** constitutes all the features Xen can
>> use on
>> the current hardware, while **default** is the subset thereof which are
>> supported features, the features which the user has explicitly opted
>> in to,
>> and excluding any features the user has explicitly opted out of.
>>
>> A new `cpuid=` command line option shall be introduced, whose
>> internals are
>> generated automatically from the featureset ABI.  This means that all
>> features
>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain
>> command
>> line control.  (RFC: The same top level option can probably be used for
>> non-feature CPUID data control, although I can't currently think of
>> any cases
>> where this would be used Also find a sensible way to express
>> 'available but
>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>
>>
>> At the guest level, **max** constitutes all the features which can be
>> offered
>> to each type of guest on this hardware.  Derived from Xen's **default**
>> policy, it includes the supported features and explicitly opted in to
>> features, which are appropriate for the guest.
>>
>> The guests **default** policy is then derived from its **max**, and
>> includes
>> the supported features which are considered migration safe.  (RFC: This
>> distinction is rather fuzzy, but for example it wouldn't include
>> things like
>> ITSC by default, as that is likely to go wrong unless special care is
>> taken.)
>>
> Just from other perspective, what happens to the features which have
> been explicilty selected and are not migration safe ? Do, we consider
> them in guest's default policy.

Explicitly selected where?

Explicit selection at the Xen level is for using experimental/preview
features, while explicit selection at the toolstack level is for both
experimental/preview features, and using features which require more
care wrt migration.

>
>> All global policies (Xen and guest, max and default) shall be made
>> available
>> to the toolstack, in a manner similar to the existing
> Instead of all, do you see any harm if we expose only the default
> policies of Xen and Guest to toolstack.

The entire point of this work is to provide the toolstack with enough
information to work correctly.  Hiding the max policies is not an
option, as it prevents the toolstack from being able to work out whether
it can offer non-default features or not.

>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions
>> to be
>> taken which include all CPUID data, not just the feature bitmaps.
>>
>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be
>> introduced,
>> which allows the toolstack to query and set the cpuid policy for a
>> specific
>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, shall fail if
>> Xen is
>> unhappy with any aspect of the policy during auditing.
>>
>> When a domain is initially created, the appropriate guests
>> **default** policy
>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>> requested policy against the guests **max** policy.  This allows
>> experimental
>> features or non-migration-safe features to be opted in to, without those
>> features being imposed upon all guests automatically.
>
>>
>> The `disable_migrate` field shall be dropped.  The concept of
>> migrateability
>> is not boolean; it is a large spectrum, all of which needs to be
>> managed by
>> the toolstack.
> Can't this large spectrum result in a bool which can then be used for
> disable_migrate. Sorry, I can't see any value add in removing
> disable_migrate.

A spectrum is by definition not a single boolean.  What is unclear about
my argument here that disable_migrate is unfit for purpose?

~Andrew

>  The simple case is picking the common subset of features
>> between the source and destination.  This becomes more complicated
>> e.g. if the
>> guest uses LBR/LER, at which point the toolstack needs to consider
>> hardware
>> with the same LBR/LER format in addition to just the plain features.
>>
>> `disable_migrate` is currently only used to expose ITSC to guests,
>> but there
>> are cases where is perfectly safe to migrate such a guest, if the
>> destination
>> host has the same TSC frequency or hardware TSC scaling support.
>>
>> Finally, `disable_migrate` doesn't (and cannot reasonably) be used to
>> inhibit
>> state gather operations, as this interferes with debugging and
>> monitoring
>> tasks.
>>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-12 13:07   ` Andrew Cooper
@ 2017-06-12 13:29     ` Jan Beulich
  2017-06-12 13:36       ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2017-06-12 13:29 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, StefanoStabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	IanJackson, Xen-devel, Euan Harris, Joao Martins,
	Boris Ostrovsky, PaulC Lai

>>> On 12.06.17 at 15:07, <andrew.cooper3@citrix.com> wrote:
> On 08/06/17 14:47, Jan Beulich wrote:
>>>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>> The `disable_migrate` field shall be dropped.  The concept of migrateability
>>> is not boolean; it is a large spectrum, all of which needs to be managed by
>>> the toolstack.  The simple case is picking the common subset of features
>>> between the source and destination.  This becomes more complicated e.g. if the
>>> guest uses LBR/LER, at which point the toolstack needs to consider hardware
>>> with the same LBR/LER format in addition to just the plain features.
>> Not sure about this - by intercepting the MSR accesses to the involved
>> MSRs, it would be possible to mimic the LBR/LER format expected by
>> the guest even if different from that of the host.
> 
> LER yes, but how would you emulate LBR?
> 
> You could set DBG_CTL.BTF/EFLAGS.TF and intercept #DB, but this would be
> visible to the guest via pushf/popf.  It would also interfere with a
> guest trying to single-step itself.

I don't understand: LBR is an MSR just like LER, and hence the
guest can't avoid using RDMSR to read its contents. If we
intercept that read, we can give them whatever format is
needed, without a need to intercept anything else. But maybe
I'm not seeing what you're getting at.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-12 13:29     ` Jan Beulich
@ 2017-06-12 13:36       ` Andrew Cooper
  2017-06-12 13:42         ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-06-12 13:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, StefanoStabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	IanJackson, Xen-devel, Euan Harris, Joao Martins,
	Boris Ostrovsky, PaulC Lai

On 12/06/17 14:29, Jan Beulich wrote:
>>>> On 12.06.17 at 15:07, <andrew.cooper3@citrix.com> wrote:
>> On 08/06/17 14:47, Jan Beulich wrote:
>>>>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>>> The `disable_migrate` field shall be dropped.  The concept of migrateability
>>>> is not boolean; it is a large spectrum, all of which needs to be managed by
>>>> the toolstack.  The simple case is picking the common subset of features
>>>> between the source and destination.  This becomes more complicated e.g. if the
>>>> guest uses LBR/LER, at which point the toolstack needs to consider hardware
>>>> with the same LBR/LER format in addition to just the plain features.
>>> Not sure about this - by intercepting the MSR accesses to the involved
>>> MSRs, it would be possible to mimic the LBR/LER format expected by
>>> the guest even if different from that of the host.
>> LER yes, but how would you emulate LBR?
>>
>> You could set DBG_CTL.BTF/EFLAGS.TF and intercept #DB, but this would be
>> visible to the guest via pushf/popf.  It would also interfere with a
>> guest trying to single-step itself.
> I don't understand: LBR is an MSR just like LER, and hence the
> guest can't avoid using RDMSR to read its contents. If we
> intercept that read, we can give them whatever format is
> needed, without a need to intercept anything else. But maybe
> I'm not seeing what you're getting at.

To emulate it, we need to sample state at the point that the last
exception or branch happened.

You can't reverse the current value in hardware at the point of the
guest reading the LBR MSR to the value it should have been under a
different format.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-12 13:36       ` Andrew Cooper
@ 2017-06-12 13:42         ` Jan Beulich
  2017-06-12 14:02           ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Beulich @ 2017-06-12 13:42 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, StefanoStabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	IanJackson, Xen-devel, Euan Harris, JoaoMartins, Boris Ostrovsky,
	PaulC Lai

>>> On 12.06.17 at 15:36, <andrew.cooper3@citrix.com> wrote:
> On 12/06/17 14:29, Jan Beulich wrote:
>>>>> On 12.06.17 at 15:07, <andrew.cooper3@citrix.com> wrote:
>>> On 08/06/17 14:47, Jan Beulich wrote:
>>>>>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>>>> The `disable_migrate` field shall be dropped.  The concept of migrateability
>>>>> is not boolean; it is a large spectrum, all of which needs to be managed by
>>>>> the toolstack.  The simple case is picking the common subset of features
>>>>> between the source and destination.  This becomes more complicated e.g. if 
> the
>>>>> guest uses LBR/LER, at which point the toolstack needs to consider hardware
>>>>> with the same LBR/LER format in addition to just the plain features.
>>>> Not sure about this - by intercepting the MSR accesses to the involved
>>>> MSRs, it would be possible to mimic the LBR/LER format expected by
>>>> the guest even if different from that of the host.
>>> LER yes, but how would you emulate LBR?
>>>
>>> You could set DBG_CTL.BTF/EFLAGS.TF and intercept #DB, but this would be
>>> visible to the guest via pushf/popf.  It would also interfere with a
>>> guest trying to single-step itself.
>> I don't understand: LBR is an MSR just like LER, and hence the
>> guest can't avoid using RDMSR to read its contents. If we
>> intercept that read, we can give them whatever format is
>> needed, without a need to intercept anything else. But maybe
>> I'm not seeing what you're getting at.
> 
> To emulate it, we need to sample state at the point that the last
> exception or branch happened.
> 
> You can't reverse the current value in hardware at the point of the
> guest reading the LBR MSR to the value it should have been under a
> different format.

Aren't we talking about correct (or at least unproblematic) top
bits of the value only? In which case the actual address bits
can be taken as is, and only the top bits need adjustment.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-12 13:42         ` Jan Beulich
@ 2017-06-12 14:02           ` Andrew Cooper
  2017-06-12 14:18             ` Jan Beulich
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-06-12 14:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, StefanoStabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	IanJackson, Xen-devel, Euan Harris, JoaoMartins, Boris Ostrovsky,
	PaulC Lai

On 12/06/17 14:42, Jan Beulich wrote:
>>>> On 12.06.17 at 15:36, <andrew.cooper3@citrix.com> wrote:
>> On 12/06/17 14:29, Jan Beulich wrote:
>>>>>> On 12.06.17 at 15:07, <andrew.cooper3@citrix.com> wrote:
>>>> On 08/06/17 14:47, Jan Beulich wrote:
>>>>>>>> On 08.06.17 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>>>>> The `disable_migrate` field shall be dropped.  The concept of migrateability
>>>>>> is not boolean; it is a large spectrum, all of which needs to be managed by
>>>>>> the toolstack.  The simple case is picking the common subset of features
>>>>>> between the source and destination.  This becomes more complicated e.g. if 
>> the
>>>>>> guest uses LBR/LER, at which point the toolstack needs to consider hardware
>>>>>> with the same LBR/LER format in addition to just the plain features.
>>>>> Not sure about this - by intercepting the MSR accesses to the involved
>>>>> MSRs, it would be possible to mimic the LBR/LER format expected by
>>>>> the guest even if different from that of the host.
>>>> LER yes, but how would you emulate LBR?
>>>>
>>>> You could set DBG_CTL.BTF/EFLAGS.TF and intercept #DB, but this would be
>>>> visible to the guest via pushf/popf.  It would also interfere with a
>>>> guest trying to single-step itself.
>>> I don't understand: LBR is an MSR just like LER, and hence the
>>> guest can't avoid using RDMSR to read its contents. If we
>>> intercept that read, we can give them whatever format is
>>> needed, without a need to intercept anything else. But maybe
>>> I'm not seeing what you're getting at.
>> To emulate it, we need to sample state at the point that the last
>> exception or branch happened.
>>
>> You can't reverse the current value in hardware at the point of the
>> guest reading the LBR MSR to the value it should have been under a
>> different format.
> Aren't we talking about correct (or at least unproblematic) top
> bits of the value only? In which case the actual address bits
> can be taken as is, and only the top bits need adjustment.

I'm completely confused.

My original statement was "if the guest uses LBR/LER, then migration
needs to be restricted to hardware with an identical LBR format".

You countered that, saying we could emulate LBR/LER as an alternative. 
The implication here is that we could alter the LBR format via
emulation, by cooking the value observed when the guest reads the LBR MSRs.

For the record, the formats are:

Software should query an architectural MSR IA32_PERF_CAPABILITIES[5:0]
about the format of the address that is stored in the LBR stack. Four
formats are defined by the following encoding:
* 000000B (32-bit record format) — Stores 32-bit offset in current CS of
respective source/destination,
* 000001B (64-bit LIP record format) — Stores 64-bit linear address of
respective source/destination,
* 000010B (64-bit EIP record format) — Stores 64-bit offset (effective
address) of respective source/destination.
* 000011B (64-bit EIP record format) and Flags — Stores 64-bit offset
(effective address) of respective source/destination. Misprediction info
is reported in the upper bit of 'FROM' registers in the LBR stack. See
LBR stack details below for flag support and definition.
* 000100B (64-bit EIP record format), Flags and TSX — Stores 64-bit
offset (effective address) of respective source/destination.
Misprediction and TSX info are reported in the upper bits of ‘FROM’
registers in the LBR stack.
* 000101B (64-bit EIP record format), Flags, TSX, LBR_INFO — Stores
64-bit offset (effective address) of respective source/destination.
Misprediction, TSX, and elapsed cycles since the last LBR update are
reported in the LBR_INFO MSR stack.
* 000110B (64-bit EIP record format), Flags, Cycles — Stores 64-bit
linear address (CS.Base + effective address) of respective
source/destination. Misprediction info is reported in the upper bits of
17-16 Vol. 3BDEBUG, BRANCH PROFILE, TSC, AND RESOURCE MONITORING
FEATURES 'FROM' registers in the LBR stack. Elapsed cycles since the
last LBR update are reported in the upper 16 bits of the 'TO' registers
in the LBR stack (see Section 17.6).

In general, I don't see any sensible way of being able to convert
between these formats at the point of an RDMSR.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN: CPUID part 3
  2017-06-12 14:02           ` Andrew Cooper
@ 2017-06-12 14:18             ` Jan Beulich
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Beulich @ 2017-06-12 14:18 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan Tianyu, Sergey Dyasli, Kevin Tian, StefanoStabellini,
	Wei Liu, Juergen Gross, George Dunlap, TimDeegan, Anshul Makkar,
	IanJackson, Xen-devel, Euan Harris, JoaoMartins, Boris Ostrovsky,
	PaulC Lai

>>> On 12.06.17 at 16:02, <andrew.cooper3@citrix.com> wrote:
> My original statement was "if the guest uses LBR/LER, then migration
> needs to be restricted to hardware with an identical LBR format".
> 
> You countered that, saying we could emulate LBR/LER as an alternative. 
> The implication here is that we could alter the LBR format via
> emulation, by cooking the value observed when the guest reads the LBR MSRs.
> 
> For the record, the formats are:
> 
> Software should query an architectural MSR IA32_PERF_CAPABILITIES[5:0]
> about the format of the address that is stored in the LBR stack. Four
> formats are defined by the following encoding:
> * 000000B (32-bit record format) — Stores 32-bit offset in current CS of
> respective source/destination,
> * 000001B (64-bit LIP record format) — Stores 64-bit linear address of
> respective source/destination,
> * 000010B (64-bit EIP record format) — Stores 64-bit offset (effective
> address) of respective source/destination.
> * 000011B (64-bit EIP record format) and Flags — Stores 64-bit offset
> (effective address) of respective source/destination. Misprediction info
> is reported in the upper bit of 'FROM' registers in the LBR stack. See
> LBR stack details below for flag support and definition.
> * 000100B (64-bit EIP record format), Flags and TSX — Stores 64-bit
> offset (effective address) of respective source/destination.
> Misprediction and TSX info are reported in the upper bits of ‘FROM’
> registers in the LBR stack.
> * 000101B (64-bit EIP record format), Flags, TSX, LBR_INFO — Stores
> 64-bit offset (effective address) of respective source/destination.
> Misprediction, TSX, and elapsed cycles since the last LBR update are
> reported in the LBR_INFO MSR stack.
> * 000110B (64-bit EIP record format), Flags, Cycles — Stores 64-bit
> linear address (CS.Base + effective address) of respective
> source/destination. Misprediction info is reported in the upper bits of
> 17-16 Vol. 3BDEBUG, BRANCH PROFILE, TSC, AND RESOURCE MONITORING
> FEATURES 'FROM' registers in the LBR stack. Elapsed cycles since the
> last LBR update are reported in the upper 16 bits of the 'TO' registers
> in the LBR stack (see Section 17.6).
> 
> In general, I don't see any sensible way of being able to convert
> between these formats at the point of an RDMSR.

Hmm, I don't see a problem converting formats 3..6 to formats 0
or 2. I also don't think any misbehavior can possibly result when
converting 2 to 3 by simply always loading a fixed value into the
mis-prediction bit. Whether 2 can be converted sensibly to 4..6
would need to be determined. Format 1 clearly is the odd one out,
conversion to/from which would only be reasonable if we assumed
flat addressing everywhere (which obviously we can assume as
long as a guest stays in 64-bit mode).

It is also clear that format 6 won't survive the addition of 5-level
page tables, as there aren't enough bits to store a meaningful
cycle count.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* DESIGN v2: CPUID part 3
  2017-06-08 13:12 DESIGN: CPUID part 3 Andrew Cooper
  2017-06-08 13:47 ` Jan Beulich
  2017-06-09 12:24 ` Anshul Makkar
@ 2017-07-04 14:55 ` Andrew Cooper
  2017-07-05  9:46   ` Joao Martins
  2 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-07-04 14:55 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

Presented herewith is the a plan for the final part of CPUID work, which
primarily covers better Xen/Toolstack interaction for configuring the guests
CPUID policy.

A PDF version of this document is available from:

http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf

Changes from v1:
 * Clarification of the interaction of emulated features
 * More information about the difference between max and default featuresets.

~Andrew

-----8<-----
% CPUID Handling (part 3)
% Revision 2

# Current state

At early boot, Xen enumerates the features it can see, takes into account
errata checks and command line arguments, and stores this information in the
`boot_cpu_data.x86_capability[]` bitmap.  This gets adjusted as APs boot up,
and is sanitised to disable all dependent leaf features.

At mid/late boot (before dom0 is constructed), Xen performs the necessary
calculations for guest cpuid handling.  Data are contained within the `struct
cpuid_policy` object, which is a representation of the architectural CPUID
information as specified by the Intel and AMD manuals.

There are a few global `cpuid_policy` objects.  First is the **raw_policy**
which is filled in from native `CPUID` instructions.  This represents what the
hardware is capable of, in its current firmware/microcode configuration.

The next global object is **host_policy**, which is derived from the
**raw_policy** and `boot_cpu_data.x86_capability[]`. It represents the
features which Xen knows about and is using.  The **host_policy** is
necessarily a subset of **raw_policy**.

The **pv_max_policy** and **hvm_max_policy** are derived from the
**host_policy**, and represent the upper bounds available to guests.
Generally speaking, the guest policies are less featurefull than the
**host_policy** because there are features which Xen doesn't or cannot safely
provide to guests.  However, they are not subsets.  There are some features
(the HYPERVISOR bit for all guests, and X2APIC mode for HVM guests) which are
emulated in the absence of real hardware support.

The toolstack may query for the **{raw,host,pv,hvm}\_featureset** information
using _XEN\_SYSCTL\_get\_cpu\_featureset_.  This is bitmap form of the feature
leaves only.

When a new domain is created, the appropriate **{pv,hvm}\_max_policy** is
duplicated as a starting point, and can be subsequently mutated indirectly by
some hypercalls
(_XEN\_DOMCTL\_{set\_address\_size,disable\_migrate,settscinfo}_) or directly
by _XEN\_DOMCTL\_set\_cpuid_.


# Issues with the existing hypercalls

_XEN\_DOMCTL\_set\_cpuid_ doesn't have a return value which the domain builder
pays attention to.  This is because, before CPUID part 2, there were no
failure conditions, as Xen would accept all toolstack-provided data, and
attempt to audit it at the time it was requested by the guest.  To simplify
the part 2 work, this behaviour was maintained, although Xen was altered to
audit the data at hypercall time, typically zeroing out areas which failed the
audit.

There is no mechanism for the toolstack to query the CPUID configuration for a
specific domain.  Originally, the domain builder constructed a guests CPUID
policy from first principles, using native `CPUID` instructions in the control
domain.  This functioned to an extent, but was subject to masking problems,
and is fundamentally incompatible with HVM control domains or the use of
_CPUID Faulting_ in newer Intel processors.

CPUID phase 1 introduced the featureset information, which provided an
architecturally sound mechanism for the toolstack to identify which features
are usable for guests.  However, the rest of the CPUID policy is still
generated from native `CPUID` instructions.

The `cpuid_policy` is per-domain information.  Most CPUID data is identical
across all CPUs.  Some data are dynamic, based on other control settings
(APIC, OSXSAVE, OSPKE, OSLWP), and Xen substitutes these appropriately when
the information is requested..  Other areas however are topology information,
including thread/core/socket layout, cache and TLB hierarchy.  These data are
inherited from whichever physical CPU the domain builder happened to be
running on when it was making calculations.  As a result, it is inappropriate
for the guest under construction, and usually entirely bogus when considered
alongside other data.


# Other problems

There is no easy provision for features at different code maturity levels,
both in the hypervisor, and in the toolstack.

Some CPUID features have top-level command line options on the Xen command
line, but most do not.  On some hardware, some features can be hidden
indirectly by altering the `cpuid_mask_*` parameters.  This is a problem for
developing new features (which want to be off-by-default but able to be opted
in to), debugging, where it can sometimes be very useful to hide features and
see if a problem reoccurs, and occasionally in security circumstances, where
disabling a feature outright is an easy stop-gap solution.

From the toolstack side, given no other constraints, a guest gets the
hypervisor-max set of features.  This set of features is a trade off between
what is supported in the hypervisor, and which features can reasonably be
offered without impeding the migrateability of the guest.  There is little
provision for features which can be opted in to at the toolstack level, and
those that are are done so via ad-hoc means.


# Proposal

First and foremost, split the current **max\_policy** notion into separate
**max** and **default** policies.  This allows for the provision of features
which are unused by default, but may be opted in to, both at the hypervisor
level and the toolstack level.

At the hypervisor level, **max** constitutes all the features Xen can use on
the current hardware, while **default** is the subset thereof which are
supported features, the features which the user has explicitly opted in to,
and excluding any features the user has explicitly opted out of.

A new `cpuid=` command line option shall be introduced, whose internals are
generated automatically from the featureset ABI.  This means that all features
added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
line control.  (RFC: The same top level option can probably be used for
non-feature CPUID data control, although I can't currently think of any cases
where this would be used Also find a sensible way to express 'available but
not to be used by Xen', as per the current `smep` and `smap` options.)


At the guest level, the **max** policy is conceptually unchanged.  It
constitutes all the features Xen is willing to offer to each type of guest on
the current hardware (including emulated features).  However, it shall instead
be derived from Xen's **default** host policy.  This is to ensure that
experimental hypervisor features must be opted in to at the Xen level before
they can be opted in to at the toolstack level.

The guests **default** policy is then derived from its **max**.  This is
because there are some features which should always be explicitly opted in to
by the toolstack, such as emulated features which come with a security
trade-off, or for non-architectural features which may differ in
implementation in heterogeneous environments.

All global policies (Xen and guest, max and default) shall be made available
to the toolstack, in a manner similar to the existing
_XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
taken which include all CPUID data, not just the feature bitmaps.

New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
which allows the toolstack to query and set the cpuid policy for a specific
domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
is unhappy with any aspect of the policy during auditing.  This provides
feedback to the user that a chosen combination will not work, rather than the
guest booting in an unexpected state.

When a domain is initially created, the appropriate guests **default** policy
is duplicated for use.  When auditing, Xen shall audit the toolstacks
requested policy against the guests **max** policy.  This allows experimental
features or non-migration-safe features to be opted in to, without those
features being imposed upon all guests automatically.

A guests CPUID policy shall be immutable after construction.  This better
matches real hardware, and simplifies the logic in Xen to translate policy
alterations into configuration changes.

(RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
The toolstack shall also have a mechanism to explicitly select topology
configuration for the guest, which primarily affects the virtual APIC ID
layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
Xen's auditing shall ensure that guests observe values consistent with the
guarantees made by the vendor manuals.

The `disable_migrate` field shall be dropped.  The concept of migrateability
is not boolean; it is a large spectrum, all of which needs to be managed by
the toolstack.  The simple case is picking the common subset of features
between the source and destination.  This becomes more complicated e.g. if the
guest uses LBR/LER, at which point the toolstack needs to consider hardware
with the same LBR/LER format in addition to just the plain features.

`disable_migrate` is currently only used to expose ITSC to guests, but there
are cases where is perfectly safe to migrate such a guest, if the destination
host has the same TSC frequency or hardware TSC scaling support.

Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit
state gather operations, as this interferes with debugging and monitoring
tasks.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-04 14:55 ` DESIGN v2: " Andrew Cooper
@ 2017-07-05  9:46   ` Joao Martins
  2017-07-05 10:32     ` Joao Martins
  2017-07-05 11:16     ` Andrew Cooper
  0 siblings, 2 replies; 19+ messages in thread
From: Joao Martins @ 2017-07-05  9:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

Hey Andrew,

On 07/04/2017 03:55 PM, Andrew Cooper wrote:
> Presented herewith is the a plan for the final part of CPUID work, which
> primarily covers better Xen/Toolstack interaction for configuring the guests
> CPUID policy.
> 
Really nice write up, a few comments below.

> A PDF version of this document is available from:
> 
> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
> 
> Changes from v1:
>  * Clarification of the interaction of emulated features
>  * More information about the difference between max and default featuresets.
> 
> ~Andrew
> 
> -----8<-----
> % CPUID Handling (part 3)
> % Revision 2
> 
> # Current state
> 
> At early boot, Xen enumerates the features it can see, takes into account
> errata checks and command line arguments, and stores this information in the
> `boot_cpu_data.x86_capability[]` bitmap.  This gets adjusted as APs boot up,
> and is sanitised to disable all dependent leaf features.
> 
> At mid/late boot (before dom0 is constructed), Xen performs the necessary
> calculations for guest cpuid handling.  Data are contained within the `struct
> cpuid_policy` object, which is a representation of the architectural CPUID
> information as specified by the Intel and AMD manuals.
> 
> There are a few global `cpuid_policy` objects.  First is the **raw_policy**
> which is filled in from native `CPUID` instructions.  This represents what the
> hardware is capable of, in its current firmware/microcode configuration.
> 
> The next global object is **host_policy**, which is derived from the
> **raw_policy** and `boot_cpu_data.x86_capability[]`. It represents the
> features which Xen knows about and is using.  The **host_policy** is
> necessarily a subset of **raw_policy**.
> 
> The **pv_max_policy** and **hvm_max_policy** are derived from the
> **host_policy**, and represent the upper bounds available to guests.
> Generally speaking, the guest policies are less featurefull than the
> **host_policy** because there are features which Xen doesn't or cannot safely
> provide to guests.  However, they are not subsets.  There are some features
> (the HYPERVISOR bit for all guests, and X2APIC mode for HVM guests) which are
> emulated in the absence of real hardware support.
> 
> The toolstack may query for the **{raw,host,pv,hvm}\_featureset** information
> using _XEN\_SYSCTL\_get\_cpu\_featureset_.  This is bitmap form of the feature
> leaves only.
> 
> When a new domain is created, the appropriate **{pv,hvm}\_max_policy** is
> duplicated as a starting point, and can be subsequently mutated indirectly by
> some hypercalls
> (_XEN\_DOMCTL\_{set\_address\_size,disable\_migrate,settscinfo}_) or directly
> by _XEN\_DOMCTL\_set\_cpuid_.
> 
> 
> # Issues with the existing hypercalls
> 
> _XEN\_DOMCTL\_set\_cpuid_ doesn't have a return value which the domain builder
> pays attention to.  This is because, before CPUID part 2, there were no
> failure conditions, as Xen would accept all toolstack-provided data, and
> attempt to audit it at the time it was requested by the guest.  To simplify
> the part 2 work, this behaviour was maintained, although Xen was altered to
> audit the data at hypercall time, typically zeroing out areas which failed the
> audit.
> 
> There is no mechanism for the toolstack to query the CPUID configuration for a
> specific domain.  Originally, the domain builder constructed a guests CPUID
> policy from first principles, using native `CPUID` instructions in the control
> domain.  This functioned to an extent, but was subject to masking problems,
> and is fundamentally incompatible with HVM control domains or the use of
> _CPUID Faulting_ in newer Intel processors.
> 
> CPUID phase 1 introduced the featureset information, which provided an
> architecturally sound mechanism for the toolstack to identify which features
> are usable for guests.  However, the rest of the CPUID policy is still
> generated from native `CPUID` instructions.
> 
> The `cpuid_policy` is per-domain information.  Most CPUID data is identical
> across all CPUs.  Some data are dynamic, based on other control settings
> (APIC, OSXSAVE, OSPKE, OSLWP), and Xen substitutes these appropriately when
> the information is requested..  Other areas however are topology information,
> including thread/core/socket layout, cache and TLB hierarchy.  These data are
> inherited from whichever physical CPU the domain builder happened to be
> running on when it was making calculations.  As a result, it is inappropriate
> for the guest under construction, and usually entirely bogus when considered
> alongside other data.
> 
> 
> # Other problems
> 
> There is no easy provision for features at different code maturity levels,
> both in the hypervisor, and in the toolstack.
> 
> Some CPUID features have top-level command line options on the Xen command
> line, but most do not.  On some hardware, some features can be hidden
> indirectly by altering the `cpuid_mask_*` parameters.  This is a problem for
> developing new features (which want to be off-by-default but able to be opted
> in to), debugging, where it can sometimes be very useful to hide features and
> see if a problem reoccurs, and occasionally in security circumstances, where
> disabling a feature outright is an easy stop-gap solution.
> 
> From the toolstack side, given no other constraints, a guest gets the
> hypervisor-max set of features.  This set of features is a trade off between
> what is supported in the hypervisor, and which features can reasonably be
> offered without impeding the migrateability of the guest.  There is little
> provision for features which can be opted in to at the toolstack level, and
> those that are are done so via ad-hoc means.
> 
> 
> # Proposal
> 
> First and foremost, split the current **max\_policy** notion into separate
> **max** and **default** policies.  This allows for the provision of features
> which are unused by default, but may be opted in to, both at the hypervisor
> level and the toolstack level.
> 
> At the hypervisor level, **max** constitutes all the features Xen can use on
> the current hardware, while **default** is the subset thereof which are
> supported features, the features which the user has explicitly opted in to,
> and excluding any features the user has explicitly opted out of.
> 
> A new `cpuid=` command line option shall be introduced, whose internals are
> generated automatically from the featureset ABI.  This means that all features
> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
> line control.  (RFC: The same top level option can probably be used for
> non-feature CPUID data control, although I can't currently think of any cases
> where this would be used Also find a sensible way to express 'available but
> not to be used by Xen', as per the current `smep` and `smap` options.)
> 
> 
> At the guest level, the **max** policy is conceptually unchanged.  It
> constitutes all the features Xen is willing to offer to each type of guest on
> the current hardware (including emulated features).  However, it shall instead
> be derived from Xen's **default** host policy.  This is to ensure that
> experimental hypervisor features must be opted in to at the Xen level before
> they can be opted in to at the toolstack level.
> 
> The guests **default** policy is then derived from its **max**.  This is
> because there are some features which should always be explicitly opted in to
> by the toolstack, such as emulated features which come with a security
> trade-off, or for non-architectural features which may differ in
> implementation in heterogeneous environments.
> 
> All global policies (Xen and guest, max and default) shall be made available
> to the toolstack, in a manner similar to the existing
> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
> taken which include all CPUID data, not just the feature bitmaps.
> 
> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
> which allows the toolstack to query and set the cpuid policy for a specific
> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
> is unhappy with any aspect of the policy during auditing.  This provides
> feedback to the user that a chosen combination will not work, rather than the
> guest booting in an unexpected state.
> 
> When a domain is initially created, the appropriate guests **default** policy
> is duplicated for use.  When auditing, Xen shall audit the toolstacks
> requested policy against the guests **max** policy.  This allows experimental
> features or non-migration-safe features to be opted in to, without those
> features being imposed upon all guests automatically.
> 
> A guests CPUID policy shall be immutable after construction.  This better
> matches real hardware, and simplifies the logic in Xen to translate policy
> alterations into configuration changes.
> 

This appears to be a suitable abstraction even for higher level toolstacks
(libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
compare them between different servers when user computes the guest cpu config
(the normalized one) and use the common denominator as the guest policy.
Probably higher level toolstack could even use these said policies constructs
and built the idea of models such that the user could easily choose one for a
pool of hosts with different families. But the discussion here is more focused
on xc <-> Xen so I won't clobber discussion with libxl remarks.

> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
> The toolstack shall also have a mechanism to explicitly select topology
> configuration for the guest, which primarily affects the virtual APIC ID
> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
> Xen's auditing shall ensure that guests observe values consistent with the
> guarantees made by the vendor manuals.
> 
Why choose max_vcpus domctl?

With multiple sockets/nodes and having supported extended topology leaf the APIC
ID layout will change considerably requiring fixup if... say we set vNUMA (I
know numa node != socket spec wise, but on the machines we have seen so far,
it's a 1:1 mapping).

Another question since we are speaking about topology is would be: how do we
make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?

This all brings me to the question of perhaps a separate domctl?

> The `disable_migrate` field shall be dropped.  The concept of migrateability
> is not boolean; it is a large spectrum, all of which needs to be managed by
> the toolstack.  The simple case is picking the common subset of features
> between the source and destination.  This becomes more complicated e.g. if the
> guest uses LBR/LER, at which point the toolstack needs to consider hardware
> with the same LBR/LER format in addition to just the plain features.
> 
> `disable_migrate` is currently only used to expose ITSC to guests, but there
> are cases where is perfectly safe to migrate such a guest, if the destination
> host has the same TSC frequency or hardware TSC scaling support.
> 
> Finally, `disable_migrate` doesn't (and cannot reasonably) be used to inhibit
> state gather operations, as this interferes with debugging and monitoring
> tasks.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-05  9:46   ` Joao Martins
@ 2017-07-05 10:32     ` Joao Martins
  2017-07-05 11:16     ` Andrew Cooper
  1 sibling, 0 replies; 19+ messages in thread
From: Joao Martins @ 2017-07-05 10:32 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

On 07/05/2017 10:46 AM, Joao Martins wrote:
> Hey Andrew,
> 
> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>> Presented herewith is the a plan for the final part of CPUID work, which
>> primarily covers better Xen/Toolstack interaction for configuring the guests
>> CPUID policy.
>>
> Really nice write up, a few comments below.
> 
>> A PDF version of this document is available from:
>>
>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
>>
>> Changes from v1:
>>  * Clarification of the interaction of emulated features
>>  * More information about the difference between max and default featuresets.
>>
>> ~Andrew
>>
>> -----8<-----

[snip]

>> # Proposal
>>
>> First and foremost, split the current **max\_policy** notion into separate
>> **max** and **default** policies.  This allows for the provision of features
>> which are unused by default, but may be opted in to, both at the hypervisor
>> level and the toolstack level.
>>
>> At the hypervisor level, **max** constitutes all the features Xen can use on
>> the current hardware, while **default** is the subset thereof which are
>> supported features, the features which the user has explicitly opted in to,
>> and excluding any features the user has explicitly opted out of.
>>
>> A new `cpuid=` command line option shall be introduced, whose internals are
>> generated automatically from the featureset ABI.  This means that all features
>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
>> line control.  (RFC: The same top level option can probably be used for
>> non-feature CPUID data control, although I can't currently think of any cases
>> where this would be used Also find a sensible way to express 'available but
>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>
>>
>> At the guest level, the **max** policy is conceptually unchanged.  It
>> constitutes all the features Xen is willing to offer to each type of guest on
>> the current hardware (including emulated features).  However, it shall instead
>> be derived from Xen's **default** host policy.  This is to ensure that
>> experimental hypervisor features must be opted in to at the Xen level before
>> they can be opted in to at the toolstack level.
>>
>> The guests **default** policy is then derived from its **max**.  This is
>> because there are some features which should always be explicitly opted in to
>> by the toolstack, such as emulated features which come with a security
>> trade-off, or for non-architectural features which may differ in
>> implementation in heterogeneous environments.
>>
>> All global policies (Xen and guest, max and default) shall be made available
>> to the toolstack, in a manner similar to the existing
>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
>> taken which include all CPUID data, not just the feature bitmaps.
>>
>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
>> which allows the toolstack to query and set the cpuid policy for a specific
>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
>> is unhappy with any aspect of the policy during auditing.  This provides
>> feedback to the user that a chosen combination will not work, rather than the
>> guest booting in an unexpected state.
>>
>> When a domain is initially created, the appropriate guests **default** policy
>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>> requested policy against the guests **max** policy.  This allows experimental
>> features or non-migration-safe features to be opted in to, without those
>> features being imposed upon all guests automatically.
>>
>> A guests CPUID policy shall be immutable after construction.  This better
>> matches real hardware, and simplifies the logic in Xen to translate policy
>> alterations into configuration changes.
>>
> 
> This appears to be a suitable abstraction even for higher level toolstacks
> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
> compare them between different servers when user computes the guest cpu config
> (the normalized one) and use the common denominator as the guest policy.
> Probably higher level toolstack could even use these said policies constructs
> and built the idea of models such that the user could easily choose one for a
> pool of hosts with different families. But the discussion here is more focused
> on xc <-> Xen so I won't clobber discussion with libxl remarks.
> 
>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>> The toolstack shall also have a mechanism to explicitly select topology
>> configuration for the guest, which primarily affects the virtual APIC ID
>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>> Xen's auditing shall ensure that guests observe values consistent with the
>> guarantees made by the vendor manuals.
>>
> Why choose max_vcpus domctl?
> 
> With multiple sockets/nodes and having supported extended topology leaf the APIC
> ID layout will change considerably requiring fixup if... say we set vNUMA (I
> know numa node != socket spec wise, but on the machines we have seen so far,
> it's a 1:1 mapping).
> 
> Another question since we are speaking about topology is would be: how do we
> make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?
> 
> This all brings me to the question of perhaps a separate domctl?

"perhaps a separate domctl" as opposed to the max_vcpus domctl. Just to give
better context and clarify that of the sentence wasn't referring to hvmloader.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-05  9:46   ` Joao Martins
  2017-07-05 10:32     ` Joao Martins
@ 2017-07-05 11:16     ` Andrew Cooper
  2017-07-05 13:22       ` Joao Martins
  1 sibling, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-07-05 11:16 UTC (permalink / raw)
  To: Joao Martins; +Cc: Xen-devel

On 05/07/17 10:46, Joao Martins wrote:
> Hey Andrew,
>
> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>> Presented herewith is the a plan for the final part of CPUID work, which
>> primarily covers better Xen/Toolstack interaction for configuring the guests
>> CPUID policy.
>>
> Really nice write up, a few comments below.
>
>> A PDF version of this document is available from:
>>
>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
>>
>> Changes from v1:
>>  * Clarification of the interaction of emulated features
>>  * More information about the difference between max and default featuresets.
>>
>> ~Andrew
>>
>> -----8<-----
>> % CPUID Handling (part 3)
>> % Revision 2
>>
>> # Current state
>>
>> At early boot, Xen enumerates the features it can see, takes into account
>> errata checks and command line arguments, and stores this information in the
>> `boot_cpu_data.x86_capability[]` bitmap.  This gets adjusted as APs boot up,
>> and is sanitised to disable all dependent leaf features.
>>
>> At mid/late boot (before dom0 is constructed), Xen performs the necessary
>> calculations for guest cpuid handling.  Data are contained within the `struct
>> cpuid_policy` object, which is a representation of the architectural CPUID
>> information as specified by the Intel and AMD manuals.
>>
>> There are a few global `cpuid_policy` objects.  First is the **raw_policy**
>> which is filled in from native `CPUID` instructions.  This represents what the
>> hardware is capable of, in its current firmware/microcode configuration.
>>
>> The next global object is **host_policy**, which is derived from the
>> **raw_policy** and `boot_cpu_data.x86_capability[]`. It represents the
>> features which Xen knows about and is using.  The **host_policy** is
>> necessarily a subset of **raw_policy**.
>>
>> The **pv_max_policy** and **hvm_max_policy** are derived from the
>> **host_policy**, and represent the upper bounds available to guests.
>> Generally speaking, the guest policies are less featurefull than the
>> **host_policy** because there are features which Xen doesn't or cannot safely
>> provide to guests.  However, they are not subsets.  There are some features
>> (the HYPERVISOR bit for all guests, and X2APIC mode for HVM guests) which are
>> emulated in the absence of real hardware support.
>>
>> The toolstack may query for the **{raw,host,pv,hvm}\_featureset** information
>> using _XEN\_SYSCTL\_get\_cpu\_featureset_.  This is bitmap form of the feature
>> leaves only.
>>
>> When a new domain is created, the appropriate **{pv,hvm}\_max_policy** is
>> duplicated as a starting point, and can be subsequently mutated indirectly by
>> some hypercalls
>> (_XEN\_DOMCTL\_{set\_address\_size,disable\_migrate,settscinfo}_) or directly
>> by _XEN\_DOMCTL\_set\_cpuid_.
>>
>>
>> # Issues with the existing hypercalls
>>
>> _XEN\_DOMCTL\_set\_cpuid_ doesn't have a return value which the domain builder
>> pays attention to.  This is because, before CPUID part 2, there were no
>> failure conditions, as Xen would accept all toolstack-provided data, and
>> attempt to audit it at the time it was requested by the guest.  To simplify
>> the part 2 work, this behaviour was maintained, although Xen was altered to
>> audit the data at hypercall time, typically zeroing out areas which failed the
>> audit.
>>
>> There is no mechanism for the toolstack to query the CPUID configuration for a
>> specific domain.  Originally, the domain builder constructed a guests CPUID
>> policy from first principles, using native `CPUID` instructions in the control
>> domain.  This functioned to an extent, but was subject to masking problems,
>> and is fundamentally incompatible with HVM control domains or the use of
>> _CPUID Faulting_ in newer Intel processors.
>>
>> CPUID phase 1 introduced the featureset information, which provided an
>> architecturally sound mechanism for the toolstack to identify which features
>> are usable for guests.  However, the rest of the CPUID policy is still
>> generated from native `CPUID` instructions.
>>
>> The `cpuid_policy` is per-domain information.  Most CPUID data is identical
>> across all CPUs.  Some data are dynamic, based on other control settings
>> (APIC, OSXSAVE, OSPKE, OSLWP), and Xen substitutes these appropriately when
>> the information is requested..  Other areas however are topology information,
>> including thread/core/socket layout, cache and TLB hierarchy.  These data are
>> inherited from whichever physical CPU the domain builder happened to be
>> running on when it was making calculations.  As a result, it is inappropriate
>> for the guest under construction, and usually entirely bogus when considered
>> alongside other data.
>>
>>
>> # Other problems
>>
>> There is no easy provision for features at different code maturity levels,
>> both in the hypervisor, and in the toolstack.
>>
>> Some CPUID features have top-level command line options on the Xen command
>> line, but most do not.  On some hardware, some features can be hidden
>> indirectly by altering the `cpuid_mask_*` parameters.  This is a problem for
>> developing new features (which want to be off-by-default but able to be opted
>> in to), debugging, where it can sometimes be very useful to hide features and
>> see if a problem reoccurs, and occasionally in security circumstances, where
>> disabling a feature outright is an easy stop-gap solution.
>>
>> From the toolstack side, given no other constraints, a guest gets the
>> hypervisor-max set of features.  This set of features is a trade off between
>> what is supported in the hypervisor, and which features can reasonably be
>> offered without impeding the migrateability of the guest.  There is little
>> provision for features which can be opted in to at the toolstack level, and
>> those that are are done so via ad-hoc means.
>>
>>
>> # Proposal
>>
>> First and foremost, split the current **max\_policy** notion into separate
>> **max** and **default** policies.  This allows for the provision of features
>> which are unused by default, but may be opted in to, both at the hypervisor
>> level and the toolstack level.
>>
>> At the hypervisor level, **max** constitutes all the features Xen can use on
>> the current hardware, while **default** is the subset thereof which are
>> supported features, the features which the user has explicitly opted in to,
>> and excluding any features the user has explicitly opted out of.
>>
>> A new `cpuid=` command line option shall be introduced, whose internals are
>> generated automatically from the featureset ABI.  This means that all features
>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
>> line control.  (RFC: The same top level option can probably be used for
>> non-feature CPUID data control, although I can't currently think of any cases
>> where this would be used Also find a sensible way to express 'available but
>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>
>>
>> At the guest level, the **max** policy is conceptually unchanged.  It
>> constitutes all the features Xen is willing to offer to each type of guest on
>> the current hardware (including emulated features).  However, it shall instead
>> be derived from Xen's **default** host policy.  This is to ensure that
>> experimental hypervisor features must be opted in to at the Xen level before
>> they can be opted in to at the toolstack level.
>>
>> The guests **default** policy is then derived from its **max**.  This is
>> because there are some features which should always be explicitly opted in to
>> by the toolstack, such as emulated features which come with a security
>> trade-off, or for non-architectural features which may differ in
>> implementation in heterogeneous environments.
>>
>> All global policies (Xen and guest, max and default) shall be made available
>> to the toolstack, in a manner similar to the existing
>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
>> taken which include all CPUID data, not just the feature bitmaps.
>>
>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
>> which allows the toolstack to query and set the cpuid policy for a specific
>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
>> is unhappy with any aspect of the policy during auditing.  This provides
>> feedback to the user that a chosen combination will not work, rather than the
>> guest booting in an unexpected state.
>>
>> When a domain is initially created, the appropriate guests **default** policy
>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>> requested policy against the guests **max** policy.  This allows experimental
>> features or non-migration-safe features to be opted in to, without those
>> features being imposed upon all guests automatically.
>>
>> A guests CPUID policy shall be immutable after construction.  This better
>> matches real hardware, and simplifies the logic in Xen to translate policy
>> alterations into configuration changes.
>>
> This appears to be a suitable abstraction even for higher level toolstacks
> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
> compare them between different servers when user computes the guest cpu config
> (the normalized one) and use the common denominator as the guest policy.
> Probably higher level toolstack could even use these said policies constructs
> and built the idea of models such that the user could easily choose one for a
> pool of hosts with different families. But the discussion here is more focused
> on xc <-> Xen so I won't clobber discussion with libxl remarks.

One thing I haven't decided on yet is how to represent the policy at a
higher level.  Somewhere (probably libxc), I am going to need to
implement is_policy_compatible(a, b), and calculate_compatible_policy(a,
b, res), which will definitely be needed by Xapi, and will probably be
useful to other higher level toolstacks.

>
>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>> The toolstack shall also have a mechanism to explicitly select topology
>> configuration for the guest, which primarily affects the virtual APIC ID
>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>> Xen's auditing shall ensure that guests observe values consistent with the
>> guarantees made by the vendor manuals.
>>
> Why choose max_vcpus domctl?

Despite its name, the max_vcpus hypercall is the one which allocates all
the vcpus in the hypervisor.  I don't want there to be any opportunity
for vcpus to exist but no topology information to have been provided.

>
> With multiple sockets/nodes and having supported extended topology leaf the APIC
> ID layout will change considerably requiring fixup if... say we set vNUMA (I
> know numa node != socket spec wise, but on the machines we have seen so far,
> it's a 1:1 mapping).

AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
will need to be accounted for in how the information is represented,
especially in leaf 0x8000001e.

Intel on the other hand (as far as I can tell), has no interaction
between NUMA and topology as far as CPUID is concerned.

> Another question since we are speaking about topology is would be: how do we
> make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?

ACPI table writing is in the toolstack now, but even if it weren't,
HVMLoader would have to do what all real firmware needs to do, and look
at CPUID.

> This all brings me to the question of perhaps a separate domctl?

I specifically want to avoid having a separate hypercall for this
information.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-05 11:16     ` Andrew Cooper
@ 2017-07-05 13:22       ` Joao Martins
  2017-07-31 19:49         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2017-07-05 13:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

On 07/05/2017 12:16 PM, Andrew Cooper wrote:
> On 05/07/17 10:46, Joao Martins wrote:
>> Hey Andrew,
>>
>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>> Presented herewith is the a plan for the final part of CPUID work, which
>>> primarily covers better Xen/Toolstack interaction for configuring the guests
>>> CPUID policy.
>>>
>> Really nice write up, a few comments below.
>>
>>> A PDF version of this document is available from:
>>>
>>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
>>>
>>> Changes from v1:
>>>  * Clarification of the interaction of emulated features
>>>  * More information about the difference between max and default featuresets.
>>>
>>> ~Andrew
>>>
>>> -----8<-----
>>> % CPUID Handling (part 3)
>>> % Revision 2
>>>

[snip]

>>> # Proposal
>>>
>>> First and foremost, split the current **max\_policy** notion into separate
>>> **max** and **default** policies.  This allows for the provision of features
>>> which are unused by default, but may be opted in to, both at the hypervisor
>>> level and the toolstack level.
>>>
>>> At the hypervisor level, **max** constitutes all the features Xen can use on
>>> the current hardware, while **default** is the subset thereof which are
>>> supported features, the features which the user has explicitly opted in to,
>>> and excluding any features the user has explicitly opted out of.
>>>
>>> A new `cpuid=` command line option shall be introduced, whose internals are
>>> generated automatically from the featureset ABI.  This means that all features
>>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
>>> line control.  (RFC: The same top level option can probably be used for
>>> non-feature CPUID data control, although I can't currently think of any cases
>>> where this would be used Also find a sensible way to express 'available but
>>> not to be used by Xen', as per the current `smep` and `smap` options.)
>>>
>>>
>>> At the guest level, the **max** policy is conceptually unchanged.  It
>>> constitutes all the features Xen is willing to offer to each type of guest on
>>> the current hardware (including emulated features).  However, it shall instead
>>> be derived from Xen's **default** host policy.  This is to ensure that
>>> experimental hypervisor features must be opted in to at the Xen level before
>>> they can be opted in to at the toolstack level.
>>>
>>> The guests **default** policy is then derived from its **max**.  This is
>>> because there are some features which should always be explicitly opted in to
>>> by the toolstack, such as emulated features which come with a security
>>> trade-off, or for non-architectural features which may differ in
>>> implementation in heterogeneous environments.
>>>
>>> All global policies (Xen and guest, max and default) shall be made available
>>> to the toolstack, in a manner similar to the existing
>>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
>>> taken which include all CPUID data, not just the feature bitmaps.
>>>
>>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
>>> which allows the toolstack to query and set the cpuid policy for a specific
>>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
>>> is unhappy with any aspect of the policy during auditing.  This provides
>>> feedback to the user that a chosen combination will not work, rather than the
>>> guest booting in an unexpected state.
>>>
>>> When a domain is initially created, the appropriate guests **default** policy
>>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
>>> requested policy against the guests **max** policy.  This allows experimental
>>> features or non-migration-safe features to be opted in to, without those
>>> features being imposed upon all guests automatically.
>>>
>>> A guests CPUID policy shall be immutable after construction.  This better
>>> matches real hardware, and simplifies the logic in Xen to translate policy
>>> alterations into configuration changes.
>>>
>> This appears to be a suitable abstraction even for higher level toolstacks
>> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
>> compare them between different servers when user computes the guest cpu config
>> (the normalized one) and use the common denominator as the guest policy.
>> Probably higher level toolstack could even use these said policies constructs
>> and built the idea of models such that the user could easily choose one for a
>> pool of hosts with different families. But the discussion here is more focused
>> on xc <-> Xen so I won't clobber discussion with libxl remarks.
> 
> One thing I haven't decided on yet is how to represent the policy at a
> higher level.  Somewhere (probably libxc), I am going to need to
> implement is_policy_compatible(a, b), and calculate_compatible_policy(a,
> b, res), which will definitely be needed by Xapi, and will probably be
> useful to other higher level toolstacks.
>
I had initially intended for libxl to keep this sort of logic when I was looking
at the topic, but with the problems depicted above, libxc is probably better
suited to have this.

>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>>> The toolstack shall also have a mechanism to explicitly select topology
>>> configuration for the guest, which primarily affects the virtual APIC ID
>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>> Xen's auditing shall ensure that guests observe values consistent with the
>>> guarantees made by the vendor manuals.
>>>
>> Why choose max_vcpus domctl?
> 
> Despite its name, the max_vcpus hypercall is the one which allocates all
> the vcpus in the hypervisor.  I don't want there to be any opportunity
> for vcpus to exist but no topology information to have been provided.
> 
/nods

So then doing this at vcpus allocation we would need to pass an additional CPU
topology argument on the max_vcpus hypercall? Otherwise it's sort of guess work
wrt sockets, cores, threads ... no?

There could be other uses too on passing this info to Xen, say e.g. the
scheduler knowing the guest CPU topology it would allow better selection of
core+sibling pair such that it could match cache/cpu topology passed on the
guest (for unpinned SMT guests).

>>
>> With multiple sockets/nodes and having supported extended topology leaf the APIC
>> ID layout will change considerably requiring fixup if... say we set vNUMA (I
>> know numa node != socket spec wise, but on the machines we have seen so far,
>> it's a 1:1 mapping).
> 
> AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
> will need to be accounted for in how the information is represented,
> especially in leaf 0x8000001e.
> 
> Intel on the other hand (as far as I can tell), has no interaction
> between NUMA and topology as far as CPUID is concerned.
>
Sorry, I should probably have mentioned earlier that "machines we have seen so
far" were Intel - I am bit unaware of the AMD added possibilities.

>> Another question since we are speaking about topology is would be: how do we
>> make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
>> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
>> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?
> 
> ACPI table writing is in the toolstack now, but even if it weren't,
> HVMLoader would have to do what all real firmware needs to do, and look
> at CPUID.
> 
Right, but the mp tables (and lapic ids) are still adjusted/created by hvmloader
unless ofc I am reading it wrong. But anyhow - if you're planning to be based on
CPUID, that is certainly more correct than what I had suggested earlier, though
with a bit more cirurgy on hvmloader.

>> This all brings me to the question of perhaps a separate domctl?
> 
> I specifically want to avoid having a separate hypercall for this
> information.
> 
OK.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-05 13:22       ` Joao Martins
@ 2017-07-31 19:49         ` Konrad Rzeszutek Wilk
  2017-08-01 18:34           ` Andrew Cooper
  0 siblings, 1 reply; 19+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-07-31 19:49 UTC (permalink / raw)
  To: Joao Martins; +Cc: Andrew Cooper, Xen-devel

On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
> > On 05/07/17 10:46, Joao Martins wrote:
> >> Hey Andrew,
> >>
> >> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
> >>> Presented herewith is the a plan for the final part of CPUID work, which
> >>> primarily covers better Xen/Toolstack interaction for configuring the guests
> >>> CPUID policy.
> >>>
> >> Really nice write up, a few comments below.
> >>
> >>> A PDF version of this document is available from:
> >>>
> >>> http://xenbits.xen.org/people/andrewcoop/cpuid-part-3-rev2.pdf
> >>>
> >>> Changes from v1:
> >>>  * Clarification of the interaction of emulated features
> >>>  * More information about the difference between max and default featuresets.
> >>>
> >>> ~Andrew
> >>>
> >>> -----8<-----
> >>> % CPUID Handling (part 3)
> >>> % Revision 2
> >>>
> 
> [snip]
> 
> >>> # Proposal
> >>>
> >>> First and foremost, split the current **max\_policy** notion into separate
> >>> **max** and **default** policies.  This allows for the provision of features
> >>> which are unused by default, but may be opted in to, both at the hypervisor
> >>> level and the toolstack level.
> >>>
> >>> At the hypervisor level, **max** constitutes all the features Xen can use on
> >>> the current hardware, while **default** is the subset thereof which are
> >>> supported features, the features which the user has explicitly opted in to,
> >>> and excluding any features the user has explicitly opted out of.
> >>>
> >>> A new `cpuid=` command line option shall be introduced, whose internals are
> >>> generated automatically from the featureset ABI.  This means that all features
> >>> added to `include/public/arch-x86/cpufeatureset.h` automatically gain command
> >>> line control.  (RFC: The same top level option can probably be used for
> >>> non-feature CPUID data control, although I can't currently think of any cases
> >>> where this would be used Also find a sensible way to express 'available but
> >>> not to be used by Xen', as per the current `smep` and `smap` options.)
> >>>
> >>>
> >>> At the guest level, the **max** policy is conceptually unchanged.  It
> >>> constitutes all the features Xen is willing to offer to each type of guest on
> >>> the current hardware (including emulated features).  However, it shall instead
> >>> be derived from Xen's **default** host policy.  This is to ensure that
> >>> experimental hypervisor features must be opted in to at the Xen level before
> >>> they can be opted in to at the toolstack level.
> >>>
> >>> The guests **default** policy is then derived from its **max**.  This is
> >>> because there are some features which should always be explicitly opted in to
> >>> by the toolstack, such as emulated features which come with a security
> >>> trade-off, or for non-architectural features which may differ in
> >>> implementation in heterogeneous environments.
> >>>
> >>> All global policies (Xen and guest, max and default) shall be made available
> >>> to the toolstack, in a manner similar to the existing
> >>> _XEN\_SYSCTL\_get\_cpu\_featureset_ mechanism.  This allows decisions to be
> >>> taken which include all CPUID data, not just the feature bitmaps.
> >>>
> >>> New _XEN\_DOMCTL\_{get,set}\_cpuid\_policy_ hypercalls will be introduced,
> >>> which allows the toolstack to query and set the cpuid policy for a specific
> >>> domain.  It shall supersede _XEN\_DOMCTL\_set\_cpuid_, and shall fail if Xen
> >>> is unhappy with any aspect of the policy during auditing.  This provides
> >>> feedback to the user that a chosen combination will not work, rather than the
> >>> guest booting in an unexpected state.
> >>>
> >>> When a domain is initially created, the appropriate guests **default** policy
> >>> is duplicated for use.  When auditing, Xen shall audit the toolstacks
> >>> requested policy against the guests **max** policy.  This allows experimental
> >>> features or non-migration-safe features to be opted in to, without those
> >>> features being imposed upon all guests automatically.
> >>>
> >>> A guests CPUID policy shall be immutable after construction.  This better
> >>> matches real hardware, and simplifies the logic in Xen to translate policy
> >>> alterations into configuration changes.
> >>>
> >> This appears to be a suitable abstraction even for higher level toolstacks
> >> (libxl). At least I can imagine libvirt fetching the PV/HVM max policy, and
> >> compare them between different servers when user computes the guest cpu config
> >> (the normalized one) and use the common denominator as the guest policy.
> >> Probably higher level toolstack could even use these said policies constructs
> >> and built the idea of models such that the user could easily choose one for a
> >> pool of hosts with different families. But the discussion here is more focused
> >> on xc <-> Xen so I won't clobber discussion with libxl remarks.
> > 
> > One thing I haven't decided on yet is how to represent the policy at a
> > higher level.  Somewhere (probably libxc), I am going to need to
> > implement is_policy_compatible(a, b), and calculate_compatible_policy(a,
> > b, res), which will definitely be needed by Xapi, and will probably be
> > useful to other higher level toolstacks.
> >
> I had initially intended for libxl to keep this sort of logic when I was looking
> at the topic, but with the problems depicted above, libxc is probably better
> suited to have this.
> 
> >>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
> >>> The toolstack shall also have a mechanism to explicitly select topology
> >>> configuration for the guest, which primarily affects the virtual APIC ID
> >>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
> >>> Xen's auditing shall ensure that guests observe values consistent with the
> >>> guarantees made by the vendor manuals.
> >>>
> >> Why choose max_vcpus domctl?
> > 
> > Despite its name, the max_vcpus hypercall is the one which allocates all
> > the vcpus in the hypervisor.  I don't want there to be any opportunity
> > for vcpus to exist but no topology information to have been provided.
> > 
> /nods
> 
> So then doing this at vcpus allocation we would need to pass an additional CPU
> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess work
> wrt sockets, cores, threads ... no?

Andrew, thoughts on this and the one below?

> 
> There could be other uses too on passing this info to Xen, say e.g. the
> scheduler knowing the guest CPU topology it would allow better selection of
> core+sibling pair such that it could match cache/cpu topology passed on the
> guest (for unpinned SMT guests).
> 
> >>
> >> With multiple sockets/nodes and having supported extended topology leaf the APIC
> >> ID layout will change considerably requiring fixup if... say we set vNUMA (I
> >> know numa node != socket spec wise, but on the machines we have seen so far,
> >> it's a 1:1 mapping).
> > 
> > AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
> > will need to be accounted for in how the information is represented,
> > especially in leaf 0x8000001e.
> > 
> > Intel on the other hand (as far as I can tell), has no interaction
> > between NUMA and topology as far as CPUID is concerned.
> >
> Sorry, I should probably have mentioned earlier that "machines we have seen so
> far" were Intel - I am bit unaware of the AMD added possibilities.
> 
> >> Another question since we are speaking about topology is would be: how do we
> >> make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
> >> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
> >> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?
> > 
> > ACPI table writing is in the toolstack now, but even if it weren't,
> > HVMLoader would have to do what all real firmware needs to do, and look
> > at CPUID.

I think the real hardware when constructing interesting topologies uses
platform specific MSRs or other hidden gems (like AMD Northbridge).

> > 
> Right, but the mp tables (and lapic ids) are still adjusted/created by hvmloader
> unless ofc I am reading it wrong. But anyhow - if you're planning to be based on

<nods>

I can't see how the CPUID would allow to construct the proper APIC MADT entries so that
the APIC IDs match as of right now?

Unless hvmloader is changed to do full SMP bootup (it does that now at some point)
and each CPU reports this information and they all update this table based on their
EAX=1 CPUID value?

> CPUID, that is certainly more correct than what I had suggested earlier, though
> with a bit more cirurgy on hvmloader.
> 
> >> This all brings me to the question of perhaps a separate domctl?
> > 
> > I specifically want to avoid having a separate hypercall for this
> > information.
> > 
> OK.
> 
> Joao
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-07-31 19:49         ` Konrad Rzeszutek Wilk
@ 2017-08-01 18:34           ` Andrew Cooper
  2017-08-02 10:34             ` Joao Martins
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Cooper @ 2017-08-01 18:34 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Joao Martins; +Cc: Xen-devel

On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
>> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
>>> On 05/07/17 10:46, Joao Martins wrote:
>>>> Hey Andrew,
>>>>
>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>>>
>>>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>>>>> The toolstack shall also have a mechanism to explicitly select topology
>>>>> configuration for the guest, which primarily affects the virtual APIC ID
>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>>>> Xen's auditing shall ensure that guests observe values consistent with the
>>>>> guarantees made by the vendor manuals.
>>>>>
>>>> Why choose max_vcpus domctl?
>>> Despite its name, the max_vcpus hypercall is the one which allocates all
>>> the vcpus in the hypervisor.  I don't want there to be any opportunity
>>> for vcpus to exist but no topology information to have been provided.
>>>
>> /nods
>>
>> So then doing this at vcpus allocation we would need to pass an additional CPU
>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess work
>> wrt sockets, cores, threads ... no?
> Andrew, thoughts on this and the one below?

Urgh sorry.  I've been distracted with some high priority interrupts (of
the non-maskable variety).

So, bad news is that the CPUID and MSR policy handling has become
substantially more complicated and entwined than I had first planned.  A
change in either of the data alters the auditing of the other, so I am
leaning towards implementing everything with a single set hypercall (as
this is the only way to get a plausibly-consistent set of data).

The good news is that I don't think we actually need any changes to the
XEN_DOMCTL_max_vcpus.  I now think there is sufficient expressibility in
the static cpuid policy to work.

>> There could be other uses too on passing this info to Xen, say e.g. the
>> scheduler knowing the guest CPU topology it would allow better selection of
>> core+sibling pair such that it could match cache/cpu topology passed on the
>> guest (for unpinned SMT guests).

I remain to be convinced (i.e. with some real performance numbers) that
the added complexity in the scheduler for that logic is a benefit in the
general case.

In practice, customers are either running very specific and dedicated
workloads (at which point pinning is used and there is no
oversubscription, and exposing the actual SMT topology is a good thing),
or customers are running general workloads with no pinning (or perhaps
cpupool-numa-split) with a moderate amount of oversubscription (at which
point exposing SMT is a bad move).

Counterintuitively, exposing NUMA in general oversubscribed scenarios is
terrible for net system performance.  What happens in practice is that
VMs which see NUMA spend their idle cycles trying to balance their own
userspace processes, rather than yielding to the hypervisor so another
guest can get a go.

>>
>>>> With multiple sockets/nodes and having supported extended topology leaf the APIC
>>>> ID layout will change considerably requiring fixup if... say we set vNUMA (I
>>>> know numa node != socket spec wise, but on the machines we have seen so far,
>>>> it's a 1:1 mapping).
>>> AMD Fam15h and later (may) have multiple NUMA nodes per socket, which
>>> will need to be accounted for in how the information is represented,
>>> especially in leaf 0x8000001e.
>>>
>>> Intel on the other hand (as far as I can tell), has no interaction
>>> between NUMA and topology as far as CPUID is concerned.
>>>
>> Sorry, I should probably have mentioned earlier that "machines we have seen so
>> far" were Intel - I am bit unaware of the AMD added possibilities.
>>
>>>> Another question since we are speaking about topology is would be: how do we
>>>> make hvmloader aware of each the APIC_ID layout? Right now, it is too hardcoded
>>>> 2 * APIC_ID :( Probably a xenstore entry 'hvmloader/cputopology-threads' and
>>>> 'hvmloader/cputopology-sockets' (or use vnuma_topo.nr_nodes for the latter)?
>>> ACPI table writing is in the toolstack now, but even if it weren't,
>>> HVMLoader would have to do what all real firmware needs to do, and look
>>> at CPUID.
> I think the real hardware when constructing interesting topologies uses
> platform specific MSRs or other hidden gems (like AMD Northbridge).

It was my understanding that APIC IDs are negotiated at power-on time,
as they are the base layer of addressing in the system.

>
>> Right, but the mp tables (and lapic ids) are still adjusted/created by hvmloader
>> unless ofc I am reading it wrong. But anyhow - if you're planning to be based on
> <nods>
>
> I can't see how the CPUID would allow to construct the proper APIC MADT entries so that
> the APIC IDs match as of right now?
>
> Unless hvmloader is changed to do full SMP bootup (it does that now at some point)
> and each CPU reports this information and they all update this table based on their
> EAX=1 CPUID value?

HVMLoader is currently hardcoded to the same assumption (APIC ID =
vcpu_id * 2) as other areas of Xen and the toolstack.

All vcpus are already booted, so the MTRRs can be configured suitably. 
Having said that, I think vcpu0 can write out the ACPI tables properly,
so long as it knows that Xen doesn't insert arbitrary holes into the
APIC ID space.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-08-01 18:34           ` Andrew Cooper
@ 2017-08-02 10:34             ` Joao Martins
  2017-08-03  2:55               ` Dario Faggioli
  0 siblings, 1 reply; 19+ messages in thread
From: Joao Martins @ 2017-08-02 10:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Dario Faggioli, Xen-devel

On 08/01/2017 07:34 PM, Andrew Cooper wrote:
> On 31/07/2017 20:49, Konrad Rzeszutek Wilk wrote:
>> On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
>>> On 07/05/2017 12:16 PM, Andrew Cooper wrote:
>>>> On 05/07/17 10:46, Joao Martins wrote:
>>>>> Hey Andrew,
>>>>>
>>>>> On 07/04/2017 03:55 PM, Andrew Cooper wrote:
>>>>>
>>>>>> (RFC: Decide exactly where to fit this.  _XEN\_DOMCTL\_max\_vcpus_ perhaps?)
>>>>>> The toolstack shall also have a mechanism to explicitly select topology
>>>>>> configuration for the guest, which primarily affects the virtual APIC ID
>>>>>> layout, and has a knock on effect for the APIC ID of the virtual IO-APIC.
>>>>>> Xen's auditing shall ensure that guests observe values consistent with the
>>>>>> guarantees made by the vendor manuals.
>>>>>>
>>>>> Why choose max_vcpus domctl?
>>>> Despite its name, the max_vcpus hypercall is the one which allocates all
>>>> the vcpus in the hypervisor.  I don't want there to be any opportunity
>>>> for vcpus to exist but no topology information to have been provided.
>>>>
>>> /nods
>>>
>>> So then doing this at vcpus allocation we would need to pass an additional CPU
>>> topology argument on the max_vcpus hypercall? Otherwise it's sort of guess work
>>> wrt sockets, cores, threads ... no?
>> Andrew, thoughts on this and the one below?
> 
> Urgh sorry.  I've been distracted with some high priority interrupts (of
> the non-maskable variety).
> 
> So, bad news is that the CPUID and MSR policy handling has become
> substantially more complicated and entwined than I had first planned.  A
> change in either of the data alters the auditing of the other, so I am
> leaning towards implementing everything with a single set hypercall (as
> this is the only way to get a plausibly-consistent set of data).
> 
> The good news is that I don't think we actually need any changes to the
> XEN_DOMCTL_max_vcpus.  I now think there is sufficient expressibility in
> the static cpuid policy to work.
> 
Awesome!

>>> There could be other uses too on passing this info to Xen, say e.g. the
>>> scheduler knowing the guest CPU topology it would allow better selection of
>>> core+sibling pair such that it could match cache/cpu topology passed on the
>>> guest (for unpinned SMT guests).
> 
> I remain to be convinced (i.e. with some real performance numbers) that
> the added complexity in the scheduler for that logic is a benefit in the
> general case.
> 
The suggestion above was a simple extension to struct domain (e.g. cores/threads
or struct cpu_topology field) - nothing too disruptive I think.

But I cannot really argue on this as this was just an idea that I found
interesting (no numbers to support it entirely). We just happened to see it
under-perform when a simple range of cpus was used for affinity, and that some
vcpus end up being scheduled belonging the same core+sibling pair IIRC; hence I
(perhaps naively) imagined that there could be value in further scheduler
enlightenment e.g. "gang-scheduling" where we schedule core+sibling always
together. I was speaking to Dario (CC'ed) on the summit whether CPU topology
could have value - and there might be but it remains to be explored once we're
able to pass a cpu topology to the guest. (In the past it seemed enthusiastic of
the idea of the topology[0] and hence I assumed to be in the context of schedulers)

[0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg03850.html

> In practice, customers are either running very specific and dedicated
> workloads (at which point pinning is used and there is no
> oversubscription, and exposing the actual SMT topology is a good thing),
>
/nods

> or customers are running general workloads with no pinning (or perhaps
> cpupool-numa-split) with a moderate amount of oversubscription (at which
> point exposing SMT is a bad move).
> 
Given the scale you folks invest on over-subscription (1000 VMs), I wonder what
moderate here means :P

> Counterintuitively, exposing NUMA in general oversubscribed scenarios is
> terrible for net system performance.  What happens in practice is that
> VMs which see NUMA spend their idle cycles trying to balance their own
> userspace processes, rather than yielding to the hypervisor so another
> guest can get a go.
> 
Interesting to know - vNUMA perhaps is only better placed for performance cases
where both (or either) I/O topology and memory locality matter - or when going
for bigger guests. Provided that the correspondent CPU topology is provided.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: DESIGN v2: CPUID part 3
  2017-08-02 10:34             ` Joao Martins
@ 2017-08-03  2:55               ` Dario Faggioli
  0 siblings, 0 replies; 19+ messages in thread
From: Dario Faggioli @ 2017-08-03  2:55 UTC (permalink / raw)
  To: Joao Martins, Andrew Cooper; +Cc: Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 6375 bytes --]

On Wed, 2017-08-02 at 11:34 +0100, Joao Martins wrote:
> On 08/01/2017 07:34 PM, Andrew Cooper wrote:
> > > On Wed, Jul 05, 2017 at 02:22:00PM +0100, Joao Martins wrote:
> > > > 
> > > > There could be other uses too on passing this info to Xen, say
> > > > e.g. the
> > > > scheduler knowing the guest CPU topology it would allow better
> > > > selection of
> > > > core+sibling pair such that it could match cache/cpu topology
> > > > passed on the
> > > > guest (for unpinned SMT guests).
> > 
> > I remain to be convinced (i.e. with some real performance numbers)
> > that
> > the added complexity in the scheduler for that logic is a benefit
> > in the
> > general case.
> > 
> 
> The suggestion above was a simple extension to struct domain (e.g.
> cores/threads
> or struct cpu_topology field) - nothing too disruptive I think.
> 
> But I cannot really argue on this as this was just an idea that I
> found
> interesting (no numbers to support it entirely). We just happened to
> see it
> under-perform when a simple range of cpus was used for affinity, and
> that some
> vcpus end up being scheduled belonging the same core+sibling pair
> IIRC; hence I
> (perhaps naively) imagined that there could be value in further
> scheduler
> enlightenment e.g. "gang-scheduling" where we schedule core+sibling
> always
> together. I was speaking to Dario (CC'ed) on the summit whether CPU
> topology
> could have value - and there might be but it remains to be explored
> once we're
> able to pass a cpu topology to the guest. (In the past it seemed
> enthusiastic of
> the idea of the topology[0] and hence I assumed to be in the context
> of schedulers)
> 
> [0] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg0
> 3850.html
> 
> > In practice, customers are either running very specific and
> > dedicated
> > workloads (at which point pinning is used and there is no
> > oversubscription, and exposing the actual SMT topology is a good
> > thing),
> > 
> /nods
> 
I am enthusiast of there going to be a way for specifying explicitly
the CPU topology of a guest.

The way we can take advantage of this, at least as a first step, is,
when the guest is pinned, and two of its vCPUs are pinned to two host
hyperthreads, make the two vCPUs hyperthreads as well, from the guest
point of view.

Then, it will be the guest's (e.g., Linux's) scheduler that will do
something clever with this information, so no need for adding
complexity anywhere (well, in theory, in the guest scheduler, but in
practise, code it's there already!).

Or, on the other hand, if pinning is *not* used, then I'd use this
mechanism to tell the guest that there's no relationship between its
vCPUs whatsoever.

In fact, currently --sticking to SMT as example-- by not specifying the
topology explicitly, there may be cases where the guest scheduler comes
to thinking that two vCPUs are SMT siblings, while they either are not,
or (if no pinning is in place), they may or may not be, depending on
onto which pCPUs the two vCPUs are executing at any given time. This
means the guest's scheduler's SMT optimization logic will trigger,
while it probably better wouldn't have.

These are the first two use cases that, as the "scheduler guy", I'm
interested to use this feature for.

Then, there indeed is the chance of using the guest topology to affect
the decision of the Xen's scheduler, e.g., to implement some form of
gang scheduling, or to force two vCPU to be executed on pCPUs that
respect such topology... But this is all still in the "wild ideas"
camp, for now. :-D

> > or customers are running general workloads with no pinning (or
> > perhaps
> > cpupool-numa-split) with a moderate amount of oversubscription (at
> > which
> > point exposing SMT is a bad move).
> > 
> 
> Given the scale you folks invest on over-subscription (1000 VMs), I
> wonder what
> moderate here means :P
> 
> > Counterintuitively, exposing NUMA in general oversubscribed
> > scenarios is
> > terrible for net system performance.  What happens in practice is
> > that
> > VMs which see NUMA spend their idle cycles trying to balance their
> > own
> > userspace processes, rather than yielding to the hypervisor so
> > another
> > guest can get a go.
> > 
> 
For NUMA aware workloads, running in guests, the guests themselves
doing something sane with both the placement and the balancing of tasks
and memory is a good thing, that will improve the performance of the
workload itself. Provided the (topology) information used for doing
this placement and this balancing are accurate... a vNUMA is what makes
them accurate.

So, IMO, for big guests, running NUMA-aware workload, vNUMA will most
of the times improve things.

I totally don't get the part where a vCPU becoming idle is what Xen
needs for running other guests' vCPUs... Xen does not at all rely on a
vCPU to periodically block or yield, in order to let another vCPU run,
neither in undesubscribed, nor in oversubscribed scenarios.

It's competitive multitasking, not cooperative multitasking that we do,
i.e., running vCPUs are preempted when it's the turn of some other vCPU
to execute.

The only thing that we gain from the fact that vCPUs go idle from time
to time, is that we may be able to make the actual pCPUs sleep a bit,
and hence we save power and produce less heat, but that's mostly the
case in undersubscribed scenarios, not in oversubscribed ones.

> Interesting to know - vNUMA perhaps is only better placed for
> performance cases
> where both (or either) I/O topology and memory locality matter - or
> when going
> for bigger guests. Provided that the correspondent CPU topology is
> provided.
> 
Exactly, it matters if the guest is big enough and/or is NUMA enough
(e.g., as you say, both its memory and I/O access are sensitive, and
suffers the fact of having to go the long route), and if the workload
is also NUMA-aware.

And yes, vNUMA needs topology information to be accurate and
consistent.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-08-03  2:55 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08 13:12 DESIGN: CPUID part 3 Andrew Cooper
2017-06-08 13:47 ` Jan Beulich
2017-06-12 13:07   ` Andrew Cooper
2017-06-12 13:29     ` Jan Beulich
2017-06-12 13:36       ` Andrew Cooper
2017-06-12 13:42         ` Jan Beulich
2017-06-12 14:02           ` Andrew Cooper
2017-06-12 14:18             ` Jan Beulich
2017-06-09 12:24 ` Anshul Makkar
2017-06-12 13:21   ` Andrew Cooper
2017-07-04 14:55 ` DESIGN v2: " Andrew Cooper
2017-07-05  9:46   ` Joao Martins
2017-07-05 10:32     ` Joao Martins
2017-07-05 11:16     ` Andrew Cooper
2017-07-05 13:22       ` Joao Martins
2017-07-31 19:49         ` Konrad Rzeszutek Wilk
2017-08-01 18:34           ` Andrew Cooper
2017-08-02 10:34             ` Joao Martins
2017-08-03  2:55               ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.