[DRAFT C] PVH CPU hotplug design document

* [DRAFT C] PVH CPU hotplug design document
@ 2017-01-17 17:14 Roger Pau Monné
  2017-01-23 16:30 ` Jan Beulich
  0 siblings, 1 reply; 10+ messages in thread
From: Roger Pau Monné @ 2017-01-17 17:14 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Graeme Gregory, Al Stone, Andrew Cooper,
	Anshul Makkar, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky

Hello,

Below is a draft of a design document for PVHv2 CPU hotplug. It should cover
both vCPU and pCPU hotplug. It's mainly centered around the hardware domain,
since for unprivileged PVH guests the vCPU hotplug mechanism is already
described in Boris series [0], and it's shared with HVM.

The aim here is to find a way to use ACPI vCPU hotplug for the hardware domain,
while still being able to properly detect and notify Xen of pCPU hotplug.

[0] https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg00060.html

---8<---
% CPU hotplug support for PVH
% Roger Pau Monné <roger.pau@citrix.com>
% Draft C

# Revision History

| Version | Date        | Changes                                           |
|---------|-------------|---------------------------------------------------|
| Draft A | 5 Jan 2017  | Initial draft.                                    |
|---------|-------------|---------------------------------------------------|
| Draft B | 12 Jan 2017 | Removed the XXX comments and clarify some         |
|         |             | sections.                                         |
|         |             |                                                   |
|         |             | Added a sample of the SSDT ASL code that would be |
|         |             | appended to the hardware domain.                  |
|---------|-------------|---------------------------------------------------|
|Draft C  | 17 Jan 2017 | Define a _SB.XEN0 bus device and place all the    |
|         |             | processor objects and the GPE block inside of it. |
|         |             |                                                   |
|         |             | Place the GPE status and enable registers and     |
|         |             | the vCPU enable bitmap in memory instead of IO    |
|         |             | space.                                            |

# Preface

This document aims to describe the interface to use in order to implement CPU
hotplug for PVH guests, this applies to hotplug of both physical and virtual
CPUs.

# Introduction

One of the design goals of PVH is to be able to remove as much Xen PV specific
code as possible, thus limiting the number of Xen PV interfaces used by guests,
and tending to use native interfaces (as used by bare metal) as much as
possible. This is in line with the efforts also done by Xen on ARM and helps
reduce the burden of maintaining huge amounts of Xen PV code inside of guests
kernels.

This however presents some challenges due to the model used by the Xen
Hypervisor, where some devices are handled by Xen while others are left for the
hardware domain to manage. The fact that Xen lacks and AML parser also makes it
harder, since it cannot get the full hardware description from dynamic ACPI
tables (DSDT, SSDT) without the hardware domain collaboration.

One of such issues is CPU enumeration and hotplug, for both the hardware and
unprivileged domains. The aim is to be able to use the same enumeration and
hotplug interface for all PVH guests, regardless of their privilege.

This document aims to describe the interface used in order to fulfill the
following actions:

 * Virtual CPU (vCPU) enumeration at boot time.
 * Hotplug of vCPUs.
 * Hotplug of physical CPUs (pCPUs) to Xen.

# Prior work

## PV CPU hotplug

CPU hotplug for Xen PV guests is implemented using xenstore and hypercalls. The
guest has to setup a watch event on the "cpu/" xenstore node, and react to
changes in this directory. CPUs are added creating a new node and setting it's
"availability" to online:

    cpu/X/availability = "online"

Where X is the vCPU ID. This is an out-of-band method, that relies on Xen
specific interfaces in order to perform CPU hotplug.

## QEMU CPU hotplug using ACPI

The ACPI tables provided to HVM guests contain processor objects, as created by
libacpi. The number of processor objects in the ACPI namespace matches the
maximum number of processors supported by HVM guests (up to 128 at the time of
writing). Processors currently disabled are marked as so in the MADT and in
their \_MAT and \_STA methods.

A PRST operation region in I/O space is also defined, with a size of 128bits,
that's used as a bitmap of enabled vCPUs on the system. A PRSC method is
provided in order to check for updates to the PRST region and trigger
notifications on the affected processor objects. The execution of the PRSC
method is done by a GPE event. Then OSPM checks the value returned by \_STA for
the ACPI\_STA\_DEVICE\_PRESENT flag in order to check if the vCPU has been
enabled.

## Native CPU hotplug

OSPM waits for a notification from ACPI on the processor object and when an
event is received the return value from _STA is checked in order to see if
ACPI\_STA\_DEVICE\_PRESENT has been enabled. This notification is triggered
from the method of a GPE block.

# PVH CPU hotplug

The aim as stated in the introduction is to use a method as similar as possible
to bare metal CPU hotplug for PVH, this is feasible for unprivileged domains,
since the ACPI tables can be created by the toolstack and provided to the
guest. Then a minimal I/O or memory handler will be added to Xen in order to
report the bitmap of enabled vCPUs. There's already a [series][0] posted to
xen-devel that implement this functionality for unprivileged PVH guests.

This however is proven to be quite difficult to implement for the hardware
domain, since it has to manage both pCPUs and vCPUs. The hardware domain should
be able to notify Xen of the addition of new pCPUs, so that they can be used by
the Hypervisor, and also be able to hotplug new vCPUs for it's own usage. Since
Xen cannot access the dynamic (AML) ACPI tables, because it lacks an AML
parser, it is the duty of the hardware domain to parse those tables and notify
Xen of relevant events.

There are several related issues here that prevent a straightforward solution
to this issue:

 * Xen cannot parse AML tables, and thus cannot get notifications from ACPI
   events. And even in the case that Xen could parse those tables, there can
   only be one OSPM registered with ACPI
 * Xen can provide a valid MADT table to the hardware domain that describes the
   environment in which the hardware domain is running, but it cannot prevent
   the hardware domain from seeing the real processor devices in the ACPI
   namespace, nor Xen can provide the hardware domain with processor
   devices that match the vCPUs at the moment.

[0]: https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg00060.html

## Proposed solution using the STAO

The general idea of this method is to use the STAO in order to hide the pCPUs
from the hardware domain, and provide processor objects for vCPUs in an extra
SSDT table.

This method requires one change to the STAO, in order to be able to notify the
hardware domain of which processors found in ACPI tables are pCPUs. The
description of the new STAO field is as follows:

 |   Field            | Byte Length | Byte Offset |     Description          |
 |--------------------|:-----------:|:-----------:|--------------------------|
 | Processor List [n] |      -      |      -      | A list of ACPI numbers,  |
 |                    |             |             | where each number is the |
 |                    |             |             | Processor UID of a       |
 |                    |             |             | physical CPU, and should |
 |                    |             |             | be treated specially by  |
 |                    |             |             | the OSPM                 |

The list of UIDs in this new field would be matched against the ACPI Processor
UID field found in local/x2 APIC MADT structs and Processor objects in the ACPI
namespace, and the OSPM should either ignore those objects, or in case it
implements pCPU hotplug, it should notify Xen of changes to these objects.

The contents of the MADT provided to the hardware domain are also going to be
different from the contents of the MADT as found in native ACPI. The local/x2
APIC entries for all the pCPUs are going to be marked as disabled.

Extra entries are going to be added for each vCPU available to the hardware
domain, up to the maximum number of supported vCPUs. Note that supported vCPUs
might be different than enabled vCPUs, so it's possible that some of these
entries are also going to be marked as disabled. The entries for vCPUs on the
MADT are going to use a processor local x2 APIC structure, and the ACPI
processor ID of vCPUs are not going to re-use processor IDs already used by
pCPUs. Xen makes no guarantee about the processor ID of the first vCPU, neither
the OS must assume them to be consecutive. Note that this would limit the
number of vCPUs so that (pCPUs + vCPUs) < 2^32.

In order to be able to perform vCPU hotplug, the vCPUs must have an ACPI
processor object in the ACPI namespace, so that the OSPM can request
notifications and get the value of the \_STA and \_MAT methods. This can be
problematic because Xen doesn't know the ACPI name of the other processor
objects, so blindly adding new ones can create namespace clashes.

This can be solved by using a different ACPI name in order to describe vCPUs in
the ACPI namespace. Most hardware vendors tend to use CPU or PR prefixes for
the processor objects, so using a 'VP' (ie: Virtual Processor) prefix should
prevent clashes.

A Xen GPE device block will be used in order to deliver events related to the
vCPUs available to the guest, since Xen doesn't know if there are any bits
available in the native GPEs. A SCI interrupt will be injected into the guest
in order to trigger the event.

The following snippet is a representation of the ASL SSDT code that is proposed
for the hardware domain:

    DefinitionBlock ("SSDT.aml", "SSDT", 5, "Xen", "HVM", 0)
    {
        Device ( \_SB.XEN0 ) {
            Name ( _HID, "ACPI0004" ) /* ACPI Module Device (bus node) */
        }
        Scope (\_SB.XEN0)
        {
            OperationRegion(XEN, SystemMemory, 0xXXXXXXXX, 41)
            Field(XEN, ByteAcc, NoLock, Preserve) {
                PRS, 2,   /* vCPU enabled bitmap */
                NCPU, 16, /* Number of vCPUs */
                MSUA, 32, /* MADT checksum address */
                MAPA, 32, /* MADT LAPIC0 address */
            }
            OperationRegion ( MSUM, SystemMemory, \_SB.XEN0.MSUA, 1 )
            Field ( MSUM, ByteAcc, NoLock, Preserve ) {
                MSU, 8
            }
            Method ( PMAT, 2 ) {
                If ( LLess(Arg0, NCPU) ) {
                    Return ( ToBuffer(Arg1) )
                }
                Return ( Buffer() {0, 8, 0xff, 0xff, 0, 0, 0, 0} )
            }
            Processor ( VP00, 0, 0x0000b010, 0x06 ) {
                Name ( _HID, "ACPI0007" )
                Name ( _UID, 1 )
                OperationRegion ( MATR, SystemMemory, Add(\_SB.XEN0.MAPA, 0), 8 )
                Field ( MATR, ByteAcc, NoLock, Preserve ) {
                    MAT, 64
                }
                Field ( MATR, ByteAcc, NoLock, Preserve ) {
                    Offset(4),
                    FLG, 1
                }
                Method ( _MAT, 0 ) {
                    Return ( ToBuffer(MAT) )
                }
                Method ( _STA ) {
                    If ( FLG ) {
                        Return ( 0xF )
                    }
                    Return ( 0x0 )
                }
                Method ( _EJ0, 1, NotSerialized ) {
                    Sleep ( 0xC8 )
                }
            }
            Processor ( VP01, 1, 0x0000b010, 0x06 ) {
                Name ( _HID, "ACPI0007" )
                Name ( _UID, 2 )
                OperationRegion ( MATR, SystemMemory, Add(\_SB.XEN0.MAPA, 8), 8 )
                Field ( MATR, ByteAcc, NoLock, Preserve ) {
                    MAT, 64
                }
                Field ( MATR, ByteAcc, NoLock, Preserve ) {
                    Offset(4),
                    FLG, 1
                }
                Method ( _MAT, 0 ) {
                    Return ( PMAT (1, MAT) )
                }
                Method ( _STA ) {
                    If ( LLess(1, \_SB.XEN0.NCPU) ) {
                        If ( FLG ) {
                            Return ( 0xF )
                        }
                    }
                    Return ( 0x0 )
                }
                Method ( _EJ0, 1, NotSerialized ) {
                    Sleep ( 0xC8 )
                }
            }
            Method ( PRSC, 0 ) {
                Store ( ToBuffer(PRS), Local0 )
                Store ( DerefOf(Index(Local0, 0)), Local1 )
                And ( Local1, 1, Local2 )
                If ( LNotEqual(Local2, \_SB.XEN0.VP00.FLG) ) {
                    Store ( Local2, \_SB.XEN0.VP00.FLG )
                    If ( LEqual(Local2, 1) ) {
                        Notify ( VP00, 1 )
                        Subtract ( \_SB.XEN0.MSU, 1, \_SB.XEN0.MSU )
                    }
                    Else {
                        Notify ( VP00, 3 )
                        Add ( \_SB.XEN0.MSU, 1, \_SB.XEN0.MSU )
                    }
                }
                ShiftRight ( Local1, 1, Local1 )
                And ( Local1, 1, Local2 )
                If ( LNotEqual(Local2, \_SB.XEN0.VP01.FLG) ) {
                    Store ( Local2, \_SB.XEN0.VP01.FLG )
                    If ( LEqual(Local2, 1) ) {
                        Notify ( VP01, 1 )
                        Subtract ( \_SB.XEN0.MSU, 1, \_SB.XEN0.MSU )
                    }
                    Else {
                        Notify ( VP01, 3 )
                        Add ( \_SB.XEN0.MSU, 1, \_SB.XEN0.MSU )
                    }
                }
                Return ( One )
            }
        }
        Device ( \_SB.XEN0.GPE0 ) {
            Name ( _HID, "ACPI0006" )
            Name ( _UID, "XENGPE0" )
            Name ( _CRS, ResourceTemplate() {
                Memory32Fixed ( ReadWrite, 0xXXXXXXXX, 0x4 )
            } )
            Method ( _E02 ) {
                \_SB.XEN0.PRSC ()
            }
        }
    }

Since the position of the XEN data memory area is not know, the hypervisor will
have to replace the address noted as 0xXXXXXXXX with the actual memory address
where this structure has been copied. The ACPI processor IDs will also be
replaced by Xen during runtime (noted as 1 and 2 in the snipped above). The
PRST region containing the vCPU enabled bitmap would also need to be relocated
by Xen over a RAM region, and updated accordingly when a vCPU is added or
removed.

The replacement can be done by compiling two different versions of the above
ASL code, each one having different values for the XEN operation region, the
ACPI processor objects IDs and other values that need to be set on a per-system
basis, and doing a binary comparison between them in order to get the relative
offsets of the differences. Note that the XEN operation region and the GPE
event and status regions would be placed over a RAM memory region.

In order to implement this, the hypervisor build is going to use part of
libacpi and the iasl compiler.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread