All of lore.kernel.org
 help / color / mirror / Atom feed
* HVMlite ABI specification DRAFT B + implementation outline
@ 2016-02-08 19:03 Roger Pau Monné
  2016-02-08 21:26 ` Boris Ostrovsky
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-08 19:03 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, Samuel Thibault,
	Boris Ostrovsky

Hello,

I've Cced a bunch of people who have expressed interest in the HVMlite 
design/implementation, both from a Xen or OS point of view. If you 
would like to be removed, please say so and I will remove you in 
further iterations. The same applies if you want to be added to the Cc.

This is an initial draft on the HVMlite design and implementation. I've 
mixed certain aspects of the design with the implementation, because I 
think we are quite tied by the implementation possibilities in certain 
aspects, so not speaking about it would make the document incomplete. I 
might be wrong on that, so feel free to comment otherwise if you would 
prefer a different approach.

The document is still not complete. I'm of course not as knowledgeable 
as some people on the Cc, so please correct me if you think there are 
mistakes or simply impossible goals.

I think I've managed to integrate all the comments from DRAFT A. I still
haven't done a s/HVMlite/PVH/, but I plan to do so once the document is
finished and ready to go inside of the Xen tree.

Roger.

---
Xen HVMlite ABI
===============

Boot ABI
--------

Since the Xen entry point into the kernel can be different from the
native entry point, a `ELFNOTE` is used in order to tell the domain
builder how to load and jump into the kernel entry point:

    ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)

The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
kernel supports the boot ABI described in this document.

The domain builder shall load the kernel into the guest memory space and
jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
following machine state:

 * `ebx`: contains the physical memory address where the loader has placed
   the boot start info structure.

 * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.

 * `cr4`: all bits are cleared.

 * `cs`: must be a 32-bit read/execute code segment with a base of ‘0’
   and a limit of ‘0xFFFFFFFF’. The selector value is unspecified.

 * `ds`, `es`: must be a 32-bit read/write data segment with a base of
   ‘0’ and a limit of ‘0xFFFFFFFF’. The selector values are all unspecified.

 * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.

 * `eflags`: all user settable bits are clear.

All other processor registers and flag bits are unspecified. The OS is in
charge of setting up it's own stack, GDT and IDT.

The format of the boot start info structure is the following (pointed to
be %ebx):

NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
address fields should be treated as not present.

 0 +----------------+
   | magic          | Contains the magic value 0x336ec578
   |                | ("xEn3" with the 0x80 bit of the "E" set).
 4 +----------------+
   | flags          | SIF_xxx flags.
 8 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
12 +----------------+
   | nr_modules     | Number of modules passed to the kernel.
16 +----------------+
   | modlist_paddr  | Physical address of an array of modules
   |                | (layout of the structure below).
20 +----------------+

The layout of each entry in the module structure is the following:

 0 +----------------+
   | paddr          | Physical address of the module.
 4 +----------------+
   | size           | Size of the module in bytes.
 8 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
12 +----------------+
   | reserved       |
16 +----------------+

Other relevant information needed in order to boot a guest kernel
(console page address, xenstore event channel...) can be obtained
using HVMPARAMS, just like it's done on HVM guests.

The setup of the hypercall page is also performed in the same way
as HVM guests, using the hypervisor cpuid leaves and msr ranges.

Hardware description
--------------------

Hardware description can come from two different sources, just like on (PV)HVM
guests.

Description of PV devices will always come from xenbus, and in fact
xenbus is the only hardware description that is guaranteed to always be
provided to HVMlite guests.

Description of physical hardware devices will always come from ACPI, in the
absence of any physical hardware device no ACPI tables will be provided. The
presence of ACPI tables can be detected by finding the RSDP, just like on
bare metal.

Non-PV devices exposed to the guest
-----------------------------------

The initial idea was to simply don't provide any emulated devices to a HVMlite
guest as the default option. We have however identified certain situations
where emulated devices could be interesting, both from a performance and
ease of implementation point of view. The following list tries to encompass
the different identified scenarios:

 * 1. HVMlite with no emulated devices at all
   ------------------------------------------
   This is the current implementation inside of Xen, everything is disabled
   by default and the guest has access to the PV devices only. This is of
   course the most secure design because it has the smaller surface of attack.

 * 2. HVMlite with (or capable to) PCI-passthrough
   -----------------------------------------------
   The current model of PCI-passthrought in PV guests is complex and requires
   heavy modifications to the guest OS. Going forward we would like to remove
   this limitation, by providing an interface that's the same as found on bare
   metal. In order to do this, at least an emulated local APIC should be
   provided to guests, together with the access to a PCI-Root complex.
   As said in the 'Hardware description' section above, this will also require
   ACPI. So this proposed scenario will require the following elements that are
   not present in the minimal (or default) HVMlite implementation: ACPI, local
   APIC, IO APIC (optional) and PCI-Root complex.

 * 3. HVMlite hardware domain
   --------------------------
   The aim is that a HVMlite hardware domain is going to work exactly like a
   HVMlite domain with passed-through devices. This means that the domain will
   need access to the same set of emulated devices, and that some ACPI tables
   must be fixed in order to reflect the reality of the container the hardware
   domain is running on. The ACPI section contains more detailed information
   about which/how these tables are going to be fixed.

   Note that in this scenario the hardware domain will *always* have a local
   APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
   channels is going to be removed in favour of the bare metal mechanisms.

The default model for HVMlite guests is going to be to provide a local APIC
together with a minimal set of ACPI tables that accurately match the reality of
the container is guest is running on. An administrator should be able to change
the default setting using the following tunables that are part of the xl
toolstack:

 * lapic: default to true. Indicates whether a local APIC is provided.
 * ioapic: default to false. Indicates whether an IO APIC is provided
   (requires lapic set to true).
 * acpi: default to true. Indicates whether ACPI tables are provided.

ACPI
----

ACPI tables will be provided to the hardware domain or to unprivileged
domains. In the case of unprivileged guests ACPI tables are going to be
created by the toolstack and will only contain the set of devices available
to the guest, which will at least be the following: local APIC and
optionally an IO APIC and passed-through device(s). In order to provide this
information from ACPI the following tables are needed as a minimum: RSDT,
FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
the MADT table is not going to be provided to the guest OS.

The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be used
to signal guests that there's no RTC device (the Xen PV wall clock should be
used instead). It is likely that this flag is not going to be set for the
hardware domain, since it should have access to the RTC present in the host
(if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
same boot_flags FADT field for DomUs in order to signal that there's no VGA
adapter present.

Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
There's no intention to enable these devices, so it is expected that the
hardware-reduced FADT flag is always going to be set.

In the case of the hardware domain, Xen has traditionally passed-through the
native ACPI tables to the guest. This is something that of course we still
want to do, but in the case of HVMlite Xen will have to make sure that
the data passed in the ACPI tables to the hardware domain contain the accurate
hardware description. This means that at least certain tables will have to
be modified/mangled before being presented to the guest:

 * MADT: the number of local APIC entries need to be fixed to match the number
         of vCPUs available to the guest. The address of the IO APIC(s) also
         need to be fixed in order to match the emulated ones that we are going
         to provide.

 * DSDT: certain devices reported in the DSDT may not be available to the guest,
         but since the DSDT is a run-time generated table we cannot fix it. In
         order to cope with this, a STAO table will be provided that should
         be able to signal which devices are not available to the hardware
         domain. This is in line with the Xen/ACPI implementation for ARM.

 * MPST, PMTT, SBTT, SRAT and SLIT: won't be initially presented to the guest,
   until we get our act together on the vNUMA stuff.

NB: there are corner cases that I'm not sure how to solve properly. Currently
the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
of the following:

 * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
   since this table is only available to the hardware domain it has to report
   the PM info back to Xen so that Xen can perform proper PM.
 * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
   mixed with native ACPICA code in most OSes. This is awkward and requires
   the usage of hooks into ACPICA which we have not yet managed to upstream.
 * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
   intrusive in general, so I'm not that pushed to remove it. It's generally
   easy in any OS to add some kind of hook that's executed every time a PCI
   device is discovered.
 * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
   this one is the same as (3), it's not really intrusive so I'm not very
   pushed to remove it.

I would ideally like to get rid of (2) in the list above, since I'm quite sure
we are never going to be able to merge the needed hooks into ACPICA. AFAICT Xen
should be able to parse the FADT table and find the address of the PM1a and
PM1b control registers and trap on access.

(1) is also quite nasty, but I don't see any possible way to get rid of it.

AP startup
----------

AP startup is performed using hypercalls. The following VCPU operations
are used in order to bring up secondary vCPUs:

 * VCPUOP_initialise is used to set the initial state of the vCPU. The
   argument passed to the hypercall must be of the type vcpu_hvm_context.
   See public/hvm/hvm_vcpu.h for the layout of the structure. Note that
   this hypercall allows starting the vCPU in several modes (16/32/64bits),
   regardless of the mode the BSP is currently running on.

 * VCPUOP_up is used to launch the vCPU once the initial state has been
   set using VCPUOP_initialise.

 * VCPUOP_down is used to bring down a vCPU.

 * VCPUOP_is_up is used to scan the number of available vCPUs.

Additionally, if a local APIC is available CPU bringup can also be performed
using the hardware native AP startup sequence (IPIs). In this case the
hypercall interface will still be provided, as a faster and more convenient
way of starting APs.

MMIO mapping
------------

For DomUs without any device passed-through no direct MMIO mappings will be
present in the physical memory map presented to the guest. For DomUs with
devices passed-though the toolstack will create direct MMIO mappings as
part of the domain build process, and thus no action will be required
from the DomU.

For the hardware domain initial direct MMIO mappings will be set for the
following regions:

NOTE: ranges are defined using memory addresses, not pages.

 * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
   memory map at the same position.

 * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
   guest physical memory.

 * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
   1:1 to the guest physical memory map. There are going to be exceptions if
   Xen has to modify the tables before presenting them to the guest.

 * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
   time they will also be made available to the guest at the same position
   in it's physical memory map. It is possible that Xen will trap accesses to
   those regions, but a guest should be able to use the native configuration
   mechanism in order to interact with this configuration space. If the
   hardware domain reports the presence of any of those regions using the
   PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
   them.

 * PCI BARs: it's not possible for Xen to know the position of the BARs of
   the PCI devices without hardware domain interaction. In order to have
   the BARs of PCI devices properly mapped the hardware domain needs to
   call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
   up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
   procedure will be transparent from guest point of view, and upon returning
   from the hypercall mappings must be already established.

Xen HVMlite implementation plan
===============================

This is of course not part of the ABI, but I guess it makes sense to add it
here in order to be able to more easily split the tasks required in order to
make the proposed implementation above a reality. I've tried to split
the tasks into smaller sub-tasks when possible.

DomU
----

 1. Initial HVMlite implementation based on a HVM guest: no emulated devices
    will be provided, interface exactly the same as a PVH guest except for the
    boot ABI.

 2. Provide ACPI tables to HVMlite guests: the initial set of provided tables
    will be: RSDT, FADT, MADT (iff local APIC is enabled).

 3. Enable the local APIC by default for HVMlite guests.

 4. Provide options to xl/libxl in order to allow admins to select the
    presence of a local APIC and IO APIC to HVMlite guests.

 5. Implement an emulated PCI Root Complex inside of Xen.

 6. Provide a DSDT table to HVMlite guests in order to signal the presence
    of PCI-passthrough devices.

IMHO, we should focus on (2) and (3) at the moment, and (4) is quite trivial
once those two are in place. (5) and (6) should be implemented once HVMlite
hardware domains are functional.

When implementing (2) it would be good to place the ACPI related code in a
place that's accessible from libxl, hvmloader and Xen itself, in order
to reduce code duplication. hvmloader already has most if not all the required
code in order to build the tables that are needed for HVMlite DomU.

Dom0
----

 1. Add a new Dom0 builder specific for HVM-like domains. PV domains have
    different requirements and sharing the same Dom0 domain builder only makes
    the code for both cases much harder to read and disentangle.

 2. Implement the code required in order to mangle/modify the ACPI tables
    provided to Dom0, so that it matches the reality of the container provided
    to Dom0.

 3. Allow HVM Dom0 to use PHYSDEVOP_pci_mmcfg_reserved and
    PHYSDEVOP_pci_device_add and make sure these hypercalls add the proper
    MMIO mappings.

 4. Do the necessary wiring so that interrupts from physical devices are
    received by Dom0 using the emulated interrupt controllers (local and IO
    APICs).

This plan is not as detailed as the DomU one, since the Dom0 work is not as
advanced as the DomU work, and is also tied to the DomU implementation. I
have an initial implementation for (1), and will continue working on it.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-08 19:03 HVMlite ABI specification DRAFT B + implementation outline Roger Pau Monné
@ 2016-02-08 21:26 ` Boris Ostrovsky
  2016-02-09 10:56 ` Andrew Cooper
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Boris Ostrovsky @ 2016-02-08 21:26 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, Samuel Thibault



On 02/08/2016 02:03 PM, Roger Pau Monné wrote:
>
>   * PCI BARs: it's not possible for Xen to know the position of the BARs of
>     the PCI devices without hardware domain interaction. In order to have
>     the BARs of PCI devices properly mapped the hardware domain needs to
>     call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>     up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>     procedure will be transparent from guest point of view, and upon returning
>     from the hypercall mappings must be already established.

We also want to use PHYSDEVOP_pci_device_add because that's how we pass 
device's PXM information to the hypervisor.

>
> Xen HVMlite implementation plan
> ===============================
>
> This is of course not part of the ABI, but I guess it makes sense to add it
> here in order to be able to more easily split the tasks required in order to
> make the proposed implementation above a reality. I've tried to split
> the tasks into smaller sub-tasks when possible.
>
> DomU
> ----
>
>   1. Initial HVMlite implementation based on a HVM guest: no emulated devices
>      will be provided, interface exactly the same as a PVH guest except for the
>      boot ABI.
>
>   2. Provide ACPI tables to HVMlite guests: the initial set of provided tables
>      will be: RSDT, FADT, MADT (iff local APIC is enabled).
>
>   3. Enable the local APIC by default for HVMlite guests.
>
>   4. Provide options to xl/libxl in order to allow admins to select the
>      presence of a local APIC and IO APIC to HVMlite guests.
>
>   5. Implement an emulated PCI Root Complex inside of Xen.
>
>   6. Provide a DSDT table to HVMlite guests in order to signal the presence
>      of PCI-passthrough devices.
>
> IMHO, we should focus on (2) and (3) at the moment, and (4) is quite trivial
> once those two are in place. (5) and (6) should be implemented once HVMlite
> hardware domains are functional.
>
> When implementing (2) it would be good to place the ACPI related code in a
> place that's accessible from libxl, hvmloader and Xen itself, in order
> to reduce code duplication. hvmloader already has most if not all the required
> code in order to build the tables that are needed for HVMlite DomU.

Since I started poking at ACPI code in hvmloader anyway I can take a 
stab at (2).

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-08 19:03 HVMlite ABI specification DRAFT B + implementation outline Roger Pau Monné
  2016-02-08 21:26 ` Boris Ostrovsky
@ 2016-02-09 10:56 ` Andrew Cooper
  2016-02-09 11:58   ` Roger Pau Monné
  2016-02-09 13:24 ` Jan Beulich
  2016-02-09 15:14 ` Boris Ostrovsky
  3 siblings, 1 reply; 24+ messages in thread
From: Andrew Cooper @ 2016-02-09 10:56 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, Samuel Thibault, Boris Ostrovsky

On 08/02/16 19:03, Roger Pau Monné wrote:
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
> address fields should be treated as not present.
>
>  0 +----------------+
>    | magic          | Contains the magic value 0x336ec578
>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>  4 +----------------+
>    | flags          | SIF_xxx flags.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | nr_modules     | Number of modules passed to the kernel.
> 16 +----------------+
>    | modlist_paddr  | Physical address of an array of modules
>    |                | (layout of the structure below).
> 20 +----------------+
>
> The layout of each entry in the module structure is the following:
>
>  0 +----------------+
>    | paddr          | Physical address of the module.
>  4 +----------------+
>    | size           | Size of the module in bytes.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | reserved       |
> 16 +----------------+
>
> Other relevant information needed in order to boot a guest kernel
> (console page address, xenstore event channel...) can be obtained
> using HVMPARAMS, just like it's done on HVM guests.
>
> The setup of the hypercall page is also performed in the same way
> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>
> Hardware description
> --------------------
>
> Hardware description can come from two different sources, just like on (PV)HVM
> guests.
>
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
>
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided. The
> presence of ACPI tables can be detected by finding the RSDP, just like on
> bare metal.

As we are extending the base structure, why not have an RSDP paddr in it
as well?  This avoids the need to scan RAM, and also serves as an
indication of "No ACPI".

>
> Non-PV devices exposed to the guest
> -----------------------------------
>
> The initial idea was to simply don't provide any emulated devices to a HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> ease of implementation point of view. The following list tries to encompass
> the different identified scenarios:
>
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.
>
>  * 2. HVMlite with (or capable to) PCI-passthrough
>    -----------------------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC should be
>    provided to guests, together with the access to a PCI-Root complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC, IO APIC (optional) and PCI-Root complex.
>
>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
>
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.
>
> The default model for HVMlite guests is going to be to provide a local APIC
> together with a minimal set of ACPI tables that accurately match the reality of
> the container is guest is running on.

This statement is contrary to option 1 above, which states that all
emulation is disabled.

FWIW, I think there needs to be a 4th option, inbetween current 1 and 2,
which is HVMLite + LAPIC.  This is then the default HVMLite ABI, and is
not passthrough-capable.

>  An administrator should be able to change
> the default setting using the following tunables that are part of the xl
> toolstack:
>
>  * lapic: default to true. Indicates whether a local APIC is provided.
>  * ioapic: default to false. Indicates whether an IO APIC is provided
>    (requires lapic set to true).
>  * acpi: default to true. Indicates whether ACPI tables are provided.
>
> <snip>
>
> MMIO mapping
> ------------
>
> For DomUs without any device passed-through no direct MMIO mappings will be
> present in the physical memory map presented to the guest. For DomUs with
> devices passed-though the toolstack will create direct MMIO mappings as
> part of the domain build process, and thus no action will be required
> from the DomU.
>
> For the hardware domain initial direct MMIO mappings will be set for the
> following regions:
>
> NOTE: ranges are defined using memory addresses, not pages.

I would preface this with "where applicable".  Non-legacy boots are
unlikely to have anything interesting in the first 1MB.

>
>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>    memory map at the same position.
>
>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>    guest physical memory.
>
>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>    1:1 to the guest physical memory map. There are going to be exceptions if
>    Xen has to modify the tables before presenting them to the guest.
>
>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>    time they will also be made available to the guest at the same position
>    in it's physical memory map. It is possible that Xen will trap accesses to
>    those regions, but a guest should be able to use the native configuration
>    mechanism in order to interact with this configuration space. If the
>    hardware domain reports the presence of any of those regions using the
>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>    them.
>
>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>    the PCI devices without hardware domain interaction.

Xen requires no dom0 interaction to find all information like this for
devices in segment 0 (i.e. all current hardware).  Segments other than 0
may have their MMCONF regions expressed in AML only.

The reason this is all awkward in Xen is that PCI devices were hacked in
as second-class citizens when IOMMU support was added.  This is a purely
a Xen software issue which needs undoing.

>  In order to have
>    the BARs of PCI devices properly mapped the hardware domain needs to
>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>    procedure will be transparent from guest point of view, and upon returning
>    from the hypercall mappings must be already established.
>
>

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 10:56 ` Andrew Cooper
@ 2016-02-09 11:58   ` Roger Pau Monné
  2016-02-09 12:10     ` Jan Beulich
  2016-02-09 14:36     ` Boris Ostrovsky
  0 siblings, 2 replies; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-09 11:58 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, Samuel Thibault, Boris Ostrovsky

El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
> On 08/02/16 19:03, Roger Pau Monné wrote:
>> The format of the boot start info structure is the following (pointed to
>> be %ebx):
>>
>> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
>> address fields should be treated as not present.
>>
>>  0 +----------------+
>>    | magic          | Contains the magic value 0x336ec578
>>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>  4 +----------------+
>>    | flags          | SIF_xxx flags.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | nr_modules     | Number of modules passed to the kernel.
>> 16 +----------------+
>>    | modlist_paddr  | Physical address of an array of modules
>>    |                | (layout of the structure below).
>> 20 +----------------+
>>
>> The layout of each entry in the module structure is the following:
>>
>>  0 +----------------+
>>    | paddr          | Physical address of the module.
>>  4 +----------------+
>>    | size           | Size of the module in bytes.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | reserved       |
>> 16 +----------------+
>>
>> Other relevant information needed in order to boot a guest kernel
>> (console page address, xenstore event channel...) can be obtained
>> using HVMPARAMS, just like it's done on HVM guests.
>>
>> The setup of the hypercall page is also performed in the same way
>> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>>
>> Hardware description
>> --------------------
>>
>> Hardware description can come from two different sources, just like on (PV)HVM
>> guests.
>>
>> Description of PV devices will always come from xenbus, and in fact
>> xenbus is the only hardware description that is guaranteed to always be
>> provided to HVMlite guests.
>>
>> Description of physical hardware devices will always come from ACPI, in the
>> absence of any physical hardware device no ACPI tables will be provided. The
>> presence of ACPI tables can be detected by finding the RSDP, just like on
>> bare metal.
> 
> As we are extending the base structure, why not have an RSDP paddr in it
> as well?  This avoids the need to scan RAM, and also serves as an
> indication of "No ACPI".

Right, this seems fine to me. I can send a patch later to expand the
structure unless anyone else complains.

> 
>>
>> Non-PV devices exposed to the guest
>> -----------------------------------
>>
>> The initial idea was to simply don't provide any emulated devices to a HVMlite
>> guest as the default option. We have however identified certain situations
>> where emulated devices could be interesting, both from a performance and
>> ease of implementation point of view. The following list tries to encompass
>> the different identified scenarios:
>>
>>  * 1. HVMlite with no emulated devices at all
>>    ------------------------------------------
>>    This is the current implementation inside of Xen, everything is disabled
>>    by default and the guest has access to the PV devices only. This is of
>>    course the most secure design because it has the smaller surface of attack.
>>
>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>    -----------------------------------------------
>>    The current model of PCI-passthrought in PV guests is complex and requires
>>    heavy modifications to the guest OS. Going forward we would like to remove
>>    this limitation, by providing an interface that's the same as found on bare
>>    metal. In order to do this, at least an emulated local APIC should be
>>    provided to guests, together with the access to a PCI-Root complex.
>>    As said in the 'Hardware description' section above, this will also require
>>    ACPI. So this proposed scenario will require the following elements that are
>>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>>    APIC, IO APIC (optional) and PCI-Root complex.
>>
>>  * 3. HVMlite hardware domain
>>    --------------------------
>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>    HVMlite domain with passed-through devices. This means that the domain will
>>    need access to the same set of emulated devices, and that some ACPI tables
>>    must be fixed in order to reflect the reality of the container the hardware
>>    domain is running on. The ACPI section contains more detailed information
>>    about which/how these tables are going to be fixed.
>>
>>    Note that in this scenario the hardware domain will *always* have a local
>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>    channels is going to be removed in favour of the bare metal mechanisms.
>>
>> The default model for HVMlite guests is going to be to provide a local APIC
>> together with a minimal set of ACPI tables that accurately match the reality of
>> the container is guest is running on.
> 
> This statement is contrary to option 1 above, which states that all
> emulation is disabled.
> 
> FWIW, I think there needs to be a 4th option, inbetween current 1 and 2,
> which is HVMLite + LAPIC.  This is then the default HVMLite ABI, and is
> not passthrough-capable.

Right, I think this makes sense because (2) is not exactly the same as
it requires the presence of a PCI root complex.

>>  An administrator should be able to change
>> the default setting using the following tunables that are part of the xl
>> toolstack:
>>
>>  * lapic: default to true. Indicates whether a local APIC is provided.
>>  * ioapic: default to false. Indicates whether an IO APIC is provided
>>    (requires lapic set to true).
>>  * acpi: default to true. Indicates whether ACPI tables are provided.
>>
>> <snip>
>>
>> MMIO mapping
>> ------------
>>
>> For DomUs without any device passed-through no direct MMIO mappings will be
>> present in the physical memory map presented to the guest. For DomUs with
>> devices passed-though the toolstack will create direct MMIO mappings as
>> part of the domain build process, and thus no action will be required
>> from the DomU.
>>
>> For the hardware domain initial direct MMIO mappings will be set for the
>> following regions:
>>
>> NOTE: ranges are defined using memory addresses, not pages.
> 
> I would preface this with "where applicable".  Non-legacy boots are
> unlikely to have anything interesting in the first 1MB.

Yes, I've only taken legacy (BIOS) boot into account here. I'm not
familiar with UEFI, so I'm not really sure how different it is, or which
memory regions should be mapped into the guest physmap in that case. I
should have made this explicit by adding a title, like:

Legacy BIOS boot
----------------

> 
>>
>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>    memory map at the same position.
>>
>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>    guest physical memory.
>>
>>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>>    1:1 to the guest physical memory map. There are going to be exceptions if
>>    Xen has to modify the tables before presenting them to the guest.
>>
>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>    time they will also be made available to the guest at the same position
>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>    those regions, but a guest should be able to use the native configuration
>>    mechanism in order to interact with this configuration space. If the
>>    hardware domain reports the presence of any of those regions using the
>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>    them.
>>
>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>    the PCI devices without hardware domain interaction.
> 
> Xen requires no dom0 interaction to find all information like this for
> devices in segment 0 (i.e. all current hardware).  Segments other than 0
> may have their MMCONF regions expressed in AML only.

Thanks for the comments, please bear with me. I think we are mixing two
things here, one is the MMCFG areas, and the other one are the BARs of
each PCI device.

AFAIK MMCFG areas are described in the 'MCFG' ACPI table, which is
static and Xen should be able to parse on it's own. Then I'm not sure
why PHYSDEVOP_pci_mmcfg_reserved is needed at all.

Then for BARs you need to know the specific PCI devices, which are
enumerated in the DSDT or similar ACPI tables, which are not static, and
thus cannot be parsed by Xen. We could do a brute force scan of the
whole PCI bus using the config registers, but that seems hacky. And as
Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
order to notify Xen of the PXM information.

If we indeed have all the information about the BARs (position and size)
we could pre-map them 1:1 before creating the hardware domain, and thus
no modifications will be needed to the PHYSDEVOP_pci_device_add hypercall.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 11:58   ` Roger Pau Monné
@ 2016-02-09 12:10     ` Jan Beulich
  2016-02-09 13:00       ` Roger Pau Monné
  2016-02-09 14:36     ` Boris Ostrovsky
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 12:10 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

>>> On 09.02.16 at 12:58, <roger.pau@citrix.com> wrote:
> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>>    time they will also be made available to the guest at the same position
>>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>>    those regions, but a guest should be able to use the native configuration
>>>    mechanism in order to interact with this configuration space. If the
>>>    hardware domain reports the presence of any of those regions using the
>>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>>    them.
>>>
>>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>>    the PCI devices without hardware domain interaction.
>> 
>> Xen requires no dom0 interaction to find all information like this for
>> devices in segment 0 (i.e. all current hardware).  Segments other than 0
>> may have their MMCONF regions expressed in AML only.
> 
> Thanks for the comments, please bear with me. I think we are mixing two
> things here, one is the MMCFG areas, and the other one are the BARs of
> each PCI device.
> 
> AFAIK MMCFG areas are described in the 'MCFG' ACPI table, which is
> static and Xen should be able to parse on it's own. Then I'm not sure
> why PHYSDEVOP_pci_mmcfg_reserved is needed at all.

Because there are safety nets: Xen and Linux consult the memory
map to determine whether actually using what the firmware say is
an MMCFG area is actually safe. Linux in addition checks ACPI
tables, and this is what Xen can't do (as it require an AML
interpreter), hence the hypercall to inform the hypervisor.

> Then for BARs you need to know the specific PCI devices, which are
> enumerated in the DSDT or similar ACPI tables, which are not static, and
> thus cannot be parsed by Xen. We could do a brute force scan of the
> whole PCI bus using the config registers, but that seems hacky. And as
> Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
> order to notify Xen of the PXM information.

We can only be sure to have access to segment 0 at boot time.
Other segments may be accessible via MMCFG only, and whether
we can safely use the respective MMCFG we won't know - see
above - early enough. That's also the reason why today we do a
brute force scan only on segment 0.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 12:10     ` Jan Beulich
@ 2016-02-09 13:00       ` Roger Pau Monné
  2016-02-09 13:41         ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-09 13:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

El 9/2/16 a les 13:10, Jan Beulich ha escrit:
>>>> On 09.02.16 at 12:58, <roger.pau@citrix.com> wrote:
>> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>>>    time they will also be made available to the guest at the same position
>>>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>>>    those regions, but a guest should be able to use the native configuration
>>>>    mechanism in order to interact with this configuration space. If the
>>>>    hardware domain reports the presence of any of those regions using the
>>>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>>>    them.
>>>>
>>>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>>>    the PCI devices without hardware domain interaction.
>>>
>>> Xen requires no dom0 interaction to find all information like this for
>>> devices in segment 0 (i.e. all current hardware).  Segments other than 0
>>> may have their MMCONF regions expressed in AML only.
>>
>> Thanks for the comments, please bear with me. I think we are mixing two
>> things here, one is the MMCFG areas, and the other one are the BARs of
>> each PCI device.
>>
>> AFAIK MMCFG areas are described in the 'MCFG' ACPI table, which is
>> static and Xen should be able to parse on it's own. Then I'm not sure
>> why PHYSDEVOP_pci_mmcfg_reserved is needed at all.
> 
> Because there are safety nets: Xen and Linux consult the memory
> map to determine whether actually using what the firmware say is
> an MMCFG area is actually safe. Linux in addition checks ACPI
> tables, and this is what Xen can't do (as it require an AML
> interpreter), hence the hypercall to inform the hypervisor.

Hm, I guess I'm overlooking something, but I think Xen checks the ACPI
tables, see xen/arch/x86/x86_64/mmconfig-shared.c:400:

    if (pci_mmcfg_check_hostbridge()) {
        unsigned int i;

        pci_mmcfg_arch_init();
        for (i = 0; i < pci_mmcfg_config_num; ++i)
            if (pci_mmcfg_arch_enable(i))
                valid = 0;
    } else {
        acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
        pci_mmcfg_arch_init();
        valid = pci_mmcfg_reject_broken();
    }

Which AFAICT suggests that Xen is indeed able to parse the 'MCFG' table,
which contains the list of MMCFG regions on the system. Is there any
other ACPI table where this information is reported that I'm missing?

This would suggest then that the hypercall is no longer needed.

>> Then for BARs you need to know the specific PCI devices, which are
>> enumerated in the DSDT or similar ACPI tables, which are not static, and
>> thus cannot be parsed by Xen. We could do a brute force scan of the
>> whole PCI bus using the config registers, but that seems hacky. And as
>> Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
>> order to notify Xen of the PXM information.
> 
> We can only be sure to have access to segment 0 at boot time.
> Other segments may be accessible via MMCFG only, and whether
> we can safely use the respective MMCFG we won't know - see
> above - early enough. That's also the reason why today we do a
> brute force scan only on segment 0.

Andrew suggested that all current hardware only uses segment 0, so it
seems like covering segment 0 is all that's needed for now. Are OSes
already capable of dealing with segments different than 0 that only
support MMCFG accesses?

And in which case, if a segment != 0 appears, and it only supports MMCFG
accesses, the MCFG table should contain an entry for this area, which
should allow us to properly scan it.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-08 19:03 HVMlite ABI specification DRAFT B + implementation outline Roger Pau Monné
  2016-02-08 21:26 ` Boris Ostrovsky
  2016-02-09 10:56 ` Andrew Cooper
@ 2016-02-09 13:24 ` Jan Beulich
  2016-02-09 15:06   ` Stefano Stabellini
  2016-02-10 12:01   ` Roger Pau Monné
  2016-02-09 15:14 ` Boris Ostrovsky
  3 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 13:24 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

>>> On 08.02.16 at 20:03, <roger.pau@citrix.com> wrote:
> Boot ABI
> --------
> 
> Since the Xen entry point into the kernel can be different from the
> native entry point, a `ELFNOTE` is used in order to tell the domain
> builder how to load and jump into the kernel entry point:
> 
>     ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)
> 
> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
> kernel supports the boot ABI described in this document.
> 
> The domain builder shall load the kernel into the guest memory space and
> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
> following machine state:
> 
>  * `ebx`: contains the physical memory address where the loader has placed
>    the boot start info structure.
> 
>  * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
> 
>  * `cr4`: all bits are cleared.
> 
>  * `cs`: must be a 32-bit read/execute code segment with a base of ‘0’
>    and a limit of ‘0xFFFFFFFF’. The selector value is unspecified.
> 
>  * `ds`, `es`: must be a 32-bit read/write data segment with a base of
>    ‘0’ and a limit of ‘0xFFFFFFFF’. The selector values are all unspecified.
> 
>  * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.
> 
>  * `eflags`: all user settable bits are clear.

The word "user" here can be mistaken. Perhaps better "all modifiable
bits"?

> All other processor registers and flag bits are unspecified. The OS is in
> charge of setting up it's own stack, GDT and IDT.

The "flag bits" part should now probably be dropped?

> The format of the boot start info structure is the following (pointed to
> be %ebx):

"... by %ebx"

> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of 
> the address fields should be treated as not present.
> 
>  0 +----------------+
>    | magic          | Contains the magic value 0x336ec578
>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>  4 +----------------+
>    | flags          | SIF_xxx flags.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | nr_modules     | Number of modules passed to the kernel.
> 16 +----------------+
>    | modlist_paddr  | Physical address of an array of modules
>    |                | (layout of the structure below).
> 20 +----------------+

There having been talk about extending the structure, I think we
need some indicator that the consumer can use to know which
fields are present. I.e. either a version field, another flags one,
or a size one.

> The layout of each entry in the module structure is the following:
> 
>  0 +----------------+
>    | paddr          | Physical address of the module.
>  4 +----------------+
>    | size           | Size of the module in bytes.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | reserved       |
> 16 +----------------+

I've been thinking about this on draft A already: Do we really want
to paint ourselves into the corner of not supporting >4Gb modules,
by limiting their addresses and sizes to 32 bits?

> Hardware description
> --------------------
> 
> Hardware description can come from two different sources, just like on 
> (PV)HVM
> guests.
> 
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
> 
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided.

This seems too strict: How about "in the absence of any physical
hardware device ACPI tables may not be provided"?

> Non-PV devices exposed to the guest
> -----------------------------------
> 
> The initial idea was to simply don't provide any emulated devices to a 
> HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> ease of implementation point of view. The following list tries to encompass
> the different identified scenarios:
> 
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.

smallest?

>  * 2. HVMlite with (or capable to) PCI-passthrough
>    -----------------------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC should be
>    provided to guests, together with the access to a PCI-Root complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC, IO APIC (optional) and PCI-Root complex.

Are you reasonably convinced that the absence of an IO-APIC
won't, with LAPICs present, cause more confusion than aid to the
OSes wanting to adopt PVHv2?

>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
> 
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.

Do you really mean "*always*"? What about a system without IO-APIC?
Would you mean to emulate one there for no reason?

Also I think you should say "the usage of many PHYSDEV operations",
because - as we've already pointed out - some are unavoidable.

> ACPI
> ----
> 
> ACPI tables will be provided to the hardware domain or to unprivileged
> domains. In the case of unprivileged guests ACPI tables are going to be
> created by the toolstack and will only contain the set of devices available
> to the guest, which will at least be the following: local APIC and
> optionally an IO APIC and passed-through device(s). In order to provide this
> information from ACPI the following tables are needed as a minimum: RSDT,
> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
> the MADT table is not going to be provided to the guest OS.
> 
> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be used
> to signal guests that there's no RTC device (the Xen PV wall clock should be
> used instead). It is likely that this flag is not going to be set for the
> hardware domain, since it should have access to the RTC present in the host
> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
> same boot_flags FADT field for DomUs in order to signal that there's no VGA
> adapter present.
> 
> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
> There's no intention to enable these devices, so it is expected that the
> hardware-reduced FADT flag is always going to be set.

We'll need to be absolutely certain that use of this flag doesn't carry
any further implications.

> In the case of the hardware domain, Xen has traditionally passed-through the
> native ACPI tables to the guest. This is something that of course we still
> want to do, but in the case of HVMlite Xen will have to make sure that
> the data passed in the ACPI tables to the hardware domain contain the 
> accurate
> hardware description. This means that at least certain tables will have to
> be modified/mangled before being presented to the guest:
> 
>  * MADT: the number of local APIC entries need to be fixed to match the number
>          of vCPUs available to the guest. The address of the IO APIC(s) also
>          need to be fixed in order to match the emulated ones that we are going
>          to provide.
> 
>  * DSDT: certain devices reported in the DSDT may not be available to the guest,
>          but since the DSDT is a run-time generated table we cannot fix it. In
>          order to cope with this, a STAO table will be provided that should
>          be able to signal which devices are not available to the hardware
>          domain. This is in line with the Xen/ACPI implementation for ARM.

Will STAO be sufficient for everything that may need customization?
I'm particularly worried about processor related methods in DSDT or
SSDT, which - if we're really meaning to do as you say - would need
to be limited (or extended) to the number of vCPU-s Dom0 gets.
What's even less clear to me is how you mean to deal with P-, C-,
and (once supported) T-state management for CPUs which don't
have a vCPU equivalent in Dom0.

> NB: there are corner cases that I'm not sure how to solve properly. Currently
> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
> of the following:
> 
>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>    since this table is only available to the hardware domain it has to report
>    the PM info back to Xen so that Xen can perform proper PM.
>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>    mixed with native ACPICA code in most OSes. This is awkward and requires
>    the usage of hooks into ACPICA which we have not yet managed to upstream.

Iirc shutdown doesn't require any custom patches anymore in Linux.

>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>    intrusive in general, so I'm not that pushed to remove it. It's generally
>    easy in any OS to add some kind of hook that's executed every time a PCI
>    device is discovered.
>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
>    this one is the same as (3), it's not really intrusive so I'm not very
>    pushed to remove it.

As said in another reply - for both of these, we just can't remove the
reporting to Xen.

> MMIO mapping
> ------------
> 
> For DomUs without any device passed-through no direct MMIO mappings will be
> present in the physical memory map presented to the guest. For DomUs with
> devices passed-though the toolstack will create direct MMIO mappings as
> part of the domain build process, and thus no action will be required
> from the DomU.
> 
> For the hardware domain initial direct MMIO mappings will be set for the
> following regions:
> 
> NOTE: ranges are defined using memory addresses, not pages.
> 
>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>    memory map at the same position.
> 
>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>    guest physical memory.

When have you last seen a machine with a hole right below the
16Mb boundary?

>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>    1:1 to the guest physical memory map. There are going to be exceptions if
>    Xen has to modify the tables before presenting them to the guest.
> 
>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>    time they will also be made available to the guest at the same position
>    in it's physical memory map. It is possible that Xen will trap accesses to
>    those regions, but a guest should be able to use the native configuration
>    mechanism in order to interact with this configuration space. If the
>    hardware domain reports the presence of any of those regions using the
>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>    them.

s/all guest/allow Dom0/ in this last sentence?

>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>    the PCI devices without hardware domain interaction. In order to have
>    the BARs of PCI devices properly mapped the hardware domain needs to
>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>    procedure will be transparent from guest point of view, and upon returning
>    from the hypercall mappings must be already established.

I'm not sure this can work, as it imposes restrictions on the ordering
of operations internal of the Dom0 OS: Successfully having probed
for a PCI device (and hence reporting its presence to Xen) doesn't
imply its BARs have already got set up. Together with the possibility
of the OS re-assigning BARs I think we will actually need another
hypercall, or the same device-add hypercall may need to be issued
more than once per device (i.e. also every time any BAR assignment
got changed).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 13:00       ` Roger Pau Monné
@ 2016-02-09 13:41         ` Jan Beulich
  2016-02-09 16:32           ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 13:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

>>> On 09.02.16 at 14:00, <roger.pau@citrix.com> wrote:
> El 9/2/16 a les 13:10, Jan Beulich ha escrit:
>>>>> On 09.02.16 at 12:58, <roger.pau@citrix.com> wrote:
>>> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>>>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at 
> boot
>>>>>    time they will also be made available to the guest at the same position
>>>>>    in it's physical memory map. It is possible that Xen will trap accesses 
> to
>>>>>    those regions, but a guest should be able to use the native configuration
>>>>>    mechanism in order to interact with this configuration space. If the
>>>>>    hardware domain reports the presence of any of those regions using the
>>>>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>>>>    them.
>>>>>
>>>>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>>>>    the PCI devices without hardware domain interaction.
>>>>
>>>> Xen requires no dom0 interaction to find all information like this for
>>>> devices in segment 0 (i.e. all current hardware).  Segments other than 0
>>>> may have their MMCONF regions expressed in AML only.
>>>
>>> Thanks for the comments, please bear with me. I think we are mixing two
>>> things here, one is the MMCFG areas, and the other one are the BARs of
>>> each PCI device.
>>>
>>> AFAIK MMCFG areas are described in the 'MCFG' ACPI table, which is
>>> static and Xen should be able to parse on it's own. Then I'm not sure
>>> why PHYSDEVOP_pci_mmcfg_reserved is needed at all.
>> 
>> Because there are safety nets: Xen and Linux consult the memory
>> map to determine whether actually using what the firmware say is
>> an MMCFG area is actually safe. Linux in addition checks ACPI
>> tables, and this is what Xen can't do (as it require an AML
>> interpreter), hence the hypercall to inform the hypervisor.
> 
> Hm, I guess I'm overlooking something, but I think Xen checks the ACPI
> tables, see xen/arch/x86/x86_64/mmconfig-shared.c:400:
> 
>     if (pci_mmcfg_check_hostbridge()) {
>         unsigned int i;
> 
>         pci_mmcfg_arch_init();
>         for (i = 0; i < pci_mmcfg_config_num; ++i)
>             if (pci_mmcfg_arch_enable(i))
>                 valid = 0;
>     } else {
>         acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
>         pci_mmcfg_arch_init();
>         valid = pci_mmcfg_reject_broken();
>     }
> 
> Which AFAICT suggests that Xen is indeed able to parse the 'MCFG' table,
> which contains the list of MMCFG regions on the system. Is there any
> other ACPI table where this information is reported that I'm missing?

You didn't read my reply carefully enough: I didn't say Xen can't
parse these tables. What I said is that Xen isn't by itself in the
position to do sanity checks that have proven necessary. Hence ...

> This would suggest then that the hypercall is no longer needed.

... it's needed - see Linux'es is_mmconf_reserved() which checks
both E820 and data read from DSDT (and maybe SSDT). Also
please don't forget about the hotplug case, where _CBA methods
need evaluation in order to obtain address information.

>>> Then for BARs you need to know the specific PCI devices, which are
>>> enumerated in the DSDT or similar ACPI tables, which are not static, and
>>> thus cannot be parsed by Xen. We could do a brute force scan of the
>>> whole PCI bus using the config registers, but that seems hacky. And as
>>> Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
>>> order to notify Xen of the PXM information.
>> 
>> We can only be sure to have access to segment 0 at boot time.
>> Other segments may be accessible via MMCFG only, and whether
>> we can safely use the respective MMCFG we won't know - see
>> above - early enough. That's also the reason why today we do a
>> brute force scan only on segment 0.
> 
> Andrew suggested that all current hardware only uses segment 0, so it
> seems like covering segment 0 is all that's needed for now. Are OSes
> already capable of dealing with segments different than 0 that only
> support MMCFG accesses?

Of course - Linux for example. I've actually seen Linux (and Xen) run
on a system with multiple segments.

> And in which case, if a segment != 0 appears, and it only supports MMCFG
> accesses, the MCFG table should contain an entry for this area, which
> should allow us to properly scan it.

See above.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 11:58   ` Roger Pau Monné
  2016-02-09 12:10     ` Jan Beulich
@ 2016-02-09 14:36     ` Boris Ostrovsky
  2016-02-09 14:42       ` Andrew Cooper
  2016-02-09 14:48       ` Jan Beulich
  1 sibling, 2 replies; 24+ messages in thread
From: Boris Ostrovsky @ 2016-02-09 14:36 UTC (permalink / raw)
  To: Roger Pau Monné, Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, Samuel Thibault

On 02/09/2016 06:58 AM, Roger Pau Monné wrote:
> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>
>>> Description of physical hardware devices will always come from ACPI, in the
>>> absence of any physical hardware device no ACPI tables will be provided. The
>>> presence of ACPI tables can be detected by finding the RSDP, just like on
>>> bare metal.
>> As we are extending the base structure, why not have an RSDP paddr in it
>> as well?  This avoids the need to scan RAM, and also serves as an
>> indication of "No ACPI".
> Right, this seems fine to me. I can send a patch later to expand the
> structure unless anyone else complains.
>

Isn't scanning memory the standard procedure for finding RSDP? Even if 
we provide the address, guests will still search the memory (or is it 
just Linux that does that?)

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 14:36     ` Boris Ostrovsky
@ 2016-02-09 14:42       ` Andrew Cooper
  2016-02-09 14:48       ` Jan Beulich
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Cooper @ 2016-02-09 14:42 UTC (permalink / raw)
  To: Boris Ostrovsky, Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, Samuel Thibault

On 09/02/16 14:36, Boris Ostrovsky wrote:
> On 02/09/2016 06:58 AM, Roger Pau Monné wrote:
>> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>>
>>>> Description of physical hardware devices will always come from
>>>> ACPI, in the
>>>> absence of any physical hardware device no ACPI tables will be
>>>> provided. The
>>>> presence of ACPI tables can be detected by finding the RSDP, just
>>>> like on
>>>> bare metal.
>>> As we are extending the base structure, why not have an RSDP paddr
>>> in it
>>> as well?  This avoids the need to scan RAM, and also serves as an
>>> indication of "No ACPI".
>> Right, this seems fine to me. I can send a patch later to expand the
>> structure unless anyone else complains.
>>
>
> Isn't scanning memory the standard procedure for finding RSDP?

For BIOS boot, yes.  For EFI, not necessarily.

> Even if we provide the address, guests will still search the memory
> (or is it just Linux that does that?)

A suitably enlightened guest need not scan memory if it is has a direct
pointer to the ACPI tables.  A guest might still scan memory, but all it
is doing is wasting time.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 14:36     ` Boris Ostrovsky
  2016-02-09 14:42       ` Andrew Cooper
@ 2016-02-09 14:48       ` Jan Beulich
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 14:48 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	roger.pau

>>> On 09.02.16 at 15:36, <boris.ostrovsky@oracle.com> wrote:
> On 02/09/2016 06:58 AM, Roger Pau Monné wrote:
>> El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
>>> On 08/02/16 19:03, Roger Pau Monné wrote:
>>>>
>>>> Description of physical hardware devices will always come from ACPI, in the
>>>> absence of any physical hardware device no ACPI tables will be provided. The
>>>> presence of ACPI tables can be detected by finding the RSDP, just like on
>>>> bare metal.
>>> As we are extending the base structure, why not have an RSDP paddr in it
>>> as well?  This avoids the need to scan RAM, and also serves as an
>>> indication of "No ACPI".
>> Right, this seems fine to me. I can send a patch later to expand the
>> structure unless anyone else complains.
>>
> 
> Isn't scanning memory the standard procedure for finding RSDP? Even if 
> we provide the address, guests will still search the memory (or is it 
> just Linux that does that?)

Not with e.g. EFI, so there's a precedent of avoiding that (legacy)
scan.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 13:24 ` Jan Beulich
@ 2016-02-09 15:06   ` Stefano Stabellini
  2016-02-09 16:15     ` Jan Beulich
  2016-02-10 12:01   ` Roger Pau Monné
  1 sibling, 1 reply; 24+ messages in thread
From: Stefano Stabellini @ 2016-02-09 15:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky, Roger Pau Monné

On Tue, 9 Feb 2016, Jan Beulich wrote:
> > In the case of the hardware domain, Xen has traditionally passed-through the
> > native ACPI tables to the guest. This is something that of course we still
> > want to do, but in the case of HVMlite Xen will have to make sure that
> > the data passed in the ACPI tables to the hardware domain contain the 
> > accurate
> > hardware description. This means that at least certain tables will have to
> > be modified/mangled before being presented to the guest:
> > 
> >  * MADT: the number of local APIC entries need to be fixed to match the number
> >          of vCPUs available to the guest. The address of the IO APIC(s) also
> >          need to be fixed in order to match the emulated ones that we are going
> >          to provide.
> > 
> >  * DSDT: certain devices reported in the DSDT may not be available to the guest,
> >          but since the DSDT is a run-time generated table we cannot fix it. In
> >          order to cope with this, a STAO table will be provided that should
> >          be able to signal which devices are not available to the hardware
> >          domain. This is in line with the Xen/ACPI implementation for ARM.
> 
> Will STAO be sufficient for everything that may need customization?
> I'm particularly worried about processor related methods in DSDT or
> SSDT, which - if we're really meaning to do as you say - would need
> to be limited (or extended) to the number of vCPU-s Dom0 gets.
> What's even less clear to me is how you mean to deal with P-, C-,
> and (once supported) T-state management for CPUs which don't
> have a vCPU equivalent in Dom0.

It is possible to use the STAO to hide entire objects, including
processors, from the DSDT, which should be good enough to prevent dom0
from calling any of the processor related methods you are referreing to.
Then we can let Xen do cpuidle and cpufreq as it is already doing.

Would that work? Or do we still need Dom0 to call any ACPI methods for
power management?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-08 19:03 HVMlite ABI specification DRAFT B + implementation outline Roger Pau Monné
                   ` (2 preceding siblings ...)
  2016-02-09 13:24 ` Jan Beulich
@ 2016-02-09 15:14 ` Boris Ostrovsky
  3 siblings, 0 replies; 24+ messages in thread
From: Boris Ostrovsky @ 2016-02-09 15:14 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, Samuel Thibault

On 02/08/2016 02:03 PM, Roger Pau Monné wrote:
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
> address fields should be treated as not present.
>
>   0 +----------------+
>     | magic          | Contains the magic value 0x336ec578
>     |                | ("xEn3" with the 0x80 bit of the "E" set).
>   4 +----------------+
>     | flags          | SIF_xxx flags.
>   8 +----------------+
>     | cmdline_paddr  | Physical address of the command line,
>     |                | a zero-terminated ASCII string.
> 12 +----------------+
>     | nr_modules     | Number of modules passed to the kernel.
> 16 +----------------+
>     | modlist_paddr  | Physical address of an array of modules
>     |                | (layout of the structure below).
> 20 +----------------+

Do we want to add version field here?

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 15:06   ` Stefano Stabellini
@ 2016-02-09 16:15     ` Jan Beulich
  2016-02-09 16:17       ` David Vrabel
  2016-02-09 16:26       ` Stefano Stabellini
  0 siblings, 2 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 16:15 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Andrew Cooper, Tim Deegan, PaulDurrant, David Vrabel,
	xen-devel, SamuelThibault, Boris Ostrovsky, roger.pau

>>> On 09.02.16 at 16:06, <stefano.stabellini@eu.citrix.com> wrote:
> On Tue, 9 Feb 2016, Jan Beulich wrote:
>> Will STAO be sufficient for everything that may need customization?
>> I'm particularly worried about processor related methods in DSDT or
>> SSDT, which - if we're really meaning to do as you say - would need
>> to be limited (or extended) to the number of vCPU-s Dom0 gets.
>> What's even less clear to me is how you mean to deal with P-, C-,
>> and (once supported) T-state management for CPUs which don't
>> have a vCPU equivalent in Dom0.
> 
> It is possible to use the STAO to hide entire objects, including
> processors, from the DSDT, which should be good enough to prevent dom0
> from calling any of the processor related methods you are referreing to.
> Then we can let Xen do cpuidle and cpufreq as it is already doing.
> 
> Would that work? Or do we still need Dom0 to call any ACPI methods for
> power management?

We want two things at once here, which afaict can't possibly work:
On one hand we want Dom0 to only see ACPI objects corresponding
to its own vCPU-s. Otoh we need Dom0 to see all objects, in order
to propagate respective information to Xen.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 16:15     ` Jan Beulich
@ 2016-02-09 16:17       ` David Vrabel
  2016-02-09 16:28         ` Jan Beulich
  2016-02-09 16:26       ` Stefano Stabellini
  1 sibling, 1 reply; 24+ messages in thread
From: David Vrabel @ 2016-02-09 16:17 UTC (permalink / raw)
  To: Jan Beulich, Stefano Stabellini
  Cc: Wei Liu, Andrew Cooper, Tim Deegan, PaulDurrant, xen-devel,
	SamuelThibault, Boris Ostrovsky, roger.pau

On 09/02/16 16:15, Jan Beulich wrote:
>>>> On 09.02.16 at 16:06, <stefano.stabellini@eu.citrix.com> wrote:
>> On Tue, 9 Feb 2016, Jan Beulich wrote:
>>> Will STAO be sufficient for everything that may need customization?
>>> I'm particularly worried about processor related methods in DSDT or
>>> SSDT, which - if we're really meaning to do as you say - would need
>>> to be limited (or extended) to the number of vCPU-s Dom0 gets.
>>> What's even less clear to me is how you mean to deal with P-, C-,
>>> and (once supported) T-state management for CPUs which don't
>>> have a vCPU equivalent in Dom0.
>>
>> It is possible to use the STAO to hide entire objects, including
>> processors, from the DSDT, which should be good enough to prevent dom0
>> from calling any of the processor related methods you are referreing to.
>> Then we can let Xen do cpuidle and cpufreq as it is already doing.
>>
>> Would that work? Or do we still need Dom0 to call any ACPI methods for
>> power management?
> 
> We want two things at once here, which afaict can't possibly work:
> On one hand we want Dom0 to only see ACPI objects corresponding
> to its own vCPU-s. Otoh we need Dom0 to see all objects, in order
> to propagate respective information to Xen.

Could dom0 query Xen for the machine ACPI tables via a hypercall?

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 16:15     ` Jan Beulich
  2016-02-09 16:17       ` David Vrabel
@ 2016-02-09 16:26       ` Stefano Stabellini
  2016-02-09 16:33         ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Stefano Stabellini @ 2016-02-09 16:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	PaulDurrant, David Vrabel, SamuelThibault, xen-devel,
	Boris Ostrovsky, roger.pau

On Tue, 9 Feb 2016, Jan Beulich wrote:
> >>> On 09.02.16 at 16:06, <stefano.stabellini@eu.citrix.com> wrote:
> > On Tue, 9 Feb 2016, Jan Beulich wrote:
> >> Will STAO be sufficient for everything that may need customization?
> >> I'm particularly worried about processor related methods in DSDT or
> >> SSDT, which - if we're really meaning to do as you say - would need
> >> to be limited (or extended) to the number of vCPU-s Dom0 gets.
> >> What's even less clear to me is how you mean to deal with P-, C-,
> >> and (once supported) T-state management for CPUs which don't
> >> have a vCPU equivalent in Dom0.
> > 
> > It is possible to use the STAO to hide entire objects, including
> > processors, from the DSDT, which should be good enough to prevent dom0
> > from calling any of the processor related methods you are referreing to.
> > Then we can let Xen do cpuidle and cpufreq as it is already doing.
> > 
> > Would that work? Or do we still need Dom0 to call any ACPI methods for
> > power management?
> 
> We want two things at once here, which afaict can't possibly work:
> On one hand we want Dom0 to only see ACPI objects corresponding
> to its own vCPU-s. Otoh we need Dom0 to see all objects, in order
> to propagate respective information to Xen.

Having Dom0 see only objects corresponding to its own vCPU-s would of
course be nicer from an architectural point of view. What exactly do we
need to propagate from Dom0 to Xen? Can we get rid of those calls?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 16:17       ` David Vrabel
@ 2016-02-09 16:28         ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 16:28 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	PaulDurrant, xen-devel, SamuelThibault, Boris Ostrovsky,
	roger.pau

>>> On 09.02.16 at 17:17, <david.vrabel@citrix.com> wrote:
> On 09/02/16 16:15, Jan Beulich wrote:
>>>>> On 09.02.16 at 16:06, <stefano.stabellini@eu.citrix.com> wrote:
>>> On Tue, 9 Feb 2016, Jan Beulich wrote:
>>>> Will STAO be sufficient for everything that may need customization?
>>>> I'm particularly worried about processor related methods in DSDT or
>>>> SSDT, which - if we're really meaning to do as you say - would need
>>>> to be limited (or extended) to the number of vCPU-s Dom0 gets.
>>>> What's even less clear to me is how you mean to deal with P-, C-,
>>>> and (once supported) T-state management for CPUs which don't
>>>> have a vCPU equivalent in Dom0.
>>>
>>> It is possible to use the STAO to hide entire objects, including
>>> processors, from the DSDT, which should be good enough to prevent dom0
>>> from calling any of the processor related methods you are referreing to.
>>> Then we can let Xen do cpuidle and cpufreq as it is already doing.
>>>
>>> Would that work? Or do we still need Dom0 to call any ACPI methods for
>>> power management?
>> 
>> We want two things at once here, which afaict can't possibly work:
>> On one hand we want Dom0 to only see ACPI objects corresponding
>> to its own vCPU-s. Otoh we need Dom0 to see all objects, in order
>> to propagate respective information to Xen.
> 
> Could dom0 query Xen for the machine ACPI tables via a hypercall?

That would certainly be doable, but what would it do then with
these tables? Loading them just like other tables is not an option,
as that would result in a mess of name space collisions. And I don't
think abstracting ACPI CA to be able to deal with two independent
sets of tables would be very welcome on the Linux or ACPI side...

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 13:41         ` Jan Beulich
@ 2016-02-09 16:32           ` Roger Pau Monné
  2016-02-09 16:41             ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-09 16:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

El 9/2/16 a les 14:41, Jan Beulich ha escrit:
>>>> On 09.02.16 at 14:00, <roger.pau@citrix.com> wrote:
>> Hm, I guess I'm overlooking something, but I think Xen checks the ACPI
>> tables, see xen/arch/x86/x86_64/mmconfig-shared.c:400:
>>
>>     if (pci_mmcfg_check_hostbridge()) {
>>         unsigned int i;
>>
>>         pci_mmcfg_arch_init();
>>         for (i = 0; i < pci_mmcfg_config_num; ++i)
>>             if (pci_mmcfg_arch_enable(i))
>>                 valid = 0;
>>     } else {
>>         acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
>>         pci_mmcfg_arch_init();
>>         valid = pci_mmcfg_reject_broken();
>>     }
>>
>> Which AFAICT suggests that Xen is indeed able to parse the 'MCFG' table,
>> which contains the list of MMCFG regions on the system. Is there any
>> other ACPI table where this information is reported that I'm missing?
> 
> You didn't read my reply carefully enough: I didn't say Xen can't
> parse these tables. What I said is that Xen isn't by itself in the
> position to do sanity checks that have proven necessary. Hence ...

Sorry, Ack, AFAICT FreeBSD is much more naive in this aspect and blindly
trusts what the ACPI MCFG table contains (or at least it seems to me
that way).

I'm not going to argue since you say that this has proven necessary, but
are this kind of broken systems still around? PVH/HVMlite requires
recent hardware in order to run, so maybe things have improved since
this was implemented.

>> This would suggest then that the hypercall is no longer needed.
> 
> ... it's needed - see Linux'es is_mmconf_reserved() which checks
> both E820 and data read from DSDT (and maybe SSDT). Also
> please don't forget about the hotplug case, where _CBA methods
> need evaluation in order to obtain address information.

Right, _CBA methods report the ECAM space, and are the only way to get
this information in the hotplug case I guess.

>>>> Then for BARs you need to know the specific PCI devices, which are
>>>> enumerated in the DSDT or similar ACPI tables, which are not static, and
>>>> thus cannot be parsed by Xen. We could do a brute force scan of the
>>>> whole PCI bus using the config registers, but that seems hacky. And as
>>>> Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
>>>> order to notify Xen of the PXM information.
>>>
>>> We can only be sure to have access to segment 0 at boot time.
>>> Other segments may be accessible via MMCFG only, and whether
>>> we can safely use the respective MMCFG we won't know - see
>>> above - early enough. That's also the reason why today we do a
>>> brute force scan only on segment 0.
>>
>> Andrew suggested that all current hardware only uses segment 0, so it
>> seems like covering segment 0 is all that's needed for now. Are OSes
>> already capable of dealing with segments different than 0 that only
>> support MMCFG accesses?
> 
> Of course - Linux for example. I've actually seen Linux (and Xen) run
> on a system with multiple segments.
> 
>> And in which case, if a segment != 0 appears, and it only supports MMCFG
>> accesses, the MCFG table should contain an entry for this area, which
>> should allow us to properly scan it.

Right, I wasn't specially trilled to remove these two PHYSDEV hypercalls
anyway, they are not very intrusive IMHO. We still need to figure out
how to do MMIO mapping of BARs however. We can continue the discussion
about MMIO mappings on your other reply to the ABI
<56B9F69002000078000D012B@prv-mh.provo.novell.com>.

Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 16:26       ` Stefano Stabellini
@ 2016-02-09 16:33         ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 16:33 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Andrew Cooper, Tim Deegan, PaulDurrant, David Vrabel,
	xen-devel, SamuelThibault, Boris Ostrovsky, roger.pau

>>> On 09.02.16 at 17:26, <stefano.stabellini@eu.citrix.com> wrote:
> On Tue, 9 Feb 2016, Jan Beulich wrote:
>> >>> On 09.02.16 at 16:06, <stefano.stabellini@eu.citrix.com> wrote:
>> > On Tue, 9 Feb 2016, Jan Beulich wrote:
>> >> Will STAO be sufficient for everything that may need customization?
>> >> I'm particularly worried about processor related methods in DSDT or
>> >> SSDT, which - if we're really meaning to do as you say - would need
>> >> to be limited (or extended) to the number of vCPU-s Dom0 gets.
>> >> What's even less clear to me is how you mean to deal with P-, C-,
>> >> and (once supported) T-state management for CPUs which don't
>> >> have a vCPU equivalent in Dom0.
>> > 
>> > It is possible to use the STAO to hide entire objects, including
>> > processors, from the DSDT, which should be good enough to prevent dom0
>> > from calling any of the processor related methods you are referreing to.
>> > Then we can let Xen do cpuidle and cpufreq as it is already doing.
>> > 
>> > Would that work? Or do we still need Dom0 to call any ACPI methods for
>> > power management?
>> 
>> We want two things at once here, which afaict can't possibly work:
>> On one hand we want Dom0 to only see ACPI objects corresponding
>> to its own vCPU-s. Otoh we need Dom0 to see all objects, in order
>> to propagate respective information to Xen.
> 
> Having Dom0 see only objects corresponding to its own vCPU-s would of
> course be nicer from an architectural point of view. What exactly do we
> need to propagate from Dom0 to Xen? Can we get rid of those calls?

Not really, no. Or else - as said above - there wouldn't be any
P- or C-state management anymore.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 16:32           ` Roger Pau Monné
@ 2016-02-09 16:41             ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2016-02-09 16:41 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

>>> On 09.02.16 at 17:32, <roger.pau@citrix.com> wrote:
> El 9/2/16 a les 14:41, Jan Beulich ha escrit:
>>>>> On 09.02.16 at 14:00, <roger.pau@citrix.com> wrote:
>>> Hm, I guess I'm overlooking something, but I think Xen checks the ACPI
>>> tables, see xen/arch/x86/x86_64/mmconfig-shared.c:400:
>>>
>>>     if (pci_mmcfg_check_hostbridge()) {
>>>         unsigned int i;
>>>
>>>         pci_mmcfg_arch_init();
>>>         for (i = 0; i < pci_mmcfg_config_num; ++i)
>>>             if (pci_mmcfg_arch_enable(i))
>>>                 valid = 0;
>>>     } else {
>>>         acpi_table_parse(ACPI_SIG_MCFG, acpi_parse_mcfg);
>>>         pci_mmcfg_arch_init();
>>>         valid = pci_mmcfg_reject_broken();
>>>     }
>>>
>>> Which AFAICT suggests that Xen is indeed able to parse the 'MCFG' table,
>>> which contains the list of MMCFG regions on the system. Is there any
>>> other ACPI table where this information is reported that I'm missing?
>> 
>> You didn't read my reply carefully enough: I didn't say Xen can't
>> parse these tables. What I said is that Xen isn't by itself in the
>> position to do sanity checks that have proven necessary. Hence ...
> 
> Sorry, Ack, AFAICT FreeBSD is much more naive in this aspect and blindly
> trusts what the ACPI MCFG table contains (or at least it seems to me
> that way).
> 
> I'm not going to argue since you say that this has proven necessary, but
> are this kind of broken systems still around? PVH/HVMlite requires
> recent hardware in order to run, so maybe things have improved since
> this was implemented.

Let me not get started on the quality of memory maps various
vendors' UEFI implementations provide.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-09 13:24 ` Jan Beulich
  2016-02-09 15:06   ` Stefano Stabellini
@ 2016-02-10 12:01   ` Roger Pau Monné
  2016-02-10 12:53     ` Jan Beulich
  1 sibling, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-10 12:01 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

El 9/2/16 a les 14:24, Jan Beulich ha escrit:
>>>> On 08.02.16 at 20:03, <roger.pau@citrix.com> wrote:
>>  * `eflags`: all user settable bits are clear.
> 
> The word "user" here can be mistaken. Perhaps better "all modifiable
> bits"?
>
>> All other processor registers and flag bits are unspecified. The OS is in
>> charge of setting up it's own stack, GDT and IDT.
> 
> The "flag bits" part should now probably be dropped?
> 
>> The format of the boot start info structure is the following (pointed to
>> be %ebx):
> 
> "... by %ebx"

Done to both of the above comments.

>> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of 
>> the address fields should be treated as not present.
>>
>>  0 +----------------+
>>    | magic          | Contains the magic value 0x336ec578
>>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>  4 +----------------+
>>    | flags          | SIF_xxx flags.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | nr_modules     | Number of modules passed to the kernel.
>> 16 +----------------+
>>    | modlist_paddr  | Physical address of an array of modules
>>    |                | (layout of the structure below).
>> 20 +----------------+
> 
> There having been talk about extending the structure, I think we
> need some indicator that the consumer can use to know which
> fields are present. I.e. either a version field, another flags one,
> or a size one.

Either a version or flags field sounds good to me. A version is
probably more desirable in order to prevent confusion with the already
present flags field:

 0 +----------------+
   | magic          | Contains the magic value 0x336ec578
   |                | ("xEn3" with the 0x80 bit of the "E" set).
 4 +----------------+
   | version        | Version of this structure. Current version is 0.
   |                | New versions are guaranteed to be
backwards-compatible.
 8 +----------------+
   | flags          | SIF_xxx flags.
12 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
16 +----------------+
   | nr_modules     | Number of modules passed to the kernel.
20 +----------------+
   | modlist_paddr  | Physical address of an array of modules
   |                | (layout of the structure below).
24 +----------------+

> 
>> The layout of each entry in the module structure is the following:
>>
>>  0 +----------------+
>>    | paddr          | Physical address of the module.
>>  4 +----------------+
>>    | size           | Size of the module in bytes.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | reserved       |
>> 16 +----------------+
> 
> I've been thinking about this on draft A already: Do we really want
> to paint ourselves into the corner of not supporting >4Gb modules,
> by limiting their addresses and sizes to 32 bits?

Hm, that's an itchy question. TBH I doubt we are going to see modules
>4GB ATM, but maybe in the future this no longer holds.

I wouldn't mind making all the fields in the module structure 64bits,
but I think we should then spell out that Xen will always try to place
the modules below the 4GiB boundary when possible.

>> Hardware description
>> --------------------
>>
>> Hardware description can come from two different sources, just like on 
>> (PV)HVM
>> guests.
>>
>> Description of PV devices will always come from xenbus, and in fact
>> xenbus is the only hardware description that is guaranteed to always be
>> provided to HVMlite guests.
>>
>> Description of physical hardware devices will always come from ACPI, in the
>> absence of any physical hardware device no ACPI tables will be provided.
> 
> This seems too strict: How about "in the absence of any physical
> hardware device ACPI tables may not be provided"?

Right, this should allow us for more freedom when deciding whether to
provide ACPI tables or not.

The only case were we might avoid ACPI tables is when no local APIC or
IO APIC is provided, and even in this scenario I would be tempted to
provide at least a FADT in order to announce that no CMOS RTC is
available (and possibly also signal reduced HW).

>> Non-PV devices exposed to the guest
>> -----------------------------------
>>
>> The initial idea was to simply don't provide any emulated devices to a 
>> HVMlite
>> guest as the default option. We have however identified certain situations
>> where emulated devices could be interesting, both from a performance and
>> ease of implementation point of view. The following list tries to encompass
>> the different identified scenarios:
>>
>>  * 1. HVMlite with no emulated devices at all
>>    ------------------------------------------
>>    This is the current implementation inside of Xen, everything is disabled
>>    by default and the guest has access to the PV devices only. This is of
>>    course the most secure design because it has the smaller surface of attack.
> 
> smallest?

Right, fixed.

>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>    -----------------------------------------------
>>    The current model of PCI-passthrought in PV guests is complex and requires
>>    heavy modifications to the guest OS. Going forward we would like to remove
>>    this limitation, by providing an interface that's the same as found on bare
>>    metal. In order to do this, at least an emulated local APIC should be
>>    provided to guests, together with the access to a PCI-Root complex.
>>    As said in the 'Hardware description' section above, this will also require
>>    ACPI. So this proposed scenario will require the following elements that are
>>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>>    APIC, IO APIC (optional) and PCI-Root complex.
> 
> Are you reasonably convinced that the absence of an IO-APIC
> won't, with LAPICs present, cause more confusion than aid to the
> OSes wanting to adopt PVHv2?

As long as the data provided in the MADT represent the container
provided I think we should be fine. In the case of no IO APICs no
entries of type 1 (IO APIC) will be provided in the MADT.

>>  * 3. HVMlite hardware domain
>>    --------------------------
>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>    HVMlite domain with passed-through devices. This means that the domain will
>>    need access to the same set of emulated devices, and that some ACPI tables
>>    must be fixed in order to reflect the reality of the container the hardware
>>    domain is running on. The ACPI section contains more detailed information
>>    about which/how these tables are going to be fixed.
>>
>>    Note that in this scenario the hardware domain will *always* have a local
>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>    channels is going to be removed in favour of the bare metal mechanisms.
> 
> Do you really mean "*always*"? What about a system without IO-APIC?
> Would you mean to emulate one there for no reason?

Oh, a real system without an IO APIC. No, then we wouldn't provide one
to the hardware domain, since it makes no sense.

> Also I think you should say "the usage of many PHYSDEV operations",
> because - as we've already pointed out - some are unavoidable.

Yes, that's right.

>> ACPI
>> ----
>>
>> ACPI tables will be provided to the hardware domain or to unprivileged
>> domains. In the case of unprivileged guests ACPI tables are going to be
>> created by the toolstack and will only contain the set of devices available
>> to the guest, which will at least be the following: local APIC and
>> optionally an IO APIC and passed-through device(s). In order to provide this
>> information from ACPI the following tables are needed as a minimum: RSDT,
>> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
>> the MADT table is not going to be provided to the guest OS.
>>
>> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be used
>> to signal guests that there's no RTC device (the Xen PV wall clock should be
>> used instead). It is likely that this flag is not going to be set for the
>> hardware domain, since it should have access to the RTC present in the host
>> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
>> same boot_flags FADT field for DomUs in order to signal that there's no VGA
>> adapter present.
>>
>> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
>> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
>> There's no intention to enable these devices, so it is expected that the
>> hardware-reduced FADT flag is always going to be set.
> 
> We'll need to be absolutely certain that use of this flag doesn't carry
> any further implications.

No, after taking a closer look at the ACPI spec I don't think we can use
this flag. It has some connotations that wouldn't be true, for example:

 - UEFI must be used for boot.
 - Sleep state entering is different. Using SLEEP_CONTROL_REG and
SLEEP_STATUS_REG instead of SLP_TYP, SLP_EN and WAK_STS. This of course
is not something that we can decide for Dom0.

And there are more implications which I think would not hold in our case.

So are we just going to say that HVMlite systems will never have a i8259
PIC or i8254 PIT? Because I don't see a proper way to report this using
standard ACPI fields.

>> In the case of the hardware domain, Xen has traditionally passed-through the
>> native ACPI tables to the guest. This is something that of course we still
>> want to do, but in the case of HVMlite Xen will have to make sure that
>> the data passed in the ACPI tables to the hardware domain contain the 
>> accurate
>> hardware description. This means that at least certain tables will have to
>> be modified/mangled before being presented to the guest:
>>
>>  * MADT: the number of local APIC entries need to be fixed to match the number
>>          of vCPUs available to the guest. The address of the IO APIC(s) also
>>          need to be fixed in order to match the emulated ones that we are going
>>          to provide.
>>
>>  * DSDT: certain devices reported in the DSDT may not be available to the guest,
>>          but since the DSDT is a run-time generated table we cannot fix it. In
>>          order to cope with this, a STAO table will be provided that should
>>          be able to signal which devices are not available to the hardware
>>          domain. This is in line with the Xen/ACPI implementation for ARM.
> 
> Will STAO be sufficient for everything that may need customization?
> I'm particularly worried about processor related methods in DSDT or
> SSDT, which - if we're really meaning to do as you say - would need
> to be limited (or extended) to the number of vCPU-s Dom0 gets.
> What's even less clear to me is how you mean to deal with P-, C-,
> and (once supported) T-state management for CPUs which don't
> have a vCPU equivalent in Dom0.

I was mostly planning to use the STAO in order to hide the UART. Hiding
the CPU methods is also something that we might do from the STAO, but as
you say we still need to report them to Xen in order to have proper PM.
This is already listed in the section called 'hacks' below.

IMHO, I think the processor related methods should not be hidden and
instead a custom Xen driver should be implemented in Dom0 in order to
report them to Xen. AFAICT masking them in the STAO would effectively
prevent _any_ driver in Dom0 from using them. This is ugly, but I don't
see any alternative at all.

> 
>> NB: there are corner cases that I'm not sure how to solve properly. Currently
>> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
>> of the following:
>>
>>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>>    since this table is only available to the hardware domain it has to report
>>    the PM info back to Xen so that Xen can perform proper PM.
>>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>>    mixed with native ACPICA code in most OSes. This is awkward and requires
>>    the usage of hooks into ACPICA which we have not yet managed to upstream.
> 
> Iirc shutdown doesn't require any custom patches anymore in Linux.

Hm, not in Linux, but the hooks have not been merged into ACPICA, which
is the standard code base used by many OSes in order to deal with ACPI.

>>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>>    intrusive in general, so I'm not that pushed to remove it. It's generally
>>    easy in any OS to add some kind of hook that's executed every time a PCI
>>    device is discovered.
>>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
>>    this one is the same as (3), it's not really intrusive so I'm not very
>>    pushed to remove it.
> 
> As said in another reply - for both of these, we just can't remove the
> reporting to Xen.

Right, as I said above I'm not specially pushed to remove them. IMHO
they are not that intrusive.

>> MMIO mapping
>> ------------
>>
>> For DomUs without any device passed-through no direct MMIO mappings will be
>> present in the physical memory map presented to the guest. For DomUs with
>> devices passed-though the toolstack will create direct MMIO mappings as
>> part of the domain build process, and thus no action will be required
>> from the DomU.
>>
>> For the hardware domain initial direct MMIO mappings will be set for the
>> following regions:
>>
>> NOTE: ranges are defined using memory addresses, not pages.
>>
>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>    memory map at the same position.
>>
>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>    guest physical memory.
> 
> When have you last seen a machine with a hole right below the
> 16Mb boundary?

Right, I will remove this. Even my old Nehalems (which is the first
architecture with IOMMU from Intel IIRC) don't have them.

Should I also mention RMRR?

  * Any RMRR regions reported will also be mapped 1:1 to Dom0.

>>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>>    1:1 to the guest physical memory map. There are going to be exceptions if
>>    Xen has to modify the tables before presenting them to the guest.
>>
>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>    time they will also be made available to the guest at the same position
>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>    those regions, but a guest should be able to use the native configuration
>>    mechanism in order to interact with this configuration space. If the
>>    hardware domain reports the presence of any of those regions using the
>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>    them.
> 
> s/all guest/allow Dom0/ in this last sentence?

Yes.

>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>    the PCI devices without hardware domain interaction. In order to have
>>    the BARs of PCI devices properly mapped the hardware domain needs to
>>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>>    procedure will be transparent from guest point of view, and upon returning
>>    from the hypercall mappings must be already established.
> 
> I'm not sure this can work, as it imposes restrictions on the ordering
> of operations internal of the Dom0 OS: Successfully having probed
> for a PCI device (and hence reporting its presence to Xen) doesn't
> imply its BARs have already got set up. Together with the possibility
> of the OS re-assigning BARs I think we will actually need another
> hypercall, or the same device-add hypercall may need to be issued
> more than once per device (i.e. also every time any BAR assignment
> got changed).

We already trap accesses to 0xcf8/0xcfc, can't we detect BAR
reassignments and then act accordingly and change the MMIO mapping?

I was thinking that we could do the initial map at the current position
when issuing the hypercall, and then detect further changes and perform
remapping if needed, but maybe I'm missing something again that makes
this approach not feasible.

Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-10 12:01   ` Roger Pau Monné
@ 2016-02-10 12:53     ` Jan Beulich
  2016-02-10 14:51       ` Roger Pau Monné
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-02-10 12:53 UTC (permalink / raw)
  To: Roger Pau Monné, Konrad Rzeszutek Wilk
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

>>> On 10.02.16 at 13:01, <roger.pau@citrix.com> wrote:
> El 9/2/16 a les 14:24, Jan Beulich ha escrit:
>>>>> On 08.02.16 at 20:03, <roger.pau@citrix.com> wrote:
>>> The layout of each entry in the module structure is the following:
>>>
>>>  0 +----------------+
>>>    | paddr          | Physical address of the module.
>>>  4 +----------------+
>>>    | size           | Size of the module in bytes.
>>>  8 +----------------+
>>>    | cmdline_paddr  | Physical address of the command line,
>>>    |                | a zero-terminated ASCII string.
>>> 12 +----------------+
>>>    | reserved       |
>>> 16 +----------------+
>> 
>> I've been thinking about this on draft A already: Do we really want
>> to paint ourselves into the corner of not supporting >4Gb modules,
>> by limiting their addresses and sizes to 32 bits?
> 
> Hm, that's an itchy question. TBH I doubt we are going to see modules
>>4GB ATM, but maybe in the future this no longer holds.
> 
> I wouldn't mind making all the fields in the module structure 64bits,
> but I think we should then spell out that Xen will always try to place
> the modules below the 4GiB boundary when possible.

Sounds reasonable.

>>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>>    -----------------------------------------------
>>>    The current model of PCI-passthrought in PV guests is complex and requires
>>>    heavy modifications to the guest OS. Going forward we would like to remove
>>>    this limitation, by providing an interface that's the same as found on bare
>>>    metal. In order to do this, at least an emulated local APIC should be
>>>    provided to guests, together with the access to a PCI-Root complex.
>>>    As said in the 'Hardware description' section above, this will also require
>>>    ACPI. So this proposed scenario will require the following elements that are
>>>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>>>    APIC, IO APIC (optional) and PCI-Root complex.
>> 
>> Are you reasonably convinced that the absence of an IO-APIC
>> won't, with LAPICs present, cause more confusion than aid to the
>> OSes wanting to adopt PVHv2?
> 
> As long as the data provided in the MADT represent the container
> provided I think we should be fine. In the case of no IO APICs no
> entries of type 1 (IO APIC) will be provided in the MADT.

I understand that, but are certain OSes are prepared for that?

>>>  * 3. HVMlite hardware domain
>>>    --------------------------
>>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>>    HVMlite domain with passed-through devices. This means that the domain will
>>>    need access to the same set of emulated devices, and that some ACPI tables
>>>    must be fixed in order to reflect the reality of the container the hardware
>>>    domain is running on. The ACPI section contains more detailed information
>>>    about which/how these tables are going to be fixed.
>>>
>>>    Note that in this scenario the hardware domain will *always* have a local
>>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>>    channels is going to be removed in favour of the bare metal mechanisms.
>> 
>> Do you really mean "*always*"? What about a system without IO-APIC?
>> Would you mean to emulate one there for no reason?
> 
> Oh, a real system without an IO APIC. No, then we wouldn't provide one
> to the hardware domain, since it makes no sense.

I.e. the above should say "... will always local APICs and IO-APICs
mirroring the physical machine's, ..." or something equivalent.

>>> ACPI
>>> ----
>>>
>>> ACPI tables will be provided to the hardware domain or to unprivileged
>>> domains. In the case of unprivileged guests ACPI tables are going to be
>>> created by the toolstack and will only contain the set of devices available
>>> to the guest, which will at least be the following: local APIC and
>>> optionally an IO APIC and passed-through device(s). In order to provide this
>>> information from ACPI the following tables are needed as a minimum: RSDT,
>>> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
>>> the MADT table is not going to be provided to the guest OS.
>>>
>>> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be used
>>> to signal guests that there's no RTC device (the Xen PV wall clock should be
>>> used instead). It is likely that this flag is not going to be set for the
>>> hardware domain, since it should have access to the RTC present in the host
>>> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
>>> same boot_flags FADT field for DomUs in order to signal that there's no VGA
>>> adapter present.
>>>
>>> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
>>> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
>>> There's no intention to enable these devices, so it is expected that the
>>> hardware-reduced FADT flag is always going to be set.
>> 
>> We'll need to be absolutely certain that use of this flag doesn't carry
>> any further implications.
> 
> No, after taking a closer look at the ACPI spec I don't think we can use
> this flag. It has some connotations that wouldn't be true, for example:
> 
>  - UEFI must be used for boot.
>  - Sleep state entering is different. Using SLEEP_CONTROL_REG and
> SLEEP_STATUS_REG instead of SLP_TYP, SLP_EN and WAK_STS. This of course
> is not something that we can decide for Dom0.
> 
> And there are more implications which I think would not hold in our case.
> 
> So are we just going to say that HVMlite systems will never have a i8259
> PIC or i8254 PIT? Because I don't see a proper way to report this using
> standard ACPI fields.

I think so, yes.

>>> MMIO mapping
>>> ------------
>>>
>>> For DomUs without any device passed-through no direct MMIO mappings will be
>>> present in the physical memory map presented to the guest. For DomUs with
>>> devices passed-though the toolstack will create direct MMIO mappings as
>>> part of the domain build process, and thus no action will be required
>>> from the DomU.
>>>
>>> For the hardware domain initial direct MMIO mappings will be set for the
>>> following regions:
>>>
>>> NOTE: ranges are defined using memory addresses, not pages.
>>>
>>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>>    memory map at the same position.
>>>
>>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>>    guest physical memory.
>> 
>> When have you last seen a machine with a hole right below the
>> 16Mb boundary?
> 
> Right, I will remove this. Even my old Nehalems (which is the first
> architecture with IOMMU from Intel IIRC) don't have them.
> 
> Should I also mention RMRR?
> 
>   * Any RMRR regions reported will also be mapped 1:1 to Dom0.

That's a good ideas, yes. But please make explicit that such
mappings will go away together with the removal of devices (for
pass-through purposes) from Dom0.

>>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>>    the PCI devices without hardware domain interaction. In order to have
>>>    the BARs of PCI devices properly mapped the hardware domain needs to
>>>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>>>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>>>    procedure will be transparent from guest point of view, and upon returning
>>>    from the hypercall mappings must be already established.
>> 
>> I'm not sure this can work, as it imposes restrictions on the ordering
>> of operations internal of the Dom0 OS: Successfully having probed
>> for a PCI device (and hence reporting its presence to Xen) doesn't
>> imply its BARs have already got set up. Together with the possibility
>> of the OS re-assigning BARs I think we will actually need another
>> hypercall, or the same device-add hypercall may need to be issued
>> more than once per device (i.e. also every time any BAR assignment
>> got changed).
> 
> We already trap accesses to 0xcf8/0xcfc, can't we detect BAR
> reassignments and then act accordingly and change the MMIO mapping?
> 
> I was thinking that we could do the initial map at the current position
> when issuing the hypercall, and then detect further changes and perform
> remapping if needed, but maybe I'm missing something again that makes
> this approach not feasible.

I think that's certainly possible, but will require quite a bit of care
when implementing. (In fact this way I think we could then also
observe bus renumbering, without requiring Dom0 to remove and
then re-add all affected devices. Konrad - what do you think?)

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-10 12:53     ` Jan Beulich
@ 2016-02-10 14:51       ` Roger Pau Monné
  2016-02-10 15:14         ` Boris Ostrovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Roger Pau Monné @ 2016-02-10 14:51 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault,
	Boris Ostrovsky

El 10/2/16 a les 13:53, Jan Beulich ha escrit:
>>>> On 10.02.16 at 13:01, <roger.pau@citrix.com> wrote:
>> El 9/2/16 a les 14:24, Jan Beulich ha escrit:
>>>>>> On 08.02.16 at 20:03, <roger.pau@citrix.com> wrote:
>>>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>>>    -----------------------------------------------
>>>>    The current model of PCI-passthrought in PV guests is complex and requires
>>>>    heavy modifications to the guest OS. Going forward we would like to remove
>>>>    this limitation, by providing an interface that's the same as found on bare
>>>>    metal. In order to do this, at least an emulated local APIC should be
>>>>    provided to guests, together with the access to a PCI-Root complex.
>>>>    As said in the 'Hardware description' section above, this will also require
>>>>    ACPI. So this proposed scenario will require the following elements that are
>>>>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>>>>    APIC, IO APIC (optional) and PCI-Root complex.
>>>
>>> Are you reasonably convinced that the absence of an IO-APIC
>>> won't, with LAPICs present, cause more confusion than aid to the
>>> OSes wanting to adopt PVHv2?
>>
>> As long as the data provided in the MADT represent the container
>> provided I think we should be fine. In the case of no IO APICs no
>> entries of type 1 (IO APIC) will be provided in the MADT.
> 
> I understand that, but are certain OSes are prepared for that?

Well, given that some modifications will always be needed in order to
run as a PVH guest, I don't think this should stop us. FWIW, I think
FreeBSD should be able to cope with this without much fuss:

https://svnweb.freebsd.org/base/head/sys/x86/acpica/madt.c?revision=291686&view=co

The detection of local APIC and IO APICs is quite isolated from each
other, and the failure to find any IO APICs should not prevent local
APICs from being enabled. Although I don't think there's any hardware
with this setup, such configuration would be valid from an ACPI point of
view. I've of course not tested this in any way, so those are only
observations by a quick look at the code. I cannot speak about Linux.

Also, thanks for the observations and comments on the document.

Roger.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: HVMlite ABI specification DRAFT B + implementation outline
  2016-02-10 14:51       ` Roger Pau Monné
@ 2016-02-10 15:14         ` Boris Ostrovsky
  0 siblings, 0 replies; 24+ messages in thread
From: Boris Ostrovsky @ 2016-02-10 15:14 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich, Konrad Rzeszutek Wilk
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, Samuel Thibault

On 02/10/2016 09:51 AM, Roger Pau Monné wrote:
>
> The detection of local APIC and IO APICs is quite isolated from each
> other, and the failure to find any IO APICs should not prevent local
> APICs from being enabled. Although I don't think there's any hardware
> with this setup, such configuration would be valid from an ACPI point of
> view. I've of course not tested this in any way, so those are only
> observations by a quick look at the code. I cannot speak about Linux.

I believe Linux should be able to handle this as well:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/acpi/boot.c?id=refs/tags/v4.5-rc3#n1140

I also removed ioapic entries from hvmloader's ACPI builder and the 
(HVM) guest came up with no immediately visible issues.

-boris

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-02-10 15:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 19:03 HVMlite ABI specification DRAFT B + implementation outline Roger Pau Monné
2016-02-08 21:26 ` Boris Ostrovsky
2016-02-09 10:56 ` Andrew Cooper
2016-02-09 11:58   ` Roger Pau Monné
2016-02-09 12:10     ` Jan Beulich
2016-02-09 13:00       ` Roger Pau Monné
2016-02-09 13:41         ` Jan Beulich
2016-02-09 16:32           ` Roger Pau Monné
2016-02-09 16:41             ` Jan Beulich
2016-02-09 14:36     ` Boris Ostrovsky
2016-02-09 14:42       ` Andrew Cooper
2016-02-09 14:48       ` Jan Beulich
2016-02-09 13:24 ` Jan Beulich
2016-02-09 15:06   ` Stefano Stabellini
2016-02-09 16:15     ` Jan Beulich
2016-02-09 16:17       ` David Vrabel
2016-02-09 16:28         ` Jan Beulich
2016-02-09 16:26       ` Stefano Stabellini
2016-02-09 16:33         ` Jan Beulich
2016-02-10 12:01   ` Roger Pau Monné
2016-02-10 12:53     ` Jan Beulich
2016-02-10 14:51       ` Roger Pau Monné
2016-02-10 15:14         ` Boris Ostrovsky
2016-02-09 15:14 ` Boris Ostrovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.