All of lore.kernel.org
 help / color / mirror / Atom feed
* (v2) Design proposal for RMRR fix
@ 2014-12-26 11:23 Tian, Kevin
  2015-01-08  0:43 ` Tian, Kevin
                   ` (3 more replies)
  0 siblings, 4 replies; 139+ messages in thread
From: Tian, Kevin @ 2014-12-26 11:23 UTC (permalink / raw)
  To: Jan Beulich, Chen, Tiejun, ian.campbell, wei.liu2, ian.jackson,
	stefano.stabellini, Zhang, Yang Z, xen-devel, konrad.wilk, tim

(please note some proposal is different from last sent version after more
discussions. But I tried to summarize previous discussions and explained why
we choose a different way. Sorry if I may miss some opens/conclusions
discussed in past months. Please help point it out which is very appreciated. :-)

----
TOC:
	1. What's RMRR
	2. RMRR status in Xen
	3. High Level Design
		3.1 Guidelines
		3.2 Confliction detection
		3.3 Policies
		3.4 Xen: setup RMRR identity mapping
		3.5 New interface: expose reserved region information
		3.6 Libxc/hvmloader: detect and avoid conflictions
		3.7 Hvmloader: reserve 'reserved regions' in guest E820
		3.8 Xen: Handle devices sharing reserved regions
	4. Plan
		4.1 Stage-1: hypervisor hardening
		4.2 Stage-2: libxc/hvmloader hardening
		
1. What's RMRR?
=====================================================================

RMRR is an acronym for Reserved Memory Region Reporting, expected to 
be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
reserved memory.

(From vt-d spec)
----
Reserved system memory regions are typically allocated by BIOS at boot
time and reported to OS as reserved address ranges in the system memory
map. Requests to these reserved regions may either occur as a result of
operations performed by the system software driver (for example in the
case of DMA from unified memory access (UMA) graphics controllers to
graphics reserved memory) or may be initiated by non system software
(for example in case of DMA performed by a USB controller under BIOS
SMM control for legacy keyboard emulation). 

For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the translation structures for 
the respective devices are expected to be set up to provide identity 
mapping for the specified reserved memory regions with read and write 
permissions. The system software is also responsible for ensuring 
that any input addresses used for device accesses to OS-visible memory 
do not overlap with the reserved system memory address ranges.

BIOS may report each such reserved memory region through the RMRR
structures, along with the devices that requires access to the 
specified reserved memory region. Reserved memory ranges that are
either not DMA targets, or memory ranges that may be target of BIOS
initiated DMA only during pre-boot phase (such as from a boot disk
drive) must not be included in the reserved memory region reporting.
The base address of each RMRR region must be 4KB aligned and the size
must be an integer multiple of 4KB. If there are no RMRR structures,
the system software concludes that the platform does not have any 
reserved memory ranges that are DMA targets.

Platform designers should avoid or limit use of reserved memory regions
since these require system software to create holes in the DMA virtual
address range available to system software and its drivers.
----

Below is one example from a BDW machine:
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address 
ab81dfff
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address 
af7fffff

Here the 1st reserved region is for USB controller, with the 2nd one
belonging to IGD.



2. RMRR status in Xen
=====================================================================

There are two main design goals according to VT-d spec:

a) Setup identity mapping for reserved regions in IOMMU page table
b) Ensure reserved regions not conflicting with OS-visible memory
(OS-visible memory in a VM means guest physical memory, and more
strictly it also means no confliction with other types of allocations
in guest physical address space, such as PCI MMIO, ACPI, etc.)

However current RMRR implementation in Xen only partially achieves a)
and completely misses b), which cause some issues:

--
[Issue-1] Identity mapping is not setup in shared ept case, so a device
with RMRR may not function correctly if assigned to a VM. 

This was the original problem we found when assigning IGD on BDW 
platform, which triggered the whole long discussion in past months

--
[Issue-2] Being lacking of goal-b), existing device assignment with 
RMRR works only when reserved regions happen to not conflicting with
other valid allocations in the guest physical address space. This could
lead to unpredicted failures in various deployments, due to non-detected
conflictions caused by platform difference and VM configuration 
difference.

One example is about USB controller assignment. It's already identified
as a problem on some platforms, that USB reserved regions conflict with
guest BIOS region. However, being the fact that host BIOS only touches 
those reserved regions for legacy keyboard emulation at early Dom0 boot 
phase, a trick is added in Xen to bypass RMRR handling for usb 
controllers. 

--
[Issue-3] devices may share same reserved regions, however
there is no logic to handle this in Xen. Assigning such devices to 
different VMs could lead to secure concern



3. High Level Design
=====================================================================

To achieve aforementioned two goals, major enhancements are required 
cross Xen hypervisor, libxc, and hvmloader, to address the gap in
goal-b), i.e. handling possible conflictions in gfn space. Fixing
goal-a) is straightforward. 

>>>3.1 Guidelines
----
There are several guidelines considered in the design:

--
[Guideline-1] No regression in a VM w/o statically-assigned devices

  If a VM isn't configured with assigned devices at creation, new 
confliction detection logic shouldn't block the VM boot progress 
(either skipped, or just throw warning)

--
[Guideline-2] No regression on devices which do not have RMRR reported

  If a VM is assigned with a device which doesn't have RMRR reported,
either statically-assigned or dynamically-assigned, new confliction
detection logic shouldn't fail the assignment request for this device.

--
[Guideline-3] New interface should be kept as common as possible

  New interface will be introduced to expose reserved regions to the
user space. Though RMRR is a VT-d specific terminology, the interface
design should be generic enough, i.e. to support a function which 
allows hypervisor to force reserving one or more gfn ranges. 

--
[Guideline-4] Keep changes simple

  RMRR reserved regions should be avoided or limited by platform 
designers, per VT-d specification. Per our observations, there are
only a few reported examples (USB, IGD) on real platforms. So we need
to balance the code complexity and usage limitation. If one limitation
is only in niche scenarios, we'd like to vote no-support to simplify
changes for now.

>>>3.2 Confliction detection
----
Confliction must be detected in several places as far as gfn is 
concerned (how to handle confliction is discussed in 3.3)

1) libxc domain builder
  Here coarse-grained gfn layout is created, including two contiguous
guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
which are passed to hvmloader for later fine-grained manipulation. Guest 
RAM trunks are populated with valid translation setup in underlying p2m 
layer. Device reserved regions must be detected in that layout.

2) Xen hypervisor device assignment
  Device assignment can happen either at VM creation time (after domain 
builder), or anytime thru hotplug after VM is booted. Regardless of 
how userspace handles confliction, Xen hypervisor will always do the 
last-conservative detection when setting up identity mapping:
	* gfn space unoccupied:
		-> insert identity mapping; no confliction
	* gfn space already occupied with identity mapping:
		-> do nothing; no confliction
	* gfn space already occupied with other mapping:
		-> confliction detected

3) hvmloader
  Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and 
internal data structures in gfn space, and it creates the final guest 
e820. So hvmloader also needs to detect conflictions when conducting 
those operations. If there's no confliction, hvmloader will reserve 
those regions in guest e820 to let guest OS aware.

>>>3.3 Policies
----
An intuitive thought is to fail immediately upon a confliction, however 
it is not flexible regarding to different requirments:

a) it's not appropriate to fail libxc domain builder just because such
confliction. We still want the guest to boot even w/o assigned device;

b) whether to fail in hvmloader has several dependencies. If it's
to check for hotplug preparation, warning is also an acceptable option
since assignment may not happen at all. Or if it's a USB controller 
but user doesn't care about legacy keyboard emulation, it's also OK to 
move forward upon a confliction;

c) in Xen hypervisor it is reasonable to fail upon confliction, where
device is actually assigned. But due to the same requirement on USB
controller, sometimes we might want it succeed just w/ warnings.

Regarding to the complexity of addressing all above flexibilities (user
preferences, per-device), which requires inventing quite some parameters
passed among different components, and regarding to the fact that 
failures would be rare (except some USB) with proactive avoidance  
in userspace, we'd like to propose below simplified policy following 
[Guideline-4]:

- 'warn' conflictions in user space (libxc and hvmloader)
- a boot option to specify 'fail' or 'warn' confliction in Xen device
assignment path, default to 'fail' (user can set to 'warn' for USB case)

Such policy provides a relaxed user space policy w/ hypervisor to do 
final judge. It has a unique merit to simplify later interface design 
and hotplug support, w/o breaking [Guideline-1/2] even when all possible 
reserved regions are exposed.

    ******agreement is first required on above policy******

>>>3.4 Xen: setup RMRR identity mapping
----
Regardless of whether userspace has detected confliction, Xen hypervisor
always needs to detect confliction itself when setting up identify 
mapping for reserved gfn regions, following above defined policy. 

Identity mapping should be really handled from the general p2m layer, 
so the same r/w permissions apply equally to CPU/DMA access paths,
regardless of the underlying fact whether EPT is shared with IOMMU.

This is to match the behavior on bare metal, where although reserved
regions are marked as E820_RESERVED, it's just a hint to the system 
software which can still read data back because physically those bits
do exist. So in the virtualization case we don't need to specially
treat CPU accesses to RMRR reserved regions (similar to other reserved
regions like ACPI NVS)

>>>3.5 New interface: expose reserved region information
----
As explained in [Guideline-3], we'd like to keep this interface general 
enough, as a common interface for hypervisor to force reserving gfn 
ranges, due to various reasons (RMRR is a client of this feature).

One design open was discussed back-and-forth accordingly, regarding to
whether the interface should return regions reported for all devices
in the platform (report-all), or selectively return regions only 
belonging to assigned devices (report-sel). report-sel can be built on
top of report-all, with extra work to help hypervisor generate filtered 
regions (e.g. introduce new interface or make device assignment happened 
before domain builder)

We propose report-all as the simple solution (different from last sent
version which used report-sel), regarding to the below facts:

  - 'warn' policy in user space makes report-all not harmful
  - 'report-all' still means only a few entries in reality:
    * RMRR reserved regions should be avoided or limited by platform
designers, per VT-d specification;
    * RMRR reserved regions are only a few on real platforms, per our
current observations;
  - anyway OS needs to handle all the reserved regions on bare metal;
  - hotplug friendly;
  - report-all can be extended to report-sel if really required

In this way, there are two situations libxc domain builder may request 
to query reserved region information w/ same interface:

a) if any statically-assigned devices, and/or
b) if a new parameter is specified, asking for hotplug preparation
	('rdm_check' or 'prepare_hotplug'?)

the 1st invocation of this interface will save all reported reserved
regions under domain structure, and later invocation (e.g. from 
hvmloader) gets saved content.

If a VM is configured w/o assigned devices, this interface is not 
invoked so there's no impact and [Guideline-1] is enforced;

If a VM is configured w/ assigned devices which don't have reserved
regions, this interface is invoked. In some cases warning may be 
thrown out due to confliction caused by other non-assigned devices, 
but it's just informational and there is no impact on assigned devices
so [Guideline-2] is enforced;

>>>3.6 Libxc/hvmloader: detect and avoid conflictions
----
libxc needs to detect reserved region conflictions with:
	- guest RAM
	- monolithic PCI MMIO hole

hvmloader needs to detect reserved region confliction with:
	- guest RAM
	- PCI MMIO allocation
	- memory allocation
	- some e820 entries like ACPI Opregion, etc.

When there's a confliction detected, libxc/hvmloader first try to
relocate conflicting gfn resources to avoid confliction. warning
will be thrown out when such relocation fails. The relocation policy 
is straightforward for most resources, however there remains a major 
design tradeoff for guest RAM, regarding to handoff between libxc 
and hvmloader...

In current implementation, guest RAM is contiguous in gfn space, w/
at most two trunks: lowmem (<4G) and highmem(>4G), which are passed
to hvmloader through hvm_info. Now by relocating guest RAM to avoid
confliction with reserved regions, sparse memory trunks are created
and it's not thought as an extensible way to introduce such sparse
structure into hvm_info.

There are several other options discussed so far:

a) Duplicate same relocation algorithm within libxc domain builder 
(when populating physmap) and hvmloader (when creating e820)
  - Pros:
	* no interface/structure change
	* anyway hvmloader still needs to handle reserved regions
  - Cons:
	* duplication is not good

b) pass sparse information through Xenstore
  (no much idea. need input from toolstack maintainers)

c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
set and hvmloader to get. Extension required to allow hvm invoke.
  - Pros:
	* centralized ownership in libxc. flexible for extension
  - Cons:
	* limiting entry to E820MAX (should be fine)
	* hvmloader e820 construction may become more complex, given
two predefined tables (reserved_regions, memory_map)

********Inputs are required to find a good option here*********

>>>3.7 hvmloader: reserve 'reserved regions' in guest E820
----
If there is no confliction detected, hvmloader needs to mark those
reserved regions as E820_RESERVED in guest E820 table, so the guest OS
is aware of those reserved regions (thus not does problematic actions
e.g. when re-allocating PCI MMIO)

>>>3.8 Xen: Handle devices sharing reserved regions
----
Per VT-d spec, it's possible to have two devices sharing same reserved
region. Though we didn't see such example in reality, hypervisor needs
to detect and handle such scenario, otherwise vulnerability may exist
if two devices are assigned to different VMs (so a malicious VM may
program its assigned device to clobber the shared region to malform 
another VM's device)

Ideally all devices sharing reserved regions should be assigned to a
single VM. However achieving this goal can't be done sole in hypervisor
w/o reworking current device assignment interface. Assignment is managed
by toolstack, which requires exposing group sharing information to 
userspace and then extends toolstack to manage assignment in bundle.

Given the problem only in ideal space, we propose to not support such
scenario, i.e. having hypervisor to fail the assignment, if the target
device happens to share some reserved regions with another device,
following [Guideline-4] to keep things simple.



4. Plan
=====================================================================
We're seeking an incremental way to split above tasks into 2 stages, 
and in each stage we move forward a step w/o causing regression. Doing
so can benefit people who want to use device assignment early, and 
also benefit newbie developer to rampup, toward a final sane solution.

4.1 Stage-1: hypervisor hardening
----
  [Tasks]
	1) Setup RMRR identity mapping in p2m layer with confliction 
detection
	2) add a boot option for fail/warn policy
	3) remove USB hack
	4) Detect and fail device assignment w/ shared reserve regions 

  [Enhancements]
	* fix [Issue-1] and [Issue-3]
	* partially fix [Issue-2] with limitations:
		- w/o userspace relocation there's larger chance to 
see conflictions. 
		- w/o reserve in guest e820, guest OS may allocate 
reserved pfn when re-enumerating PCI resource

  [Regressions]
	* devices which can be assigned successfully before may be
failed now due to confliction detection. However it's not a regression
per se. and user can change policy to 'warn' if required.  

4.2 Stage-2: libxc/hvmloader hardening
----
  [Tasks]
	5) Introduce new interface to expose reserve region information
	6) Detect and avoid reserved region conflictions in libxc
	7) Pass libxc guest RAM layout to hvmloader
	8) Detect and avoid reserved region conflictions in hvmloader
	9) Reserve 'reserved regions' in guest E820 in hvmloader

  [Enhancements]
	* completely fix [Issue-2]

  [Regression]
	* n/a

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2014-12-26 11:23 (v2) Design proposal for RMRR fix Tian, Kevin
@ 2015-01-08  0:43 ` Tian, Kevin
  2015-01-08 12:32 ` Tim Deegan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-08  0:43 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich, Chen, Tiejun, ian.campbell, wei.liu2,
	ian.jackson, stefano.stabellini, Zhang, Yang Z, xen-devel,
	konrad.wilk, tim

Ping in case this mail is hidden after long holiday. :-)

> From: Tian, Kevin
> Sent: Friday, December 26, 2014 7:23 PM
> 
> (please note some proposal is different from last sent version after more
> discussions. But I tried to summarize previous discussions and explained why
> we choose a different way. Sorry if I may miss some opens/conclusions
> discussed in past months. Please help point it out which is very appreciated. :-)
> 
> ----
> TOC:
> 	1. What's RMRR
> 	2. RMRR status in Xen
> 	3. High Level Design
> 		3.1 Guidelines
> 		3.2 Confliction detection
> 		3.3 Policies
> 		3.4 Xen: setup RMRR identity mapping
> 		3.5 New interface: expose reserved region information
> 		3.6 Libxc/hvmloader: detect and avoid conflictions
> 		3.7 Hvmloader: reserve 'reserved regions' in guest E820
> 		3.8 Xen: Handle devices sharing reserved regions
> 	4. Plan
> 		4.1 Stage-1: hypervisor hardening
> 		4.2 Stage-2: libxc/hvmloader hardening
> 
> 1. What's RMRR?
> ================================================================
> =====
> 
> RMRR is an acronym for Reserved Memory Region Reporting, expected to
> be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
> reserved memory.
> 
> (From vt-d spec)
> ----
> Reserved system memory regions are typically allocated by BIOS at boot
> time and reported to OS as reserved address ranges in the system memory
> map. Requests to these reserved regions may either occur as a result of
> operations performed by the system software driver (for example in the
> case of DMA from unified memory access (UMA) graphics controllers to
> graphics reserved memory) or may be initiated by non system software
> (for example in case of DMA performed by a USB controller under BIOS
> SMM control for legacy keyboard emulation).
> 
> For proper functioning of these legacy reserved memory usages, when
> system software enables DMA remapping, the translation structures for
> the respective devices are expected to be set up to provide identity
> mapping for the specified reserved memory regions with read and write
> permissions. The system software is also responsible for ensuring
> that any input addresses used for device accesses to OS-visible memory
> do not overlap with the reserved system memory address ranges.
> 
> BIOS may report each such reserved memory region through the RMRR
> structures, along with the devices that requires access to the
> specified reserved memory region. Reserved memory ranges that are
> either not DMA targets, or memory ranges that may be target of BIOS
> initiated DMA only during pre-boot phase (such as from a boot disk
> drive) must not be included in the reserved memory region reporting.
> The base address of each RMRR region must be 4KB aligned and the size
> must be an integer multiple of 4KB. If there are no RMRR structures,
> the system software concludes that the platform does not have any
> reserved memory ranges that are DMA targets.
> 
> Platform designers should avoid or limit use of reserved memory regions
> since these require system software to create holes in the DMA virtual
> address range available to system software and its drivers.
> ----
> 
> Below is one example from a BDW machine:
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address
> ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address
> af7fffff
> 
> Here the 1st reserved region is for USB controller, with the 2nd one
> belonging to IGD.
> 
> 
> 
> 2. RMRR status in Xen
> ================================================================
> =====
> 
> There are two main design goals according to VT-d spec:
> 
> a) Setup identity mapping for reserved regions in IOMMU page table
> b) Ensure reserved regions not conflicting with OS-visible memory
> (OS-visible memory in a VM means guest physical memory, and more
> strictly it also means no confliction with other types of allocations
> in guest physical address space, such as PCI MMIO, ACPI, etc.)
> 
> However current RMRR implementation in Xen only partially achieves a)
> and completely misses b), which cause some issues:
> 
> --
> [Issue-1] Identity mapping is not setup in shared ept case, so a device
> with RMRR may not function correctly if assigned to a VM.
> 
> This was the original problem we found when assigning IGD on BDW
> platform, which triggered the whole long discussion in past months
> 
> --
> [Issue-2] Being lacking of goal-b), existing device assignment with
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration
> difference.
> 
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches
> those reserved regions for legacy keyboard emulation at early Dom0 boot
> phase, a trick is added in Xen to bypass RMRR handling for usb
> controllers.
> 
> --
> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to
> different VMs could lead to secure concern
> 
> 
> 
> 3. High Level Design
> ================================================================
> =====
> 
> To achieve aforementioned two goals, major enhancements are required
> cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> goal-b), i.e. handling possible conflictions in gfn space. Fixing
> goal-a) is straightforward.
> 
> >>>3.1 Guidelines
> ----
> There are several guidelines considered in the design:
> 
> --
> [Guideline-1] No regression in a VM w/o statically-assigned devices
> 
>   If a VM isn't configured with assigned devices at creation, new
> confliction detection logic shouldn't block the VM boot progress
> (either skipped, or just throw warning)
> 
> --
> [Guideline-2] No regression on devices which do not have RMRR reported
> 
>   If a VM is assigned with a device which doesn't have RMRR reported,
> either statically-assigned or dynamically-assigned, new confliction
> detection logic shouldn't fail the assignment request for this device.
> 
> --
> [Guideline-3] New interface should be kept as common as possible
> 
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which
> allows hypervisor to force reserving one or more gfn ranges.
> 
> --
> [Guideline-4] Keep changes simple
> 
>   RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification. Per our observations, there are
> only a few reported examples (USB, IGD) on real platforms. So we need
> to balance the code complexity and usage limitation. If one limitation
> is only in niche scenarios, we'd like to vote no-support to simplify
> changes for now.
> 
> >>>3.2 Confliction detection
> ----
> Confliction must be detected in several places as far as gfn is
> concerned (how to handle confliction is discussed in 3.3)
> 
> 1) libxc domain builder
>   Here coarse-grained gfn layout is created, including two contiguous
> guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> which are passed to hvmloader for later fine-grained manipulation. Guest
> RAM trunks are populated with valid translation setup in underlying p2m
> layer. Device reserved regions must be detected in that layout.
> 
> 2) Xen hypervisor device assignment
>   Device assignment can happen either at VM creation time (after domain
> builder), or anytime thru hotplug after VM is booted. Regardless of
> how userspace handles confliction, Xen hypervisor will always do the
> last-conservative detection when setting up identity mapping:
> 	* gfn space unoccupied:
> 		-> insert identity mapping; no confliction
> 	* gfn space already occupied with identity mapping:
> 		-> do nothing; no confliction
> 	* gfn space already occupied with other mapping:
> 		-> confliction detected
> 
> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> internal data structures in gfn space, and it creates the final guest
> e820. So hvmloader also needs to detect conflictions when conducting
> those operations. If there's no confliction, hvmloader will reserve
> those regions in guest e820 to let guest OS aware.
> 
> >>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however
> it is not flexible regarding to different requirments:
> 
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;
> 
> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller
> but user doesn't care about legacy keyboard emulation, it's also OK to
> move forward upon a confliction;
> 
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.
> 
> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that
> failures would be rare (except some USB) with proactive avoidance
> in userspace, we'd like to propose below simplified policy following
> [Guideline-4]:
> 
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)
> 
> Such policy provides a relaxed user space policy w/ hypervisor to do
> final judge. It has a unique merit to simplify later interface design
> and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> reserved regions are exposed.
> 
>     ******agreement is first required on above policy******
> 
> >>>3.4 Xen: setup RMRR identity mapping
> ----
> Regardless of whether userspace has detected confliction, Xen hypervisor
> always needs to detect confliction itself when setting up identify
> mapping for reserved gfn regions, following above defined policy.
> 
> Identity mapping should be really handled from the general p2m layer,
> so the same r/w permissions apply equally to CPU/DMA access paths,
> regardless of the underlying fact whether EPT is shared with IOMMU.
> 
> This is to match the behavior on bare metal, where although reserved
> regions are marked as E820_RESERVED, it's just a hint to the system
> software which can still read data back because physically those bits
> do exist. So in the virtualization case we don't need to specially
> treat CPU accesses to RMRR reserved regions (similar to other reserved
> regions like ACPI NVS)
> 
> >>>3.5 New interface: expose reserved region information
> ----
> As explained in [Guideline-3], we'd like to keep this interface general
> enough, as a common interface for hypervisor to force reserving gfn
> ranges, due to various reasons (RMRR is a client of this feature).
> 
> One design open was discussed back-and-forth accordingly, regarding to
> whether the interface should return regions reported for all devices
> in the platform (report-all), or selectively return regions only
> belonging to assigned devices (report-sel). report-sel can be built on
> top of report-all, with extra work to help hypervisor generate filtered
> regions (e.g. introduce new interface or make device assignment happened
> before domain builder)
> 
> We propose report-all as the simple solution (different from last sent
> version which used report-sel), regarding to the below facts:
> 
>   - 'warn' policy in user space makes report-all not harmful
>   - 'report-all' still means only a few entries in reality:
>     * RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification;
>     * RMRR reserved regions are only a few on real platforms, per our
> current observations;
>   - anyway OS needs to handle all the reserved regions on bare metal;
>   - hotplug friendly;
>   - report-all can be extended to report-sel if really required
> 
> In this way, there are two situations libxc domain builder may request
> to query reserved region information w/ same interface:
> 
> a) if any statically-assigned devices, and/or
> b) if a new parameter is specified, asking for hotplug preparation
> 	('rdm_check' or 'prepare_hotplug'?)
> 
> the 1st invocation of this interface will save all reported reserved
> regions under domain structure, and later invocation (e.g. from
> hvmloader) gets saved content.
> 
> If a VM is configured w/o assigned devices, this interface is not
> invoked so there's no impact and [Guideline-1] is enforced;
> 
> If a VM is configured w/ assigned devices which don't have reserved
> regions, this interface is invoked. In some cases warning may be
> thrown out due to confliction caused by other non-assigned devices,
> but it's just informational and there is no impact on assigned devices
> so [Guideline-2] is enforced;
> 
> >>>3.6 Libxc/hvmloader: detect and avoid conflictions
> ----
> libxc needs to detect reserved region conflictions with:
> 	- guest RAM
> 	- monolithic PCI MMIO hole
> 
> hvmloader needs to detect reserved region confliction with:
> 	- guest RAM
> 	- PCI MMIO allocation
> 	- memory allocation
> 	- some e820 entries like ACPI Opregion, etc.
> 
> When there's a confliction detected, libxc/hvmloader first try to
> relocate conflicting gfn resources to avoid confliction. warning
> will be thrown out when such relocation fails. The relocation policy
> is straightforward for most resources, however there remains a major
> design tradeoff for guest RAM, regarding to handoff between libxc
> and hvmloader...
> 
> In current implementation, guest RAM is contiguous in gfn space, w/
> at most two trunks: lowmem (<4G) and highmem(>4G), which are passed
> to hvmloader through hvm_info. Now by relocating guest RAM to avoid
> confliction with reserved regions, sparse memory trunks are created
> and it's not thought as an extensible way to introduce such sparse
> structure into hvm_info.
> 
> There are several other options discussed so far:
> 
> a) Duplicate same relocation algorithm within libxc domain builder
> (when populating physmap) and hvmloader (when creating e820)
>   - Pros:
> 	* no interface/structure change
> 	* anyway hvmloader still needs to handle reserved regions
>   - Cons:
> 	* duplication is not good
> 
> b) pass sparse information through Xenstore
>   (no much idea. need input from toolstack maintainers)
> 
> c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> set and hvmloader to get. Extension required to allow hvm invoke.
>   - Pros:
> 	* centralized ownership in libxc. flexible for extension
>   - Cons:
> 	* limiting entry to E820MAX (should be fine)
> 	* hvmloader e820 construction may become more complex, given
> two predefined tables (reserved_regions, memory_map)
> 
> ********Inputs are required to find a good option here*********
> 
> >>>3.7 hvmloader: reserve 'reserved regions' in guest E820
> ----
> If there is no confliction detected, hvmloader needs to mark those
> reserved regions as E820_RESERVED in guest E820 table, so the guest OS
> is aware of those reserved regions (thus not does problematic actions
> e.g. when re-allocating PCI MMIO)
> 
> >>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform
> another VM's device)
> 
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to
> userspace and then extends toolstack to manage assignment in bundle.
> 
> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.
> 
> 
> 
> 4. Plan
> ================================================================
> =====
> We're seeking an incremental way to split above tasks into 2 stages,
> and in each stage we move forward a step w/o causing regression. Doing
> so can benefit people who want to use device assignment early, and
> also benefit newbie developer to rampup, toward a final sane solution.
> 
> 4.1 Stage-1: hypervisor hardening
> ----
>   [Tasks]
> 	1) Setup RMRR identity mapping in p2m layer with confliction
> detection
> 	2) add a boot option for fail/warn policy
> 	3) remove USB hack
> 	4) Detect and fail device assignment w/ shared reserve regions
> 
>   [Enhancements]
> 	* fix [Issue-1] and [Issue-3]
> 	* partially fix [Issue-2] with limitations:
> 		- w/o userspace relocation there's larger chance to
> see conflictions.
> 		- w/o reserve in guest e820, guest OS may allocate
> reserved pfn when re-enumerating PCI resource
> 
>   [Regressions]
> 	* devices which can be assigned successfully before may be
> failed now due to confliction detection. However it's not a regression
> per se. and user can change policy to 'warn' if required.
> 
> 4.2 Stage-2: libxc/hvmloader hardening
> ----
>   [Tasks]
> 	5) Introduce new interface to expose reserve region information
> 	6) Detect and avoid reserved region conflictions in libxc
> 	7) Pass libxc guest RAM layout to hvmloader
> 	8) Detect and avoid reserved region conflictions in hvmloader
> 	9) Reserve 'reserved regions' in guest E820 in hvmloader
> 
>   [Enhancements]
> 	* completely fix [Issue-2]
> 
>   [Regression]
> 	* n/a
> 
> Thanks,
> Kevin
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2014-12-26 11:23 (v2) Design proposal for RMRR fix Tian, Kevin
  2015-01-08  0:43 ` Tian, Kevin
@ 2015-01-08 12:32 ` Tim Deegan
  2015-01-09  0:53   ` Tian, Kevin
  2015-01-08 12:49 ` George Dunlap
  2015-01-08 13:54 ` Jan Beulich
  3 siblings, 1 reply; 139+ messages in thread
From: Tim Deegan @ 2015-01-08 12:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

Hi Kevin,

Thanks for sending out this design document.  I think Jan will have
the most to say about this.  Looking just at the hypervisor side of
things, and leaving the tools desig to others...

At 11:23 +0000 on 26 Dec (1419589382), Tian, Kevin wrote:
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.

Can you explain more concretely why we would want to allow assignment
whn the RMRR setup fails?  It seems like the device's use of the RMRR
will (at least) corrupt the OS and (possibly) make the device itself
fail.

If we do need to allow this, it should be configured for a particular
device, rather than just disabling the safety checks for all devices at
once.

> >>>3.4 Xen: setup RMRR identity mapping
> ----
> Regardless of whether userspace has detected confliction, Xen hypervisor
> always needs to detect confliction itself when setting up identify 
> mapping for reserved gfn regions, following above defined policy. 
> 
> Identity mapping should be really handled from the general p2m layer, 
> so the same r/w permissions apply equally to CPU/DMA access paths,
> regardless of the underlying fact whether EPT is shared with IOMMU.

Agreed!

> >>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform 
> another VM's device)
> 
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to 
> userspace and then extends toolstack to manage assignment in bundle.

Xen can at least enforce that devices that share RMRR are not assigned
to different domains.  Is the problem here that "unassigned" devices
are actually assigned to dom0?

> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.

How confident are you that this doesn't happen in real servers?  What
sort of range of servers have you been able to check?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2014-12-26 11:23 (v2) Design proposal for RMRR fix Tian, Kevin
  2015-01-08  0:43 ` Tian, Kevin
  2015-01-08 12:32 ` Tim Deegan
@ 2015-01-08 12:49 ` George Dunlap
  2015-01-08 12:54   ` George Dunlap
                     ` (2 more replies)
  2015-01-08 13:54 ` Jan Beulich
  3 siblings, 3 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-08 12:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Fri, Dec 26, 2014 at 11:23 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> (please note some proposal is different from last sent version after more
> discussions. But I tried to summarize previous discussions and explained why
> we choose a different way. Sorry if I may miss some opens/conclusions
> discussed in past months. Please help point it out which is very appreciated. :-)

Kevin, thanks for this document.  A few questions / comments below:

> For proper functioning of these legacy reserved memory usages, when
> system software enables DMA remapping, the translation structures for
> the respective devices are expected to be set up to provide identity
> mapping for the specified reserved memory regions with read and write
> permissions. The system software is also responsible for ensuring
> that any input addresses used for device accesses to OS-visible memory
> do not overlap with the reserved system memory address ranges.

Just to be clear: "identity mapping" here means that gpfn == mfn, in
both the p2m and IOMMU.  (I suppose it might mean vfn == gpfn as well,
but that wouldn't really concern us, as the guest deals with virtual
mappings.)

> However current RMRR implementation in Xen only partially achieves a)
> and completely misses b), which cause some issues:
>
> --
> [Issue-1] Identity mapping is not setup in shared ept case, so a device
> with RMRR may not function correctly if assigned to a VM.
>
> This was the original problem we found when assigning IGD on BDW
> platform, which triggered the whole long discussion in past months
>
> --
> [Issue-2] Being lacking of goal-b), existing device assignment with
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration
> difference.
>
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches
> those reserved regions for legacy keyboard emulation at early Dom0 boot
> phase, a trick is added in Xen to bypass RMRR handling for usb
> controllers.
>
> --
> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to
> different VMs could lead to secure concern

So to summarize:

When assigning a device to a guest, the device's associated RMRRs must
be identity mapped in the p2m and IOMMU.

At the moment, we don't have a reliable way to reclaim a particular
gpfn space from a guest once it's been used for other puproses (e.g.,
guest RAM or other MMIO ranges).

So, we need to make sure at guest creation time that we reserve any
RMRR ranges for devices we may wish to assign, and make sure that the
RMRR in gpfn space is empty.

For statically-assigned devices, we know at guest creation time which
RMRRs may be required.  But if we want to dynamically add devices, we
must figure out ahead of time which devices we *might* add, and
reserve the RMRRs at boot time.

As a separate problem, two different devices may share the same RMRR,
meaning that if we assign these devices to two different VMs, the RMRR
may be mapped into the gpfn space of two different VMs.  This may well
be a security issue, so we need to handle it carefully.

> 3. High Level Design
> =====================================================================
>
> To achieve aforementioned two goals, major enhancements are required
> cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> goal-b), i.e. handling possible conflictions in gfn space. Fixing
> goal-a) is straightforward.
>
>>>>3.1 Guidelines
> ----
> There are several guidelines considered in the design:
>
> --
> [Guideline-1] No regression in a VM w/o statically-assigned devices
>
>   If a VM isn't configured with assigned devices at creation, new
> confliction detection logic shouldn't block the VM boot progress
> (either skipped, or just throw warning)
>
> --
> [Guideline-2] No regression on devices which do not have RMRR reported
>
>   If a VM is assigned with a device which doesn't have RMRR reported,
> either statically-assigned or dynamically-assigned, new confliction
> detection logic shouldn't fail the assignment request for this device.
>
> --
> [Guideline-3] New interface should be kept as common as possible
>
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which
> allows hypervisor to force reserving one or more gfn ranges.
>
> --
> [Guideline-4] Keep changes simple
>
>   RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification. Per our observations, there are
> only a few reported examples (USB, IGD) on real platforms. So we need
> to balance the code complexity and usage limitation. If one limitation
> is only in niche scenarios, we'd like to vote no-support to simplify
> changes for now.

This is an excellent set of principles -- thanks.

>
>>>>3.2 Confliction detection
> ----
> Confliction must be detected in several places as far as gfn is
> concerned (how to handle confliction is discussed in 3.3)
>
> 1) libxc domain builder
>   Here coarse-grained gfn layout is created, including two contiguous
> guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> which are passed to hvmloader for later fine-grained manipulation. Guest
> RAM trunks are populated with valid translation setup in underlying p2m
> layer. Device reserved regions must be detected in that layout.
>
> 2) Xen hypervisor device assignment
>   Device assignment can happen either at VM creation time (after domain
> builder), or anytime thru hotplug after VM is booted. Regardless of
> how userspace handles confliction, Xen hypervisor will always do the
> last-conservative detection when setting up identity mapping:
>         * gfn space unoccupied:
>                 -> insert identity mapping; no confliction
>         * gfn space already occupied with identity mapping:
>                 -> do nothing; no confliction
>         * gfn space already occupied with other mapping:
>                 -> confliction detected
>
> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> internal data structures in gfn space, and it creates the final guest
> e820. So hvmloader also needs to detect conflictions when conducting
> those operations. If there's no confliction, hvmloader will reserve
> those regions in guest e820 to let guest OS aware.

I think this can be summarized a bit more clearly by what each bit of
code needs to actually do:

1. libxc
 - RMRR areas need to be not populated with gfns during boot time.

2. Xen
 - When a device with RMRRs is assigned, Xen must make an
identity-mapping of the appropriate RMRR ranges.

3. hvmloader
 - hvmoader must report RMRRs in the e820 map of all devices which a
guest may ever be assigned
 - when placing devices in MMIO space, hvmloader must avoid placing
MMIO devices over RMRR regions which are / may be assigned to a guest.

One component I think may be missing here -- qemu-traditional is very
tolerant with regards to the gpfn space; but qemu-upstream expects to
know the layout of guest gpfn space, and may crash if its idea of gpfn
space doesn't match Xen's idea.  Unfortunately, however, there is not
a very close link between these two at the moment; IIUC at the moment
this is limited to the domain builder telling qemu how big the lowmem
PCI hole will be.  Any solution which marks GPFN space as "non-memory"
needs to make sure this is communicated to qemu-upstream as well.

>>>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however
> it is not flexible regarding to different requirments:
>
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;
>
> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller
> but user doesn't care about legacy keyboard emulation, it's also OK to
> move forward upon a confliction;
>
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.
>
> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that
> failures would be rare (except some USB) with proactive avoidance
> in userspace, we'd like to propose below simplified policy following
> [Guideline-4]:
>
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)
>
> Such policy provides a relaxed user space policy w/ hypervisor to do
> final judge. It has a unique merit to simplify later interface design
> and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> reserved regions are exposed.
>
>     ******agreement is first required on above policy******

So the important part of policy is what the user experience is.  I
think we can assume that all device assignment will happen through
libxl; so from a user interface perspective we mainly want to be
thinking about the xl / libxl interface.

How the various sub-components react if something unexpected happens
is then just a matter of robust system design.

So first of all, I think RMRR reservations should be specified at
domain creation time.  If a user tries to assign a device with RMRRs
to a VM that has not reserved those ranges at creation time, the
assignment should fail.

The main place this checking should happen is in the toolstack
(libxl).  The toolstack can then give a sensible error message to the
user, which may include things they can to to fix the problem.

In the case of statically-assigned devices, the toolstack can look at
the RMRRs required and make sure to reserve them at domain creation
time.

For dynamically-assigned devices, I think there should be an option to
make the guest's memory layout mirror the host: this would include the
PCI hole and all RMRR ranges.  This would be off by default.

We could imagine a way of specifying "I may want to assign this pool
of devices to this VM", or to manually specify RMRR ranges which
should be reserved, but I think that's a bit more advanced than we
really need right now.

>>>>3.5 New interface: expose reserved region information

It's not clear to me who this new interface is being exposed to.

It seems to me what we want is for the toolstack to figure out, at
guest creation time, what RMRRs should be reserved for this VM, and
probably put that information in xenstore somewhere, where it's
available to hvmloader.  I assume the RMRR information is already
available through sysfs in dom0?

One question: where are these RMRRs typically located in memory?  Are
they normally up in the MMIO region?  Or can they occur anywhere (even
in really low areas, say, under 1GiB)?

If RMRRs almost always happen up above 2G, for example, then a simple
solution that wouldn't require too much work would be to make sure
that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
enough to include all RMRRs.  That would satisfy the libxc and qemu
requirements.

If we then store specific RMRRs we want included in xenstore,
hvmloader can put them in the e820 map, and that would satisfy the
hvmloader requirement.

Then when we assign the device, those ranges will be already unused in
the p2m, and (if I understand correctly) Xen will already map the RMRR
ranges 1-1 upon device assignment.

What do you think?

If making the RMRRs fit inside the guest MMIO hole is not practical
(for example, if the ranges occur very low in memory), then we'll have
to come up with a way to specify, both to libxc and to qemu, where
these  holes in memory are.

>>>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform
> another VM's device)
>
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to
> userspace and then extends toolstack to manage assignment in bundle.
>
> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.

I think denying it by default, first in the toolstack and as a
fall-back in the hypervisor, is a good idea.

It shouldn't be too difficult, however, to add an option to override
this.  We have a lot of individual users who use Xen for device
pass-through; such advanced users should be allowed to "shoot
themselves in the foot" if they want to.

Thoughts?

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:49 ` George Dunlap
@ 2015-01-08 12:54   ` George Dunlap
  2015-01-08 13:00     ` Jan Beulich
  2015-01-09  2:43     ` Tian, Kevin
  2015-01-08 12:58   ` Jan Beulich
  2015-01-09  2:42   ` Tian, Kevin
  2 siblings, 2 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-08 12:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> If RMRRs almost always happen up above 2G, for example, then a simple
> solution that wouldn't require too much work would be to make sure
> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
> enough to include all RMRRs.  That would satisfy the libxc and qemu
> requirements.
>
> If we then store specific RMRRs we want included in xenstore,
> hvmloader can put them in the e820 map, and that would satisfy the
> hvmloader requirement.

An alternate thing to do here would be to "properly" fix the
qemu-upstream problem, by making a way for hvmloader to communicate
changes in the gpfn layout to qemu.

Then hvmloader could do the work of moving memory under RMRRs to
higher memory; and libxc wouldn't need to be involved at all.

I think it would also fix our long-standing issues with assigning PCI
devices to qemu-upstream guests, which up until now have only been
worked around.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:49 ` George Dunlap
  2015-01-08 12:54   ` George Dunlap
@ 2015-01-08 12:58   ` Jan Beulich
  2015-01-09  2:29     ` Tian, Kevin
  2015-01-09  2:42   ` Tian, Kevin
  2 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-08 12:58 UTC (permalink / raw)
  To: George Dunlap, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 08.01.15 at 13:49, <George.Dunlap@eu.citrix.com> wrote:
> One question: where are these RMRRs typically located in memory?  Are
> they normally up in the MMIO region?  Or can they occur anywhere (even
> in really low areas, say, under 1GiB)?

They would typically sit in the MMIO hole or below 1Mb; that latter case
is particularly problematic as it might conflict with what we want to put
there (BIOS etc).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:54   ` George Dunlap
@ 2015-01-08 13:00     ` Jan Beulich
  2015-01-08 15:15       ` George Dunlap
  2015-01-09  2:43     ` Tian, Kevin
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-08 13:00 UTC (permalink / raw)
  To: George Dunlap, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 08.01.15 at 13:54, <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> If RMRRs almost always happen up above 2G, for example, then a simple
>> solution that wouldn't require too much work would be to make sure
>> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
>> enough to include all RMRRs.  That would satisfy the libxc and qemu
>> requirements.
>>
>> If we then store specific RMRRs we want included in xenstore,
>> hvmloader can put them in the e820 map, and that would satisfy the
>> hvmloader requirement.
> 
> An alternate thing to do here would be to "properly" fix the
> qemu-upstream problem, by making a way for hvmloader to communicate
> changes in the gpfn layout to qemu.
> 
> Then hvmloader could do the work of moving memory under RMRRs to
> higher memory; and libxc wouldn't need to be involved at all.

I don't think avoiding libxc involvement is possible: Once a certain
range of memory has been determined to need reserving (e.g.
due to a statically assigned device), attempts to populate the
respective GFNs with RAM would (ought to) fail.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2014-12-26 11:23 (v2) Design proposal for RMRR fix Tian, Kevin
                   ` (2 preceding siblings ...)
  2015-01-08 12:49 ` George Dunlap
@ 2015-01-08 13:54 ` Jan Beulich
  2015-01-08 15:59   ` George Dunlap
  2015-01-09  2:27   ` Tian, Kevin
  3 siblings, 2 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-08 13:54 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 26.12.14 at 12:23, <kevin.tian@intel.com> wrote:
> [Issue-2] Being lacking of goal-b), existing device assignment with 
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration 
> difference.
> 
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches 
> those reserved regions for legacy keyboard emulation at early Dom0 boot 
> phase, a trick is added in Xen to bypass RMRR handling for usb 
> controllers. 

s/trick/hack/ - after all, doing this is not safe. Plus if these regions
really were needed only for early boot legacy keyboard emulation,
they wouldn't need expressing as RMRR afaict, or if that really was
a requirement a suitable flag should be added to tell the OS that
once a proper driver is in place for the device, the RMRR won't be
needed anymore. In any event - the hack needs to go away.

> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to 
> different VMs could lead to secure concern

s/could lead to/is a/

> [Guideline-3] New interface should be kept as common as possible
> 
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which 
> allows hypervisor to force reserving one or more gfn ranges. 

s/hypervisor/user space/ ? Or else I don't see the connection between
the new interface and the enforcement of the reserved ranges.

> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and 
> internal data structures in gfn space, and it creates the final guest 
> e820. So hvmloader also needs to detect conflictions when conducting 
> those operations. If there's no confliction, hvmloader will reserve 
> those regions in guest e820 to let guest OS aware.

Ideally, rather than detecting conflicts, hvmloader would just
consume what libxc set up. Obviously that would require awareness
in libxc of things it currently doesn't care about (like fitting PCI BARs
into the MMIO hole, enlarging it as necessary). I admit that this may
end up being difficult to implement. Another alternative would be to
have libxc only populate a limited part of RAM (for hvmloader to be
loadable), and have hvmloader do the bulk of the populating.

>>>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however 
> it is not flexible regarding to different requirments:
> 
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;

I don't think that's right (and I believe this was discussed before):
When device assignment fails, VM creation should fail too. It is the
responsibility of the host admin in that case to remove some or all
of the to be assigned devices from the guest config.

> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller 
> but user doesn't care about legacy keyboard emulation, it's also OK to 
> move forward upon a confliction;

Again assuming that RMRRs for USB devices are _only_ used for
legacy keyboard emulation, which may or may not be true.

> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.

But only when asked to do so by the host admin.

> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that 
> failures would be rare (except some USB) with proactive avoidance  
> in userspace, we'd like to propose below simplified policy following 
> [Guideline-4]:
> 
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)

I think someone else (Tim?) already said this: Such a "warn" option
would unlikely to be desirable as a global one, affecting all devices,
but should rather be a flag settable on particular devices.

>>>>3.5 New interface: expose reserved region information
> ----
> As explained in [Guideline-3], we'd like to keep this interface general 
> enough, as a common interface for hypervisor to force reserving gfn 
> ranges, due to various reasons (RMRR is a client of this feature).
> 
> One design open was discussed back-and-forth accordingly, regarding to
> whether the interface should return regions reported for all devices
> in the platform (report-all), or selectively return regions only 
> belonging to assigned devices (report-sel). report-sel can be built on
> top of report-all, with extra work to help hypervisor generate filtered 
> regions (e.g. introduce new interface or make device assignment happened 
> before domain builder)
> 
> We propose report-all as the simple solution (different from last sent
> version which used report-sel), regarding to the below facts:
> 
>   - 'warn' policy in user space makes report-all not harmful
>   - 'report-all' still means only a few entries in reality:
>     * RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification;
>     * RMRR reserved regions are only a few on real platforms, per our
> current observations;

Few yes, but in the IGD example you gave the region is quite large,
and it would be fairly odd to have all guests have a strange, large
hole in their address spaces. Furthermore remember that these
holes vary from machine to machine, so a migrateable guest would
needlessly end up having a hole potentially not helping subsequent
hotplug at all.

> In this way, there are two situations libxc domain builder may request 
> to query reserved region information w/ same interface:
> 
> a) if any statically-assigned devices, and/or
> b) if a new parameter is specified, asking for hotplug preparation
> 	('rdm_check' or 'prepare_hotplug'?)
> 
> the 1st invocation of this interface will save all reported reserved
> regions under domain structure, and later invocation (e.g. from 
> hvmloader) gets saved content.

Why would the reserved regions need attaching to the domain
structure? The combination of (to be) assigned devices and
global RMRR list always allow reproducing the intended set of
regions without any extra storage.

>>>>3.6 Libxc/hvmloader: detect and avoid conflictions
> ----
> libxc needs to detect reserved region conflictions with:
> 	- guest RAM
> 	- monolithic PCI MMIO hole
> 
> hvmloader needs to detect reserved region confliction with:
> 	- guest RAM
> 	- PCI MMIO allocation
> 	- memory allocation
> 	- some e820 entries like ACPI Opregion, etc.

- BIOS and alike

> There are several other options discussed so far:
> 
> a) Duplicate same relocation algorithm within libxc domain builder 
> (when populating physmap) and hvmloader (when creating e820)
>   - Pros:
> 	* no interface/structure change
> 	* anyway hvmloader still needs to handle reserved regions
>   - Cons:
> 	* duplication is not good
> 
> b) pass sparse information through Xenstore
>   (no much idea. need input from toolstack maintainers)
> 
> c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> set and hvmloader to get. Extension required to allow hvm invoke.
>   - Pros:
> 	* centralized ownership in libxc. flexible for extension
>   - Cons:
> 	* limiting entry to E820MAX (should be fine)
> 	* hvmloader e820 construction may become more complex, given
> two predefined tables (reserved_regions, memory_map)

d) Move down the lowmem RAM/MMIO boundary so that a single,
contiguous chunk of lowmem results, with all other RAM moving up
beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
considered here, and I think we can reasonably safely assume that
no RMRRs will ever report ranges above 1Mb but below the host
lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
that the lowmem chunk will always be reasonably big).

> 4. Plan
> =====================================================================
> We're seeking an incremental way to split above tasks into 2 stages, 
> and in each stage we move forward a step w/o causing regression. Doing
> so can benefit people who want to use device assignment early, and 
> also benefit newbie developer to rampup, toward a final sane solution.
> 
> 4.1 Stage-1: hypervisor hardening
> ----
>   [Tasks]
> 	1) Setup RMRR identity mapping in p2m layer with confliction 
> detection
> 	2) add a boot option for fail/warn policy
> 	3) remove USB hack
> 	4) Detect and fail device assignment w/ shared reserve regions 
> 
>   [Enhancements]
> 	* fix [Issue-1] and [Issue-3]

According to what you wrote earlier, [Issue-3] is not intended to be
fixed, but instead devices sharing the same RMRR(s) are to be
declared unassignable.

> 	* partially fix [Issue-2] with limitations:
> 		- w/o userspace relocation there's larger chance to 
> see conflictions. 
> 		- w/o reserve in guest e820, guest OS may allocate 
> reserved pfn when re-enumerating PCI resource
> 
>   [Regressions]
> 	* devices which can be assigned successfully before may be
> failed now due to confliction detection. However it's not a regression
> per se. and user can change policy to 'warn' if required.  

Avoiding such a (perceived) regression would seem to be possible by
intermixing hypervisor and libxc/hvmloader adjustments.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 13:00     ` Jan Beulich
@ 2015-01-08 15:15       ` George Dunlap
  2015-01-08 15:21         ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-08 15:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On Thu, Jan 8, 2015 at 1:00 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 08.01.15 at 13:54, <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
>> <George.Dunlap@eu.citrix.com> wrote:
>>> If RMRRs almost always happen up above 2G, for example, then a simple
>>> solution that wouldn't require too much work would be to make sure
>>> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
>>> enough to include all RMRRs.  That would satisfy the libxc and qemu
>>> requirements.
>>>
>>> If we then store specific RMRRs we want included in xenstore,
>>> hvmloader can put them in the e820 map, and that would satisfy the
>>> hvmloader requirement.
>>
>> An alternate thing to do here would be to "properly" fix the
>> qemu-upstream problem, by making a way for hvmloader to communicate
>> changes in the gpfn layout to qemu.
>>
>> Then hvmloader could do the work of moving memory under RMRRs to
>> higher memory; and libxc wouldn't need to be involved at all.
>
> I don't think avoiding libxc involvement is possible: Once a certain
> range of memory has been determined to need reserving (e.g.
> due to a statically assigned device), attempts to populate the
> respective GFNs with RAM would (ought to) fail.

Is it the case that marking a range as an RMRR in Xen basically only
involves adding a 1-1 mapping in the p2m / IOMMU?

So is it the case that the main reason the "empty RMRR gpfn range in
hvmloader" solution wouldn't work is that for statically-assigned
devices, the devices are assigned to the guest before it boots (and
thus before hvmloader gets to run)?

We could imagine not actually mapping the RMRRs on domain creation,
but allowing hvmloader to move the gpfns around and then ask Xen to do
the 1-1 mapping.

It would obviously be a bit more complicated, but there's also
something satisfying about being able to have all of the gpfn
re-arrangement stuff happening in hvmloader, rather than have to
separately specify the guest p2m layout to three different bits of
code (libxc domain builder, qemu, hvmloader).

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 15:15       ` George Dunlap
@ 2015-01-08 15:21         ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-08 15:21 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 08.01.15 at 16:15, <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jan 8, 2015 at 1:00 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 08.01.15 at 13:54, <George.Dunlap@eu.citrix.com> wrote:
>>> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
>>> <George.Dunlap@eu.citrix.com> wrote:
>>>> If RMRRs almost always happen up above 2G, for example, then a simple
>>>> solution that wouldn't require too much work would be to make sure
>>>> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
>>>> enough to include all RMRRs.  That would satisfy the libxc and qemu
>>>> requirements.
>>>>
>>>> If we then store specific RMRRs we want included in xenstore,
>>>> hvmloader can put them in the e820 map, and that would satisfy the
>>>> hvmloader requirement.
>>>
>>> An alternate thing to do here would be to "properly" fix the
>>> qemu-upstream problem, by making a way for hvmloader to communicate
>>> changes in the gpfn layout to qemu.
>>>
>>> Then hvmloader could do the work of moving memory under RMRRs to
>>> higher memory; and libxc wouldn't need to be involved at all.
>>
>> I don't think avoiding libxc involvement is possible: Once a certain
>> range of memory has been determined to need reserving (e.g.
>> due to a statically assigned device), attempts to populate the
>> respective GFNs with RAM would (ought to) fail.
> 
> Is it the case that marking a range as an RMRR in Xen basically only
> involves adding a 1-1 mapping in the p2m / IOMMU?
> 
> So is it the case that the main reason the "empty RMRR gpfn range in
> hvmloader" solution wouldn't work is that for statically-assigned
> devices, the devices are assigned to the guest before it boots (and
> thus before hvmloader gets to run)?
> 
> We could imagine not actually mapping the RMRRs on domain creation,
> but allowing hvmloader to move the gpfns around and then ask Xen to do
> the 1-1 mapping.

That might be an option, but would be safe/secure only as long as
nothing gets mapped in these regions before, which again would
require libxc changes.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 13:54 ` Jan Beulich
@ 2015-01-08 15:59   ` George Dunlap
  2015-01-08 16:10     ` Jan Beulich
  2015-01-09  2:49     ` Tian, Kevin
  2015-01-09  2:27   ` Tian, Kevin
  1 sibling, 2 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-08 15:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> Ideally, rather than detecting conflicts, hvmloader would just
> consume what libxc set up. Obviously that would require awareness
> in libxc of things it currently doesn't care about (like fitting PCI BARs
> into the MMIO hole, enlarging it as necessary). I admit that this may
> end up being difficult to implement.

Yes, the idea of moving all that logic into libxc just seems not very
nice; particularly as, if I remember correctly, the domain builder
code cannot access xenstore, since it would introduce a circular
dependency.  (I might be remembering that incorrectly.)

> Another alternative would be to
> have libxc only populate a limited part of RAM (for hvmloader to be
> loadable), and have hvmloader do the bulk of the populating.

Ah, that's an interesting idea.  It seems like it might make
development of domain-building features quite a bit more complicated
though.  Worth having a think about.

>>>>>3.3 Policies
>> ----
>> An intuitive thought is to fail immediately upon a confliction, however
>> it is not flexible regarding to different requirments:
>>
>> a) it's not appropriate to fail libxc domain builder just because such
>> confliction. We still want the guest to boot even w/o assigned device;
>
> I don't think that's right (and I believe this was discussed before):
> When device assignment fails, VM creation should fail too. It is the
> responsibility of the host admin in that case to remove some or all
> of the to be assigned devices from the guest config.

Yes; basically, if we get to domain build time and we haven't reserved
the RMRRs, then it's a bug in libxl (since it's apparently told Xen
one thing and libxc another thing).  Having libxc fail in that case is
a perfectly sensible thing to do.

>> We propose report-all as the simple solution (different from last sent
>> version which used report-sel), regarding to the below facts:
>>
>>   - 'warn' policy in user space makes report-all not harmful
>>   - 'report-all' still means only a few entries in reality:
>>     * RMRR reserved regions should be avoided or limited by platform
>> designers, per VT-d specification;
>>     * RMRR reserved regions are only a few on real platforms, per our
>> current observations;
>
> Few yes, but in the IGD example you gave the region is quite large,
> and it would be fairly odd to have all guests have a strange, large
> hole in their address spaces. Furthermore remember that these
> holes vary from machine to machine, so a migrateable guest would
> needlessly end up having a hole potentially not helping subsequent
> hotplug at all.

Yes, I think that by default VMs should have no RMRRs set up on domain
creation.  The only way to get RMRRs in your address space should be
to opt-in at domain creation time (either by statically assigning
devices, or by requesting your memory layout to mirror the host's).


>> In this way, there are two situations libxc domain builder may request
>> to query reserved region information w/ same interface:
>>
>> a) if any statically-assigned devices, and/or
>> b) if a new parameter is specified, asking for hotplug preparation
>>       ('rdm_check' or 'prepare_hotplug'?)
>>
>> the 1st invocation of this interface will save all reported reserved
>> regions under domain structure, and later invocation (e.g. from
>> hvmloader) gets saved content.
>
> Why would the reserved regions need attaching to the domain
> structure? The combination of (to be) assigned devices and
> global RMRR list always allow reproducing the intended set of
> regions without any extra storage.

So when you say "(to be) assigned devices", you mean any device which
is currently assigned, *or may be assigned at some point in the
future*?

Do you think the extra storage for "this VM might possibly be assigned
this device at some point" wouldn't really be that much bigger than
"this VM might possibly map this RMRR at some point in the future"?

It seems a lot cleaner to me to have the toolstack tell Xen what
ranges are reserved for RMRR per VM, and then have Xen check again
when assigning a device to make sure that the RMRRs have already been
reserved.

>> 4. Plan
>> =====================================================================
>> We're seeking an incremental way to split above tasks into 2 stages,
>> and in each stage we move forward a step w/o causing regression. Doing
>> so can benefit people who want to use device assignment early, and
>> also benefit newbie developer to rampup, toward a final sane solution.
>>
>> 4.1 Stage-1: hypervisor hardening
>> ----
>>   [Tasks]
>>       1) Setup RMRR identity mapping in p2m layer with confliction
>> detection
>>       2) add a boot option for fail/warn policy
>>       3) remove USB hack
>>       4) Detect and fail device assignment w/ shared reserve regions
>>
>>   [Enhancements]
>>       * fix [Issue-1] and [Issue-3]
>
> According to what you wrote earlier, [Issue-3] is not intended to be
> fixed, but instead devices sharing the same RMRR(s) are to be
> declared unassignable.

Yes -- fix the security hole by forbidding the situation which causes
it to happen.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 15:59   ` George Dunlap
@ 2015-01-08 16:10     ` Jan Beulich
  2015-01-08 18:02       ` George Dunlap
  2015-01-09  2:49     ` Tian, Kevin
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-08 16:10 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>> the 1st invocation of this interface will save all reported reserved
>>> regions under domain structure, and later invocation (e.g. from
>>> hvmloader) gets saved content.
>>
>> Why would the reserved regions need attaching to the domain
>> structure? The combination of (to be) assigned devices and
>> global RMRR list always allow reproducing the intended set of
>> regions without any extra storage.
> 
> So when you say "(to be) assigned devices", you mean any device which
> is currently assigned, *or may be assigned at some point in the
> future*?

Yes.

> Do you think the extra storage for "this VM might possibly be assigned
> this device at some point" wouldn't really be that much bigger than
> "this VM might possibly map this RMRR at some point in the future"?

Since listing devices without RMRR association would be pointless,
I think a list of devices would require less storage. But see below.

> It seems a lot cleaner to me to have the toolstack tell Xen what
> ranges are reserved for RMRR per VM, and then have Xen check again
> when assigning a device to make sure that the RMRRs have already been
> reserved.

With an extra level of what can be got wrong by the admin.
However, I now realize that doing it this way would allow
specifying regions not associated with any device on the host
the guest boots on, but associated with one on a host the guest
may later migrate to.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 16:10     ` Jan Beulich
@ 2015-01-08 18:02       ` George Dunlap
  2015-01-08 18:12         ` Pasi Kärkkäinen
                           ` (3 more replies)
  0 siblings, 4 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-08 18:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
>> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> the 1st invocation of this interface will save all reported reserved
>>>> regions under domain structure, and later invocation (e.g. from
>>>> hvmloader) gets saved content.
>>>
>>> Why would the reserved regions need attaching to the domain
>>> structure? The combination of (to be) assigned devices and
>>> global RMRR list always allow reproducing the intended set of
>>> regions without any extra storage.
>>
>> So when you say "(to be) assigned devices", you mean any device which
>> is currently assigned, *or may be assigned at some point in the
>> future*?
>
> Yes.
>
>> Do you think the extra storage for "this VM might possibly be assigned
>> this device at some point" wouldn't really be that much bigger than
>> "this VM might possibly map this RMRR at some point in the future"?
>
> Since listing devices without RMRR association would be pointless,
> I think a list of devices would require less storage. But see below.
>
>> It seems a lot cleaner to me to have the toolstack tell Xen what
>> ranges are reserved for RMRR per VM, and then have Xen check again
>> when assigning a device to make sure that the RMRRs have already been
>> reserved.
>
> With an extra level of what can be got wrong by the admin.
> However, I now realize that doing it this way would allow
> specifying regions not associated with any device on the host
> the guest boots on, but associated with one on a host the guest
> may later migrate to.

I did say the toolstack, not the admin. :-)

At the xl level, I envisioned a single boolean that would say, "Make
my memory layout resemble the host system" -- so the MMIO hole would
be the same size, and all the RMRRs would be reserved.

But xapi, for instance, has a concept of "hardware pools" containing
individual hardware devices, which can be assigned to VMs.  You could
imagine a toolstack like xapi keeping track of all devices which
*might be* assigned to a guest, and supplying Xen with the RMRRs.  As
you say, then this could include hardware across a pool of hosts, with
the RMRRs of any device in the system reserved.

Alternately, could the toolstack could be responsible for making sure
that nobody uses such a range; and then Xen when a device is assigned,
Xen can check to make sure that the gpfn space is empty before adding
the RMRRs?  That might be the most flexible.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 18:02       ` George Dunlap
@ 2015-01-08 18:12         ` Pasi Kärkkäinen
  2015-01-09  3:12         ` Tian, Kevin
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 139+ messages in thread
From: Pasi Kärkkäinen @ 2015-01-08 18:12 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	ian.jackson, tim, xen-devel, Jan Beulich, Yang Z Zhang,
	Tiejun Chen

On Thu, Jan 08, 2015 at 06:02:04PM +0000, George Dunlap wrote:
> On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> the 1st invocation of this interface will save all reported reserved
> >>>> regions under domain structure, and later invocation (e.g. from
> >>>> hvmloader) gets saved content.
> >>>
> >>> Why would the reserved regions need attaching to the domain
> >>> structure? The combination of (to be) assigned devices and
> >>> global RMRR list always allow reproducing the intended set of
> >>> regions without any extra storage.
> >>
> >> So when you say "(to be) assigned devices", you mean any device which
> >> is currently assigned, *or may be assigned at some point in the
> >> future*?
> >
> > Yes.
> >
> >> Do you think the extra storage for "this VM might possibly be assigned
> >> this device at some point" wouldn't really be that much bigger than
> >> "this VM might possibly map this RMRR at some point in the future"?
> >
> > Since listing devices without RMRR association would be pointless,
> > I think a list of devices would require less storage. But see below.
> >
> >> It seems a lot cleaner to me to have the toolstack tell Xen what
> >> ranges are reserved for RMRR per VM, and then have Xen check again
> >> when assigning a device to make sure that the RMRRs have already been
> >> reserved.
> >
> > With an extra level of what can be got wrong by the admin.
> > However, I now realize that doing it this way would allow
> > specifying regions not associated with any device on the host
> > the guest boots on, but associated with one on a host the guest
> > may later migrate to.
> 
> I did say the toolstack, not the admin. :-)
> 
> At the xl level, I envisioned a single boolean that would say, "Make
> my memory layout resemble the host system" -- so the MMIO hole would
> be the same size, and all the RMRRs would be reserved.
>

There is e820_host= parameter for domUs already.. but currently it doesn't work for HVM guests afaik.


-- Pasi

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:32 ` Tim Deegan
@ 2015-01-09  0:53   ` Tian, Kevin
  2015-01-09 12:00     ` Andrew Cooper
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  0:53 UTC (permalink / raw)
  To: Tim Deegan
  Cc: wei.liu2, ian.campbell, stefano.stabellini, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, January 08, 2015 8:32 PM
> 
> Hi Kevin,
> 
> Thanks for sending out this design document.  I think Jan will have
> the most to say about this.  Looking just at the hypervisor side of
> things, and leaving the tools desig to others...
> 
> At 11:23 +0000 on 26 Dec (1419589382), Tian, Kevin wrote:
> > c) in Xen hypervisor it is reasonable to fail upon confliction, where
> > device is actually assigned. But due to the same requirement on USB
> > controller, sometimes we might want it succeed just w/ warnings.
> 
> Can you explain more concretely why we would want to allow assignment
> whn the RMRR setup fails?  It seems like the device's use of the RMRR
> will (at least) corrupt the OS and (possibly) make the device itself
> fail.

For USB device, RMRR is used only in early-boot phase, as a way for
communication between legacy keyboard driver and bios. That emulation
mode will be disabled either at some ACPI initialization step or when
USB keyboard driver is loader (sorry don't remember detail). So when
it's assigned to a guest, there's no legacy emulation and not setting up
identity mapping is not a critical issue. If we think such usage case is
a valid one, 'fail' would cause regression on that.

> 
> If we do need to allow this, it should be configured for a particular
> device, rather than just disabling the safety checks for all devices at
> once.

will think about this. we still have safety checks, and the open is whether
fail immediately or just warn users that assigned device might have
problem but it's his call to decide whether moving forward. If overall
warning is OK, then we don't need per-device policy. But if all think
fail should be the default option, then yes we need per-device policy
to support the relaxed option.

> 
> > >>>3.4 Xen: setup RMRR identity mapping
> > ----
> > Regardless of whether userspace has detected confliction, Xen hypervisor
> > always needs to detect confliction itself when setting up identify
> > mapping for reserved gfn regions, following above defined policy.
> >
> > Identity mapping should be really handled from the general p2m layer,
> > so the same r/w permissions apply equally to CPU/DMA access paths,
> > regardless of the underlying fact whether EPT is shared with IOMMU.
> 
> Agreed!
> 
> > >>>3.8 Xen: Handle devices sharing reserved regions
> > ----
> > Per VT-d spec, it's possible to have two devices sharing same reserved
> > region. Though we didn't see such example in reality, hypervisor needs
> > to detect and handle such scenario, otherwise vulnerability may exist
> > if two devices are assigned to different VMs (so a malicious VM may
> > program its assigned device to clobber the shared region to malform
> > another VM's device)
> >
> > Ideally all devices sharing reserved regions should be assigned to a
> > single VM. However achieving this goal can't be done sole in hypervisor
> > w/o reworking current device assignment interface. Assignment is managed
> > by toolstack, which requires exposing group sharing information to
> > userspace and then extends toolstack to manage assignment in bundle.
> 
> Xen can at least enforce that devices that share RMRR are not assigned
> to different domains.  Is the problem here that "unassigned" devices
> are actually assigned to dom0?

yes, that's why we said Xen will check and fail the assignment of this device
if other devices sharing same RMRR regions have been assigned to other
VMs (e.g. dom0).

> 
> > Given the problem only in ideal space, we propose to not support such
> > scenario, i.e. having hypervisor to fail the assignment, if the target
> > device happens to share some reserved regions with another device,
> > following [Guideline-4] to keep things simple.
> 
> How confident are you that this doesn't happen in real servers?  What
> sort of range of servers have you been able to check?

We didn't do one-time check on many servers. It's based on our experience
working on VT-d in past years. If it turns out to be a real problem later, we
can redesign the assignment interface to allow group assignment. For now
we think it's enough to enforce check in Xen. :-)

> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 13:54 ` Jan Beulich
  2015-01-08 15:59   ` George Dunlap
@ 2015-01-09  2:27   ` Tian, Kevin
  2015-01-09  9:21     ` Jan Beulich
  1 sibling, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  2:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, January 08, 2015 9:55 PM
> 
> >>> On 26.12.14 at 12:23, <kevin.tian@intel.com> wrote:
> > [Issue-2] Being lacking of goal-b), existing device assignment with
> > RMRR works only when reserved regions happen to not conflicting with
> > other valid allocations in the guest physical address space. This could
> > lead to unpredicted failures in various deployments, due to non-detected
> > conflictions caused by platform difference and VM configuration
> > difference.
> >
> > One example is about USB controller assignment. It's already identified
> > as a problem on some platforms, that USB reserved regions conflict with
> > guest BIOS region. However, being the fact that host BIOS only touches
> > those reserved regions for legacy keyboard emulation at early Dom0 boot
> > phase, a trick is added in Xen to bypass RMRR handling for usb
> > controllers.
> 
> s/trick/hack/ - after all, doing this is not safe. Plus if these regions
> really were needed only for early boot legacy keyboard emulation,
> they wouldn't need expressing as RMRR afaict, or if that really was
> a requirement a suitable flag should be added to tell the OS that
> once a proper driver is in place for the device, the RMRR won't be
> needed anymore. In any event - the hack needs to go away.

early boot doesn't mean pre-OS boot. It's still used in early OS boot before
ACPI switches mode/ or a USB keyboard driver writes a register (will check
detail later). and VT-d can be used on bare metal, that's why RMRR is
required from specification p.o.v (OS can setup IOMMU for USB devices
very early)

so it's reported, and if adding strict confliction detection with immediate
failure policy (if RMRR is <1M), USB devices which previously were 
assigned would fail now after removing the hack. That's later open 
whether we still want to keep an warning option to support it. 

> 
> > [Issue-3] devices may share same reserved regions, however
> > there is no logic to handle this in Xen. Assigning such devices to
> > different VMs could lead to secure concern
> 
> s/could lead to/is a/
> 
> > [Guideline-3] New interface should be kept as common as possible
> >
> >   New interface will be introduced to expose reserved regions to the
> > user space. Though RMRR is a VT-d specific terminology, the interface
> > design should be generic enough, i.e. to support a function which
> > allows hypervisor to force reserving one or more gfn ranges.
> 
> s/hypervisor/user space/ ? Or else I don't see the connection between
> the new interface and the enforcement of the reserved ranges.

the reserved regions are specified by hypervisor due to some reason, e.g.
RMRR, and are finally reserved by user space. Here I want to describe
the intention comes from hypervisor.

> 
> > 3) hvmloader
> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> > internal data structures in gfn space, and it creates the final guest
> > e820. So hvmloader also needs to detect conflictions when conducting
> > those operations. If there's no confliction, hvmloader will reserve
> > those regions in guest e820 to let guest OS aware.
> 
> Ideally, rather than detecting conflicts, hvmloader would just
> consume what libxc set up. Obviously that would require awareness
> in libxc of things it currently doesn't care about (like fitting PCI BARs
> into the MMIO hole, enlarging it as necessary). I admit that this may
> end up being difficult to implement. Another alternative would be to
> have libxc only populate a limited part of RAM (for hvmloader to be
> loadable), and have hvmloader do the bulk of the populating.

there are quite some allocations which are suitable in hvmloader, such
as ACPI, PCI BARs, and other hole allocations. some of them are hvmloader's
own usage, and others are related to guest bios. I don't think it's worthy of
the mass refactoring of moving those allocations to libxc, just for this very
specific task. As long as hvmloader still needs allocate gfns, it needs to
keep confliction detection logic itself.

so I want to avoid big changes if possible (which could proliferate to more
tasks making this specific RMRR task never ends), and only target the 
minimal necessary changes now.


> 
> >>>>3.3 Policies
> > ----
> > An intuitive thought is to fail immediately upon a confliction, however
> > it is not flexible regarding to different requirments:
> >
> > a) it's not appropriate to fail libxc domain builder just because such
> > confliction. We still want the guest to boot even w/o assigned device;
> 
> I don't think that's right (and I believe this was discussed before):
> When device assignment fails, VM creation should fail too. It is the
> responsibility of the host admin in that case to remove some or all
> of the to be assigned devices from the guest config.

think about bare metal. If a device say NIC doesn't work, would the
platform reject to work at all? there could be errors, but their scope
are limited within specific function. user can still use a platform w/
errors as long as related functions are not used.

Similarly we should allow domainbuilder to move forward upon a
device assignment failure (something like circuit error when powering
the device), and user will note this problem when using the device
(either not present or not function correctly).

same thing for hotplug usage. all the detections for future hotplug
usage are just preparation and not strict. you don't want to hang
a platform just because it's not suitable to hotplug some device in
the future.

> 
> > b) whether to fail in hvmloader has several dependencies. If it's
> > to check for hotplug preparation, warning is also an acceptable option
> > since assignment may not happen at all. Or if it's a USB controller
> > but user doesn't care about legacy keyboard emulation, it's also OK to
> > move forward upon a confliction;
> 
> Again assuming that RMRRs for USB devices are _only_ used for
> legacy keyboard emulation, which may or may not be true.
> 
> > c) in Xen hypervisor it is reasonable to fail upon confliction, where
> > device is actually assigned. But due to the same requirement on USB
> > controller, sometimes we might want it succeed just w/ warnings.
> 
> But only when asked to do so by the host admin.
> 
> > Regarding to the complexity of addressing all above flexibilities (user
> > preferences, per-device), which requires inventing quite some parameters
> > passed among different components, and regarding to the fact that
> > failures would be rare (except some USB) with proactive avoidance
> > in userspace, we'd like to propose below simplified policy following
> > [Guideline-4]:
> >
> > - 'warn' conflictions in user space (libxc and hvmloader)
> > - a boot option to specify 'fail' or 'warn' confliction in Xen device
> > assignment path, default to 'fail' (user can set to 'warn' for USB case)
> 
> I think someone else (Tim?) already said this: Such a "warn" option
> would unlikely to be desirable as a global one, affecting all devices,
> but should rather be a flag settable on particular devices.
> 
> >>>>3.5 New interface: expose reserved region information
> > ----
> > As explained in [Guideline-3], we'd like to keep this interface general
> > enough, as a common interface for hypervisor to force reserving gfn
> > ranges, due to various reasons (RMRR is a client of this feature).
> >
> > One design open was discussed back-and-forth accordingly, regarding to
> > whether the interface should return regions reported for all devices
> > in the platform (report-all), or selectively return regions only
> > belonging to assigned devices (report-sel). report-sel can be built on
> > top of report-all, with extra work to help hypervisor generate filtered
> > regions (e.g. introduce new interface or make device assignment happened
> > before domain builder)
> >
> > We propose report-all as the simple solution (different from last sent
> > version which used report-sel), regarding to the below facts:
> >
> >   - 'warn' policy in user space makes report-all not harmful
> >   - 'report-all' still means only a few entries in reality:
> >     * RMRR reserved regions should be avoided or limited by platform
> > designers, per VT-d specification;
> >     * RMRR reserved regions are only a few on real platforms, per our
> > current observations;
> 
> Few yes, but in the IGD example you gave the region is quite large,
> and it would be fairly odd to have all guests have a strange, large
> hole in their address spaces. Furthermore remember that these
> holes vary from machine to machine, so a migrateable guest would
> needlessly end up having a hole potentially not helping subsequent
> hotplug at all.

it's not strange since it never exceeds the set on bare metal, but yes, 
migration raises another interesting point. currently I don't think 
migration w/ assigned devices is supported. but even considering
future possibility, there's always limitation since whatever reserved
regions created at boot time in e820 are static which can't adapt
to dynamic device changes. for hotplug or migration, you always
suffer from seeing some holes which might not be relevant at a
moment.

> 
> > In this way, there are two situations libxc domain builder may request
> > to query reserved region information w/ same interface:
> >
> > a) if any statically-assigned devices, and/or
> > b) if a new parameter is specified, asking for hotplug preparation
> > 	('rdm_check' or 'prepare_hotplug'?)
> >
> > the 1st invocation of this interface will save all reported reserved
> > regions under domain structure, and later invocation (e.g. from
> > hvmloader) gets saved content.
> 
> Why would the reserved regions need attaching to the domain
> structure? The combination of (to be) assigned devices and
> global RMRR list always allow reproducing the intended set of
> regions without any extra storage.

it's possible a new device is plugged into host between two 
adjacent invocations, and inconsistent information will be returned
that way. 

> 
> >>>>3.6 Libxc/hvmloader: detect and avoid conflictions
> > ----
> > libxc needs to detect reserved region conflictions with:
> > 	- guest RAM
> > 	- monolithic PCI MMIO hole
> >
> > hvmloader needs to detect reserved region confliction with:
> > 	- guest RAM
> > 	- PCI MMIO allocation
> > 	- memory allocation
> > 	- some e820 entries like ACPI Opregion, etc.
> 
> - BIOS and alike

yes

> 
> > There are several other options discussed so far:
> >
> > a) Duplicate same relocation algorithm within libxc domain builder
> > (when populating physmap) and hvmloader (when creating e820)
> >   - Pros:
> > 	* no interface/structure change
> > 	* anyway hvmloader still needs to handle reserved regions
> >   - Cons:
> > 	* duplication is not good
> >
> > b) pass sparse information through Xenstore
> >   (no much idea. need input from toolstack maintainers)
> >
> > c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> > set and hvmloader to get. Extension required to allow hvm invoke.
> >   - Pros:
> > 	* centralized ownership in libxc. flexible for extension
> >   - Cons:
> > 	* limiting entry to E820MAX (should be fine)
> > 	* hvmloader e820 construction may become more complex, given
> > two predefined tables (reserved_regions, memory_map)
> 
> d) Move down the lowmem RAM/MMIO boundary so that a single,
> contiguous chunk of lowmem results, with all other RAM moving up
> beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
> considered here, and I think we can reasonably safely assume that
> no RMRRs will ever report ranges above 1Mb but below the host
> lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
> that the lowmem chunk will always be reasonably big).

I don't see how above assumption is validated, but it's a good simplification
since how much we want to avoid confliction is implementation tradeoff. :-)

> 
> > 4. Plan
> >
> ================================================================
> =====
> > We're seeking an incremental way to split above tasks into 2 stages,
> > and in each stage we move forward a step w/o causing regression. Doing
> > so can benefit people who want to use device assignment early, and
> > also benefit newbie developer to rampup, toward a final sane solution.
> >
> > 4.1 Stage-1: hypervisor hardening
> > ----
> >   [Tasks]
> > 	1) Setup RMRR identity mapping in p2m layer with confliction
> > detection
> > 	2) add a boot option for fail/warn policy
> > 	3) remove USB hack
> > 	4) Detect and fail device assignment w/ shared reserve regions
> >
> >   [Enhancements]
> > 	* fix [Issue-1] and [Issue-3]
> 
> According to what you wrote earlier, [Issue-3] is not intended to be
> fixed, but instead devices sharing the same RMRR(s) are to be
> declared unassignable.

yes, that's clearer.

> 
> > 	* partially fix [Issue-2] with limitations:
> > 		- w/o userspace relocation there's larger chance to
> > see conflictions.
> > 		- w/o reserve in guest e820, guest OS may allocate
> > reserved pfn when re-enumerating PCI resource
> >
> >   [Regressions]
> > 	* devices which can be assigned successfully before may be
> > failed now due to confliction detection. However it's not a regression
> > per se. and user can change policy to 'warn' if required.
> 
> Avoiding such a (perceived) regression would seem to be possible by
> intermixing hypervisor and libxc/hvmloader adjustments.
> 
> Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:58   ` Jan Beulich
@ 2015-01-09  2:29     ` Tian, Kevin
  2015-01-09  9:24       ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  2:29 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, January 08, 2015 8:59 PM
> 
> >>> On 08.01.15 at 13:49, <George.Dunlap@eu.citrix.com> wrote:
> > One question: where are these RMRRs typically located in memory?  Are
> > they normally up in the MMIO region?  Or can they occur anywhere (even
> > in really low areas, say, under 1GiB)?
> 
> They would typically sit in the MMIO hole or below 1Mb; that latter case
> is particularly problematic as it might conflict with what we want to put
> there (BIOS etc).
> 

and later case is not solvable, which is then related to other discussion whether
we want to fail such case

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:49 ` George Dunlap
  2015-01-08 12:54   ` George Dunlap
  2015-01-08 12:58   ` Jan Beulich
@ 2015-01-09  2:42   ` Tian, Kevin
  2 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  2:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Thursday, January 08, 2015 8:50 PM
> 
> On Fri, Dec 26, 2014 at 11:23 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> > (please note some proposal is different from last sent version after more
> > discussions. But I tried to summarize previous discussions and explained why
> > we choose a different way. Sorry if I may miss some opens/conclusions
> > discussed in past months. Please help point it out which is very
> appreciated. :-)
> 
> Kevin, thanks for this document.  A few questions / comments below:
> 
> > For proper functioning of these legacy reserved memory usages, when
> > system software enables DMA remapping, the translation structures for
> > the respective devices are expected to be set up to provide identity
> > mapping for the specified reserved memory regions with read and write
> > permissions. The system software is also responsible for ensuring
> > that any input addresses used for device accesses to OS-visible memory
> > do not overlap with the reserved system memory address ranges.
> 
> Just to be clear: "identity mapping" here means that gpfn == mfn, in
> both the p2m and IOMMU.  (I suppose it might mean vfn == gpfn as well,
> but that wouldn't really concern us, as the guest deals with virtual
> mappings.)

I'm not sure what you meant by 'vfn', but it applies to whatever address space
created by IOMMU page table. And it's from VT-d spec which also covers
bare metal usage. 

> 
> > However current RMRR implementation in Xen only partially achieves a)
> > and completely misses b), which cause some issues:
> >
> > --
> > [Issue-1] Identity mapping is not setup in shared ept case, so a device
> > with RMRR may not function correctly if assigned to a VM.
> >
> > This was the original problem we found when assigning IGD on BDW
> > platform, which triggered the whole long discussion in past months
> >
> > --
> > [Issue-2] Being lacking of goal-b), existing device assignment with
> > RMRR works only when reserved regions happen to not conflicting with
> > other valid allocations in the guest physical address space. This could
> > lead to unpredicted failures in various deployments, due to non-detected
> > conflictions caused by platform difference and VM configuration
> > difference.
> >
> > One example is about USB controller assignment. It's already identified
> > as a problem on some platforms, that USB reserved regions conflict with
> > guest BIOS region. However, being the fact that host BIOS only touches
> > those reserved regions for legacy keyboard emulation at early Dom0 boot
> > phase, a trick is added in Xen to bypass RMRR handling for usb
> > controllers.
> >
> > --
> > [Issue-3] devices may share same reserved regions, however
> > there is no logic to handle this in Xen. Assigning such devices to
> > different VMs could lead to secure concern
> 
> So to summarize:
> 
> When assigning a device to a guest, the device's associated RMRRs must
> be identity mapped in the p2m and IOMMU.
> 
> At the moment, we don't have a reliable way to reclaim a particular
> gpfn space from a guest once it's been used for other puproses (e.g.,
> guest RAM or other MMIO ranges).
> 
> So, we need to make sure at guest creation time that we reserve any
> RMRR ranges for devices we may wish to assign, and make sure that the
> RMRR in gpfn space is empty.
> 
> For statically-assigned devices, we know at guest creation time which
> RMRRs may be required.  But if we want to dynamically add devices, we
> must figure out ahead of time which devices we *might* add, and
> reserve the RMRRs at boot time.
> 
> As a separate problem, two different devices may share the same RMRR,
> meaning that if we assign these devices to two different VMs, the RMRR
> may be mapped into the gpfn space of two different VMs.  This may well
> be a security issue, so we need to handle it carefully.

exactly. :-)

> 
> > 3. High Level Design
> >
> ================================================================
> =====
> >
> > To achieve aforementioned two goals, major enhancements are required
> > cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> > goal-b), i.e. handling possible conflictions in gfn space. Fixing
> > goal-a) is straightforward.
> >
> >>>>3.1 Guidelines
> > ----
> > There are several guidelines considered in the design:
> >
> > --
> > [Guideline-1] No regression in a VM w/o statically-assigned devices
> >
> >   If a VM isn't configured with assigned devices at creation, new
> > confliction detection logic shouldn't block the VM boot progress
> > (either skipped, or just throw warning)
> >
> > --
> > [Guideline-2] No regression on devices which do not have RMRR reported
> >
> >   If a VM is assigned with a device which doesn't have RMRR reported,
> > either statically-assigned or dynamically-assigned, new confliction
> > detection logic shouldn't fail the assignment request for this device.
> >
> > --
> > [Guideline-3] New interface should be kept as common as possible
> >
> >   New interface will be introduced to expose reserved regions to the
> > user space. Though RMRR is a VT-d specific terminology, the interface
> > design should be generic enough, i.e. to support a function which
> > allows hypervisor to force reserving one or more gfn ranges.
> >
> > --
> > [Guideline-4] Keep changes simple
> >
> >   RMRR reserved regions should be avoided or limited by platform
> > designers, per VT-d specification. Per our observations, there are
> > only a few reported examples (USB, IGD) on real platforms. So we need
> > to balance the code complexity and usage limitation. If one limitation
> > is only in niche scenarios, we'd like to vote no-support to simplify
> > changes for now.
> 
> This is an excellent set of principles -- thanks.
> 
> >
> >>>>3.2 Confliction detection
> > ----
> > Confliction must be detected in several places as far as gfn is
> > concerned (how to handle confliction is discussed in 3.3)
> >
> > 1) libxc domain builder
> >   Here coarse-grained gfn layout is created, including two contiguous
> > guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> > which are passed to hvmloader for later fine-grained manipulation. Guest
> > RAM trunks are populated with valid translation setup in underlying p2m
> > layer. Device reserved regions must be detected in that layout.
> >
> > 2) Xen hypervisor device assignment
> >   Device assignment can happen either at VM creation time (after domain
> > builder), or anytime thru hotplug after VM is booted. Regardless of
> > how userspace handles confliction, Xen hypervisor will always do the
> > last-conservative detection when setting up identity mapping:
> >         * gfn space unoccupied:
> >                 -> insert identity mapping; no confliction
> >         * gfn space already occupied with identity mapping:
> >                 -> do nothing; no confliction
> >         * gfn space already occupied with other mapping:
> >                 -> confliction detected
> >
> > 3) hvmloader
> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> > internal data structures in gfn space, and it creates the final guest
> > e820. So hvmloader also needs to detect conflictions when conducting
> > those operations. If there's no confliction, hvmloader will reserve
> > those regions in guest e820 to let guest OS aware.
> 
> I think this can be summarized a bit more clearly by what each bit of
> code needs to actually do:
> 
> 1. libxc
>  - RMRR areas need to be not populated with gfns during boot time.
> 
> 2. Xen
>  - When a device with RMRRs is assigned, Xen must make an
> identity-mapping of the appropriate RMRR ranges.
> 
> 3. hvmloader
>  - hvmoader must report RMRRs in the e820 map of all devices which a
> guest may ever be assigned
>  - when placing devices in MMIO space, hvmloader must avoid placing
> MMIO devices over RMRR regions which are / may be assigned to a guest.
> 
> One component I think may be missing here -- qemu-traditional is very
> tolerant with regards to the gpfn space; but qemu-upstream expects to
> know the layout of guest gpfn space, and may crash if its idea of gpfn
> space doesn't match Xen's idea.  Unfortunately, however, there is not
> a very close link between these two at the moment; IIUC at the moment
> this is limited to the domain builder telling qemu how big the lowmem
> PCI hole will be.  Any solution which marks GPFN space as "non-memory"
> needs to make sure this is communicated to qemu-upstream as well.

so what qemu cares about is only RAM pages, right? If that's the case,
Jan's idea in another mail might make sense, i.e. we assume no RMRR
in lowmem (or at least not low enough so we have to split lowmem) and
then basic lowmem structure doesn't change.

> 
> >>>>3.3 Policies
> > ----
> > An intuitive thought is to fail immediately upon a confliction, however
> > it is not flexible regarding to different requirments:
> >
> > a) it's not appropriate to fail libxc domain builder just because such
> > confliction. We still want the guest to boot even w/o assigned device;
> >
> > b) whether to fail in hvmloader has several dependencies. If it's
> > to check for hotplug preparation, warning is also an acceptable option
> > since assignment may not happen at all. Or if it's a USB controller
> > but user doesn't care about legacy keyboard emulation, it's also OK to
> > move forward upon a confliction;
> >
> > c) in Xen hypervisor it is reasonable to fail upon confliction, where
> > device is actually assigned. But due to the same requirement on USB
> > controller, sometimes we might want it succeed just w/ warnings.
> >
> > Regarding to the complexity of addressing all above flexibilities (user
> > preferences, per-device), which requires inventing quite some parameters
> > passed among different components, and regarding to the fact that
> > failures would be rare (except some USB) with proactive avoidance
> > in userspace, we'd like to propose below simplified policy following
> > [Guideline-4]:
> >
> > - 'warn' conflictions in user space (libxc and hvmloader)
> > - a boot option to specify 'fail' or 'warn' confliction in Xen device
> > assignment path, default to 'fail' (user can set to 'warn' for USB case)
> >
> > Such policy provides a relaxed user space policy w/ hypervisor to do
> > final judge. It has a unique merit to simplify later interface design
> > and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> > reserved regions are exposed.
> >
> >     ******agreement is first required on above policy******
> 
> So the important part of policy is what the user experience is.  I
> think we can assume that all device assignment will happen through
> libxl; so from a user interface perspective we mainly want to be
> thinking about the xl / libxl interface.
> 
> How the various sub-components react if something unexpected happens
> is then just a matter of robust system design.
> 
> So first of all, I think RMRR reservations should be specified at
> domain creation time.  If a user tries to assign a device with RMRRs
> to a VM that has not reserved those ranges at creation time, the
> assignment should fail.
> 
> The main place this checking should happen is in the toolstack
> (libxl).  The toolstack can then give a sensible error message to the
> user, which may include things they can to to fix the problem.
> 
> In the case of statically-assigned devices, the toolstack can look at
> the RMRRs required and make sure to reserve them at domain creation
> time.
> 
> For dynamically-assigned devices, I think there should be an option to
> make the guest's memory layout mirror the host: this would include the
> PCI hole and all RMRR ranges.  This would be off by default.
> 
> We could imagine a way of specifying "I may want to assign this pool
> of devices to this VM", or to manually specify RMRR ranges which
> should be reserved, but I think that's a bit more advanced than we
> really need right now.

yes, that type of enhancements can be considered later.

> 
> >>>>3.5 New interface: expose reserved region information
> 
> It's not clear to me who this new interface is being exposed to.
> 
> It seems to me what we want is for the toolstack to figure out, at
> guest creation time, what RMRRs should be reserved for this VM, and
> probably put that information in xenstore somewhere, where it's
> available to hvmloader.  I assume the RMRR information is already
> available through sysfs in dom0?

no

> 
> One question: where are these RMRRs typically located in memory?  Are
> they normally up in the MMIO region?  Or can they occur anywhere (even
> in really low areas, say, under 1GiB)?

reported by ACPI structures. and like Jan replied, could be anywhere.

> 
> If RMRRs almost always happen up above 2G, for example, then a simple
> solution that wouldn't require too much work would be to make sure
> that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
> enough to include all RMRRs.  That would satisfy the libxc and qemu
> requirements.

unfortunately it's not so simple case.

> 
> If we then store specific RMRRs we want included in xenstore,
> hvmloader can put them in the e820 map, and that would satisfy the
> hvmloader requirement.

is xenstore really necessary given the new introduced hypercall?

> 
> Then when we assign the device, those ranges will be already unused in
> the p2m, and (if I understand correctly) Xen will already map the RMRR
> ranges 1-1 upon device assignment.
> 
> What do you think?
> 
> If making the RMRRs fit inside the guest MMIO hole is not practical
> (for example, if the ranges occur very low in memory), then we'll have
> to come up with a way to specify, both to libxc and to qemu, where
> these  holes in memory are.
> 
> >>>>3.8 Xen: Handle devices sharing reserved regions
> > ----
> > Per VT-d spec, it's possible to have two devices sharing same reserved
> > region. Though we didn't see such example in reality, hypervisor needs
> > to detect and handle such scenario, otherwise vulnerability may exist
> > if two devices are assigned to different VMs (so a malicious VM may
> > program its assigned device to clobber the shared region to malform
> > another VM's device)
> >
> > Ideally all devices sharing reserved regions should be assigned to a
> > single VM. However achieving this goal can't be done sole in hypervisor
> > w/o reworking current device assignment interface. Assignment is managed
> > by toolstack, which requires exposing group sharing information to
> > userspace and then extends toolstack to manage assignment in bundle.
> >
> > Given the problem only in ideal space, we propose to not support such
> > scenario, i.e. having hypervisor to fail the assignment, if the target
> > device happens to share some reserved regions with another device,
> > following [Guideline-4] to keep things simple.
> 
> I think denying it by default, first in the toolstack and as a
> fall-back in the hypervisor, is a good idea.
> 
> It shouldn't be too difficult, however, to add an option to override
> this.  We have a lot of individual users who use Xen for device
> pass-through; such advanced users should be allowed to "shoot
> themselves in the foot" if they want to.
> 
> Thoughts?
> 

that's also the option I'd like to keep. Instead of enforcing most strict
policy by Xen, better to warn the problem but let user choose whether
they want to move forward.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 12:54   ` George Dunlap
  2015-01-08 13:00     ` Jan Beulich
@ 2015-01-09  2:43     ` Tian, Kevin
  2015-01-12 11:25       ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  2:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Thursday, January 08, 2015 8:55 PM
> 
> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
> > If RMRRs almost always happen up above 2G, for example, then a simple
> > solution that wouldn't require too much work would be to make sure
> > that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
> > enough to include all RMRRs.  That would satisfy the libxc and qemu
> > requirements.
> >
> > If we then store specific RMRRs we want included in xenstore,
> > hvmloader can put them in the e820 map, and that would satisfy the
> > hvmloader requirement.
> 
> An alternate thing to do here would be to "properly" fix the
> qemu-upstream problem, by making a way for hvmloader to communicate
> changes in the gpfn layout to qemu.
> 
> Then hvmloader could do the work of moving memory under RMRRs to
> higher memory; and libxc wouldn't need to be involved at all.
> 
> I think it would also fix our long-standing issues with assigning PCI
> devices to qemu-upstream guests, which up until now have only been
> worked around.
> 

could you elaborate a bit for that long-standing issue?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 15:59   ` George Dunlap
  2015-01-08 16:10     ` Jan Beulich
@ 2015-01-09  2:49     ` Tian, Kevin
  1 sibling, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  2:49 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Friday, January 09, 2015 12:00 AM
> 
> >>>>>3.3 Policies
> >> ----
> >> An intuitive thought is to fail immediately upon a confliction, however
> >> it is not flexible regarding to different requirments:
> >>
> >> a) it's not appropriate to fail libxc domain builder just because such
> >> confliction. We still want the guest to boot even w/o assigned device;
> >
> > I don't think that's right (and I believe this was discussed before):
> > When device assignment fails, VM creation should fail too. It is the
> > responsibility of the host admin in that case to remove some or all
> > of the to be assigned devices from the guest config.
> 
> Yes; basically, if we get to domain build time and we haven't reserved
> the RMRRs, then it's a bug in libxl (since it's apparently told Xen
> one thing and libxc another thing).  Having libxc fail in that case is
> a perfectly sensible thing to do.

I'd like a way either throwing warning and then move forward, or just fails 
the device assignment (i.e. not exposing to guest) instead of failing the 
whole guest, to mimic native behavior.

> 
> >> We propose report-all as the simple solution (different from last sent
> >> version which used report-sel), regarding to the below facts:
> >>
> >>   - 'warn' policy in user space makes report-all not harmful
> >>   - 'report-all' still means only a few entries in reality:
> >>     * RMRR reserved regions should be avoided or limited by platform
> >> designers, per VT-d specification;
> >>     * RMRR reserved regions are only a few on real platforms, per our
> >> current observations;
> >
> > Few yes, but in the IGD example you gave the region is quite large,
> > and it would be fairly odd to have all guests have a strange, large
> > hole in their address spaces. Furthermore remember that these
> > holes vary from machine to machine, so a migrateable guest would
> > needlessly end up having a hole potentially not helping subsequent
> > hotplug at all.
> 
> Yes, I think that by default VMs should have no RMRRs set up on domain
> creation.  The only way to get RMRRs in your address space should be
> to opt-in at domain creation time (either by statically assigning
> devices, or by requesting your memory layout to mirror the host's).

ideally yes, but I'm thinking how bad it is to create some unnecessary
holes compared to the complexity of accurate control.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 18:02       ` George Dunlap
  2015-01-08 18:12         ` Pasi Kärkkäinen
@ 2015-01-09  3:12         ` Tian, Kevin
  2015-01-09  8:58         ` Jan Beulich
  2015-01-09 20:27         ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09  3:12 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Friday, January 09, 2015 2:02 AM
> 
> On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> the 1st invocation of this interface will save all reported reserved
> >>>> regions under domain structure, and later invocation (e.g. from
> >>>> hvmloader) gets saved content.
> >>>
> >>> Why would the reserved regions need attaching to the domain
> >>> structure? The combination of (to be) assigned devices and
> >>> global RMRR list always allow reproducing the intended set of
> >>> regions without any extra storage.
> >>
> >> So when you say "(to be) assigned devices", you mean any device which
> >> is currently assigned, *or may be assigned at some point in the
> >> future*?
> >
> > Yes.
> >
> >> Do you think the extra storage for "this VM might possibly be assigned
> >> this device at some point" wouldn't really be that much bigger than
> >> "this VM might possibly map this RMRR at some point in the future"?
> >
> > Since listing devices without RMRR association would be pointless,
> > I think a list of devices would require less storage. But see below.
> >
> >> It seems a lot cleaner to me to have the toolstack tell Xen what
> >> ranges are reserved for RMRR per VM, and then have Xen check again
> >> when assigning a device to make sure that the RMRRs have already been
> >> reserved.
> >
> > With an extra level of what can be got wrong by the admin.
> > However, I now realize that doing it this way would allow
> > specifying regions not associated with any device on the host
> > the guest boots on, but associated with one on a host the guest
> > may later migrate to.
> 
> I did say the toolstack, not the admin. :-)
> 
> At the xl level, I envisioned a single boolean that would say, "Make
> my memory layout resemble the host system" -- so the MMIO hole would
> be the same size, and all the RMRRs would be reserved.
> 
> But xapi, for instance, has a concept of "hardware pools" containing
> individual hardware devices, which can be assigned to VMs.  You could
> imagine a toolstack like xapi keeping track of all devices which
> *might be* assigned to a guest, and supplying Xen with the RMRRs.  As
> you say, then this could include hardware across a pool of hosts, with
> the RMRRs of any device in the system reserved.

you don't need tell Xen about which RMRRs (even from other hosts) to be
reserved. What Xen needs to do is simple, i.e. setup identity mapping at
device assignment request, and detect any conflictions for requested gfn.
all the flexibility/preparation/reservation are userspace work between
libxl and hvmloader.

> 
> Alternately, could the toolstack could be responsible for making sure
> that nobody uses such a range; and then Xen when a device is assigned,
> Xen can check to make sure that the gpfn space is empty before adding
> the RMRRs?  That might be the most flexible.
> 

I think that's the current proposed way, right? Xen only does check, and
if no confliction then setup identity mapping. most changes/opens are
in user space what/how to reserve the range and avoid confliction.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 18:02       ` George Dunlap
  2015-01-08 18:12         ` Pasi Kärkkäinen
  2015-01-09  3:12         ` Tian, Kevin
@ 2015-01-09  8:58         ` Jan Beulich
  2015-01-09 20:27         ` Konrad Rzeszutek Wilk
  3 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-09  8:58 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 08.01.15 at 19:02, <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
>>> It seems a lot cleaner to me to have the toolstack tell Xen what
>>> ranges are reserved for RMRR per VM, and then have Xen check again
>>> when assigning a device to make sure that the RMRRs have already been
>>> reserved.
>>
>> With an extra level of what can be got wrong by the admin.
>> However, I now realize that doing it this way would allow
>> specifying regions not associated with any device on the host
>> the guest boots on, but associated with one on a host the guest
>> may later migrate to.
> 
> I did say the toolstack, not the admin. :-)

You did, but the tool stack needs to take its knowledge from
somewhere, which I implied to be the guest config.

> At the xl level, I envisioned a single boolean that would say, "Make
> my memory layout resemble the host system" -- so the MMIO hole would
> be the same size, and all the RMRRs would be reserved.

Right, that's an option where things can't go wrong, but not allowing
maximum flexibility (as mentioned before when considering migration).

> But xapi, for instance, has a concept of "hardware pools" containing
> individual hardware devices, which can be assigned to VMs.  You could
> imagine a toolstack like xapi keeping track of all devices which
> *might be* assigned to a guest, and supplying Xen with the RMRRs.  As
> you say, then this could include hardware across a pool of hosts, with
> the RMRRs of any device in the system reserved.
> 
> Alternately, could the toolstack could be responsible for making sure
> that nobody uses such a range; and then Xen when a device is assigned,
> Xen can check to make sure that the gpfn space is empty before adding
> the RMRRs?  That might be the most flexible.

"such a range" is pretty unspecific here: The main question is where the
tool stack would get the information on the set of ranges from. Of course
if it has this information, it can make sure no guest (unless marked
otherwise) uses any of these ranges. The hypervisor side verification
needs to be done in any case.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  2:27   ` Tian, Kevin
@ 2015-01-09  9:21     ` Jan Beulich
  2015-01-09 10:10       ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-09  9:21 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 09.01.15 at 03:27, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, January 08, 2015 9:55 PM
>> >>> On 26.12.14 at 12:23, <kevin.tian@intel.com> wrote:
>> > 3) hvmloader
>> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
>> > internal data structures in gfn space, and it creates the final guest
>> > e820. So hvmloader also needs to detect conflictions when conducting
>> > those operations. If there's no confliction, hvmloader will reserve
>> > those regions in guest e820 to let guest OS aware.
>> 
>> Ideally, rather than detecting conflicts, hvmloader would just
>> consume what libxc set up. Obviously that would require awareness
>> in libxc of things it currently doesn't care about (like fitting PCI BARs
>> into the MMIO hole, enlarging it as necessary). I admit that this may
>> end up being difficult to implement. Another alternative would be to
>> have libxc only populate a limited part of RAM (for hvmloader to be
>> loadable), and have hvmloader do the bulk of the populating.
> 
> there are quite some allocations which are suitable in hvmloader, such
> as ACPI, PCI BARs, and other hole allocations. some of them are hvmloader's
> own usage, and others are related to guest bios. I don't think it's worthy 
> of
> the mass refactoring of moving those allocations to libxc, just for this 
> very
> specific task. As long as hvmloader still needs allocate gfns, it needs to
> keep confliction detection logic itself.

Allocations done by hvmloader doesn't need to look for conflicts.
All it needs to make sure is that what it allocates is actually RAM.
It not doing so today is already a (latent) bug. The thing is that
if libxc set up a proper, fully correct memory map for hvmloader
to consume, hvmloader doesn't need to do anything else than
play by this memory map.

>> >>>>3.3 Policies
>> > ----
>> > An intuitive thought is to fail immediately upon a confliction, however
>> > it is not flexible regarding to different requirments:
>> >
>> > a) it's not appropriate to fail libxc domain builder just because such
>> > confliction. We still want the guest to boot even w/o assigned device;
>> 
>> I don't think that's right (and I believe this was discussed before):
>> When device assignment fails, VM creation should fail too. It is the
>> responsibility of the host admin in that case to remove some or all
>> of the to be assigned devices from the guest config.
> 
> think about bare metal. If a device say NIC doesn't work, would the
> platform reject to work at all? there could be errors, but their scope
> are limited within specific function. user can still use a platform w/
> errors as long as related functions are not used.
> 
> Similarly we should allow domainbuilder to move forward upon a
> device assignment failure (something like circuit error when powering
> the device), and user will note this problem when using the device
> (either not present or not function correctly).
> 
> same thing for hotplug usage. all the detections for future hotplug
> usage are just preparation and not strict. you don't want to hang
> a platform just because it's not suitable to hotplug some device in
> the future.

Hotplug is something that can fail, and surely shouldn't lead to a
hung guest. Very similar to hotplug on bare metal indeed.

Boot time device assignment is different: The question isn't whether
an assigned device works, instead the proper analogy is whether a
device is _present_. If a device doesn't work on bare metal, it will
still be discoverable. Yet if device assignment fails, that's not going
to be the case - for security reasons, the guest would not see any
notion of the device.

The "device does not work" analogy to bare metal would only apply
if the device's presence would prevent the system from booting (in
which case you'd have to physically remove it from the system, just
like for the virtualized case you'd have to remove it from the guest
config).

>> > We propose report-all as the simple solution (different from last sent
>> > version which used report-sel), regarding to the below facts:
>> >
>> >   - 'warn' policy in user space makes report-all not harmful
>> >   - 'report-all' still means only a few entries in reality:
>> >     * RMRR reserved regions should be avoided or limited by platform
>> > designers, per VT-d specification;
>> >     * RMRR reserved regions are only a few on real platforms, per our
>> > current observations;
>> 
>> Few yes, but in the IGD example you gave the region is quite large,
>> and it would be fairly odd to have all guests have a strange, large
>> hole in their address spaces. Furthermore remember that these
>> holes vary from machine to machine, so a migrateable guest would
>> needlessly end up having a hole potentially not helping subsequent
>> hotplug at all.
> 
> it's not strange since it never exceeds the set on bare metal, but yes, 
> migration raises another interesting point. currently I don't think 
> migration w/ assigned devices is supported. but even considering
> future possibility, there's always limitation since whatever reserved
> regions created at boot time in e820 are static which can't adapt
> to dynamic device changes. for hotplug or migration, you always
> suffer from seeing some holes which might not be relevant at a
> moment.

The question isn't about migrating with devices assigned, but about
assigning devices after migration (consider a dual vif + SR-IOV NIC
guest setup where the SR-IOV NIC gets hot-removed before
migration and a new one hot-plugged afterwards).

Furthermore any tying of the guest memory layout to the host's
where the guest first boots is awkward, as post-migration there's
not going to be any reliable correlation between the guest layout
and the new host's.

>> > In this way, there are two situations libxc domain builder may request
>> > to query reserved region information w/ same interface:
>> >
>> > a) if any statically-assigned devices, and/or
>> > b) if a new parameter is specified, asking for hotplug preparation
>> > 	('rdm_check' or 'prepare_hotplug'?)
>> >
>> > the 1st invocation of this interface will save all reported reserved
>> > regions under domain structure, and later invocation (e.g. from
>> > hvmloader) gets saved content.
>> 
>> Why would the reserved regions need attaching to the domain
>> structure? The combination of (to be) assigned devices and
>> global RMRR list always allow reproducing the intended set of
>> regions without any extra storage.
> 
> it's possible a new device is plugged into host between two 
> adjacent invocations, and inconsistent information will be returned
> that way. 

Can hot-plugged devices indeed be associated with RMRRs? This
would seem like a contradiction to me, since RMRRs are specifically
there to aid boot time compatibility issues.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  2:29     ` Tian, Kevin
@ 2015-01-09  9:24       ` Jan Beulich
  2015-01-09 10:03         ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-09  9:24 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 09.01.15 at 03:29, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, January 08, 2015 8:59 PM
>> 
>> >>> On 08.01.15 at 13:49, <George.Dunlap@eu.citrix.com> wrote:
>> > One question: where are these RMRRs typically located in memory?  Are
>> > they normally up in the MMIO region?  Or can they occur anywhere (even
>> > in really low areas, say, under 1GiB)?
>> 
>> They would typically sit in the MMIO hole or below 1Mb; that latter case
>> is particularly problematic as it might conflict with what we want to put
>> there (BIOS etc).
>> 
> 
> and later case is not solvable, which is then related to other discussion 
> whether
> we want to fail such case

That latter case is partially solvable: The BIOS put below 1Mb has a
permanent and a transient part. Dealing with the transient part
overlapping an RMRR ought to be possible (e.g. by delaying the
actual device assignment until the point where hvmloader knows it
is safe to do). An overlap of the permanent part with an RMRR is of
course fatal.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  9:24       ` Jan Beulich
@ 2015-01-09 10:03         ` Tian, Kevin
  0 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09 10:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, January 09, 2015 5:24 PM
> 
> >>> On 09.01.15 at 03:29, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, January 08, 2015 8:59 PM
> >>
> >> >>> On 08.01.15 at 13:49, <George.Dunlap@eu.citrix.com> wrote:
> >> > One question: where are these RMRRs typically located in memory?
> Are
> >> > they normally up in the MMIO region?  Or can they occur anywhere
> (even
> >> > in really low areas, say, under 1GiB)?
> >>
> >> They would typically sit in the MMIO hole or below 1Mb; that latter case
> >> is particularly problematic as it might conflict with what we want to put
> >> there (BIOS etc).
> >>
> >
> > and later case is not solvable, which is then related to other discussion
> > whether
> > we want to fail such case
> 
> That latter case is partially solvable: The BIOS put below 1Mb has a
> permanent and a transient part. Dealing with the transient part
> overlapping an RMRR ought to be possible (e.g. by delaying the
> actual device assignment until the point where hvmloader knows it
> is safe to do). An overlap of the permanent part with an RMRR is of
> course fatal.
> 

yes, but that type of changes are tricky, and since it's only partial so not
worthy of doing so. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  9:21     ` Jan Beulich
@ 2015-01-09 10:10       ` Tian, Kevin
  2015-01-09 10:35         ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-09 10:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, January 09, 2015 5:21 PM
> 
> >>> On 09.01.15 at 03:27, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, January 08, 2015 9:55 PM
> >> >>> On 26.12.14 at 12:23, <kevin.tian@intel.com> wrote:
> >> > 3) hvmloader
> >> >   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> >> > internal data structures in gfn space, and it creates the final guest
> >> > e820. So hvmloader also needs to detect conflictions when conducting
> >> > those operations. If there's no confliction, hvmloader will reserve
> >> > those regions in guest e820 to let guest OS aware.
> >>
> >> Ideally, rather than detecting conflicts, hvmloader would just
> >> consume what libxc set up. Obviously that would require awareness
> >> in libxc of things it currently doesn't care about (like fitting PCI BARs
> >> into the MMIO hole, enlarging it as necessary). I admit that this may
> >> end up being difficult to implement. Another alternative would be to
> >> have libxc only populate a limited part of RAM (for hvmloader to be
> >> loadable), and have hvmloader do the bulk of the populating.
> >
> > there are quite some allocations which are suitable in hvmloader, such
> > as ACPI, PCI BARs, and other hole allocations. some of them are hvmloader's
> > own usage, and others are related to guest bios. I don't think it's worthy
> > of
> > the mass refactoring of moving those allocations to libxc, just for this
> > very
> > specific task. As long as hvmloader still needs allocate gfns, it needs to
> > keep confliction detection logic itself.
> 
> Allocations done by hvmloader doesn't need to look for conflicts.
> All it needs to make sure is that what it allocates is actually RAM.
> It not doing so today is already a (latent) bug. The thing is that
> if libxc set up a proper, fully correct memory map for hvmloader
> to consume, hvmloader doesn't need to do anything else than
> play by this memory map.

for RAM I think we're aligned. with your earlier suggestion I think we can
achieve the goal. :-)

> 
> >> >>>>3.3 Policies
> >> > ----
> >> > An intuitive thought is to fail immediately upon a confliction, however
> >> > it is not flexible regarding to different requirments:
> >> >
> >> > a) it's not appropriate to fail libxc domain builder just because such
> >> > confliction. We still want the guest to boot even w/o assigned device;
> >>
> >> I don't think that's right (and I believe this was discussed before):
> >> When device assignment fails, VM creation should fail too. It is the
> >> responsibility of the host admin in that case to remove some or all
> >> of the to be assigned devices from the guest config.
> >
> > think about bare metal. If a device say NIC doesn't work, would the
> > platform reject to work at all? there could be errors, but their scope
> > are limited within specific function. user can still use a platform w/
> > errors as long as related functions are not used.
> >
> > Similarly we should allow domainbuilder to move forward upon a
> > device assignment failure (something like circuit error when powering
> > the device), and user will note this problem when using the device
> > (either not present or not function correctly).
> >
> > same thing for hotplug usage. all the detections for future hotplug
> > usage are just preparation and not strict. you don't want to hang
> > a platform just because it's not suitable to hotplug some device in
> > the future.
> 
> Hotplug is something that can fail, and surely shouldn't lead to a
> hung guest. Very similar to hotplug on bare metal indeed.
> 
> Boot time device assignment is different: The question isn't whether
> an assigned device works, instead the proper analogy is whether a
> device is _present_. If a device doesn't work on bare metal, it will
> still be discoverable. Yet if device assignment fails, that's not going
> to be the case - for security reasons, the guest would not see any
> notion of the device.

the question is whether we want such device assignment fail due to
RMRR confliction, and the fail decision should be when Xen handles
actual assignment instead of when domain builder prepares reserved
regions.

> 
> The "device does not work" analogy to bare metal would only apply
> if the device's presence would prevent the system from booting (in
> which case you'd have to physically remove it from the system, just
> like for the virtualized case you'd have to remove it from the guest
> config).
> 
> >> > We propose report-all as the simple solution (different from last sent
> >> > version which used report-sel), regarding to the below facts:
> >> >
> >> >   - 'warn' policy in user space makes report-all not harmful
> >> >   - 'report-all' still means only a few entries in reality:
> >> >     * RMRR reserved regions should be avoided or limited by platform
> >> > designers, per VT-d specification;
> >> >     * RMRR reserved regions are only a few on real platforms, per our
> >> > current observations;
> >>
> >> Few yes, but in the IGD example you gave the region is quite large,
> >> and it would be fairly odd to have all guests have a strange, large
> >> hole in their address spaces. Furthermore remember that these
> >> holes vary from machine to machine, so a migrateable guest would
> >> needlessly end up having a hole potentially not helping subsequent
> >> hotplug at all.
> >
> > it's not strange since it never exceeds the set on bare metal, but yes,
> > migration raises another interesting point. currently I don't think
> > migration w/ assigned devices is supported. but even considering
> > future possibility, there's always limitation since whatever reserved
> > regions created at boot time in e820 are static which can't adapt
> > to dynamic device changes. for hotplug or migration, you always
> > suffer from seeing some holes which might not be relevant at a
> > moment.
> 
> The question isn't about migrating with devices assigned, but about
> assigning devices after migration (consider a dual vif + SR-IOV NIC
> guest setup where the SR-IOV NIC gets hot-removed before
> migration and a new one hot-plugged afterwards).
> 
> Furthermore any tying of the guest memory layout to the host's
> where the guest first boots is awkward, as post-migration there's
> not going to be any reliable correlation between the guest layout
> and the new host's.

how can you solve this? like above example, a NIC on node-A leaves
a reserved region in guest e820. now it's hot-removed and then
migrated to node-b. there's no way to update e820 again since it's
only boot structure. then user will still see such awkward regions.
since it's not avoidable, report-all in the summary mail looks not
causing a new problem.

> 
> >> > In this way, there are two situations libxc domain builder may request
> >> > to query reserved region information w/ same interface:
> >> >
> >> > a) if any statically-assigned devices, and/or
> >> > b) if a new parameter is specified, asking for hotplug preparation
> >> > 	('rdm_check' or 'prepare_hotplug'?)
> >> >
> >> > the 1st invocation of this interface will save all reported reserved
> >> > regions under domain structure, and later invocation (e.g. from
> >> > hvmloader) gets saved content.
> >>
> >> Why would the reserved regions need attaching to the domain
> >> structure? The combination of (to be) assigned devices and
> >> global RMRR list always allow reproducing the intended set of
> >> regions without any extra storage.
> >
> > it's possible a new device is plugged into host between two
> > adjacent invocations, and inconsistent information will be returned
> > that way.
> 
> Can hot-plugged devices indeed be associated with RMRRs? This
> would seem like a contradiction to me, since RMRRs are specifically
> there to aid boot time compatibility issues.
> 

you're right. it's only for boot time compability, and please ignore it. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09 10:10       ` Tian, Kevin
@ 2015-01-09 10:35         ` Jan Beulich
  2015-01-12  8:46           ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-09 10:35 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Boot time device assignment is different: The question isn't whether
>> an assigned device works, instead the proper analogy is whether a
>> device is _present_. If a device doesn't work on bare metal, it will
>> still be discoverable. Yet if device assignment fails, that's not going
>> to be the case - for security reasons, the guest would not see any
>> notion of the device.
> 
> the question is whether we want such device assignment fail due to
> RMRR confliction, and the fail decision should be when Xen handles
> actual assignment instead of when domain builder prepares reserved
> regions.

Detecting the failure only in the hypervisor has the downside of
potentially leaving the user with little clues as to what went wrong.
Sending messages to the hypervisor log in that case is
questionable, yet the tool stack (namely libxc) is known to not
always do a good job in error propagation.

>> The question isn't about migrating with devices assigned, but about
>> assigning devices after migration (consider a dual vif + SR-IOV NIC
>> guest setup where the SR-IOV NIC gets hot-removed before
>> migration and a new one hot-plugged afterwards).
>> 
>> Furthermore any tying of the guest memory layout to the host's
>> where the guest first boots is awkward, as post-migration there's
>> not going to be any reliable correlation between the guest layout
>> and the new host's.
> 
> how can you solve this? like above example, a NIC on node-A leaves
> a reserved region in guest e820. now it's hot-removed and then
> migrated to node-b. there's no way to update e820 again since it's
> only boot structure. then user will still see such awkward regions.
> since it's not avoidable, report-all in the summary mail looks not
> causing a new problem.

The solution to this are reserved regions specified in the guest config,
independent of host characteristics.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  0:53   ` Tian, Kevin
@ 2015-01-09 12:00     ` Andrew Cooper
  0 siblings, 0 replies; 139+ messages in thread
From: Andrew Cooper @ 2015-01-09 12:00 UTC (permalink / raw)
  To: Tian, Kevin, Tim Deegan
  Cc: wei.liu2, ian.campbell, stefano.stabellini, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On 09/01/15 00:53, Tian, Kevin wrote:
>> From: Tim Deegan [mailto:tim@xen.org]
>> Sent: Thursday, January 08, 2015 8:32 PM
>>
>> Hi Kevin,
>>
>> Thanks for sending out this design document.  I think Jan will have
>> the most to say about this.  Looking just at the hypervisor side of
>> things, and leaving the tools desig to others...
>>
>> At 11:23 +0000 on 26 Dec (1419589382), Tian, Kevin wrote:
>>> c) in Xen hypervisor it is reasonable to fail upon confliction, where
>>> device is actually assigned. But due to the same requirement on USB
>>> controller, sometimes we might want it succeed just w/ warnings.
>> Can you explain more concretely why we would want to allow assignment
>> whn the RMRR setup fails?  It seems like the device's use of the RMRR
>> will (at least) corrupt the OS and (possibly) make the device itself
>> fail.
> For USB device, RMRR is used only in early-boot phase, as a way for
> communication between legacy keyboard driver and bios. That emulation
> mode will be disabled either at some ACPI initialization step or when
> USB keyboard driver is loader (sorry don't remember detail). So when
> it's assigned to a guest, there's no legacy emulation and not setting up
> identity mapping is not a critical issue. If we think such usage case is
> a valid one, 'fail' would cause regression on that.

XenServer has observed OEMs using RMRRs for stats reporting back to the BMC.

It is not safe to assume that a USB device with an RMRR would only ever
be using it for legacy keyboard emulation.  It is also not safe to
assume that legacy keyboard emulation might not resume at some point in
the future, although I would hope that it would not.

Therefore, I feel that USB controllers should not be permitted a special
case compared to other devices.

~Andrew

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-08 18:02       ` George Dunlap
                           ` (2 preceding siblings ...)
  2015-01-09  8:58         ` Jan Beulich
@ 2015-01-09 20:27         ` Konrad Rzeszutek Wilk
  2015-01-12  9:44           ` Tian, Kevin
  2015-01-12 12:12           ` Ian Campbell
  3 siblings, 2 replies; 139+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-09 20:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	ian.jackson, tim, xen-devel, Jan Beulich, Yang Z Zhang,
	Tiejun Chen

On Thu, Jan 08, 2015 at 06:02:04PM +0000, George Dunlap wrote:
> On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>> the 1st invocation of this interface will save all reported reserved
> >>>> regions under domain structure, and later invocation (e.g. from
> >>>> hvmloader) gets saved content.
> >>>
> >>> Why would the reserved regions need attaching to the domain
> >>> structure? The combination of (to be) assigned devices and
> >>> global RMRR list always allow reproducing the intended set of
> >>> regions without any extra storage.
> >>
> >> So when you say "(to be) assigned devices", you mean any device which
> >> is currently assigned, *or may be assigned at some point in the
> >> future*?
> >
> > Yes.
> >
> >> Do you think the extra storage for "this VM might possibly be assigned
> >> this device at some point" wouldn't really be that much bigger than
> >> "this VM might possibly map this RMRR at some point in the future"?
> >
> > Since listing devices without RMRR association would be pointless,
> > I think a list of devices would require less storage. But see below.
> >
> >> It seems a lot cleaner to me to have the toolstack tell Xen what
> >> ranges are reserved for RMRR per VM, and then have Xen check again
> >> when assigning a device to make sure that the RMRRs have already been
> >> reserved.
> >
> > With an extra level of what can be got wrong by the admin.
> > However, I now realize that doing it this way would allow
> > specifying regions not associated with any device on the host
> > the guest boots on, but associated with one on a host the guest
> > may later migrate to.
> 
> I did say the toolstack, not the admin. :-)
> 
> At the xl level, I envisioned a single boolean that would say, "Make
> my memory layout resemble the host system" -- so the MMIO hole would
> be the same size, and all the RMRRs would be reserved.

Like the e820_host=1 ? :-)

> 
> But xapi, for instance, has a concept of "hardware pools" containing
> individual hardware devices, which can be assigned to VMs.  You could
> imagine a toolstack like xapi keeping track of all devices which
> *might be* assigned to a guest, and supplying Xen with the RMRRs.  As
> you say, then this could include hardware across a pool of hosts, with
> the RMRRs of any device in the system reserved.
> 
> Alternately, could the toolstack could be responsible for making sure
> that nobody uses such a range; and then Xen when a device is assigned,
> Xen can check to make sure that the gpfn space is empty before adding
> the RMRRs?  That might be the most flexible.
> 
>  -George
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09 10:35         ` Jan Beulich
@ 2015-01-12  8:46           ` Tian, Kevin
  2015-01-12  9:32             ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12  8:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, January 09, 2015 6:35 PM
> 
> >>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Boot time device assignment is different: The question isn't whether
> >> an assigned device works, instead the proper analogy is whether a
> >> device is _present_. If a device doesn't work on bare metal, it will
> >> still be discoverable. Yet if device assignment fails, that's not going
> >> to be the case - for security reasons, the guest would not see any
> >> notion of the device.
> >
> > the question is whether we want such device assignment fail due to
> > RMRR confliction, and the fail decision should be when Xen handles
> > actual assignment instead of when domain builder prepares reserved
> > regions.
> 
> Detecting the failure only in the hypervisor has the downside of
> potentially leaving the user with little clues as to what went wrong.
> Sending messages to the hypervisor log in that case is
> questionable, yet the tool stack (namely libxc) is known to not
> always do a good job in error propagation.
> 
> >> The question isn't about migrating with devices assigned, but about
> >> assigning devices after migration (consider a dual vif + SR-IOV NIC
> >> guest setup where the SR-IOV NIC gets hot-removed before
> >> migration and a new one hot-plugged afterwards).
> >>
> >> Furthermore any tying of the guest memory layout to the host's
> >> where the guest first boots is awkward, as post-migration there's
> >> not going to be any reliable correlation between the guest layout
> >> and the new host's.
> >
> > how can you solve this? like above example, a NIC on node-A leaves
> > a reserved region in guest e820. now it's hot-removed and then
> > migrated to node-b. there's no way to update e820 again since it's
> > only boot structure. then user will still see such awkward regions.
> > since it's not avoidable, report-all in the summary mail looks not
> > causing a new problem.
> 
> The solution to this are reserved regions specified in the guest config,
> independent of host characteristics.
> 

I don't think how reserved regions are specified matter here. My point
is that when a region is reserved in e820 at boot time, there's no way
to erase that knowledge in the guest even when devices causing that
reservation are hot removed later.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12  8:46           ` Tian, Kevin
@ 2015-01-12  9:32             ` Jan Beulich
  2015-01-12  9:41               ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12  9:32 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 09:46, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, January 09, 2015 6:35 PM
>> >>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> The question isn't about migrating with devices assigned, but about
>> >> assigning devices after migration (consider a dual vif + SR-IOV NIC
>> >> guest setup where the SR-IOV NIC gets hot-removed before
>> >> migration and a new one hot-plugged afterwards).
>> >>
>> >> Furthermore any tying of the guest memory layout to the host's
>> >> where the guest first boots is awkward, as post-migration there's
>> >> not going to be any reliable correlation between the guest layout
>> >> and the new host's.
>> >
>> > how can you solve this? like above example, a NIC on node-A leaves
>> > a reserved region in guest e820. now it's hot-removed and then
>> > migrated to node-b. there's no way to update e820 again since it's
>> > only boot structure. then user will still see such awkward regions.
>> > since it's not avoidable, report-all in the summary mail looks not
>> > causing a new problem.
>> 
>> The solution to this are reserved regions specified in the guest config,
>> independent of host characteristics.
> 
> I don't think how reserved regions are specified matter here. My point
> is that when a region is reserved in e820 at boot time, there's no way
> to erase that knowledge in the guest even when devices causing that
> reservation are hot removed later.

I don't think anyone ever indicated that such erasure would be
needed/wanted - I'm not sure how you ended up there.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12  9:32             ` Jan Beulich
@ 2015-01-12  9:41               ` Tian, Kevin
  2015-01-12  9:50                 ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12  9:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 5:33 PM
> 
> >>> On 12.01.15 at 09:46, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, January 09, 2015 6:35 PM
> >> >>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> The question isn't about migrating with devices assigned, but about
> >> >> assigning devices after migration (consider a dual vif + SR-IOV NIC
> >> >> guest setup where the SR-IOV NIC gets hot-removed before
> >> >> migration and a new one hot-plugged afterwards).
> >> >>
> >> >> Furthermore any tying of the guest memory layout to the host's
> >> >> where the guest first boots is awkward, as post-migration there's
> >> >> not going to be any reliable correlation between the guest layout
> >> >> and the new host's.
> >> >
> >> > how can you solve this? like above example, a NIC on node-A leaves
> >> > a reserved region in guest e820. now it's hot-removed and then
> >> > migrated to node-b. there's no way to update e820 again since it's
> >> > only boot structure. then user will still see such awkward regions.
> >> > since it's not avoidable, report-all in the summary mail looks not
> >> > causing a new problem.
> >>
> >> The solution to this are reserved regions specified in the guest config,
> >> independent of host characteristics.
> >
> > I don't think how reserved regions are specified matter here. My point
> > is that when a region is reserved in e820 at boot time, there's no way
> > to erase that knowledge in the guest even when devices causing that
> > reservation are hot removed later.
> 
> I don't think anyone ever indicated that such erasure would be
> needed/wanted - I'm not sure how you ended up there.
> 

I ended here to indicate that report-all which gives user more reserved
regions than necessary is not a weird case since above scenario can also
create such fact. User shouldn't set expectation about reserved region
layout. and this argument is necessary to support our proposal of using
report-all. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09 20:27         ` Konrad Rzeszutek Wilk
@ 2015-01-12  9:44           ` Tian, Kevin
  2015-01-12 12:12           ` Ian Campbell
  1 sibling, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12  9:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Saturday, January 10, 2015 4:28 AM
> 
> On Thu, Jan 08, 2015 at 06:02:04PM +0000, George Dunlap wrote:
> > On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> > >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>> the 1st invocation of this interface will save all reported reserved
> > >>>> regions under domain structure, and later invocation (e.g. from
> > >>>> hvmloader) gets saved content.
> > >>>
> > >>> Why would the reserved regions need attaching to the domain
> > >>> structure? The combination of (to be) assigned devices and
> > >>> global RMRR list always allow reproducing the intended set of
> > >>> regions without any extra storage.
> > >>
> > >> So when you say "(to be) assigned devices", you mean any device which
> > >> is currently assigned, *or may be assigned at some point in the
> > >> future*?
> > >
> > > Yes.
> > >
> > >> Do you think the extra storage for "this VM might possibly be assigned
> > >> this device at some point" wouldn't really be that much bigger than
> > >> "this VM might possibly map this RMRR at some point in the future"?
> > >
> > > Since listing devices without RMRR association would be pointless,
> > > I think a list of devices would require less storage. But see below.
> > >
> > >> It seems a lot cleaner to me to have the toolstack tell Xen what
> > >> ranges are reserved for RMRR per VM, and then have Xen check again
> > >> when assigning a device to make sure that the RMRRs have already been
> > >> reserved.
> > >
> > > With an extra level of what can be got wrong by the admin.
> > > However, I now realize that doing it this way would allow
> > > specifying regions not associated with any device on the host
> > > the guest boots on, but associated with one on a host the guest
> > > may later migrate to.
> >
> > I did say the toolstack, not the admin. :-)
> >
> > At the xl level, I envisioned a single boolean that would say, "Make
> > my memory layout resemble the host system" -- so the MMIO hole would
> > be the same size, and all the RMRRs would be reserved.
> 
> Like the e820_host=1 ? :-)
> 

so this is the extension to report-all, not just for reserved regions but for
all e820 entries. :-)

one thing I'm struggling here (w/ Jan in other threads) is whether reporting
all reserved regions on the host can be a default setting to simplify the 
overall rmrr implementations, given that fact that end user shouldn't set
expectation on actual reserved regions.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12  9:41               ` Tian, Kevin
@ 2015-01-12  9:50                 ` Jan Beulich
  2015-01-12  9:56                   ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12  9:50 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 10:41, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, January 12, 2015 5:33 PM
>> >>> On 12.01.15 at 09:46, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Friday, January 09, 2015 6:35 PM
>> >> >>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
>> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> The question isn't about migrating with devices assigned, but about
>> >> >> assigning devices after migration (consider a dual vif + SR-IOV NIC
>> >> >> guest setup where the SR-IOV NIC gets hot-removed before
>> >> >> migration and a new one hot-plugged afterwards).
>> >> >>
>> >> >> Furthermore any tying of the guest memory layout to the host's
>> >> >> where the guest first boots is awkward, as post-migration there's
>> >> >> not going to be any reliable correlation between the guest layout
>> >> >> and the new host's.
>> >> >
>> >> > how can you solve this? like above example, a NIC on node-A leaves
>> >> > a reserved region in guest e820. now it's hot-removed and then
>> >> > migrated to node-b. there's no way to update e820 again since it's
>> >> > only boot structure. then user will still see such awkward regions.
>> >> > since it's not avoidable, report-all in the summary mail looks not
>> >> > causing a new problem.
>> >>
>> >> The solution to this are reserved regions specified in the guest config,
>> >> independent of host characteristics.
>> >
>> > I don't think how reserved regions are specified matter here. My point
>> > is that when a region is reserved in e820 at boot time, there's no way
>> > to erase that knowledge in the guest even when devices causing that
>> > reservation are hot removed later.
>> 
>> I don't think anyone ever indicated that such erasure would be
>> needed/wanted - I'm not sure how you ended up there.
>> 
> 
> I ended here to indicate that report-all which gives user more reserved
> regions than necessary is not a weird case since above scenario can also
> create such fact. User shouldn't set expectation about reserved region
> layout. and this argument is necessary to support our proposal of using
> report-all. :-)

The fact that ranges can't be removed from a guest's memory map
is irrelevant - there's simply no question that this is that way. The
main counter argument against report-all remains: It may result in
unnecessarily little low memory in guests not needing all of the host
regions to be reserved for them.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12  9:50                 ` Jan Beulich
@ 2015-01-12  9:56                   ` Tian, Kevin
  2015-01-12 10:08                     ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12  9:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 5:51 PM
> 
> >>> On 12.01.15 at 10:41, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 5:33 PM
> >> >>> On 12.01.15 at 09:46, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Friday, January 09, 2015 6:35 PM
> >> >> >>> On 09.01.15 at 11:10, <kevin.tian@intel.com> wrote:
> >> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> The question isn't about migrating with devices assigned, but about
> >> >> >> assigning devices after migration (consider a dual vif + SR-IOV NIC
> >> >> >> guest setup where the SR-IOV NIC gets hot-removed before
> >> >> >> migration and a new one hot-plugged afterwards).
> >> >> >>
> >> >> >> Furthermore any tying of the guest memory layout to the host's
> >> >> >> where the guest first boots is awkward, as post-migration there's
> >> >> >> not going to be any reliable correlation between the guest layout
> >> >> >> and the new host's.
> >> >> >
> >> >> > how can you solve this? like above example, a NIC on node-A leaves
> >> >> > a reserved region in guest e820. now it's hot-removed and then
> >> >> > migrated to node-b. there's no way to update e820 again since it's
> >> >> > only boot structure. then user will still see such awkward regions.
> >> >> > since it's not avoidable, report-all in the summary mail looks not
> >> >> > causing a new problem.
> >> >>
> >> >> The solution to this are reserved regions specified in the guest config,
> >> >> independent of host characteristics.
> >> >
> >> > I don't think how reserved regions are specified matter here. My point
> >> > is that when a region is reserved in e820 at boot time, there's no way
> >> > to erase that knowledge in the guest even when devices causing that
> >> > reservation are hot removed later.
> >>
> >> I don't think anyone ever indicated that such erasure would be
> >> needed/wanted - I'm not sure how you ended up there.
> >>
> >
> > I ended here to indicate that report-all which gives user more reserved
> > regions than necessary is not a weird case since above scenario can also
> > create such fact. User shouldn't set expectation about reserved region
> > layout. and this argument is necessary to support our proposal of using
> > report-all. :-)
> 
> The fact that ranges can't be removed from a guest's memory map
> is irrelevant - there's simply no question that this is that way. The
> main counter argument against report-all remains: It may result in
> unnecessarily little low memory in guests not needing all of the host
> regions to be reserved for them.
> 

the result is related to another open whether we want to block guest
boot for such problem. If 'warn' in domain builder is acceptable, we
don't need to change lowmem for such rare 1GB case, just throws
a warning for unnecessary conflictions (doesn't hurt if user doesn't
assign it). 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12  9:56                   ` Tian, Kevin
@ 2015-01-12 10:08                     ` Jan Beulich
  2015-01-12 10:12                       ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12 10:08 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> the result is related to another open whether we want to block guest
> boot for such problem. If 'warn' in domain builder is acceptable, we
> don't need to change lowmem for such rare 1GB case, just throws
> a warning for unnecessary conflictions (doesn't hurt if user doesn't
> assign it). 

And how would you then deal with the one guest needing that
range reserved?

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 10:08                     ` Jan Beulich
@ 2015-01-12 10:12                       ` Tian, Kevin
  2015-01-12 10:22                         ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 10:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 6:09 PM
> 
> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> > the result is related to another open whether we want to block guest
> > boot for such problem. If 'warn' in domain builder is acceptable, we
> > don't need to change lowmem for such rare 1GB case, just throws
> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> > assign it).
> 
> And how would you then deal with the one guest needing that
> range reserved?
> 

if guest needs the range, then report-all or report-sel doesn't matter.
domain builder throws the warning, and later device assignment will
fail (or warn w/ override). In reality I think 1GB is rare. Making such
assumption to simplify implementation is reasonable.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 10:12                       ` Tian, Kevin
@ 2015-01-12 10:22                         ` Jan Beulich
  2015-01-12 11:22                           ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12 10:22 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, January 12, 2015 6:09 PM
>> 
>> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
>> > the result is related to another open whether we want to block guest
>> > boot for such problem. If 'warn' in domain builder is acceptable, we
>> > don't need to change lowmem for such rare 1GB case, just throws
>> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
>> > assign it).
>> 
>> And how would you then deal with the one guest needing that
>> range reserved?
> 
> if guest needs the range, then report-all or report-sel doesn't matter.
> domain builder throws the warning, and later device assignment will
> fail (or warn w/ override). In reality I think 1GB is rare. Making such
> assumption to simplify implementation is reasonable.

One of my main problems with all you recent argumentation here
is the arbitrary use of the 1Gb boundary - there's nothing special
in this discussion with where the boundary is. Everything revolves
around the (undue) effect of report-all on domains not needing all
of the ranges found on the host.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 10:22                         ` Jan Beulich
@ 2015-01-12 11:22                           ` Tian, Kevin
  2015-01-12 11:37                             ` Jan Beulich
  2015-01-12 12:13                             ` George Dunlap
  0 siblings, 2 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 11:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 6:23 PM
> 
> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 6:09 PM
> >>
> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> >> > the result is related to another open whether we want to block guest
> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> >> > don't need to change lowmem for such rare 1GB case, just throws
> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> >> > assign it).
> >>
> >> And how would you then deal with the one guest needing that
> >> range reserved?
> >
> > if guest needs the range, then report-all or report-sel doesn't matter.
> > domain builder throws the warning, and later device assignment will
> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> > assumption to simplify implementation is reasonable.
> 
> One of my main problems with all you recent argumentation here
> is the arbitrary use of the 1Gb boundary - there's nothing special
> in this discussion with where the boundary is. Everything revolves
> around the (undue) effect of report-all on domains not needing all
> of the ranges found on the host.
> 

I'm not sure which part of my argument is not clear here. report-all
would be a problem here only if we want to fix all the conflictions
(then pulling unnecessary devices increases the confliction possibility) 
in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
while warn other conflictions (e.g. <3G) in domain builder (let later 
assignment path to actually fail if confliction does matter), then we 
don't need to solve all conflictions in domain builder (if say 1G example
fixing it may instead reduce lowmem greatly) and then report-all 
may just add more warnings than report-sel for unused devices.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09  2:43     ` Tian, Kevin
@ 2015-01-12 11:25       ` George Dunlap
  2015-01-12 13:56         ` Pasi Kärkkäinen
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-12 11:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Fri, Jan 9, 2015 at 2:43 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> From: George Dunlap
>> Sent: Thursday, January 08, 2015 8:55 PM
>>
>> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
>> <George.Dunlap@eu.citrix.com> wrote:
>> > If RMRRs almost always happen up above 2G, for example, then a simple
>> > solution that wouldn't require too much work would be to make sure
>> > that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
>> > enough to include all RMRRs.  That would satisfy the libxc and qemu
>> > requirements.
>> >
>> > If we then store specific RMRRs we want included in xenstore,
>> > hvmloader can put them in the e820 map, and that would satisfy the
>> > hvmloader requirement.
>>
>> An alternate thing to do here would be to "properly" fix the
>> qemu-upstream problem, by making a way for hvmloader to communicate
>> changes in the gpfn layout to qemu.
>>
>> Then hvmloader could do the work of moving memory under RMRRs to
>> higher memory; and libxc wouldn't need to be involved at all.
>>
>> I think it would also fix our long-standing issues with assigning PCI
>> devices to qemu-upstream guests, which up until now have only been
>> worked around.
>>
>
> could you elaborate a bit for that long-standing issue?

So qemu-traditional didn't particularly expect to know the guest
memory layout.  qemu-upstream does; it expects to know what areas of
memory are guest memory and what areas of memory are unmapped.  If a
read or write happens to a gpfn which *xen* knows is valid, but which
*qemu-upstream* thinks is unmapped, then qemu-upstream will crash.

The problem though is that the guest's memory map is not actually
communicated to qemu-upstream in any way.  Originally, qemu-upstream
was only told how much memory the guest had, and it just "happens" to
choose the same guest memory layout as the libxc domain builder does.
This works, but it is bad design, because if libxc were to change for
some reason, people would have to simply remember to also change the
qemu-upstream layout.

Where this really bites us is in PCI pass-through.  The default <4G
MMIO hole is very small; and hvmloader naturally expects to be able to
make this area larger by relocating memory from below 4G to above 4G.
It moves the memory in Xen's p2m, but it has no way of communicating
this to qemu-upstream.  So when the guest does an MMIO instuction that
causes qemu-upstream to access that memory, the guest crashes.

There are two work-arounds at the moment:
1. A flag which tells hvmloader not to relocate memory
2. The option to tell qemu-upstream to make the memory hole larger.

Both are just work-arounds though; a "proper fix" would be to allow
hvmloader some way of telling qemu that the memory has moved, so it
can update its memory map.

This will (I'm pretty sure) have an effect on RMRR regions as well,
for the reasons I've mentioned above: whether make the "holes" for the
RMRRs in libxc or in hvmloader, if we *move* that memory up to the top
of the address space (rather than, say, just not giving that RAM to
the guest), then qemu-upstream's idea of the guest memory map will be
wrong, and will probably crash at some point.

Having the ability for hvmloader to populate and/or move the memory
around, and then tell qemu-upstream what the resulting map looked
like, would fix both the MMIO-resize issue and the RMRR problem, wrt
qemu-upstream.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 11:22                           ` Tian, Kevin
@ 2015-01-12 11:37                             ` Jan Beulich
  2015-01-12 11:41                               ` Tian, Kevin
  2015-01-12 12:13                             ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12 11:37 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 12:22, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, January 12, 2015 6:23 PM
>> One of my main problems with all you recent argumentation here
>> is the arbitrary use of the 1Gb boundary - there's nothing special
>> in this discussion with where the boundary is. Everything revolves
>> around the (undue) effect of report-all on domains not needing all
>> of the ranges found on the host.
>> 
> 
> I'm not sure which part of my argument is not clear here. report-all
> would be a problem here only if we want to fix all the conflictions
> (then pulling unnecessary devices increases the confliction possibility) 
> in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> while warn other conflictions (e.g. <3G) in domain builder (let later 
> assignment path to actually fail if confliction does matter),

And have no way for the user to (securely) avoid that failure. Plus
the definition of "reasonable" here is of course going to be arbitrary.

Jan

> then we 
> don't need to solve all conflictions in domain builder (if say 1G example
> fixing it may instead reduce lowmem greatly) and then report-all 
> may just add more warnings than report-sel for unused devices.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 11:37                             ` Jan Beulich
@ 2015-01-12 11:41                               ` Tian, Kevin
  2015-01-12 12:03                                 ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 11:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 7:37 PM
> 
> >>> On 12.01.15 at 12:22, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 6:23 PM
> >> One of my main problems with all you recent argumentation here
> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> in this discussion with where the boundary is. Everything revolves
> >> around the (undue) effect of report-all on domains not needing all
> >> of the ranges found on the host.
> >>
> >
> > I'm not sure which part of my argument is not clear here. report-all
> > would be a problem here only if we want to fix all the conflictions
> > (then pulling unnecessary devices increases the confliction possibility)
> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > while warn other conflictions (e.g. <3G) in domain builder (let later
> > assignment path to actually fail if confliction does matter),
> 
> And have no way for the user to (securely) avoid that failure. Plus
> the definition of "reasonable" here is of course going to be arbitrary.
> 
> Jan

actually here I didn't get your point then. It's your proposal to make 
reasonable assumption like below:

---
d) Move down the lowmem RAM/MMIO boundary so that a single,
contiguous chunk of lowmem results, with all other RAM moving up
beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
considered here, and I think we can reasonably safely assume that
no RMRRs will ever report ranges above 1Mb but below the host
lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
that the lowmem chunk will always be reasonably big).
---

Thanks
Kevin

> 
> > then we
> > don't need to solve all conflictions in domain builder (if say 1G example
> > fixing it may instead reduce lowmem greatly) and then report-all
> > may just add more warnings than report-sel for unused devices.
> >
> > Thanks
> > Kevin
> 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 11:41                               ` Tian, Kevin
@ 2015-01-12 12:03                                 ` Jan Beulich
  2015-01-12 12:16                                   ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-12 12:03 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 12:41, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, January 12, 2015 7:37 PM
>> 
>> >>> On 12.01.15 at 12:22, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Monday, January 12, 2015 6:23 PM
>> >> One of my main problems with all you recent argumentation here
>> >> is the arbitrary use of the 1Gb boundary - there's nothing special
>> >> in this discussion with where the boundary is. Everything revolves
>> >> around the (undue) effect of report-all on domains not needing all
>> >> of the ranges found on the host.
>> >>
>> >
>> > I'm not sure which part of my argument is not clear here. report-all
>> > would be a problem here only if we want to fix all the conflictions
>> > (then pulling unnecessary devices increases the confliction possibility)
>> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
>> > while warn other conflictions (e.g. <3G) in domain builder (let later
>> > assignment path to actually fail if confliction does matter),
>> 
>> And have no way for the user to (securely) avoid that failure. Plus
>> the definition of "reasonable" here is of course going to be arbitrary.
> 
> actually here I didn't get your point then. It's your proposal to make 
> reasonable assumption like below:
> 
> ---
> d) Move down the lowmem RAM/MMIO boundary so that a single,
> contiguous chunk of lowmem results, with all other RAM moving up
> beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
> considered here, and I think we can reasonably safely assume that
> no RMRRs will ever report ranges above 1Mb but below the host
> lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
> that the lowmem chunk will always be reasonably big).

Correct - but my point is that this won't work well with your report-all
mechanism, only with the report-sel one.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-09 20:27         ` Konrad Rzeszutek Wilk
  2015-01-12  9:44           ` Tian, Kevin
@ 2015-01-12 12:12           ` Ian Campbell
  2015-01-14 20:06             ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-12 12:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Yang Z Zhang,
	Tiejun Chen

On Fri, 2015-01-09 at 15:27 -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 08, 2015 at 06:02:04PM +0000, George Dunlap wrote:
> > On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> > >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>> the 1st invocation of this interface will save all reported reserved
> > >>>> regions under domain structure, and later invocation (e.g. from
> > >>>> hvmloader) gets saved content.
> > >>>
> > >>> Why would the reserved regions need attaching to the domain
> > >>> structure? The combination of (to be) assigned devices and
> > >>> global RMRR list always allow reproducing the intended set of
> > >>> regions without any extra storage.
> > >>
> > >> So when you say "(to be) assigned devices", you mean any device which
> > >> is currently assigned, *or may be assigned at some point in the
> > >> future*?
> > >
> > > Yes.
> > >
> > >> Do you think the extra storage for "this VM might possibly be assigned
> > >> this device at some point" wouldn't really be that much bigger than
> > >> "this VM might possibly map this RMRR at some point in the future"?
> > >
> > > Since listing devices without RMRR association would be pointless,
> > > I think a list of devices would require less storage. But see below.
> > >
> > >> It seems a lot cleaner to me to have the toolstack tell Xen what
> > >> ranges are reserved for RMRR per VM, and then have Xen check again
> > >> when assigning a device to make sure that the RMRRs have already been
> > >> reserved.
> > >
> > > With an extra level of what can be got wrong by the admin.
> > > However, I now realize that doing it this way would allow
> > > specifying regions not associated with any device on the host
> > > the guest boots on, but associated with one on a host the guest
> > > may later migrate to.
> > 
> > I did say the toolstack, not the admin. :-)
> > 
> > At the xl level, I envisioned a single boolean that would say, "Make
> > my memory layout resemble the host system" -- so the MMIO hole would
> > be the same size, and all the RMRRs would be reserved.
> 
> Like the e820_host=1 ? :-)

I'd been thinking about that all the way down this thread ;-) It seems
like a fairly reasonable approach, and the interfaces (e.g. get host
memory e820) are mostly already there. But maybe there are HVM specific
reasons why its not...

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 11:22                           ` Tian, Kevin
  2015-01-12 11:37                             ` Jan Beulich
@ 2015-01-12 12:13                             ` George Dunlap
  2015-01-12 12:23                               ` Ian Campbell
                                                 ` (2 more replies)
  1 sibling, 3 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-12 12:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Monday, January 12, 2015 6:23 PM
>>
>> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Monday, January 12, 2015 6:09 PM
>> >>
>> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
>> >> > the result is related to another open whether we want to block guest
>> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
>> >> > don't need to change lowmem for such rare 1GB case, just throws
>> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
>> >> > assign it).
>> >>
>> >> And how would you then deal with the one guest needing that
>> >> range reserved?
>> >
>> > if guest needs the range, then report-all or report-sel doesn't matter.
>> > domain builder throws the warning, and later device assignment will
>> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
>> > assumption to simplify implementation is reasonable.
>>
>> One of my main problems with all you recent argumentation here
>> is the arbitrary use of the 1Gb boundary - there's nothing special
>> in this discussion with where the boundary is. Everything revolves
>> around the (undue) effect of report-all on domains not needing all
>> of the ranges found on the host.
>>
>
> I'm not sure which part of my argument is not clear here. report-all
> would be a problem here only if we want to fix all the conflictions
> (then pulling unnecessary devices increases the confliction possibility)
> in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> while warn other conflictions (e.g. <3G) in domain builder (let later
> assignment path to actually fail if confliction does matter), then we
> don't need to solve all conflictions in domain builder (if say 1G example
> fixing it may instead reduce lowmem greatly) and then report-all
> may just add more warnings than report-sel for unused devices.

You keep saying "report-all" or "report-sel", but I'm not 100% clear
what you mean by those.  In any case, the naming has got to be a bit
misleading: the important questions at the moment, AFAICT, are:

1. Whether we make holes at boot time for all RMRRs on the system, or
whether only make RMRRs for some subset (or potentially some other
arbitrary range, which may include RMRRs on other hosts to which we
may want to migrate).

2. Whether those holes are made by the domain builder in libxc, or by hvmloader

3. What happens if Xen is asked to assign a device and it finds that
the required RMRR is not empty:
 a. during guest creation
 b. after the guest has booted

Obviously at some point some part of the toolstack needs to identify
which RMRRs go with what device, so that either libxc or hvmloader can
make the appropriate holes in the address space; but at that point,
"report" is not so much the right word as "query".  (Obviously we want
to "report" in the e820 map all RMRRs that we've made holes for in the
guest.)

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:03                                 ` Jan Beulich
@ 2015-01-12 12:16                                   ` Tian, Kevin
  2015-01-12 12:46                                     ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 12:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 12, 2015 8:03 PM
> 
> >>> On 12.01.15 at 12:41, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 7:37 PM
> >>
> >> >>> On 12.01.15 at 12:22, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Monday, January 12, 2015 6:23 PM
> >> >> One of my main problems with all you recent argumentation here
> >> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> >> in this discussion with where the boundary is. Everything revolves
> >> >> around the (undue) effect of report-all on domains not needing all
> >> >> of the ranges found on the host.
> >> >>
> >> >
> >> > I'm not sure which part of my argument is not clear here. report-all
> >> > would be a problem here only if we want to fix all the conflictions
> >> > (then pulling unnecessary devices increases the confliction possibility)
> >> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> >> > while warn other conflictions (e.g. <3G) in domain builder (let later
> >> > assignment path to actually fail if confliction does matter),
> >>
> >> And have no way for the user to (securely) avoid that failure. Plus
> >> the definition of "reasonable" here is of course going to be arbitrary.
> >
> > actually here I didn't get your point then. It's your proposal to make
> > reasonable assumption like below:
> >
> > ---
> > d) Move down the lowmem RAM/MMIO boundary so that a single,
> > contiguous chunk of lowmem results, with all other RAM moving up
> > beyond 4Gb. Of course RMRRs below the 1Mb boundary must not be
> > considered here, and I think we can reasonably safely assume that
> > no RMRRs will ever report ranges above 1Mb but below the host
> > lowmem RAM/MMIO boundary (i.e. we can presumably rest assured
> > that the lowmem chunk will always be reasonably big).
> 
> Correct - but my point is that this won't work well with your report-all
> mechanism, only with the report-sel one.
> 

I've explained this several times. If there's a violation on above assumption 
from required devices, same for report-all and report-sel. If the violation is 
caused by unnecessary devices, please note I'm proposing 'warn' here so
report-all at most just adds more warnings in domain builder. the conflict
will be caught later if it becomes relevant to be assigned (e.g. thru hotplug).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:13                             ` George Dunlap
@ 2015-01-12 12:23                               ` Ian Campbell
  2015-01-12 12:28                               ` Tian, Kevin
  2015-01-12 12:30                               ` Tian, Kevin
  2 siblings, 0 replies; 139+ messages in thread
From: Ian Campbell @ 2015-01-12 12:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, 2015-01-12 at 12:13 +0000, George Dunlap wrote:
> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 6:23 PM
> >>
> >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Monday, January 12, 2015 6:09 PM
> >> >>
> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> >> >> > the result is related to another open whether we want to block guest
> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> >> >> > assign it).
> >> >>
> >> >> And how would you then deal with the one guest needing that
> >> >> range reserved?
> >> >
> >> > if guest needs the range, then report-all or report-sel doesn't matter.
> >> > domain builder throws the warning, and later device assignment will
> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> >> > assumption to simplify implementation is reasonable.
> >>
> >> One of my main problems with all you recent argumentation here
> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> in this discussion with where the boundary is. Everything revolves
> >> around the (undue) effect of report-all on domains not needing all
> >> of the ranges found on the host.
> >>
> >
> > I'm not sure which part of my argument is not clear here. report-all
> > would be a problem here only if we want to fix all the conflictions
> > (then pulling unnecessary devices increases the confliction possibility)
> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > while warn other conflictions (e.g. <3G) in domain builder (let later
> > assignment path to actually fail if confliction does matter), then we
> > don't need to solve all conflictions in domain builder (if say 1G example
> > fixing it may instead reduce lowmem greatly) and then report-all
> > may just add more warnings than report-sel for unused devices.
> 
> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> what you mean by those.

Is the distinction between "all reserved areas" and "only (selectively)
those which are related to an RMRR"? That's how I've been reading it...

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:13                             ` George Dunlap
  2015-01-12 12:23                               ` Ian Campbell
@ 2015-01-12 12:28                               ` Tian, Kevin
  2015-01-12 14:19                                 ` George Dunlap
  2015-01-12 12:30                               ` Tian, Kevin
  2 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 12:28 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Monday, January 12, 2015 8:14 PM
> 
> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Monday, January 12, 2015 6:23 PM
> >>
> >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Monday, January 12, 2015 6:09 PM
> >> >>
> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> >> >> > the result is related to another open whether we want to block guest
> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> >> >> > assign it).
> >> >>
> >> >> And how would you then deal with the one guest needing that
> >> >> range reserved?
> >> >
> >> > if guest needs the range, then report-all or report-sel doesn't matter.
> >> > domain builder throws the warning, and later device assignment will
> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> >> > assumption to simplify implementation is reasonable.
> >>
> >> One of my main problems with all you recent argumentation here
> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> in this discussion with where the boundary is. Everything revolves
> >> around the (undue) effect of report-all on domains not needing all
> >> of the ranges found on the host.
> >>
> >
> > I'm not sure which part of my argument is not clear here. report-all
> > would be a problem here only if we want to fix all the conflictions
> > (then pulling unnecessary devices increases the confliction possibility)
> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > while warn other conflictions (e.g. <3G) in domain builder (let later
> > assignment path to actually fail if confliction does matter), then we
> > don't need to solve all conflictions in domain builder (if say 1G example
> > fixing it may instead reduce lowmem greatly) and then report-all
> > may just add more warnings than report-sel for unused devices.
> 
> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> what you mean by those.  In any case, the naming has got to be a bit
> misleading: the important questions at the moment, AFAICT, are:

I explained them in original proposal

> 
> 1. Whether we make holes at boot time for all RMRRs on the system, or
> whether only make RMRRs for some subset (or potentially some other
> arbitrary range, which may include RMRRs on other hosts to which we
> may want to migrate).

I use 'report-all' to stand for making holes for all RMRRs on the system,
while 'report-sel' for specified subset.

including other RMRRs (from admin for migration) is orthogonal to
above open.

> 
> 2. Whether those holes are made by the domain builder in libxc, or by
> hvmloader

based on current discussion, whether to make holes in hvmloader
doesn't bring fundamental difference. as long as domain builder
still need to populate memory (even minimal for hvmloader to boot),
it needs to check conflict and may ideally make hole too (though we
may make assumption not doing that)

> 
> 3. What happens if Xen is asked to assign a device and it finds that
> the required RMRR is not empty:
>  a. during guest creation
>  b. after the guest has booted

for Xen we don't need differentiate a/b. by default it's clear failure
should be returned as it implies a security/correctness issue if
moving forward. but based on discussion an override to 'warn' only
is preferred, so admin can make decision (remains an open on
whether to do global override or per-device override)

> 
> Obviously at some point some part of the toolstack needs to identify
> which RMRRs go with what device, so that either libxc or hvmloader can
> make the appropriate holes in the address space; but at that point,
> "report" is not so much the right word as "query".  (Obviously we want
> to "report" in the e820 map all RMRRs that we've made holes for in the
> guest.)

yes, using 'report' doesn't catch all the changes we need to make. Just
use them to simplify discussion in case all are on the same page. However
clearly my original explanation didn't make it. :/

and state my major intention again. I don't think the preparation (i.e.
detect confliction and make holes) for device assignment should be a 
a blocking failure. Throw warning should be enough (i.e. in libxc). We
should let actual device assignment path to make final call based on
admin's configuration (default 'fail' w/ 'warn' override). Based on that
policy I think 'report-all' (making holes for all host RMRRs) is an
acceptable approach, w/ small impact on possibly more warning 
messages (actually not bad to help admin understand the hotplug
possibility on this platform) and show more reserved regions to the
end user (but he shouldn't make any assumption here). :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:13                             ` George Dunlap
  2015-01-12 12:23                               ` Ian Campbell
  2015-01-12 12:28                               ` Tian, Kevin
@ 2015-01-12 12:30                               ` Tian, Kevin
  2 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-12 12:30 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: Tian, Kevin
> Sent: Monday, January 12, 2015 8:29 PM
> 
> > From: George Dunlap
> > Sent: Monday, January 12, 2015 8:14 PM
> >
> > On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > >> From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> Sent: Monday, January 12, 2015 6:23 PM
> > >>
> > >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> > >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> >> Sent: Monday, January 12, 2015 6:09 PM
> > >> >>
> > >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> > >> >> > the result is related to another open whether we want to block guest
> > >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> > >> >> > don't need to change lowmem for such rare 1GB case, just throws
> > >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> > >> >> > assign it).
> > >> >>
> > >> >> And how would you then deal with the one guest needing that
> > >> >> range reserved?
> > >> >
> > >> > if guest needs the range, then report-all or report-sel doesn't matter.
> > >> > domain builder throws the warning, and later device assignment will
> > >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> > >> > assumption to simplify implementation is reasonable.
> > >>
> > >> One of my main problems with all you recent argumentation here
> > >> is the arbitrary use of the 1Gb boundary - there's nothing special
> > >> in this discussion with where the boundary is. Everything revolves
> > >> around the (undue) effect of report-all on domains not needing all
> > >> of the ranges found on the host.
> > >>
> > >
> > > I'm not sure which part of my argument is not clear here. report-all
> > > would be a problem here only if we want to fix all the conflictions
> > > (then pulling unnecessary devices increases the confliction possibility)
> > > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > > while warn other conflictions (e.g. <3G) in domain builder (let later
> > > assignment path to actually fail if confliction does matter), then we
> > > don't need to solve all conflictions in domain builder (if say 1G example
> > > fixing it may instead reduce lowmem greatly) and then report-all
> > > may just add more warnings than report-sel for unused devices.
> >
> > You keep saying "report-all" or "report-sel", but I'm not 100% clear
> > what you mean by those.  In any case, the naming has got to be a bit
> > misleading: the important questions at the moment, AFAICT, are:
> 
> I explained them in original proposal
> 
> >
> > 1. Whether we make holes at boot time for all RMRRs on the system, or
> > whether only make RMRRs for some subset (or potentially some other
> > arbitrary range, which may include RMRRs on other hosts to which we
> > may want to migrate).
> 
> I use 'report-all' to stand for making holes for all RMRRs on the system,
> while 'report-sel' for specified subset.
> 

more accurate... 'report-sel' for making holes for RMRRs belonging to 
specified devices.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:16                                   ` Tian, Kevin
@ 2015-01-12 12:46                                     ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-12 12:46 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 12.01.15 at 13:16, <kevin.tian@intel.com> wrote:
> I've explained this several times. If there's a violation on above 
> assumption 
> from required devices, same for report-all and report-sel. If the violation is 
> caused by unnecessary devices, please note I'm proposing 'warn' here so
> report-all at most just adds more warnings in domain builder. the conflict
> will be caught later if it becomes relevant to be assigned (e.g. thru 
> hotplug).

Since we're apparently not understanding one another, please
explain with a suitable example how you envision things to behave.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 11:25       ` George Dunlap
@ 2015-01-12 13:56         ` Pasi Kärkkäinen
  2015-01-12 14:23           ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Pasi Kärkkäinen @ 2015-01-12 13:56 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, Jan 12, 2015 at 11:25:56AM +0000, George Dunlap wrote:
> On Fri, Jan 9, 2015 at 2:43 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> From: George Dunlap
> >> Sent: Thursday, January 08, 2015 8:55 PM
> >>
> >> On Thu, Jan 8, 2015 at 12:49 PM, George Dunlap
> >> <George.Dunlap@eu.citrix.com> wrote:
> >> > If RMRRs almost always happen up above 2G, for example, then a simple
> >> > solution that wouldn't require too much work would be to make sure
> >> > that the PCI MMIO hole we specify to libxc and to qemu-upstream is big
> >> > enough to include all RMRRs.  That would satisfy the libxc and qemu
> >> > requirements.
> >> >
> >> > If we then store specific RMRRs we want included in xenstore,
> >> > hvmloader can put them in the e820 map, and that would satisfy the
> >> > hvmloader requirement.
> >>
> >> An alternate thing to do here would be to "properly" fix the
> >> qemu-upstream problem, by making a way for hvmloader to communicate
> >> changes in the gpfn layout to qemu.
> >>
> >> Then hvmloader could do the work of moving memory under RMRRs to
> >> higher memory; and libxc wouldn't need to be involved at all.
> >>
> >> I think it would also fix our long-standing issues with assigning PCI
> >> devices to qemu-upstream guests, which up until now have only been
> >> worked around.
> >>
> >
> > could you elaborate a bit for that long-standing issue?
> 
> So qemu-traditional didn't particularly expect to know the guest
> memory layout.  qemu-upstream does; it expects to know what areas of
> memory are guest memory and what areas of memory are unmapped.  If a
> read or write happens to a gpfn which *xen* knows is valid, but which
> *qemu-upstream* thinks is unmapped, then qemu-upstream will crash.
> 
> The problem though is that the guest's memory map is not actually
> communicated to qemu-upstream in any way.  Originally, qemu-upstream
> was only told how much memory the guest had, and it just "happens" to
> choose the same guest memory layout as the libxc domain builder does.
> This works, but it is bad design, because if libxc were to change for
> some reason, people would have to simply remember to also change the
> qemu-upstream layout.
> 
> Where this really bites us is in PCI pass-through.  The default <4G
> MMIO hole is very small; and hvmloader naturally expects to be able to
> make this area larger by relocating memory from below 4G to above 4G.
> It moves the memory in Xen's p2m, but it has no way of communicating
> this to qemu-upstream.  So when the guest does an MMIO instuction that
> causes qemu-upstream to access that memory, the guest crashes.
> 
> There are two work-arounds at the moment:
> 1. A flag which tells hvmloader not to relocate memory
> 2. The option to tell qemu-upstream to make the memory hole larger.
> 
> Both are just work-arounds though; a "proper fix" would be to allow
> hvmloader some way of telling qemu that the memory has moved, so it
> can update its memory map.
> 
> This will (I'm pretty sure) have an effect on RMRR regions as well,
> for the reasons I've mentioned above: whether make the "holes" for the
> RMRRs in libxc or in hvmloader, if we *move* that memory up to the top
> of the address space (rather than, say, just not giving that RAM to
> the guest), then qemu-upstream's idea of the guest memory map will be
> wrong, and will probably crash at some point.
> 
> Having the ability for hvmloader to populate and/or move the memory
> around, and then tell qemu-upstream what the resulting map looked
> like, would fix both the MMIO-resize issue and the RMRR problem, wrt
> qemu-upstream.
> 

Hmm, wasn't this changed slightly during Xen 4.5 development by Don Slutz?

You can now specify the mmio_hole size for HVM guests when using qemu-upstream:
http://wiki.xenproject.org/wiki/Xen_Project_4.5_Feature_List


"Bigger PCI MMIO hole in QEMU via the mmio_hole parameter in guest config, which allows configuring the MMIO size below 4GB. "

"Backport pc & q35: Add new machine opt max-ram-below-4g":
http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=ffdacad07002e14a8072ae28086a57452e48d458

"x86: hvm: Allow configuration of the size of the mmio_hole.":
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=2d927fc41b8e130b3b8910e4442d4691111d2ac7


-- Pasi


>  -George
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:28                               ` Tian, Kevin
@ 2015-01-12 14:19                                 ` George Dunlap
  2015-01-13 11:03                                   ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-12 14:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, Jan 12, 2015 at 12:28 PM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> From: George Dunlap
>> Sent: Monday, January 12, 2015 8:14 PM
>>
>> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Monday, January 12, 2015 6:23 PM
>> >>
>> >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
>> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Monday, January 12, 2015 6:09 PM
>> >> >>
>> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
>> >> >> > the result is related to another open whether we want to block guest
>> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
>> >> >> > don't need to change lowmem for such rare 1GB case, just throws
>> >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
>> >> >> > assign it).
>> >> >>
>> >> >> And how would you then deal with the one guest needing that
>> >> >> range reserved?
>> >> >
>> >> > if guest needs the range, then report-all or report-sel doesn't matter.
>> >> > domain builder throws the warning, and later device assignment will
>> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
>> >> > assumption to simplify implementation is reasonable.
>> >>
>> >> One of my main problems with all you recent argumentation here
>> >> is the arbitrary use of the 1Gb boundary - there's nothing special
>> >> in this discussion with where the boundary is. Everything revolves
>> >> around the (undue) effect of report-all on domains not needing all
>> >> of the ranges found on the host.
>> >>
>> >
>> > I'm not sure which part of my argument is not clear here. report-all
>> > would be a problem here only if we want to fix all the conflictions
>> > (then pulling unnecessary devices increases the confliction possibility)
>> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
>> > while warn other conflictions (e.g. <3G) in domain builder (let later
>> > assignment path to actually fail if confliction does matter), then we
>> > don't need to solve all conflictions in domain builder (if say 1G example
>> > fixing it may instead reduce lowmem greatly) and then report-all
>> > may just add more warnings than report-sel for unused devices.
>>
>> You keep saying "report-all" or "report-sel", but I'm not 100% clear
>> what you mean by those.  In any case, the naming has got to be a bit
>> misleading: the important questions at the moment, AFAICT, are:
>
> I explained them in original proposal

Yes, I read it and didn't understand it there either. :-)

>> 1. Whether we make holes at boot time for all RMRRs on the system, or
>> whether only make RMRRs for some subset (or potentially some other
>> arbitrary range, which may include RMRRs on other hosts to which we
>> may want to migrate).
>
> I use 'report-all' to stand for making holes for all RMRRs on the system,
> while 'report-sel' for specified subset.
>
> including other RMRRs (from admin for migration) is orthogonal to
> above open.

Right; so the "report" in this case is "report to the guest".

As I said, I think that's confusing terminology; after all, we want to
report to the guest all holes that we make, and only the holes that we
make.  The question isn't then which ones we report, but which ones we
make holes for. :-)

So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
may be from this host, or even another host)?

Given that the ranges may be of arbitrary size, and that we may want
to specify additional ranges for migration to other hosts, then I
think we need at some level we need the machinery to be in place to
specify the RMRRs that will be reserved for a specific guest.

At the xl level, there should of course be a way to specify "use all
host RMRRs"; but what should happen then is that xl / libxl should
query Xen for the host RMRRs and then pass those down to the next
layer of the library.

>> 2. Whether those holes are made by the domain builder in libxc, or by
>> hvmloader
>
> based on current discussion, whether to make holes in hvmloader
> doesn't bring fundamental difference. as long as domain builder
> still need to populate memory (even minimal for hvmloader to boot),
> it needs to check conflict and may ideally make hole too (though we
> may make assumption not doing that)

Well it will have an impact on the overall design of the code; but
you're right, if RMRRs really can (and will) be anywhere in memory,
then the domain builder will need to know what RMRRs are going to be
reserved for this VM and avoid populating those.  If, on the other
hand, we can make some fairly small assumptions about where there will
not be any RMRRs, then we can get away with handling everything in
hvmloader.

>>
>> 3. What happens if Xen is asked to assign a device and it finds that
>> the required RMRR is not empty:
>>  a. during guest creation
>>  b. after the guest has booted
>
> for Xen we don't need differentiate a/b. by default it's clear failure
> should be returned as it implies a security/correctness issue if
> moving forward. but based on discussion an override to 'warn' only
> is preferred, so admin can make decision (remains an open on
> whether to do global override or per-device override)

Well I think part of our confusion here is what "fail" vs "warn" means.

Fail / warn might be "Do we refuse to assign the device, or do we go
ahead and assign the device, knowing that it may act buggy?"

Or it might be, "Do we fail domain creation if at some step we
discover an RMRR conflict?  Or do we let the domain create succeed but
warn that the device has not been attached?"

I think in any case, failing to *assign* the device is the right thing
to do (except perhaps with a per-device override option).

libxl already has a policy of what happens when pci assignment fails
during domain creation.  If I'm reading the code right, libxl will
destroy the domain if libxl__device_pci_add() fails during domain
creation; I think that's the right thing to do.  If you want to change
that policy, that's a different discussion.

But if the device assignment fails due to an unspecified RMRR, that's
a bug in the toolstack -- it should have looked at the device list,
found out what RMRRs were necessary, and reserved those ranges before
we got to that point.

The only time I would expect device assignment might fail during
domain creation is if one of the devices had an RMRR shared with a
device already assigned to another VM.

>> Obviously at some point some part of the toolstack needs to identify
>> which RMRRs go with what device, so that either libxc or hvmloader can
>> make the appropriate holes in the address space; but at that point,
>> "report" is not so much the right word as "query".  (Obviously we want
>> to "report" in the e820 map all RMRRs that we've made holes for in the
>> guest.)
>
> yes, using 'report' doesn't catch all the changes we need to make. Just
> use them to simplify discussion in case all are on the same page. However
> clearly my original explanation didn't make it. :/
>
> and state my major intention again. I don't think the preparation (i.e.
> detect confliction and make holes) for device assignment should be a
> a blocking failure.  Throw warning should be enough (i.e. in libxc). We
> should let actual device assignment path to make final call based on
> admin's configuration (default 'fail' w/ 'warn' override). Based on that
> policy I think 'report-all' (making holes for all host RMRRs) is an
> acceptable approach, w/ small impact on possibly more warning
> messages (actually not bad to help admin understand the hotplug
> possibility on this platform) and show more reserved regions to the
> end user (but he shouldn't make any assumption here). :-)

I don't really understand what you're talking about here.

When the libxc domain builder runs, there is *no* guest memory mapped.
So if it has the RMRRs, then it can *avoid* conflict; and if it
doesn't have the RMRRs, it can't even *detect* conflict.  So there is
no reason for libxc to either give a warning, or cause a failure.

So I'm not sure why you think making holes for all RMRRs would have
more warning messages.

And when you say "show more reserved regions to the end user", I take
it you mean the guest kernel (via the e820 map)?

I'm also not clear what assumptions "he" may be making: you mean, the
existence of an RMRR in the e820 map shouldn't be taken to mean that
he has a specific device assigned?  No, indeed, he should not make
such an assumption. :-)

Again -- I think that the only place "rmrr-host" and "rmrr-sel" is
important is at the very top level -- in xl, and possibly at a high
level in libxl.  By the time things reach libxc and hvmloader, they
should simply be told, "These are the RMRRs for this domain", and they
should avoid conflicts and report those in the e820 map.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 13:56         ` Pasi Kärkkäinen
@ 2015-01-12 14:23           ` George Dunlap
  0 siblings, 0 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-12 14:23 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Tian, Kevin, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, Jan 12, 2015 at 1:56 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> On Mon, Jan 12, 2015 at 11:25:56AM +0000, George Dunlap wrote:
>> So qemu-traditional didn't particularly expect to know the guest
>> memory layout.  qemu-upstream does; it expects to know what areas of
>> memory are guest memory and what areas of memory are unmapped.  If a
>> read or write happens to a gpfn which *xen* knows is valid, but which
>> *qemu-upstream* thinks is unmapped, then qemu-upstream will crash.
>>
>> The problem though is that the guest's memory map is not actually
>> communicated to qemu-upstream in any way.  Originally, qemu-upstream
>> was only told how much memory the guest had, and it just "happens" to
>> choose the same guest memory layout as the libxc domain builder does.
>> This works, but it is bad design, because if libxc were to change for
>> some reason, people would have to simply remember to also change the
>> qemu-upstream layout.
>>
>> Where this really bites us is in PCI pass-through.  The default <4G
>> MMIO hole is very small; and hvmloader naturally expects to be able to
>> make this area larger by relocating memory from below 4G to above 4G.
>> It moves the memory in Xen's p2m, but it has no way of communicating
>> this to qemu-upstream.  So when the guest does an MMIO instuction that
>> causes qemu-upstream to access that memory, the guest crashes.
>>
>> There are two work-arounds at the moment:
>> 1. A flag which tells hvmloader not to relocate memory
>> 2. The option to tell qemu-upstream to make the memory hole larger.
>>
>> Both are just work-arounds though; a "proper fix" would be to allow
>> hvmloader some way of telling qemu that the memory has moved, so it
>> can update its memory map.
>>
>> This will (I'm pretty sure) have an effect on RMRR regions as well,
>> for the reasons I've mentioned above: whether make the "holes" for the
>> RMRRs in libxc or in hvmloader, if we *move* that memory up to the top
>> of the address space (rather than, say, just not giving that RAM to
>> the guest), then qemu-upstream's idea of the guest memory map will be
>> wrong, and will probably crash at some point.
>>
>> Having the ability for hvmloader to populate and/or move the memory
>> around, and then tell qemu-upstream what the resulting map looked
>> like, would fix both the MMIO-resize issue and the RMRR problem, wrt
>> qemu-upstream.
>>
>
> Hmm, wasn't this changed slightly during Xen 4.5 development by Don Slutz?
>
> You can now specify the mmio_hole size for HVM guests when using qemu-upstream:
> http://wiki.xenproject.org/wiki/Xen_Project_4.5_Feature_List
>
>
> "Bigger PCI MMIO hole in QEMU via the mmio_hole parameter in guest config, which allows configuring the MMIO size below 4GB. "
>
> "Backport pc & q35: Add new machine opt max-ram-below-4g":
> http://xenbits.xen.org/gitweb/?p=qemu-upstream-unstable.git;a=commit;h=ffdacad07002e14a8072ae28086a57452e48d458
>
> "x86: hvm: Allow configuration of the size of the mmio_hole.":
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=2d927fc41b8e130b3b8910e4442d4691111d2ac7

Yes -- that's workaround #2 above ("tell qemu-upstream to make the
memory hole larger").  But it's still a work-around, because it
requires the admin to figure out how big a memory hole he needs.  With
qemu-traditional, he could just assign whatever devices he wanted, and
hvmloader would make it the right size automatically.  Ideally that's
how it would work for qemu-upstream as well.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 14:19                                 ` George Dunlap
@ 2015-01-13 11:03                                   ` Tian, Kevin
  2015-01-13 11:56                                     ` Jan Beulich
                                                       ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-13 11:03 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Monday, January 12, 2015 10:20 PM
> 
> On Mon, Jan 12, 2015 at 12:28 PM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> From: George Dunlap
> >> Sent: Monday, January 12, 2015 8:14 PM
> >>
> >> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com>
> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Monday, January 12, 2015 6:23 PM
> >> >>
> >> >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> >> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> Sent: Monday, January 12, 2015 6:09 PM
> >> >> >>
> >> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> >> >> >> > the result is related to another open whether we want to block
> guest
> >> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> >> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> >> >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> >> >> >> > assign it).
> >> >> >>
> >> >> >> And how would you then deal with the one guest needing that
> >> >> >> range reserved?
> >> >> >
> >> >> > if guest needs the range, then report-all or report-sel doesn't matter.
> >> >> > domain builder throws the warning, and later device assignment will
> >> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> >> >> > assumption to simplify implementation is reasonable.
> >> >>
> >> >> One of my main problems with all you recent argumentation here
> >> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> >> >> in this discussion with where the boundary is. Everything revolves
> >> >> around the (undue) effect of report-all on domains not needing all
> >> >> of the ranges found on the host.
> >> >>
> >> >
> >> > I'm not sure which part of my argument is not clear here. report-all
> >> > would be a problem here only if we want to fix all the conflictions
> >> > (then pulling unnecessary devices increases the confliction possibility)
> >> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> >> > while warn other conflictions (e.g. <3G) in domain builder (let later
> >> > assignment path to actually fail if confliction does matter), then we
> >> > don't need to solve all conflictions in domain builder (if say 1G example
> >> > fixing it may instead reduce lowmem greatly) and then report-all
> >> > may just add more warnings than report-sel for unused devices.
> >>
> >> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> >> what you mean by those.  In any case, the naming has got to be a bit
> >> misleading: the important questions at the moment, AFAICT, are:
> >
> > I explained them in original proposal
> 
> Yes, I read it and didn't understand it there either. :-)

sorry for that.

> 
> >> 1. Whether we make holes at boot time for all RMRRs on the system, or
> >> whether only make RMRRs for some subset (or potentially some other
> >> arbitrary range, which may include RMRRs on other hosts to which we
> >> may want to migrate).
> >
> > I use 'report-all' to stand for making holes for all RMRRs on the system,
> > while 'report-sel' for specified subset.
> >
> > including other RMRRs (from admin for migration) is orthogonal to
> > above open.
> 
> Right; so the "report" in this case is "report to the guest".
> 
> As I said, I think that's confusing terminology; after all, we want to
> report to the guest all holes that we make, and only the holes that we
> make.  The question isn't then which ones we report, but which ones we
> make holes for. :-)

originally I use 'report' to describe the hypercall which hypervisor composes
the actual information about RMRR, so it can be 'report to libxl' or 'report
to the guest' regarding to who invokes that hypercall.

but yes here we more care about what's reported to the guest.

> 
> So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
> the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
> may be from this host, or even another host)?

the counterpart of 'rmrr-host' gives me feeling of 'rmrr-guest'. :-)

> 
> Given that the ranges may be of arbitrary size, and that we may want
> to specify additional ranges for migration to other hosts, then I
> think we need at some level we need the machinery to be in place to
> specify the RMRRs that will be reserved for a specific guest.
> 
> At the xl level, there should of course be a way to specify "use all
> host RMRRs"; but what should happen then is that xl / libxl should
> query Xen for the host RMRRs and then pass those down to the next
> layer of the library.
> 
> >> 2. Whether those holes are made by the domain builder in libxc, or by
> >> hvmloader
> >
> > based on current discussion, whether to make holes in hvmloader
> > doesn't bring fundamental difference. as long as domain builder
> > still need to populate memory (even minimal for hvmloader to boot),
> > it needs to check conflict and may ideally make hole too (though we
> > may make assumption not doing that)
> 
> Well it will have an impact on the overall design of the code; but
> you're right, if RMRRs really can (and will) be anywhere in memory,
> then the domain builder will need to know what RMRRs are going to be
> reserved for this VM and avoid populating those.  If, on the other
> hand, we can make some fairly small assumptions about where there will
> not be any RMRRs, then we can get away with handling everything in
> hvmloader.

I'm not sure such fairly small assumptions can be made. For example,
host RMRR may include one or several regions in host PCI MMIO
space (say >3G), then hvmloader has to understand such knowledge
to avoid allocating them for guest PCI MMIO.

> 
> >>
> >> 3. What happens if Xen is asked to assign a device and it finds that
> >> the required RMRR is not empty:
> >>  a. during guest creation
> >>  b. after the guest has booted
> >
> > for Xen we don't need differentiate a/b. by default it's clear failure
> > should be returned as it implies a security/correctness issue if
> > moving forward. but based on discussion an override to 'warn' only
> > is preferred, so admin can make decision (remains an open on
> > whether to do global override or per-device override)
> 
> Well I think part of our confusion here is what "fail" vs "warn" means.
> 
> Fail / warn might be "Do we refuse to assign the device, or do we go
> ahead and assign the device, knowing that it may act buggy?"
> 
> Or it might be, "Do we fail domain creation if at some step we
> discover an RMRR conflict?  Or do we let the domain create succeed but
> warn that the device has not been attached?"
> 
> I think in any case, failing to *assign* the device is the right thing
> to do (except perhaps with a per-device override option).

yes

> 
> libxl already has a policy of what happens when pci assignment fails
> during domain creation.  If I'm reading the code right, libxl will
> destroy the domain if libxl__device_pci_add() fails during domain
> creation; I think that's the right thing to do.  If you want to change
> that policy, that's a different discussion.

not my intention. as I said, the policy for Xen is clear. Just fail the
assignment hypercall (or warn w/ override). keep whatever policy
defined by libxl.

> 
> But if the device assignment fails due to an unspecified RMRR, that's
> a bug in the toolstack -- it should have looked at the device list,
> found out what RMRRs were necessary, and reserved those ranges before
> we got to that point.
> 
> The only time I would expect device assignment might fail during
> domain creation is if one of the devices had an RMRR shared with a
> device already assigned to another VM.
> 
> >> Obviously at some point some part of the toolstack needs to identify
> >> which RMRRs go with what device, so that either libxc or hvmloader can
> >> make the appropriate holes in the address space; but at that point,
> >> "report" is not so much the right word as "query".  (Obviously we want
> >> to "report" in the e820 map all RMRRs that we've made holes for in the
> >> guest.)
> >
> > yes, using 'report' doesn't catch all the changes we need to make. Just
> > use them to simplify discussion in case all are on the same page. However
> > clearly my original explanation didn't make it. :/
> >
> > and state my major intention again. I don't think the preparation (i.e.
> > detect confliction and make holes) for device assignment should be a
> > a blocking failure.  Throw warning should be enough (i.e. in libxc). We
> > should let actual device assignment path to make final call based on
> > admin's configuration (default 'fail' w/ 'warn' override). Based on that
> > policy I think 'report-all' (making holes for all host RMRRs) is an
> > acceptable approach, w/ small impact on possibly more warning
> > messages (actually not bad to help admin understand the hotplug
> > possibility on this platform) and show more reserved regions to the
> > end user (but he shouldn't make any assumption here). :-)
> 
> I don't really understand what you're talking about here.
> 
> When the libxc domain builder runs, there is *no* guest memory mapped.
> So if it has the RMRRs, then it can *avoid* conflict; and if it
> doesn't have the RMRRs, it can't even *detect* conflict.  So there is
> no reason for libxc to either give a warning, or cause a failure.

not all the conflicts can or will be avoided. e.g. USB may report a 
region conflicting with guest BIOS which is a hard conflict. another 
example (from one design option) is that we may want to keep 
current cross-component structure (one lowmem + one highmem)
so conflict in the middle (e.g. 2G) is a problem (avoiding it will break
lowmem or make lowmem too small).

as long as we agree some conflicts may not be avoided, then it
comes to the open about whether to give a warning, or cause a failure.
I view making holes in domain builder as a preparation for later device
assignment, so gives warning should be sufficient here since Xen will
fail the assignment hypercall later when it actually happens and then
libxl will react according to defined policy like you described above.

> 
> So I'm not sure why you think making holes for all RMRRs would have
> more warning messages.

based on the fact that not all RMRRs can or will be avoided, definitely
making holes for all RMRRs on the host can potentially lead to more 
conflicts than just making holes for RMRRs belonging to assigned
devices. When we agree warning is OK in domain builder, it means
more warning messages. 

> 
> And when you say "show more reserved regions to the end user", I take
> it you mean the guest kernel (via the e820 map)?

yes since all reserved regions have to be marked in the e820, so guest
OS itself won't allocate the hole e.g. when doing PCI re-configuration.

> 
> I'm also not clear what assumptions "he" may be making: you mean, the
> existence of an RMRR in the e820 map shouldn't be taken to mean that
> he has a specific device assigned?  No, indeed, he should not make
> such an assumption. :-)

I meant 'he' shouldn't make assumption on how many reserved regions
should exist in e820 based on exposed devices. Jan has a concern exposing
more reserved regions in e820 than necessary is not a good thing. I'm
trying to convince him it should be fine. :-)

> 
> Again -- I think that the only place "rmrr-host" and "rmrr-sel" is
> important is at the very top level -- in xl, and possibly at a high
> level in libxl.  By the time things reach libxc and hvmloader, they
> should simply be told, "These are the RMRRs for this domain", and they
> should avoid conflicts and report those in the e820 map.
> 

Having libxl to centrally manage RMRR at a high level is a good idea,
which however need help from you and Ian on what're detail tasks
in libxl to achieve such goal. We're not tool expert (especially to
Tiejun) so definitely more suggestion in this area is welcomed. :-)

Then I hope you understand now about our discussion in libxl/xen/
hvmloader, based on the fact that conflict may not be avoided. 
That's the major open in original discussion with Jan. I'd like to 
give an example of the flow here per Jan's suggestion, starting 
from domain builder after reserved regions have been specified 
by high level libxl.

Let's take an synthetic platform w/ two devices each reported 
with one RMRR reserved region:
	(D1): [0xe0000, 0xeffff] in <1MB area
	(D2): [0xa0000000, 0xa37fffff] in ~2.75G area

The guest is configured with 4G memory, and assigned with D2.
due to libxl policy (say for migration and hotplug) in total 3
ranges are reported:
	(hotplug): [0xe0000, 0xeffff] in <1MB area in this node
	(migration): [0x40000000, 0x40003fff] in ~1G area in another node
	(static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node

let's use xenstore to save such information (assume accessible to both
domain builder and hvmloader?)

STEP-1. domain builder

say the default layout w/o reserved regions would be:
	lowmem: 	[0, 0xbfffffff]
	mmio hole: 	[0xc0000000, 0xffffffff]
	highmem:	[0x100000000, 0x140000000]

domain builder then queries reserved regions from xenstore, 
and tries to avoid conflicts.

For [0xad000000, 0xaf7fffff], it can be avoided by reducing
lowmem to 0xad000000 and increase highmem:
	lowmem: 	[0, 0x9fffffff]
	mmio hole: 	[0xa0000000, 0xffffffff]
	highmem:	[0x100000000, 0x160000000]


For [0x40000000, 0x40003fff], leave it as a conflict since either
reducing lowmem to 1G is not nice to guest which doesn't use
highmem or we have to break lowmem into two trunks so more 
structure changes are required.

For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)

w/ libxl centrally managed mode, domain builder doesn't know
whether a conflict will lead to an immediate error or not, so 
the best policy here is to throw warning and then move forward.
conflicts will be caught in later steps when a region is actually
concerned.

STEP-2. static device assignment

after domain builder, libxl will request Xen hypervisor to complete
actual device assignment. Because D2 is statically assigned to
this guest, Xen will setup identity mapping for [0xa0000000, 
0xa37fffff] with conflict detection in gfn space. Since domain builder
has making hole for this region, there'll no conflict and device will
be assigned to the guest successfully.

STEP-3. hvmloader boot

hvmloader also needs to query reserved regions (still thru xenstore?)
due to two reasons:
	- mark all reported reserved regions in guest e820
	- make holes to avoid conflict in dynamic allocation (e.g. PCI
BAR, ACPI opregion, etc.)

hvmloader can avoid making holes for guest RAM again (even there
are potential conflicts w/ guest RAM they would be acceptable otherwise
libxl will fail the boot before reaching here). So hvmloader will just 
add a new reserved e820 entry and make hole for [0xa0000000, 
0xa37fffff] in this example, which doesn't have a guest RAM confliction.

STEP-4. D2 hotplug

After guest has been booted, user decides to hotplug D1 so libxl
will raise another device assignment request to Xen hypervisor. At
this point, a conflict is detected on [0xe0000, 0xeffff] and then a 
failure is returned by default. an override is provide in Xen to allow 
warn and return success, and user understands doing so implies 
an unsecure environment.

Out of this example hotplug will succeed if no conflict on RMRRs of 
the hotplug device.

STEP-5. migration

after guest has been booted, user decides to migrate the guest
to another node and the guest will be assigned with a new device
w/ a [0x40000000, 0x40003fff] reserved region on the new node. 
After the guest is migrated, Xen on the new node will detect conflict
when new device is hotplugged and failure (or warn override so 
success) will be returned accordingly.

Out of this example hotplug will succeed if no conflict on RMRRs of 
the hotplug device.

---

After completing this long thread, I actually like this high level libxl
management idea, which avoids complexity in domain builder/
hvmloader to understand device/RMRR association and then uses
different policy according to whether it's statically assigned or 
hotplug or for other purpose. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 11:03                                   ` Tian, Kevin
@ 2015-01-13 11:56                                     ` Jan Beulich
  2015-01-13 12:03                                       ` Tian, Kevin
  2015-01-13 13:45                                     ` George Dunlap
  2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-13 11:56 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 13.01.15 at 12:03, <kevin.tian@intel.com> wrote:
> Then I hope you understand now about our discussion in libxl/xen/
> hvmloader, based on the fact that conflict may not be avoided. 
> That's the major open in original discussion with Jan. I'd like to 
> give an example of the flow here per Jan's suggestion, starting 
> from domain builder after reserved regions have been specified 
> by high level libxl.
> 
> Let's take an synthetic platform w/ two devices each reported 
> with one RMRR reserved region:
> 	(D1): [0xe0000, 0xeffff] in <1MB area
> 	(D2): [0xa0000000, 0xa37fffff] in ~2.75G area
> 
> The guest is configured with 4G memory, and assigned with D2.
> due to libxl policy (say for migration and hotplug) in total 3
> ranges are reported:
> 	(hotplug): [0xe0000, 0xeffff] in <1MB area in this node
> 	(migration): [0x40000000, 0x40003fff] in ~1G area in another node
> 	(static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node
> 
> let's use xenstore to save such information (assume accessible to both
> domain builder and hvmloader?)
> 
> STEP-1. domain builder
> 
> say the default layout w/o reserved regions would be:
> 	lowmem: 	[0, 0xbfffffff]
> 	mmio hole: 	[0xc0000000, 0xffffffff]
> 	highmem:	[0x100000000, 0x140000000]
> 
> domain builder then queries reserved regions from xenstore, 
> and tries to avoid conflicts.
> 
> For [0xad000000, 0xaf7fffff], it can be avoided by reducing
> lowmem to 0xad000000 and increase highmem:

Inconsistent numbers?

> 	lowmem: 	[0, 0x9fffffff]
> 	mmio hole: 	[0xa0000000, 0xffffffff]
> 	highmem:	[0x100000000, 0x160000000]
> 
> 
> For [0x40000000, 0x40003fff], leave it as a conflict since either
> reducing lowmem to 1G is not nice to guest which doesn't use
> highmem or we have to break lowmem into two trunks so more 
> structure changes are required.

This makes no sense - if such an area was explicitly requested to
be reserved, leaving it as a conflict is not an option.

> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
> 
> w/ libxl centrally managed mode, domain builder doesn't know
> whether a conflict will lead to an immediate error or not, so 
> the best policy here is to throw warning and then move forward.
> conflicts will be caught in later steps when a region is actually
> concerned.
> 
> STEP-2. static device assignment
> 
> after domain builder, libxl will request Xen hypervisor to complete
> actual device assignment. Because D2 is statically assigned to
> this guest, Xen will setup identity mapping for [0xa0000000, 
> 0xa37fffff] with conflict detection in gfn space. Since domain builder
> has making hole for this region, there'll no conflict and device will
> be assigned to the guest successfully.
> 
> STEP-3. hvmloader boot
> 
> hvmloader also needs to query reserved regions (still thru xenstore?)

The mechanism (xenstore vs hypercall) is secondary right now I think.

> due to two reasons:
> 	- mark all reported reserved regions in guest e820
> 	- make holes to avoid conflict in dynamic allocation (e.g. PCI
> BAR, ACPI opregion, etc.)
> 
> hvmloader can avoid making holes for guest RAM again (even there
> are potential conflicts w/ guest RAM they would be acceptable otherwise
> libxl will fail the boot before reaching here). So hvmloader will just 
> add a new reserved e820 entry and make hole for [0xa0000000, 
> 0xa37fffff] in this example, which doesn't have a guest RAM confliction.

Make hole? The hole is already there from STEP-1.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 11:56                                     ` Jan Beulich
@ 2015-01-13 12:03                                       ` Tian, Kevin
  2015-01-13 15:52                                         ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-13 12:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 13, 2015 7:56 PM
> 
> >>> On 13.01.15 at 12:03, <kevin.tian@intel.com> wrote:
> > Then I hope you understand now about our discussion in libxl/xen/
> > hvmloader, based on the fact that conflict may not be avoided.
> > That's the major open in original discussion with Jan. I'd like to
> > give an example of the flow here per Jan's suggestion, starting
> > from domain builder after reserved regions have been specified
> > by high level libxl.
> >
> > Let's take an synthetic platform w/ two devices each reported
> > with one RMRR reserved region:
> > 	(D1): [0xe0000, 0xeffff] in <1MB area
> > 	(D2): [0xa0000000, 0xa37fffff] in ~2.75G area
> >
> > The guest is configured with 4G memory, and assigned with D2.
> > due to libxl policy (say for migration and hotplug) in total 3
> > ranges are reported:
> > 	(hotplug): [0xe0000, 0xeffff] in <1MB area in this node
> > 	(migration): [0x40000000, 0x40003fff] in ~1G area in another node
> > 	(static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node
> >
> > let's use xenstore to save such information (assume accessible to both
> > domain builder and hvmloader?)
> >
> > STEP-1. domain builder
> >
> > say the default layout w/o reserved regions would be:
> > 	lowmem: 	[0, 0xbfffffff]
> > 	mmio hole: 	[0xc0000000, 0xffffffff]
> > 	highmem:	[0x100000000, 0x140000000]
> >
> > domain builder then queries reserved regions from xenstore,
> > and tries to avoid conflicts.
> >
> > For [0xad000000, 0xaf7fffff], it can be avoided by reducing
> > lowmem to 0xad000000 and increase highmem:
> 
> Inconsistent numbers?

sorry, should be [0xa0000000, 0xa37fffff]

> 
> > 	lowmem: 	[0, 0x9fffffff]
> > 	mmio hole: 	[0xa0000000, 0xffffffff]
> > 	highmem:	[0x100000000, 0x160000000]
> >
> >
> > For [0x40000000, 0x40003fff], leave it as a conflict since either
> > reducing lowmem to 1G is not nice to guest which doesn't use
> > highmem or we have to break lowmem into two trunks so more
> > structure changes are required.
> 
> This makes no sense - if such an area was explicitly requested to
> be reserved, leaving it as a conflict is not an option.

explicitly requested by libxl but leaving it as a conflict in domain
builder is just fine. later steps will actually catch conflicts when
relevant regions are actually used (e.g. in static assignment, in
hotplug, or in migration). 

> 
> > For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
> >
> > w/ libxl centrally managed mode, domain builder doesn't know
> > whether a conflict will lead to an immediate error or not, so
> > the best policy here is to throw warning and then move forward.
> > conflicts will be caught in later steps when a region is actually
> > concerned.
> >
> > STEP-2. static device assignment
> >
> > after domain builder, libxl will request Xen hypervisor to complete
> > actual device assignment. Because D2 is statically assigned to
> > this guest, Xen will setup identity mapping for [0xa0000000,
> > 0xa37fffff] with conflict detection in gfn space. Since domain builder
> > has making hole for this region, there'll no conflict and device will
> > be assigned to the guest successfully.
> >
> > STEP-3. hvmloader boot
> >
> > hvmloader also needs to query reserved regions (still thru xenstore?)
> 
> The mechanism (xenstore vs hypercall) is secondary right now I think.
> 
> > due to two reasons:
> > 	- mark all reported reserved regions in guest e820
> > 	- make holes to avoid conflict in dynamic allocation (e.g. PCI
> > BAR, ACPI opregion, etc.)
> >
> > hvmloader can avoid making holes for guest RAM again (even there
> > are potential conflicts w/ guest RAM they would be acceptable otherwise
> > libxl will fail the boot before reaching here). So hvmloader will just
> > add a new reserved e820 entry and make hole for [0xa0000000,
> > 0xa37fffff] in this example, which doesn't have a guest RAM confliction.
> 
> Make hole? The hole is already there from STEP-1.
> 

It's only in mega mmio hole in STEP-1. here when hvmloader allocates
guest PCI mmio, it should be aware of this hole again.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 11:03                                   ` Tian, Kevin
  2015-01-13 11:56                                     ` Jan Beulich
@ 2015-01-13 13:45                                     ` George Dunlap
  2015-01-13 15:47                                       ` Jan Beulich
  2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-13 13:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Tue, Jan 13, 2015 at 11:03 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> Right; so the "report" in this case is "report to the guest".
>>
>> As I said, I think that's confusing terminology; after all, we want to
>> report to the guest all holes that we make, and only the holes that we
>> make.  The question isn't then which ones we report, but which ones we
>> make holes for. :-)
>
> originally I use 'report' to describe the hypercall which hypervisor composes
> the actual information about RMRR, so it can be 'report to libxl' or 'report
> to the guest' regarding to who invokes that hypercall.

Yes, that's what I thought "report" meant; which I think disposes one
to think of one specific way the system would work: i.e., libxl asks
Xen for the RMRRs, Xen filters out which ones to give it, and libxl
passes on all RMRRs reported to it by Xen.  Since I think libxl is the
right place to "filter" RMRRs, then we shouldn't think about this as
Xen "reporting" RMRRs, but as libxl "querying" RMRRs, and then
choosing which ones to pass in (perhaps all of them, perhaps a
subset).

>> So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
>> the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
>> may be from this host, or even another host)?
>
> the counterpart of 'rmrr-host' gives me feeling of 'rmrr-guest'. :-)

Think about the "e820_host" option for a bit (which tries to make the
guest's e820 memory regions look like the hosts') and maybe it will
make more sense. :-)

>> Well it will have an impact on the overall design of the code; but
>> you're right, if RMRRs really can (and will) be anywhere in memory,
>> then the domain builder will need to know what RMRRs are going to be
>> reserved for this VM and avoid populating those.  If, on the other
>> hand, we can make some fairly small assumptions about where there will
>> not be any RMRRs, then we can get away with handling everything in
>> hvmloader.
>
> I'm not sure such fairly small assumptions can be made. For example,
> host RMRR may include one or several regions in host PCI MMIO
> space (say >3G), then hvmloader has to understand such knowledge
> to avoid allocating them for guest PCI MMIO.

Yes, I'm talking here about Jan's idea of having the domain builder in
libxc do the minimal amount of work to get hvmloader to run, and then
having hvmloader populate the rest of the address space. So the
comparison is:

1. Both libxc and hvmloader know about RMRRs.  libxc uses this
information to avoid placing the hvmloader over an RMRR region,
hvmloader uses the information to populate the memory map and place
the MMIO ranges such that neither overlap with RMRRs.

2. Only hvmloader knows about RMRRs.  libxc places hvmloader in a
location in RAM basically guaranteed never to overlap with RMRRs;
hvmloader uses the information to populate memory map and place the
MMIO ranges such that neither overlap with RMRRs.

#2 is only possible if we can find a region of the physical address
space almost guaranteed never to overlap with RMRRs.  Otherwise, we
may have to fall back to #1.

>> > and state my major intention again. I don't think the preparation (i.e.
>> > detect confliction and make holes) for device assignment should be a
>> > a blocking failure.  Throw warning should be enough (i.e. in libxc). We
>> > should let actual device assignment path to make final call based on
>> > admin's configuration (default 'fail' w/ 'warn' override). Based on that
>> > policy I think 'report-all' (making holes for all host RMRRs) is an
>> > acceptable approach, w/ small impact on possibly more warning
>> > messages (actually not bad to help admin understand the hotplug
>> > possibility on this platform) and show more reserved regions to the
>> > end user (but he shouldn't make any assumption here). :-)
>>
>> I don't really understand what you're talking about here.
>>
>> When the libxc domain builder runs, there is *no* guest memory mapped.
>> So if it has the RMRRs, then it can *avoid* conflict; and if it
>> doesn't have the RMRRs, it can't even *detect* conflict.  So there is
>> no reason for libxc to either give a warning, or cause a failure.
>
> not all the conflicts can or will be avoided. e.g. USB may report a
> region conflicting with guest BIOS which is a hard conflict. another
> example (from one design option) is that we may want to keep
> current cross-component structure (one lowmem + one highmem)
> so conflict in the middle (e.g. 2G) is a problem (avoiding it will break
> lowmem or make lowmem too small).

Ah, this is the missing bit of information.  So can you expand on this
a bit -- are you saying that the guest BIOS has a fixed place in RAM
it must be loaded, and that area can't be changed?  And that
furthermore, for some reason, this may overlap with RMRRs for
passed-through devices?

>> I'm also not clear what assumptions "he" may be making: you mean, the
>> existence of an RMRR in the e820 map shouldn't be taken to mean that
>> he has a specific device assigned?  No, indeed, he should not make
>> such an assumption. :-)
>
> I meant 'he' shouldn't make assumption on how many reserved regions
> should exist in e820 based on exposed devices. Jan has a concern exposing
> more reserved regions in e820 than necessary is not a good thing. I'm
> trying to convince him it should be fine. :-)

Right -- well there is a level of practicality here: if in fact many
operating systems ignore the e820 map and base their ideas on what
devices are present, then we would have to try to work around that.

But since this is actually done by the OS and not the driver, in the
absence of any major OSes that actually behave this way, it seems to
me like taking the simpler option of assuming that the guest OS will
honor the e820 map should be OK.

> Then I hope you understand now about our discussion in libxl/xen/
> hvmloader, based on the fact that conflict may not be avoided.
> That's the major open in original discussion with Jan. I'd like to
> give an example of the flow here per Jan's suggestion, starting
> from domain builder after reserved regions have been specified
> by high level libxl.
>
> Let's take an synthetic platform w/ two devices each reported
> with one RMRR reserved region:
>         (D1): [0xe0000, 0xeffff] in <1MB area
>         (D2): [0xa0000000, 0xa37fffff] in ~2.75G area
>
> The guest is configured with 4G memory, and assigned with D2.
> due to libxl policy (say for migration and hotplug) in total 3
> ranges are reported:
>         (hotplug): [0xe0000, 0xeffff] in <1MB area in this node
>         (migration): [0x40000000, 0x40003fff] in ~1G area in another node
>         (static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node
>
> let's use xenstore to save such information (assume accessible to both
> domain builder and hvmloader?)

IIRC xenstore is available to hvmloader, but *not* to the libxc domain
builder.  (I could be wrong about that.)

>
> STEP-1. domain builder
>
> say the default layout w/o reserved regions would be:
>         lowmem:         [0, 0xbfffffff]
>         mmio hole:      [0xc0000000, 0xffffffff]
>         highmem:        [0x100000000, 0x140000000]
>
> domain builder then queries reserved regions from xenstore,
> and tries to avoid conflicts.
>
> For [0xad000000, 0xaf7fffff], it can be avoided by reducing
> lowmem to 0xad000000 and increase highmem:
>         lowmem:         [0, 0x9fffffff]
>         mmio hole:      [0xa0000000, 0xffffffff]
>         highmem:        [0x100000000, 0x160000000]
>
>
> For [0x40000000, 0x40003fff], leave it as a conflict since either
> reducing lowmem to 1G is not nice to guest which doesn't use
> highmem or we have to break lowmem into two trunks so more
> structure changes are required.

Why can we not just leave that area of memory unpopulated?

> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)

And we can't move the guest BIOS in any way?

So in your example here, libxc continues to do the main work of laying
out the address space, and hvmloader only has to deal with laying out
the MMIO regions.

What do you think of Jan's idea, of changing things so that libxc only
does just enough work to set up hvmloader, and then hvmloader
populates the guest memory -- avoiding putting either RAM or MMIO
regions into RMRRs?

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 13:45                                     ` George Dunlap
@ 2015-01-13 15:47                                       ` Jan Beulich
  2015-01-13 16:00                                         ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-13 15:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 13.01.15 at 14:45, <George.Dunlap@eu.citrix.com> wrote:
> On Tue, Jan 13, 2015 at 11:03 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>>> Well it will have an impact on the overall design of the code; but
>>> you're right, if RMRRs really can (and will) be anywhere in memory,
>>> then the domain builder will need to know what RMRRs are going to be
>>> reserved for this VM and avoid populating those.  If, on the other
>>> hand, we can make some fairly small assumptions about where there will
>>> not be any RMRRs, then we can get away with handling everything in
>>> hvmloader.
>>
>> I'm not sure such fairly small assumptions can be made. For example,
>> host RMRR may include one or several regions in host PCI MMIO
>> space (say >3G), then hvmloader has to understand such knowledge
>> to avoid allocating them for guest PCI MMIO.
> 
> Yes, I'm talking here about Jan's idea of having the domain builder in
> libxc do the minimal amount of work to get hvmloader to run, and then
> having hvmloader populate the rest of the address space. So the
> comparison is:
> 
> 1. Both libxc and hvmloader know about RMRRs.  libxc uses this
> information to avoid placing the hvmloader over an RMRR region,
> hvmloader uses the information to populate the memory map and place
> the MMIO ranges such that neither overlap with RMRRs.
> 
> 2. Only hvmloader knows about RMRRs.  libxc places hvmloader in a
> location in RAM basically guaranteed never to overlap with RMRRs;
> hvmloader uses the information to populate memory map and place the
> MMIO ranges such that neither overlap with RMRRs.
> 
> #2 is only possible if we can find a region of the physical address
> space almost guaranteed never to overlap with RMRRs.  Otherwise, we
> may have to fall back to #1.

hvmloader loads at 0x100000, and I think we can be pretty certain
that there's not going to be any RMRRs for that space.

>>> I'm also not clear what assumptions "he" may be making: you mean, the
>>> existence of an RMRR in the e820 map shouldn't be taken to mean that
>>> he has a specific device assigned?  No, indeed, he should not make
>>> such an assumption. :-)
>>
>> I meant 'he' shouldn't make assumption on how many reserved regions
>> should exist in e820 based on exposed devices. Jan has a concern exposing
>> more reserved regions in e820 than necessary is not a good thing. I'm
>> trying to convince him it should be fine. :-)
> 
> Right -- well there is a level of practicality here: if in fact many
> operating systems ignore the e820 map and base their ideas on what
> devices are present, then we would have to try to work around that.
> 
> But since this is actually done by the OS and not the driver, in the
> absence of any major OSes that actually behave this way, it seems to
> me like taking the simpler option of assuming that the guest OS will
> honor the e820 map should be OK.

Since your response doesn't seem connected to what Kevin said, I
think there's some misunderstanding here: The concern Kevin
mentioned I have is marking more regions than necessary as reserved
in the E820 map (needlessly reducing or splitting up lowmem).

>> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
> 
> And we can't move the guest BIOS in any way?

No. BIOSes know the address they get put at. The only hope here
is that conflicts would be only with the transiently loaded init-time
portion of the BIOS: Typically, the BIOS has a large resident part
in the F0000-FFFFF range, while SeaBIOS in particular has another
init-time part living immediately below the resident one, and getting
discarded once BIOS init was done.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 12:03                                       ` Tian, Kevin
@ 2015-01-13 15:52                                         ` Jan Beulich
  2015-01-13 15:58                                           ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-13 15:52 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 13.01.15 at 13:03, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, January 13, 2015 7:56 PM
>> >>> On 13.01.15 at 12:03, <kevin.tian@intel.com> wrote:
>> > 	lowmem: 	[0, 0x9fffffff]
>> > 	mmio hole: 	[0xa0000000, 0xffffffff]
>> > 	highmem:	[0x100000000, 0x160000000]
>> >
>> >
>> > For [0x40000000, 0x40003fff], leave it as a conflict since either
>> > reducing lowmem to 1G is not nice to guest which doesn't use
>> > highmem or we have to break lowmem into two trunks so more
>> > structure changes are required.
>> 
>> This makes no sense - if such an area was explicitly requested to
>> be reserved, leaving it as a conflict is not an option.
> 
> explicitly requested by libxl but leaving it as a conflict in domain
> builder is just fine. later steps will actually catch conflicts when
> relevant regions are actually used (e.g. in static assignment, in
> hotplug, or in migration). 

But why do you think xl requested the region to be reserved?
Presumably because the guest config file said so. And if the
config file said so, there's no alternative to punching a hole
there - failing device assignment later on is what the guest
config file setting was added to avoid.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 15:52                                         ` Jan Beulich
@ 2015-01-13 15:58                                           ` George Dunlap
  2015-01-14  8:06                                             ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-13 15:58 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/13/2015 03:52 PM, Jan Beulich wrote:
>>>> On 13.01.15 at 13:03, <kevin.tian@intel.com> wrote:
>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>> Sent: Tuesday, January 13, 2015 7:56 PM
>>>>>> On 13.01.15 at 12:03, <kevin.tian@intel.com> wrote:
>>>> 	lowmem: 	[0, 0x9fffffff]
>>>> 	mmio hole: 	[0xa0000000, 0xffffffff]
>>>> 	highmem:	[0x100000000, 0x160000000]
>>>>
>>>>
>>>> For [0x40000000, 0x40003fff], leave it as a conflict since either
>>>> reducing lowmem to 1G is not nice to guest which doesn't use
>>>> highmem or we have to break lowmem into two trunks so more
>>>> structure changes are required.
>>>
>>> This makes no sense - if such an area was explicitly requested to
>>> be reserved, leaving it as a conflict is not an option.
>>
>> explicitly requested by libxl but leaving it as a conflict in domain
>> builder is just fine. later steps will actually catch conflicts when
>> relevant regions are actually used (e.g. in static assignment, in
>> hotplug, or in migration). 
> 
> But why do you think xl requested the region to be reserved?
> Presumably because the guest config file said so. And if the
> config file said so, there's no alternative to punching a hole
> there - failing device assignment later on is what the guest
> config file setting was added to avoid.

Yes -- the general principle should be: if the admin has asked for X,
and X cannot be done, then the entire operation should fail, so that the
admin can either make it possible for X to happen, or decide to do Y
instead.  Automatically doing Z when the admin has explicitly asked for
X is poor interface design.

That's why libxl will destroy the domain if the pci_add fails: if you've
asked for device A to be assigned (X), and it can't be assigned for
whatever reason, the domain creation should fail and the admin should be
told why, so he can ask for what he wants instead.

In the case of RMRRs, if the admin has asked for an RMRR to be made, and
it can't be made for whatever reason, the guest creation should fail.
For statically assigned devices this will fall out naturally from the
device assignment failing; but for RMRRs corresponding to hot-plug
regions, we'll have to fail somewhere else -- preferrably in libxc
rather than hvmloader?

We may want to make a special case for RMRRs that overlap guest BIOS I
guess.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 15:47                                       ` Jan Beulich
@ 2015-01-13 16:00                                         ` George Dunlap
  2015-01-13 16:06                                           ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-13 16:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On 01/13/2015 03:47 PM, Jan Beulich wrote:
>>>> On 13.01.15 at 14:45, <George.Dunlap@eu.citrix.com> wrote:
>> On Tue, Jan 13, 2015 at 11:03 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>>>> Well it will have an impact on the overall design of the code; but
>>>> you're right, if RMRRs really can (and will) be anywhere in memory,
>>>> then the domain builder will need to know what RMRRs are going to be
>>>> reserved for this VM and avoid populating those.  If, on the other
>>>> hand, we can make some fairly small assumptions about where there will
>>>> not be any RMRRs, then we can get away with handling everything in
>>>> hvmloader.
>>>
>>> I'm not sure such fairly small assumptions can be made. For example,
>>> host RMRR may include one or several regions in host PCI MMIO
>>> space (say >3G), then hvmloader has to understand such knowledge
>>> to avoid allocating them for guest PCI MMIO.
>>
>> Yes, I'm talking here about Jan's idea of having the domain builder in
>> libxc do the minimal amount of work to get hvmloader to run, and then
>> having hvmloader populate the rest of the address space. So the
>> comparison is:
>>
>> 1. Both libxc and hvmloader know about RMRRs.  libxc uses this
>> information to avoid placing the hvmloader over an RMRR region,
>> hvmloader uses the information to populate the memory map and place
>> the MMIO ranges such that neither overlap with RMRRs.
>>
>> 2. Only hvmloader knows about RMRRs.  libxc places hvmloader in a
>> location in RAM basically guaranteed never to overlap with RMRRs;
>> hvmloader uses the information to populate memory map and place the
>> MMIO ranges such that neither overlap with RMRRs.
>>
>> #2 is only possible if we can find a region of the physical address
>> space almost guaranteed never to overlap with RMRRs.  Otherwise, we
>> may have to fall back to #1.
> 
> hvmloader loads at 0x100000, and I think we can be pretty certain
> that there's not going to be any RMRRs for that space.

Good.

>>>> I'm also not clear what assumptions "he" may be making: you mean, the
>>>> existence of an RMRR in the e820 map shouldn't be taken to mean that
>>>> he has a specific device assigned?  No, indeed, he should not make
>>>> such an assumption. :-)
>>>
>>> I meant 'he' shouldn't make assumption on how many reserved regions
>>> should exist in e820 based on exposed devices. Jan has a concern exposing
>>> more reserved regions in e820 than necessary is not a good thing. I'm
>>> trying to convince him it should be fine. :-)
>>
>> Right -- well there is a level of practicality here: if in fact many
>> operating systems ignore the e820 map and base their ideas on what
>> devices are present, then we would have to try to work around that.
>>
>> But since this is actually done by the OS and not the driver, in the
>> absence of any major OSes that actually behave this way, it seems to
>> me like taking the simpler option of assuming that the guest OS will
>> honor the e820 map should be OK.
> 
> Since your response doesn't seem connected to what Kevin said, I
> think there's some misunderstanding here: The concern Kevin
> mentioned I have is marking more regions than necessary as reserved
> in the E820 map (needlessly reducing or splitting up lowmem).

OK, so you're concerned with reducing fragmentation / maximizing
availability of lowmem.  Yes, that's another reason to try to minimize
the number of RMRRs reported in general.

Another option I was thinking about: Before assigning a device to a
guest, you have to unplug the device and assign it to pci-back (e.g.,
with xl pci-assignable-add).  In addition to something like rmmr=host,
we could add rmrr=assignable, which would add all of the RMRRs of all
devices currently listed as "assignable".  The idea would then be that
you first make all your devices assignable, then just start your guests,
and everything you've made assignable will be able to be assigned.

> 
>>> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
>>
>> And we can't move the guest BIOS in any way?
> 
> No. BIOSes know the address they get put at. The only hope here
> is that conflicts would be only with the transiently loaded init-time
> portion of the BIOS: Typically, the BIOS has a large resident part
> in the F0000-FFFFF range, while SeaBIOS in particular has another
> init-time part living immediately below the resident one, and getting
> discarded once BIOS init was done.

I see, thanks.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 16:00                                         ` George Dunlap
@ 2015-01-13 16:06                                           ` Jan Beulich
  2015-01-14  6:52                                             ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-13 16:06 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, TiejunChen

>>> On 13.01.15 at 17:00, <george.dunlap@eu.citrix.com> wrote:
> Another option I was thinking about: Before assigning a device to a
> guest, you have to unplug the device and assign it to pci-back (e.g.,
> with xl pci-assignable-add).  In addition to something like rmmr=host,
> we could add rmrr=assignable, which would add all of the RMRRs of all
> devices currently listed as "assignable".  The idea would then be that
> you first make all your devices assignable, then just start your guests,
> and everything you've made assignable will be able to be assigned.

Nice idea indeed, but I'm not sure about its practicability: It may
not be desirable to make all devices eventually to be handed to a
guest prior to starting any of the guests it may get handed to. In
particular there may be reasons why the host needs the device
while (or until after) creating the guests.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 11:03                                   ` Tian, Kevin
  2015-01-13 11:56                                     ` Jan Beulich
  2015-01-13 13:45                                     ` George Dunlap
@ 2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
  2015-01-14  8:13                                       ` Tian, Kevin
  2015-01-14 12:47                                       ` George Dunlap
  2 siblings, 2 replies; 139+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-13 16:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Zhang, Yang Z, Chen,
	Tiejun

On Tue, Jan 13, 2015 at 11:03:22AM +0000, Tian, Kevin wrote:
> > From: George Dunlap
> > Sent: Monday, January 12, 2015 10:20 PM
> > 
> > On Mon, Jan 12, 2015 at 12:28 PM, Tian, Kevin <kevin.tian@intel.com> wrote:
> > >> From: George Dunlap
> > >> Sent: Monday, January 12, 2015 8:14 PM
> > >>
> > >> On Mon, Jan 12, 2015 at 11:22 AM, Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> >> Sent: Monday, January 12, 2015 6:23 PM
> > >> >>
> > >> >> >>> On 12.01.15 at 11:12, <kevin.tian@intel.com> wrote:
> > >> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> >> >> Sent: Monday, January 12, 2015 6:09 PM
> > >> >> >>
> > >> >> >> >>> On 12.01.15 at 10:56, <kevin.tian@intel.com> wrote:
> > >> >> >> > the result is related to another open whether we want to block
> > guest
> > >> >> >> > boot for such problem. If 'warn' in domain builder is acceptable, we
> > >> >> >> > don't need to change lowmem for such rare 1GB case, just throws
> > >> >> >> > a warning for unnecessary conflictions (doesn't hurt if user doesn't
> > >> >> >> > assign it).
> > >> >> >>
> > >> >> >> And how would you then deal with the one guest needing that
> > >> >> >> range reserved?
> > >> >> >
> > >> >> > if guest needs the range, then report-all or report-sel doesn't matter.
> > >> >> > domain builder throws the warning, and later device assignment will
> > >> >> > fail (or warn w/ override). In reality I think 1GB is rare. Making such
> > >> >> > assumption to simplify implementation is reasonable.
> > >> >>
> > >> >> One of my main problems with all you recent argumentation here
> > >> >> is the arbitrary use of the 1Gb boundary - there's nothing special
> > >> >> in this discussion with where the boundary is. Everything revolves
> > >> >> around the (undue) effect of report-all on domains not needing all
> > >> >> of the ranges found on the host.
> > >> >>
> > >> >
> > >> > I'm not sure which part of my argument is not clear here. report-all
> > >> > would be a problem here only if we want to fix all the conflictions
> > >> > (then pulling unnecessary devices increases the confliction possibility)
> > >> > in the domain builder. but if we only fix reasonable ones (e.g. >3GB)
> > >> > while warn other conflictions (e.g. <3G) in domain builder (let later
> > >> > assignment path to actually fail if confliction does matter), then we
> > >> > don't need to solve all conflictions in domain builder (if say 1G example
> > >> > fixing it may instead reduce lowmem greatly) and then report-all
> > >> > may just add more warnings than report-sel for unused devices.
> > >>
> > >> You keep saying "report-all" or "report-sel", but I'm not 100% clear
> > >> what you mean by those.  In any case, the naming has got to be a bit
> > >> misleading: the important questions at the moment, AFAICT, are:
> > >
> > > I explained them in original proposal
> > 
> > Yes, I read it and didn't understand it there either. :-)
> 
> sorry for that.
> 
> > 
> > >> 1. Whether we make holes at boot time for all RMRRs on the system, or
> > >> whether only make RMRRs for some subset (or potentially some other
> > >> arbitrary range, which may include RMRRs on other hosts to which we
> > >> may want to migrate).
> > >
> > > I use 'report-all' to stand for making holes for all RMRRs on the system,
> > > while 'report-sel' for specified subset.
> > >
> > > including other RMRRs (from admin for migration) is orthogonal to
> > > above open.
> > 
> > Right; so the "report" in this case is "report to the guest".
> > 
> > As I said, I think that's confusing terminology; after all, we want to
> > report to the guest all holes that we make, and only the holes that we
> > make.  The question isn't then which ones we report, but which ones we
> > make holes for. :-)
> 
> originally I use 'report' to describe the hypercall which hypervisor composes
> the actual information about RMRR, so it can be 'report to libxl' or 'report
> to the guest' regarding to who invokes that hypercall.
> 
> but yes here we more care about what's reported to the guest.
> 
> > 
> > So for this discussion, maybe "rmrr-host" (meaning, copy RMRRs from
> > the host) or "rmrr-sel" (meaning, specify a selection of RMRRs, which
> > may be from this host, or even another host)?
> 
> the counterpart of 'rmrr-host' gives me feeling of 'rmrr-guest'. :-)
> 
> > 
> > Given that the ranges may be of arbitrary size, and that we may want
> > to specify additional ranges for migration to other hosts, then I
> > think we need at some level we need the machinery to be in place to
> > specify the RMRRs that will be reserved for a specific guest.
> > 
> > At the xl level, there should of course be a way to specify "use all
> > host RMRRs"; but what should happen then is that xl / libxl should
> > query Xen for the host RMRRs and then pass those down to the next
> > layer of the library.
> > 
> > >> 2. Whether those holes are made by the domain builder in libxc, or by
> > >> hvmloader
> > >
> > > based on current discussion, whether to make holes in hvmloader
> > > doesn't bring fundamental difference. as long as domain builder
> > > still need to populate memory (even minimal for hvmloader to boot),
> > > it needs to check conflict and may ideally make hole too (though we
> > > may make assumption not doing that)
> > 
> > Well it will have an impact on the overall design of the code; but
> > you're right, if RMRRs really can (and will) be anywhere in memory,
> > then the domain builder will need to know what RMRRs are going to be
> > reserved for this VM and avoid populating those.  If, on the other
> > hand, we can make some fairly small assumptions about where there will
> > not be any RMRRs, then we can get away with handling everything in
> > hvmloader.
> 
> I'm not sure such fairly small assumptions can be made. For example,
> host RMRR may include one or several regions in host PCI MMIO
> space (say >3G), then hvmloader has to understand such knowledge
> to avoid allocating them for guest PCI MMIO.
> 
> > 
> > >>
> > >> 3. What happens if Xen is asked to assign a device and it finds that
> > >> the required RMRR is not empty:
> > >>  a. during guest creation
> > >>  b. after the guest has booted
> > >
> > > for Xen we don't need differentiate a/b. by default it's clear failure
> > > should be returned as it implies a security/correctness issue if
> > > moving forward. but based on discussion an override to 'warn' only
> > > is preferred, so admin can make decision (remains an open on
> > > whether to do global override or per-device override)
> > 
> > Well I think part of our confusion here is what "fail" vs "warn" means.
> > 
> > Fail / warn might be "Do we refuse to assign the device, or do we go
> > ahead and assign the device, knowing that it may act buggy?"
> > 
> > Or it might be, "Do we fail domain creation if at some step we
> > discover an RMRR conflict?  Or do we let the domain create succeed but
> > warn that the device has not been attached?"
> > 
> > I think in any case, failing to *assign* the device is the right thing
> > to do (except perhaps with a per-device override option).
> 
> yes
> 
> > 
> > libxl already has a policy of what happens when pci assignment fails
> > during domain creation.  If I'm reading the code right, libxl will
> > destroy the domain if libxl__device_pci_add() fails during domain
> > creation; I think that's the right thing to do.  If you want to change
> > that policy, that's a different discussion.
> 
> not my intention. as I said, the policy for Xen is clear. Just fail the
> assignment hypercall (or warn w/ override). keep whatever policy
> defined by libxl.
> 
> > 
> > But if the device assignment fails due to an unspecified RMRR, that's
> > a bug in the toolstack -- it should have looked at the device list,
> > found out what RMRRs were necessary, and reserved those ranges before
> > we got to that point.
> > 
> > The only time I would expect device assignment might fail during
> > domain creation is if one of the devices had an RMRR shared with a
> > device already assigned to another VM.
> > 
> > >> Obviously at some point some part of the toolstack needs to identify
> > >> which RMRRs go with what device, so that either libxc or hvmloader can
> > >> make the appropriate holes in the address space; but at that point,
> > >> "report" is not so much the right word as "query".  (Obviously we want
> > >> to "report" in the e820 map all RMRRs that we've made holes for in the
> > >> guest.)
> > >
> > > yes, using 'report' doesn't catch all the changes we need to make. Just
> > > use them to simplify discussion in case all are on the same page. However
> > > clearly my original explanation didn't make it. :/
> > >
> > > and state my major intention again. I don't think the preparation (i.e.
> > > detect confliction and make holes) for device assignment should be a
> > > a blocking failure.  Throw warning should be enough (i.e. in libxc). We
> > > should let actual device assignment path to make final call based on
> > > admin's configuration (default 'fail' w/ 'warn' override). Based on that
> > > policy I think 'report-all' (making holes for all host RMRRs) is an
> > > acceptable approach, w/ small impact on possibly more warning
> > > messages (actually not bad to help admin understand the hotplug
> > > possibility on this platform) and show more reserved regions to the
> > > end user (but he shouldn't make any assumption here). :-)
> > 
> > I don't really understand what you're talking about here.
> > 
> > When the libxc domain builder runs, there is *no* guest memory mapped.
> > So if it has the RMRRs, then it can *avoid* conflict; and if it
> > doesn't have the RMRRs, it can't even *detect* conflict.  So there is
> > no reason for libxc to either give a warning, or cause a failure.
> 
> not all the conflicts can or will be avoided. e.g. USB may report a 
> region conflicting with guest BIOS which is a hard conflict. another 
> example (from one design option) is that we may want to keep 
> current cross-component structure (one lowmem + one highmem)
> so conflict in the middle (e.g. 2G) is a problem (avoiding it will break
> lowmem or make lowmem too small).
> 
> as long as we agree some conflicts may not be avoided, then it
> comes to the open about whether to give a warning, or cause a failure.
> I view making holes in domain builder as a preparation for later device
> assignment, so gives warning should be sufficient here since Xen will
> fail the assignment hypercall later when it actually happens and then
> libxl will react according to defined policy like you described above.
> 
> > 
> > So I'm not sure why you think making holes for all RMRRs would have
> > more warning messages.
> 
> based on the fact that not all RMRRs can or will be avoided, definitely
> making holes for all RMRRs on the host can potentially lead to more 
> conflicts than just making holes for RMRRs belonging to assigned
> devices. When we agree warning is OK in domain builder, it means
> more warning messages. 
> 
> > 
> > And when you say "show more reserved regions to the end user", I take
> > it you mean the guest kernel (via the e820 map)?
> 
> yes since all reserved regions have to be marked in the e820, so guest
> OS itself won't allocate the hole e.g. when doing PCI re-configuration.
> 
> > 
> > I'm also not clear what assumptions "he" may be making: you mean, the
> > existence of an RMRR in the e820 map shouldn't be taken to mean that
> > he has a specific device assigned?  No, indeed, he should not make
> > such an assumption. :-)
> 
> I meant 'he' shouldn't make assumption on how many reserved regions
> should exist in e820 based on exposed devices. Jan has a concern exposing
> more reserved regions in e820 than necessary is not a good thing. I'm
> trying to convince him it should be fine. :-)
> 
> > 
> > Again -- I think that the only place "rmrr-host" and "rmrr-sel" is
> > important is at the very top level -- in xl, and possibly at a high
> > level in libxl.  By the time things reach libxc and hvmloader, they
> > should simply be told, "These are the RMRRs for this domain", and they
> > should avoid conflicts and report those in the e820 map.
> > 
> 
> Having libxl to centrally manage RMRR at a high level is a good idea,
> which however need help from you and Ian on what're detail tasks
> in libxl to achieve such goal. We're not tool expert (especially to
> Tiejun) so definitely more suggestion in this area is welcomed. :-)
> 
> Then I hope you understand now about our discussion in libxl/xen/
> hvmloader, based on the fact that conflict may not be avoided. 
> That's the major open in original discussion with Jan. I'd like to 
> give an example of the flow here per Jan's suggestion, starting 
> from domain builder after reserved regions have been specified 
> by high level libxl.
> 
> Let's take an synthetic platform w/ two devices each reported 
> with one RMRR reserved region:
> 	(D1): [0xe0000, 0xeffff] in <1MB area
> 	(D2): [0xa0000000, 0xa37fffff] in ~2.75G area
> 
> The guest is configured with 4G memory, and assigned with D2.
> due to libxl policy (say for migration and hotplug) in total 3
> ranges are reported:
> 	(hotplug): [0xe0000, 0xeffff] in <1MB area in this node
> 	(migration): [0x40000000, 0x40003fff] in ~1G area in another node
> 	(static-assign): [0xa0000000, 0xa37fffff] in ~2.75G area in this node
> 
> let's use xenstore to save such information (assume accessible to both
> domain builder and hvmloader?)
> 
> STEP-1. domain builder
> 
> say the default layout w/o reserved regions would be:
> 	lowmem: 	[0, 0xbfffffff]
> 	mmio hole: 	[0xc0000000, 0xffffffff]
> 	highmem:	[0x100000000, 0x140000000]
> 
> domain builder then queries reserved regions from xenstore, 
> and tries to avoid conflicts.

Perhaps an easier way of this is to use the existing
mechanism we have - that is the XENMEM_memory_map (which
BTW e820_host uses). If all of this is done in the libxl (which
already does this based on the host E820, thought it can
be modified to query the hypervisor for other 'reserved
regions') and hvmloader is modified to use XENMEM_memory_map
and base its E820 on that (and also QEMU-xen), then we solve
this problem and also the http://bugs.xenproject.org/xen/bug/28?

(lots of handwaving).
> 
> For [0xad000000, 0xaf7fffff], it can be avoided by reducing
> lowmem to 0xad000000 and increase highmem:
> 	lowmem: 	[0, 0x9fffffff]
> 	mmio hole: 	[0xa0000000, 0xffffffff]
> 	highmem:	[0x100000000, 0x160000000]
> 
> 
> For [0x40000000, 0x40003fff], leave it as a conflict since either
> reducing lowmem to 1G is not nice to guest which doesn't use
> highmem or we have to break lowmem into two trunks so more 
> structure changes are required.
> 
> For [0xe0000, 0xeffff], leave it as a conflict (w/ guest BIOS)
> 
> w/ libxl centrally managed mode, domain builder doesn't know
> whether a conflict will lead to an immediate error or not, so 
> the best policy here is to throw warning and then move forward.
> conflicts will be caught in later steps when a region is actually
> concerned.
> 
> STEP-2. static device assignment
> 
> after domain builder, libxl will request Xen hypervisor to complete
> actual device assignment. Because D2 is statically assigned to
> this guest, Xen will setup identity mapping for [0xa0000000, 
> 0xa37fffff] with conflict detection in gfn space. Since domain builder
> has making hole for this region, there'll no conflict and device will
> be assigned to the guest successfully.
> 
> STEP-3. hvmloader boot
> 
> hvmloader also needs to query reserved regions (still thru xenstore?)
> due to two reasons:
> 	- mark all reported reserved regions in guest e820
> 	- make holes to avoid conflict in dynamic allocation (e.g. PCI
> BAR, ACPI opregion, etc.)
> 
> hvmloader can avoid making holes for guest RAM again (even there
> are potential conflicts w/ guest RAM they would be acceptable otherwise
> libxl will fail the boot before reaching here). So hvmloader will just 
> add a new reserved e820 entry and make hole for [0xa0000000, 
> 0xa37fffff] in this example, which doesn't have a guest RAM confliction.
> 
> STEP-4. D2 hotplug
> 
> After guest has been booted, user decides to hotplug D1 so libxl
> will raise another device assignment request to Xen hypervisor. At
> this point, a conflict is detected on [0xe0000, 0xeffff] and then a 
> failure is returned by default. an override is provide in Xen to allow 
> warn and return success, and user understands doing so implies 
> an unsecure environment.
> 
> Out of this example hotplug will succeed if no conflict on RMRRs of 
> the hotplug device.
> 
> STEP-5. migration
> 
> after guest has been booted, user decides to migrate the guest
> to another node and the guest will be assigned with a new device
> w/ a [0x40000000, 0x40003fff] reserved region on the new node. 
> After the guest is migrated, Xen on the new node will detect conflict
> when new device is hotplugged and failure (or warn override so 
> success) will be returned accordingly.
> 
> Out of this example hotplug will succeed if no conflict on RMRRs of 
> the hotplug device.
> 
> ---
> 
> After completing this long thread, I actually like this high level libxl
> management idea, which avoids complexity in domain builder/
> hvmloader to understand device/RMRR association and then uses
> different policy according to whether it's statically assigned or 
> hotplug or for other purpose. :-)
> 
> Thanks
> Kevin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 16:06                                           ` Jan Beulich
@ 2015-01-14  6:52                                             ` Tian, Kevin
  2015-01-14 12:14                                               ` Ian Campbell
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14  6:52 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 14, 2015 12:06 AM
> 
> >>> On 13.01.15 at 17:00, <george.dunlap@eu.citrix.com> wrote:
> > Another option I was thinking about: Before assigning a device to a
> > guest, you have to unplug the device and assign it to pci-back (e.g.,
> > with xl pci-assignable-add).  In addition to something like rmmr=host,
> > we could add rmrr=assignable, which would add all of the RMRRs of all
> > devices currently listed as "assignable".  The idea would then be that
> > you first make all your devices assignable, then just start your guests,
> > and everything you've made assignable will be able to be assigned.
> 
> Nice idea indeed, but I'm not sure about its practicability: It may
> not be desirable to make all devices eventually to be handed to a
> guest prior to starting any of the guests it may get handed to. In
> particular there may be reasons why the host needs the device
> while (or until after) creating the guests.
> 

and I'm not sure whether there's enough knowledge to judge whether 
a device is assignable since potential conflicts may be detected only
when the guest is launched.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 15:58                                           ` George Dunlap
@ 2015-01-14  8:06                                             ` Tian, Kevin
  2015-01-14  9:00                                               ` Jan Beulich
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14  8:06 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Tuesday, January 13, 2015 11:59 PM
> 
> On 01/13/2015 03:52 PM, Jan Beulich wrote:
> >>>> On 13.01.15 at 13:03, <kevin.tian@intel.com> wrote:
> >>>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >>> Sent: Tuesday, January 13, 2015 7:56 PM
> >>>>>> On 13.01.15 at 12:03, <kevin.tian@intel.com> wrote:
> >>>> 	lowmem: 	[0, 0x9fffffff]
> >>>> 	mmio hole: 	[0xa0000000, 0xffffffff]
> >>>> 	highmem:	[0x100000000, 0x160000000]
> >>>>
> >>>>
> >>>> For [0x40000000, 0x40003fff], leave it as a conflict since either
> >>>> reducing lowmem to 1G is not nice to guest which doesn't use
> >>>> highmem or we have to break lowmem into two trunks so more
> >>>> structure changes are required.
> >>>
> >>> This makes no sense - if such an area was explicitly requested to
> >>> be reserved, leaving it as a conflict is not an option.
> >>
> >> explicitly requested by libxl but leaving it as a conflict in domain
> >> builder is just fine. later steps will actually catch conflicts when
> >> relevant regions are actually used (e.g. in static assignment, in
> >> hotplug, or in migration).
> >
> > But why do you think xl requested the region to be reserved?
> > Presumably because the guest config file said so. And if the
> > config file said so, there's no alternative to punching a hole
> > there - failing device assignment later on is what the guest
> > config file setting was added to avoid.
> 
> Yes -- the general principle should be: if the admin has asked for X,
> and X cannot be done, then the entire operation should fail, so that the
> admin can either make it possible for X to happen, or decide to do Y
> instead.  Automatically doing Z when the admin has explicitly asked for
> X is poor interface design.
> 
> That's why libxl will destroy the domain if the pci_add fails: if you've
> asked for device A to be assigned (X), and it can't be assigned for
> whatever reason, the domain creation should fail and the admin should be
> told why, so he can ask for what he wants instead.
> 
> In the case of RMRRs, if the admin has asked for an RMRR to be made, and
> it can't be made for whatever reason, the guest creation should fail.
> For statically assigned devices this will fall out naturally from the
> device assignment failing; but for RMRRs corresponding to hot-plug
> regions, we'll have to fail somewhere else -- preferrably in libxc
> rather than hvmloader?
> 
> We may want to make a special case for RMRRs that overlap guest BIOS I
> guess.

Apparently I didn't get support on a simplified model which only warns conflict
in domain builder while letting later assignment to actual fail. As you pointed 
here, it violates the general principle. That's fine.

So I'll withdraw my proposal and then discuss detail along your idea.

Now assume high level libxl will query all host RMRRs from Xen hypervisor,
and provide one option for rmrr_host (requiring all host RMRRs for hotplug 
preparation) and also another option to allow specify additional regions 
which may come from other host due to migration (for static-assigned 
device we don't need a new option). high level libxl will decide a set of
reserved regions according to user configurations or whatever policies, and
pass those information to domain builder.

So domain builder doesn't know the association between devices and specified 
regions (in migration there won't be such association). It just tries to detect 
and make holes to avoid potential conflicts, and if can't then fail the whole 
domain creation at all per your suggestions.

We discussed earlier there are two reasons that some conflicts may not be 
avoided:
	- RMRRs conflicting with guest BIOS in <1MB area, as an example of 
hard conflicts
	- RMRRs conflicting with lowmem which is low enough then avoiding it
will either break lowmem or make lowmem too low to impact guest (just
an option being discussed)

Now the open is whether we want to fail domain creation for all of above
conflicts. user may choose to bear with conflicts at his own disposal, or
libxl doesn't want to fail conflicts as preparation for future hotplug/migration.
One possible option is to add a per-region flag to specify whether treating
relevant conflict as an error, when libxl composes the list to domain builder. 
and this information will be saved in a user space database accessible to
all components and also waterfall to Xen hypervisor when libxl requests 
actual device assignment.

with such model to tell whether conflict is treated as an error for a 
specific region, we can get unified policy cross different components
(libxc/Xen/hvmloader).

for another open of having the domain builder do minimal work to launch
hvmloader who does actual population, it's a good idea but I'm not sure
how much work remains in the direction. It's a bigger scope change than
what's anticipated for this RMRR work and the detail hasn't been discussed
or agreed by toolstack experts. So I'd propose sticking to original proposal 
to make this RMRR fix a reasonably necessary task (i.e. let domain builder 
handle RAM conflicts while let hvmloader handle other conflicts), instead 
of growing it unboundly into a more general cleanup. If that's the right
direction to go, we also hope other experts can jump in to help. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
@ 2015-01-14  8:13                                       ` Tian, Kevin
  2015-01-14  9:02                                         ` Jan Beulich
  2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
  2015-01-14 12:47                                       ` George Dunlap
  1 sibling, 2 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14  8:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Zhang, Yang Z, Chen,
	Tiejun

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Wednesday, January 14, 2015 12:46 AM
> 
> 
> Perhaps an easier way of this is to use the existing
> mechanism we have - that is the XENMEM_memory_map (which
> BTW e820_host uses). If all of this is done in the libxl (which
> already does this based on the host E820, thought it can
> be modified to query the hypervisor for other 'reserved
> regions') and hvmloader is modified to use XENMEM_memory_map
> and base its E820 on that (and also QEMU-xen), then we solve
> this problem and also the http://bugs.xenproject.org/xen/bug/28?
> 

I'm not familiar with that option, but a quick search looks saying
it's only for PV guest?

and please note XENMEM_memory_map only includes RAM entries
(looks also only for pv), while following above intention what we
really want is real e820_host w/ all entries filled.

Thanks
Kevin

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  8:06                                             ` Tian, Kevin
@ 2015-01-14  9:00                                               ` Jan Beulich
  2015-01-14  9:43                                                 ` Tian, Kevin
  2015-01-14 12:17                                               ` Ian Campbell
  2015-01-14 12:29                                               ` George Dunlap
  2 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14  9:00 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
> Now the open is whether we want to fail domain creation for all of above
> conflicts. user may choose to bear with conflicts at his own disposal, or
> libxl doesn't want to fail conflicts as preparation for future 
> hotplug/migration.
> One possible option is to add a per-region flag to specify whether treating
> relevant conflict as an error, when libxl composes the list to domain 
> builder. 
> and this information will be saved in a user space database accessible to
> all components and also waterfall to Xen hypervisor when libxl requests 
> actual device assignment.

That's certainly a possibility, albeit saying (in the guest config) that
a region to be reserved only when possible is about the same as
not stating that region. If at all, I'd see the rmrr-host value be a
tristate (don't, try, and force) to that effect.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  8:13                                       ` Tian, Kevin
@ 2015-01-14  9:02                                         ` Jan Beulich
  2015-01-14  9:44                                           ` Tian, Kevin
  2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14  9:02 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 09:13, <kevin.tian@intel.com> wrote:
>>  From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
>> Sent: Wednesday, January 14, 2015 12:46 AM
>> 
>> Perhaps an easier way of this is to use the existing
>> mechanism we have - that is the XENMEM_memory_map (which
>> BTW e820_host uses). If all of this is done in the libxl (which
>> already does this based on the host E820, thought it can
>> be modified to query the hypervisor for other 'reserved
>> regions') and hvmloader is modified to use XENMEM_memory_map
>> and base its E820 on that (and also QEMU-xen), then we solve
>> this problem and also the http://bugs.xenproject.org/xen/bug/28? 
> 
> I'm not familiar with that option, but a quick search looks saying
> it's only for PV guest?
> 
> and please note XENMEM_memory_map only includes RAM entries
> (looks also only for pv), while following above intention what we
> really want is real e820_host w/ all entries filled.

But from the very beginning when these were proposed it was
said that they would need extending from being PV/PVH only to
also be usable for HVM. Such a change would be minimally
intrusive afaict as at least the latter already is allowed for PVH
too.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:00                                               ` Jan Beulich
@ 2015-01-14  9:43                                                 ` Tian, Kevin
  2015-01-14 10:24                                                   ` Jan Beulich
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14  9:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 14, 2015 5:00 PM
> 
> >>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
> > Now the open is whether we want to fail domain creation for all of above
> > conflicts. user may choose to bear with conflicts at his own disposal, or
> > libxl doesn't want to fail conflicts as preparation for future
> > hotplug/migration.
> > One possible option is to add a per-region flag to specify whether treating
> > relevant conflict as an error, when libxl composes the list to domain
> > builder.
> > and this information will be saved in a user space database accessible to
> > all components and also waterfall to Xen hypervisor when libxl requests
> > actual device assignment.
> 
> That's certainly a possibility, albeit saying (in the guest config) that
> a region to be reserved only when possible is about the same as
> not stating that region. If at all, I'd see the rmrr-host value be a
> tristate (don't, try, and force) to that effect.
> 

how about something like below with bi-state?

for statically assigned device:
	pci = [ "00:02.0, 0/1" ]
where '0/1' represents try/force (or use 'try/force', or have a meaningful 
attribute like rmrr_check=try/force?)

for other usages like hotplug/migration:
	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
If 'host' is specified, it implies rmrr_host, besides user can specific 
explicit ranges according to his detail requirement.

based on above configuration interface, libxl can construct necessary
reserve regions with individual try/force policies.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:02                                         ` Jan Beulich
@ 2015-01-14  9:44                                           ` Tian, Kevin
  2015-01-14 10:25                                             ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14  9:44 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 14, 2015 5:03 PM
> 
> >>> On 14.01.15 at 09:13, <kevin.tian@intel.com> wrote:
> >>  From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> >> Sent: Wednesday, January 14, 2015 12:46 AM
> >>
> >> Perhaps an easier way of this is to use the existing
> >> mechanism we have - that is the XENMEM_memory_map (which
> >> BTW e820_host uses). If all of this is done in the libxl (which
> >> already does this based on the host E820, thought it can
> >> be modified to query the hypervisor for other 'reserved
> >> regions') and hvmloader is modified to use XENMEM_memory_map
> >> and base its E820 on that (and also QEMU-xen), then we solve
> >> this problem and also the http://bugs.xenproject.org/xen/bug/28?
> >
> > I'm not familiar with that option, but a quick search looks saying
> > it's only for PV guest?
> >
> > and please note XENMEM_memory_map only includes RAM entries
> > (looks also only for pv), while following above intention what we
> > really want is real e820_host w/ all entries filled.
> 
> But from the very beginning when these were proposed it was
> said that they would need extending from being PV/PVH only to
> also be usable for HVM. Such a change would be minimally
> intrusive afaict as at least the latter already is allowed for PVH
> too.
> 

if we make assumption on not breaking lowmem/highmem structure
in domain builder, then this change can be avoided.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:43                                                 ` Tian, Kevin
@ 2015-01-14 10:24                                                   ` Jan Beulich
  2015-01-14 12:01                                                     ` George Dunlap
                                                                       ` (2 more replies)
  2015-01-14 12:16                                                   ` George Dunlap
  2015-01-14 12:21                                                   ` Ian Campbell
  2 siblings, 3 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 10:24 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, January 14, 2015 5:00 PM
>> 
>> >>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>> > Now the open is whether we want to fail domain creation for all of above
>> > conflicts. user may choose to bear with conflicts at his own disposal, or
>> > libxl doesn't want to fail conflicts as preparation for future
>> > hotplug/migration.
>> > One possible option is to add a per-region flag to specify whether treating
>> > relevant conflict as an error, when libxl composes the list to domain
>> > builder.
>> > and this information will be saved in a user space database accessible to
>> > all components and also waterfall to Xen hypervisor when libxl requests
>> > actual device assignment.
>> 
>> That's certainly a possibility, albeit saying (in the guest config) that
>> a region to be reserved only when possible is about the same as
>> not stating that region. If at all, I'd see the rmrr-host value be a
>> tristate (don't, try, and force) to that effect.
>> 
> 
> how about something like below with bi-state?
> 
> for statically assigned device:
> 	pci = [ "00:02.0, 0/1" ]
> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
> attribute like rmrr_check=try/force?)

As said many times before, for statically assigned devices such a flag
makes no sense.

> for other usages like hotplug/migration:
> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
> If 'host' is specified, it implies rmrr_host, besides user can specific 
> explicit ranges according to his detail requirement.

For host the flag makes sense, but for the explicitly specified regions
- as said before - I don't think it does.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:44                                           ` Tian, Kevin
@ 2015-01-14 10:25                                             ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 10:25 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 10:44, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, January 14, 2015 5:03 PM
>> 
>> >>> On 14.01.15 at 09:13, <kevin.tian@intel.com> wrote:
>> >>  From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
>> >> Sent: Wednesday, January 14, 2015 12:46 AM
>> >>
>> >> Perhaps an easier way of this is to use the existing
>> >> mechanism we have - that is the XENMEM_memory_map (which
>> >> BTW e820_host uses). If all of this is done in the libxl (which
>> >> already does this based on the host E820, thought it can
>> >> be modified to query the hypervisor for other 'reserved
>> >> regions') and hvmloader is modified to use XENMEM_memory_map
>> >> and base its E820 on that (and also QEMU-xen), then we solve
>> >> this problem and also the http://bugs.xenproject.org/xen/bug/28? 
>> >
>> > I'm not familiar with that option, but a quick search looks saying
>> > it's only for PV guest?
>> >
>> > and please note XENMEM_memory_map only includes RAM entries
>> > (looks also only for pv), while following above intention what we
>> > really want is real e820_host w/ all entries filled.
>> 
>> But from the very beginning when these were proposed it was
>> said that they would need extending from being PV/PVH only to
>> also be usable for HVM. Such a change would be minimally
>> intrusive afaict as at least the latter already is allowed for PVH
>> too.
> 
> if we make assumption on not breaking lowmem/highmem structure
> in domain builder, then this change can be avoided.

Of course. But we're currently evaluating options...

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 10:24                                                   ` Jan Beulich
@ 2015-01-14 12:01                                                     ` George Dunlap
  2015-01-14 12:11                                                       ` Tian, Kevin
  2015-01-14 14:32                                                       ` Jan Beulich
  2015-01-14 12:03                                                     ` Tian, Kevin
  2015-01-14 12:12                                                     ` George Dunlap
  2 siblings, 2 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:01 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>> Sent: Wednesday, January 14, 2015 5:00 PM
>>>
>>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>>> Now the open is whether we want to fail domain creation for all of above
>>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>>> libxl doesn't want to fail conflicts as preparation for future
>>>> hotplug/migration.
>>>> One possible option is to add a per-region flag to specify whether treating
>>>> relevant conflict as an error, when libxl composes the list to domain
>>>> builder.
>>>> and this information will be saved in a user space database accessible to
>>>> all components and also waterfall to Xen hypervisor when libxl requests
>>>> actual device assignment.
>>>
>>> That's certainly a possibility, albeit saying (in the guest config) that
>>> a region to be reserved only when possible is about the same as
>>> not stating that region. If at all, I'd see the rmrr-host value be a
>>> tristate (don't, try, and force) to that effect.
>>>
>>
>> how about something like below with bi-state?
>>
>> for statically assigned device:
>> 	pci = [ "00:02.0, 0/1" ]
>> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
>> attribute like rmrr_check=try/force?)
> 
> As said many times before, for statically assigned devices such a flag
> makes no sense.
> 
>> for other usages like hotplug/migration:
>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>> explicit ranges according to his detail requirement.
> 
> For host the flag makes sense, but for the explicitly specified regions
> - as said before - I don't think it does.

You don't think there are any circumstances where an admin should be
allowed to "shoot himself in the foot" by assigning a device which he
knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
won't actually be used?

I thought I heard someone say that many devices will only use RMRRs for
compatibility with older OSes or during boot; in which case, there may
be devices which you can safely assign to newer OSes / hot-plug after
the guest has booted even without reserving the RMRR.  If such devices
exist, then the admin should be able to assign those, shouldn't they?

Making it "rmrr=force" by default, but allowing an admin to specify
"rmrr=try", makes sense to me.  It does introduce an extra layer of
complication, so I wouldn't push for it; but if Kevin / Intel wants to
do the work, I think it's a good thing.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 10:24                                                   ` Jan Beulich
  2015-01-14 12:01                                                     ` George Dunlap
@ 2015-01-14 12:03                                                     ` Tian, Kevin
  2015-01-14 14:34                                                       ` Jan Beulich
  2015-01-14 12:12                                                     ` George Dunlap
  2 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14 12:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 14, 2015 6:24 PM
> 
> >>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, January 14, 2015 5:00 PM
> >>
> >> >>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
> >> > Now the open is whether we want to fail domain creation for all of above
> >> > conflicts. user may choose to bear with conflicts at his own disposal, or
> >> > libxl doesn't want to fail conflicts as preparation for future
> >> > hotplug/migration.
> >> > One possible option is to add a per-region flag to specify whether treating
> >> > relevant conflict as an error, when libxl composes the list to domain
> >> > builder.
> >> > and this information will be saved in a user space database accessible to
> >> > all components and also waterfall to Xen hypervisor when libxl requests
> >> > actual device assignment.
> >>
> >> That's certainly a possibility, albeit saying (in the guest config) that
> >> a region to be reserved only when possible is about the same as
> >> not stating that region. If at all, I'd see the rmrr-host value be a
> >> tristate (don't, try, and force) to that effect.
> >>
> >
> > how about something like below with bi-state?
> >
> > for statically assigned device:
> > 	pci = [ "00:02.0, 0/1" ]
> > where '0/1' represents try/force (or use 'try/force', or have a meaningful
> > attribute like rmrr_check=try/force?)
> 
> As said many times before, for statically assigned devices such a flag
> makes no sense.

why? seems we agreed in the very 1st place to allow admin provide
per-device override for this (e.g. like you argument that user may 
get confirmation from device vendor that though conflict exists but
user may opt to ignore it...). alternatively this option replaces the 
hackish USB code in Xen hypervisor but still preserve the opt.

> 
> > for other usages like hotplug/migration:
> > 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
> > If 'host' is specified, it implies rmrr_host, besides user can specific
> > explicit ranges according to his detail requirement.
> 
> For host the flag makes sense, but for the explicitly specified regions
> - as said before - I don't think it does.
> 

I may miss your suggestion on this. what do you think is the right way
for explicitly specified regions? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:01                                                     ` George Dunlap
@ 2015-01-14 12:11                                                       ` Tian, Kevin
  2015-01-14 14:32                                                       ` Jan Beulich
  1 sibling, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-14 12:11 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Wednesday, January 14, 2015 8:02 PM
> 
> On 01/14/2015 10:24 AM, Jan Beulich wrote:
> >>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
> >>>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >>> Sent: Wednesday, January 14, 2015 5:00 PM
> >>>
> >>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
> >>>> Now the open is whether we want to fail domain creation for all of above
> >>>> conflicts. user may choose to bear with conflicts at his own disposal, or
> >>>> libxl doesn't want to fail conflicts as preparation for future
> >>>> hotplug/migration.
> >>>> One possible option is to add a per-region flag to specify whether
> treating
> >>>> relevant conflict as an error, when libxl composes the list to domain
> >>>> builder.
> >>>> and this information will be saved in a user space database accessible to
> >>>> all components and also waterfall to Xen hypervisor when libxl requests
> >>>> actual device assignment.
> >>>
> >>> That's certainly a possibility, albeit saying (in the guest config) that
> >>> a region to be reserved only when possible is about the same as
> >>> not stating that region. If at all, I'd see the rmrr-host value be a
> >>> tristate (don't, try, and force) to that effect.
> >>>
> >>
> >> how about something like below with bi-state?
> >>
> >> for statically assigned device:
> >> 	pci = [ "00:02.0, 0/1" ]
> >> where '0/1' represents try/force (or use 'try/force', or have a meaningful
> >> attribute like rmrr_check=try/force?)
> >
> > As said many times before, for statically assigned devices such a flag
> > makes no sense.
> >
> >> for other usages like hotplug/migration:
> >> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
> >> If 'host' is specified, it implies rmrr_host, besides user can specific
> >> explicit ranges according to his detail requirement.
> >
> > For host the flag makes sense, but for the explicitly specified regions
> > - as said before - I don't think it does.
> 
> You don't think there are any circumstances where an admin should be
> allowed to "shoot himself in the foot" by assigning a device which he
> knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
> won't actually be used?
> 
> I thought I heard someone say that many devices will only use RMRRs for
> compatibility with older OSes or during boot; in which case, there may
> be devices which you can safely assign to newer OSes / hot-plug after
> the guest has booted even without reserving the RMRR.  If such devices
> exist, then the admin should be able to assign those, shouldn't they?
> 
> Making it "rmrr=force" by default, but allowing an admin to specify
> "rmrr=try", makes sense to me.  It does introduce an extra layer of
> complication, so I wouldn't push for it; but if Kevin / Intel wants to
> do the work, I think it's a good thing.
> 

We'd like to hear more opinions here, both from developers and users.
Possibly Citrix guys have more inputs from Xenserver/Xenclient
productions, because they may have some devices assigned before
which have RMRR but not expose an issue before this fix (e.g. USB 
controller in a laptop). Now adding a strict policy to treat all conflicts
as error may cause some regression on those usages. I'm not sure
how severe it will be. If it's apparently not a problem to most people,
then we can follow that simple policy which is easier. :-) 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 10:24                                                   ` Jan Beulich
  2015-01-14 12:01                                                     ` George Dunlap
  2015-01-14 12:03                                                     ` Tian, Kevin
@ 2015-01-14 12:12                                                     ` George Dunlap
  2015-01-14 14:36                                                       ` Jan Beulich
  2 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:12 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/14/2015 10:24 AM, Jan Beulich wrote:
>> for other usages like hotplug/migration:
>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>> explicit ranges according to his detail requirement.
> 
> For host the flag makes sense, but for the explicitly specified regions
> - as said before - I don't think it does.

I thought we wanted to be able to specify regions that weren't on this
host, but were on another host, so that an admin could migrate to
another host and add devices there which might have RMRRs?

In any case, at some point there will be a place in libxl which simply
accepts a list of RMRRs; there's no real reason not to expose that.  It
may come in handy at some point in the future.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  6:52                                             ` Tian, Kevin
@ 2015-01-14 12:14                                               ` Ian Campbell
  2015-01-14 12:23                                                 ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-14 12:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, George Dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Wed, 2015-01-14 at 06:52 +0000, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Wednesday, January 14, 2015 12:06 AM
> > 
> > >>> On 13.01.15 at 17:00, <george.dunlap@eu.citrix.com> wrote:
> > > Another option I was thinking about: Before assigning a device to a
> > > guest, you have to unplug the device and assign it to pci-back (e.g.,
> > > with xl pci-assignable-add).  In addition to something like rmmr=host,
> > > we could add rmrr=assignable, which would add all of the RMRRs of all
> > > devices currently listed as "assignable".  The idea would then be that
> > > you first make all your devices assignable, then just start your guests,
> > > and everything you've made assignable will be able to be assigned.
> > 
> > Nice idea indeed, but I'm not sure about its practicability: It may
> > not be desirable to make all devices eventually to be handed to a
> > guest prior to starting any of the guests it may get handed to. In
> > particular there may be reasons why the host needs the device
> > while (or until after) creating the guests.
> > 
> 
> and I'm not sure whether there's enough knowledge to judge whether 
> a device is assignable since potential conflicts may be detected only
> when the guest is launched.

I don't think George was intending to imply otherwise, assignable here
just means "bound to xen-pciback", there may be other reasons why the
device cannot be assigned in practice when you come to actually use it,
i.e. RMRR conflicts which may only be discovered when a guest is started
would be one such practical reason.

George's suggestion sounds to me like a nice shortcut configuration
which will benefit many users, even if not all of them. So long as more
fine grained control is an option itseems like a Nice To Have type
thing.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:43                                                 ` Tian, Kevin
  2015-01-14 10:24                                                   ` Jan Beulich
@ 2015-01-14 12:16                                                   ` George Dunlap
  2015-01-14 14:39                                                     ` Jan Beulich
  2015-01-14 12:21                                                   ` Ian Campbell
  2 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:16 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

On 01/14/2015 09:43 AM, Tian, Kevin wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, January 14, 2015 5:00 PM
>>
>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>> Now the open is whether we want to fail domain creation for all of above
>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>> libxl doesn't want to fail conflicts as preparation for future
>>> hotplug/migration.
>>> One possible option is to add a per-region flag to specify whether treating
>>> relevant conflict as an error, when libxl composes the list to domain
>>> builder.
>>> and this information will be saved in a user space database accessible to
>>> all components and also waterfall to Xen hypervisor when libxl requests
>>> actual device assignment.
>>
>> That's certainly a possibility, albeit saying (in the guest config) that
>> a region to be reserved only when possible is about the same as
>> not stating that region. If at all, I'd see the rmrr-host value be a
>> tristate (don't, try, and force) to that effect.
>>
> 
> how about something like below with bi-state?
> 
> for statically assigned device:
> 	pci = [ "00:02.0, 0/1" ]
> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
> attribute like rmrr_check=try/force?)

I think the typical thing to do here would be to introduce a named
parameter.  pci already has things like:

  pci = [ "00:02.0,seize=1,msitranslate=0" ]

So you should just follow suit and make it something like, "rmrr=try".

> 
> for other usages like hotplug/migration:
> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
> If 'host' is specified, it implies rmrr_host, besides user can specific 
> explicit ranges according to his detail requirement.
> 
> based on above configuration interface, libxl can construct necessary
> reserve regions with individual try/force policies.

Same here; I'd do something like:

 rmrr = [ "0xe0000:0xeffff,check=try", "0xa000000:0xa0000fff" ]

Where here the first one would be allowed to conflict in the domain
builder; but the second would error out if it couldn't be made for some
reason.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  8:06                                             ` Tian, Kevin
  2015-01-14  9:00                                               ` Jan Beulich
@ 2015-01-14 12:17                                               ` Ian Campbell
  2015-01-14 15:07                                                 ` Jan Beulich
  2015-01-14 12:29                                               ` George Dunlap
  2 siblings, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-14 12:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, George Dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Wed, 2015-01-14 at 08:06 +0000, Tian, Kevin wrote:
> - RMRRs conflicting with guest BIOS in <1MB area, as an example of 
> hard conflicts

OOI what is the (estimated) probability of such an RMRR existing which
doesn't already conflict with the real host BIOS?

Host BIOSes are generally large compared to the guest BIOS, but with the
amount of decompression and relocation etc they do I don't know how much
of them generally remains in the <1MB region.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  9:43                                                 ` Tian, Kevin
  2015-01-14 10:24                                                   ` Jan Beulich
  2015-01-14 12:16                                                   ` George Dunlap
@ 2015-01-14 12:21                                                   ` Ian Campbell
  2 siblings, 0 replies; 139+ messages in thread
From: Ian Campbell @ 2015-01-14 12:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, George Dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Wed, 2015-01-14 at 09:43 +0000, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Wednesday, January 14, 2015 5:00 PM
> > 
> > >>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
> > > Now the open is whether we want to fail domain creation for all of above
> > > conflicts. user may choose to bear with conflicts at his own disposal, or
> > > libxl doesn't want to fail conflicts as preparation for future
> > > hotplug/migration.
> > > One possible option is to add a per-region flag to specify whether treating
> > > relevant conflict as an error, when libxl composes the list to domain
> > > builder.
> > > and this information will be saved in a user space database accessible to
> > > all components and also waterfall to Xen hypervisor when libxl requests
> > > actual device assignment.
> > 
> > That's certainly a possibility, albeit saying (in the guest config) that
> > a region to be reserved only when possible is about the same as
> > not stating that region. If at all, I'd see the rmrr-host value be a
> > tristate (don't, try, and force) to that effect.
> > 
> 
> how about something like below with bi-state?
> 
> for statically assigned device:
> 	pci = [ "00:02.0, 0/1" ]
> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
> attribute like rmrr_check=try/force?)

NB the pci syntax already supports key=value style options, so this
would have to use the more meaningful foo=try/force approach if you want
to go this route (and it's also clearly preferable from a UI pov
anyway).
> 
> for other usages like hotplug/migration:
> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
> If 'host' is specified, it implies rmrr_host, besides user can specific 
> explicit ranges according to his detail requirement.

Please avoid opaque "0" vs "1" values here too if you go this way.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:14                                               ` Ian Campbell
@ 2015-01-14 12:23                                                 ` George Dunlap
  2015-01-15  8:12                                                   ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:23 UTC (permalink / raw)
  To: Ian Campbell, Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, tim, ian.jackson, xen-devel,
	Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On 01/14/2015 12:14 PM, Ian Campbell wrote:
> On Wed, 2015-01-14 at 06:52 +0000, Tian, Kevin wrote:
>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>> Sent: Wednesday, January 14, 2015 12:06 AM
>>>
>>>>>> On 13.01.15 at 17:00, <george.dunlap@eu.citrix.com> wrote:
>>>> Another option I was thinking about: Before assigning a device to a
>>>> guest, you have to unplug the device and assign it to pci-back (e.g.,
>>>> with xl pci-assignable-add).  In addition to something like rmmr=host,
>>>> we could add rmrr=assignable, which would add all of the RMRRs of all
>>>> devices currently listed as "assignable".  The idea would then be that
>>>> you first make all your devices assignable, then just start your guests,
>>>> and everything you've made assignable will be able to be assigned.
>>>
>>> Nice idea indeed, but I'm not sure about its practicability: It may
>>> not be desirable to make all devices eventually to be handed to a
>>> guest prior to starting any of the guests it may get handed to. In
>>> particular there may be reasons why the host needs the device
>>> while (or until after) creating the guests.
>>>
>>
>> and I'm not sure whether there's enough knowledge to judge whether 
>> a device is assignable since potential conflicts may be detected only
>> when the guest is launched.
> 
> I don't think George was intending to imply otherwise, assignable here
> just means "bound to xen-pciback", there may be other reasons why the
> device cannot be assigned in practice when you come to actually use it,
> i.e. RMRR conflicts which may only be discovered when a guest is started
> would be one such practical reason.

Yes -- xl has a concept called "pci-assignable".  Before you can add a
device to a guest, you have to call "xl pci-assignable-add [device
spec]".  You can also run "xl pci-assignable-list" to see which devices
are currently assignable.

Normally this is true even for statically-assigned devices: If you add
pci= [ "$bdi" ] to a config file, and $bdi hasn't been made assignable,
then the pci-attach in domain creation will fail and the domain will be
destroyed.  You can make the domain builder do this automatically with
the "seize=1" parameter; i.e., pci = [ "$bdf,seize=1" ].

My suggestion was that in addition to specifying the particular ranges,
and specifying rmrr=host, we could also specify "rmrr=assignable", which
would cause the domain builder to internally run
libxl_pci_assignable_list() and find the RMRRs for all devices on the list.

But as Ian says, that's a "nice to have", not a requirement.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  8:06                                             ` Tian, Kevin
  2015-01-14  9:00                                               ` Jan Beulich
  2015-01-14 12:17                                               ` Ian Campbell
@ 2015-01-14 12:29                                               ` George Dunlap
  2015-01-14 14:42                                                 ` Jan Beulich
  2 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:29 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

On 01/14/2015 08:06 AM, Tian, Kevin wrote:
> We discussed earlier there are two reasons that some conflicts may not be 
> avoided:
> 	- RMRRs conflicting with guest BIOS in <1MB area, as an example of 
> hard conflicts
> 	- RMRRs conflicting with lowmem which is low enough then avoiding it
> will either break lowmem or make lowmem too low to impact guest (just
> an option being discussed)

So here you're assuming that we're going to keep the lowmem / mmio hole
/ himem thing.  Is that necessary?  I was assuming that if we have
arbitrary RMRRs, that we would just have to accept that we'd need to be
able to punch an arbitrary number of holes in the p2m space.

That would involve telling qemu-upstream about the existence of such
holes, but that's something I think we need to be able to do at some
point anyway.

If we're going to tell xc_hvm_build about the RMRR ranges anyway, it
might as well just go ahead and punch the holes arbitrarily, rather than
restricting itself to simply making the mmio hole larger.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
  2015-01-14  8:13                                       ` Tian, Kevin
@ 2015-01-14 12:47                                       ` George Dunlap
  1 sibling, 0 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-14 12:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

[ BTW, Konrad, could you do a bit of quote trimming when quoting such a
long e-mail?  It takes a non-trivial amount of time to figure out where
you've actually said something. Thanks. :-) ]

On 01/13/2015 04:45 PM, Konrad Rzeszutek Wilk wrote:
>> STEP-1. domain builder
>>
>> say the default layout w/o reserved regions would be:
>> 	lowmem: 	[0, 0xbfffffff]
>> 	mmio hole: 	[0xc0000000, 0xffffffff]
>> 	highmem:	[0x100000000, 0x140000000]
>>
>> domain builder then queries reserved regions from xenstore, 
>> and tries to avoid conflicts.
> 
> Perhaps an easier way of this is to use the existing
> mechanism we have - that is the XENMEM_memory_map (which
> BTW e820_host uses). If all of this is done in the libxl (which
> already does this based on the host E820, thought it can
> be modified to query the hypervisor for other 'reserved
> regions') and hvmloader is modified to use XENMEM_memory_map
> and base its E820 on that (and also QEMU-xen), then we solve
> this problem and also the http://bugs.xenproject.org/xen/bug/28?
> 
> (lots of handwaving).

Hmm -- yes, since we have that, that might be a better option.

Having qemu-upstream read XENMEM_memory_map for a domain would avoid
having to pass a massive long set of parameters to qemu for RMRRs (and
would allow us to get rid of mmio_hole_size as well).

But I don't think by itself that it will fix
http://bugs.xenproject.org/xen/bug/28, because that's ultimately about
hvmloader *moving* memory around after the domain has been created.
We'd still need to add a way for hvmloader to tell qemu about changes to
the memory map on-the-fly.

Would we need to have hvmloader then update the e820 in Xen as well, so
that future calls to XENMEM_memory_map returned accurate values?

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:01                                                     ` George Dunlap
  2015-01-14 12:11                                                       ` Tian, Kevin
@ 2015-01-14 14:32                                                       ` Jan Beulich
  2015-01-14 14:37                                                         ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:32 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, Tiejun Chen

>>> On 14.01.15 at 13:01, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>>> Sent: Wednesday, January 14, 2015 5:00 PM
>>>>
>>>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>>>> Now the open is whether we want to fail domain creation for all of above
>>>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>>>> libxl doesn't want to fail conflicts as preparation for future
>>>>> hotplug/migration.
>>>>> One possible option is to add a per-region flag to specify whether treating
>>>>> relevant conflict as an error, when libxl composes the list to domain
>>>>> builder.
>>>>> and this information will be saved in a user space database accessible to
>>>>> all components and also waterfall to Xen hypervisor when libxl requests
>>>>> actual device assignment.
>>>>
>>>> That's certainly a possibility, albeit saying (in the guest config) that
>>>> a region to be reserved only when possible is about the same as
>>>> not stating that region. If at all, I'd see the rmrr-host value be a
>>>> tristate (don't, try, and force) to that effect.
>>>>
>>>
>>> how about something like below with bi-state?
>>>
>>> for statically assigned device:
>>> 	pci = [ "00:02.0, 0/1" ]
>>> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
>>> attribute like rmrr_check=try/force?)
>> 
>> As said many times before, for statically assigned devices such a flag
>> makes no sense.
>> 
>>> for other usages like hotplug/migration:
>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>> explicit ranges according to his detail requirement.
>> 
>> For host the flag makes sense, but for the explicitly specified regions
>> - as said before - I don't think it does.
> 
> You don't think there are any circumstances where an admin should be
> allowed to "shoot himself in the foot" by assigning a device which he
> knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
> won't actually be used?

I did advocate for allowing this, and continue to do so. But I think
the necessary override for this would apply at assignment time,
not when punching the holes (i.e. would need to be a different
setting).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:03                                                     ` Tian, Kevin
@ 2015-01-14 14:34                                                       ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:34 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 13:03, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, January 14, 2015 6:24 PM
>> 
>> >>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, January 14, 2015 5:00 PM
>> >>
>> >> >>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>> >> > Now the open is whether we want to fail domain creation for all of above
>> >> > conflicts. user may choose to bear with conflicts at his own disposal, or
>> >> > libxl doesn't want to fail conflicts as preparation for future
>> >> > hotplug/migration.
>> >> > One possible option is to add a per-region flag to specify whether treating
>> >> > relevant conflict as an error, when libxl composes the list to domain
>> >> > builder.
>> >> > and this information will be saved in a user space database accessible to
>> >> > all components and also waterfall to Xen hypervisor when libxl requests
>> >> > actual device assignment.
>> >>
>> >> That's certainly a possibility, albeit saying (in the guest config) that
>> >> a region to be reserved only when possible is about the same as
>> >> not stating that region. If at all, I'd see the rmrr-host value be a
>> >> tristate (don't, try, and force) to that effect.
>> >>
>> >
>> > how about something like below with bi-state?
>> >
>> > for statically assigned device:
>> > 	pci = [ "00:02.0, 0/1" ]
>> > where '0/1' represents try/force (or use 'try/force', or have a meaningful
>> > attribute like rmrr_check=try/force?)
>> 
>> As said many times before, for statically assigned devices such a flag
>> makes no sense.
> 
> why? seems we agreed in the very 1st place to allow admin provide
> per-device override for this (e.g. like you argument that user may 
> get confirmation from device vendor that though conflict exists but
> user may opt to ignore it...). alternatively this option replaces the 
> hackish USB code in Xen hypervisor but still preserve the opt.

As just said in another reply to George - making this possible is (to
me) independent of the try/force flag you suggest to be associated
with any of the reserved regions.

>> > for other usages like hotplug/migration:
>> > 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', 
> ...]
>> > If 'host' is specified, it implies rmrr_host, besides user can specific
>> > explicit ranges according to his detail requirement.
>> 
>> For host the flag makes sense, but for the explicitly specified regions
>> - as said before - I don't think it does.
>> 
> 
> I may miss your suggestion on this. what do you think is the right way
> for explicitly specified regions? 

Explicitly specified regions are unconditionally "force".

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:12                                                     ` George Dunlap
@ 2015-01-14 14:36                                                       ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:36 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, Tiejun Chen

>>> On 14.01.15 at 13:12, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>> for other usages like hotplug/migration:
>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>> explicit ranges according to his detail requirement.
>> 
>> For host the flag makes sense, but for the explicitly specified regions
>> - as said before - I don't think it does.
> 
> I thought we wanted to be able to specify regions that weren't on this
> host, but were on another host, so that an admin could migrate to
> another host and add devices there which might have RMRRs?

Correct. But as being admin specified, there shouldn't be a try/force
flag associated with them: When the admin asks for such regions,
they need to be observed no matter what (or guest creation needs
to fail).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 14:32                                                       ` Jan Beulich
@ 2015-01-14 14:37                                                         ` George Dunlap
  2015-01-14 14:47                                                           ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 14:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, Tiejun Chen

On 01/14/2015 02:32 PM, Jan Beulich wrote:
>>>> On 14.01.15 at 13:01, <george.dunlap@eu.citrix.com> wrote:
>> On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>>>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>> Sent: Wednesday, January 14, 2015 5:00 PM
>>>>>
>>>>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>>>>> Now the open is whether we want to fail domain creation for all of above
>>>>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>>>>> libxl doesn't want to fail conflicts as preparation for future
>>>>>> hotplug/migration.
>>>>>> One possible option is to add a per-region flag to specify whether treating
>>>>>> relevant conflict as an error, when libxl composes the list to domain
>>>>>> builder.
>>>>>> and this information will be saved in a user space database accessible to
>>>>>> all components and also waterfall to Xen hypervisor when libxl requests
>>>>>> actual device assignment.
>>>>>
>>>>> That's certainly a possibility, albeit saying (in the guest config) that
>>>>> a region to be reserved only when possible is about the same as
>>>>> not stating that region. If at all, I'd see the rmrr-host value be a
>>>>> tristate (don't, try, and force) to that effect.
>>>>>
>>>>
>>>> how about something like below with bi-state?
>>>>
>>>> for statically assigned device:
>>>> 	pci = [ "00:02.0, 0/1" ]
>>>> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
>>>> attribute like rmrr_check=try/force?)
>>>
>>> As said many times before, for statically assigned devices such a flag
>>> makes no sense.
>>>
>>>> for other usages like hotplug/migration:
>>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>>> explicit ranges according to his detail requirement.
>>>
>>> For host the flag makes sense, but for the explicitly specified regions
>>> - as said before - I don't think it does.
>>
>> You don't think there are any circumstances where an admin should be
>> allowed to "shoot himself in the foot" by assigning a device which he
>> knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
>> won't actually be used?
> 
> I did advocate for allowing this, and continue to do so. But I think
> the necessary override for this would apply at assignment time,
> not when punching the holes (i.e. would need to be a different
> setting).

But essentially what you're saying then is that for such devices, you
should not be able to statically assign them; you are only allowed to
hotplug them.

If you want to statically assign such a device, then libxl *should* try
to make the RMRR if possible, but shouldn't fail if it can't; and, it
needs to tell Xen not to fail the assignment when setting up the domain.

For that purpose, adding "rmrr=try" to the pci config spec makes the
most sense, doesn't it?

Or am I missing something?

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:16                                                   ` George Dunlap
@ 2015-01-14 14:39                                                     ` Jan Beulich
  2015-01-14 18:16                                                       ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:39 UTC (permalink / raw)
  To: George Dunlap, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 13:16, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 09:43 AM, Tian, Kevin wrote:
>> for other usages like hotplug/migration:
>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>> explicit ranges according to his detail requirement.
>> 
>> based on above configuration interface, libxl can construct necessary
>> reserve regions with individual try/force policies.
> 
> Same here; I'd do something like:
> 
>  rmrr = [ "0xe0000:0xeffff,check=try", "0xa000000:0xa0000fff" ]
> 
> Where here the first one would be allowed to conflict in the domain
> builder; but the second would error out if it couldn't be made for some
> reason.

Just to avoid confusion - I continue to think that the try flag on
explicitly specified regions makes no sense, i.e. I'd see only
something like

>  rmrr = [ "host,check=try", "0xe0000:0xeffff", "0xa000000:0xa0000fff" ]

as viable (with the token "check" not necessarily being the most
expressive one for the purpose it has).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:29                                               ` George Dunlap
@ 2015-01-14 14:42                                                 ` Jan Beulich
  2015-01-14 18:22                                                   ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:42 UTC (permalink / raw)
  To: George Dunlap, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 13:29, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 08:06 AM, Tian, Kevin wrote:
>> We discussed earlier there are two reasons that some conflicts may not be 
>> avoided:
>> 	- RMRRs conflicting with guest BIOS in <1MB area, as an example of 
>> hard conflicts
>> 	- RMRRs conflicting with lowmem which is low enough then avoiding it
>> will either break lowmem or make lowmem too low to impact guest (just
>> an option being discussed)
> 
> So here you're assuming that we're going to keep the lowmem / mmio hole
> / himem thing.  Is that necessary?  I was assuming that if we have
> arbitrary RMRRs, that we would just have to accept that we'd need to be
> able to punch an arbitrary number of holes in the p2m space.

On the basis that the host would have placed the RMRRs in its MMIO
hole, I think I agree with Kevin that if possible we should stick with
the simpler lowmem / mmio-hole / highmem model if possible. If we
really find this too limiting, switching to the more fine grained model
later on will still be possible.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 14:37                                                         ` George Dunlap
@ 2015-01-14 14:47                                                           ` Jan Beulich
  2015-01-14 18:29                                                             ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 14:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, TiejunChen

>>> On 14.01.15 at 15:37, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 02:32 PM, Jan Beulich wrote:
>>>>> On 14.01.15 at 13:01, <george.dunlap@eu.citrix.com> wrote:
>>> On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>>>>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>>>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>>> Sent: Wednesday, January 14, 2015 5:00 PM
>>>>>>
>>>>>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>>>>>> Now the open is whether we want to fail domain creation for all of above
>>>>>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>>>>>> libxl doesn't want to fail conflicts as preparation for future
>>>>>>> hotplug/migration.
>>>>>>> One possible option is to add a per-region flag to specify whether treating
>>>>>>> relevant conflict as an error, when libxl composes the list to domain
>>>>>>> builder.
>>>>>>> and this information will be saved in a user space database accessible to
>>>>>>> all components and also waterfall to Xen hypervisor when libxl requests
>>>>>>> actual device assignment.
>>>>>>
>>>>>> That's certainly a possibility, albeit saying (in the guest config) that
>>>>>> a region to be reserved only when possible is about the same as
>>>>>> not stating that region. If at all, I'd see the rmrr-host value be a
>>>>>> tristate (don't, try, and force) to that effect.
>>>>>>
>>>>>
>>>>> how about something like below with bi-state?
>>>>>
>>>>> for statically assigned device:
>>>>> 	pci = [ "00:02.0, 0/1" ]
>>>>> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
>>>>> attribute like rmrr_check=try/force?)
>>>>
>>>> As said many times before, for statically assigned devices such a flag
>>>> makes no sense.
>>>>
>>>>> for other usages like hotplug/migration:
>>>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>>>> explicit ranges according to his detail requirement.
>>>>
>>>> For host the flag makes sense, but for the explicitly specified regions
>>>> - as said before - I don't think it does.
>>>
>>> You don't think there are any circumstances where an admin should be
>>> allowed to "shoot himself in the foot" by assigning a device which he
>>> knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
>>> won't actually be used?
>> 
>> I did advocate for allowing this, and continue to do so. But I think
>> the necessary override for this would apply at assignment time,
>> not when punching the holes (i.e. would need to be a different
>> setting).
> 
> But essentially what you're saying then is that for such devices, you
> should not be able to statically assign them; you are only allowed to
> hotplug them.
> 
> If you want to statically assign such a device, then libxl *should* try
> to make the RMRR if possible, but shouldn't fail if it can't; and, it
> needs to tell Xen not to fail the assignment when setting up the domain.
> 
> For that purpose, adding "rmrr=try" to the pci config spec makes the
> most sense, doesn't it?
> 
> Or am I missing something?

No, you're right. The model just is a little more complicated: The
rmrr = [] settings need to be combined with the statically assigned
devices' pci = [] settings. What will get most problematic is if you
want rmrr = [ "host,check=force" ] but then make an exception for
a statically assigned device (like a USB controller).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:17                                               ` Ian Campbell
@ 2015-01-14 15:07                                                 ` Jan Beulich
  2015-01-14 15:18                                                   ` Ian Campbell
  2015-01-15  8:40                                                   ` Tian, Kevin
  0 siblings, 2 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 15:07 UTC (permalink / raw)
  To: Ian Campbell, Kevin Tian
  Cc: wei.liu2, stefano.stabellini, George Dunlap, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 13:17, <Ian.Campbell@citrix.com> wrote:
> On Wed, 2015-01-14 at 08:06 +0000, Tian, Kevin wrote:
>> - RMRRs conflicting with guest BIOS in <1MB area, as an example of 
>> hard conflicts
> 
> OOI what is the (estimated) probability of such an RMRR existing which
> doesn't already conflict with the real host BIOS?

Surely the host BIOS will know to place the RMRRs outside its BIOS
image.

> Host BIOSes are generally large compared to the guest BIOS, but with the
> amount of decompression and relocation etc they do I don't know how much
> of them generally remains in the <1MB region.

Recall the example: (host) RMRR naming E0000-EFFFF, which
overlaps with the init-time guest BIOS image, but doesn't overlap
with its resident part (as long as that doesn't exceed 64k in size).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:07                                                 ` Jan Beulich
@ 2015-01-14 15:18                                                   ` Ian Campbell
  2015-01-14 15:39                                                     ` George Dunlap
  2015-01-15  8:40                                                   ` Tian, Kevin
  1 sibling, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-14 15:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Yang Z Zhang, Tiejun Chen

On Wed, 2015-01-14 at 15:07 +0000, Jan Beulich wrote:
> >>> On 14.01.15 at 13:17, <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2015-01-14 at 08:06 +0000, Tian, Kevin wrote:
> >> - RMRRs conflicting with guest BIOS in <1MB area, as an example of 
> >> hard conflicts
> > 
> > OOI what is the (estimated) probability of such an RMRR existing which
> > doesn't already conflict with the real host BIOS?
> 
> Surely the host BIOS will know to place the RMRRs outside its BIOS
> image.

Yes, my point was that if this were the case (as you'd expect) and the
virtual BIOS was smaller than the physical one, then the probability of
an RMRR conflicting with the virtual BIOS would be low.

> > Host BIOSes are generally large compared to the guest BIOS, but with the
> > amount of decompression and relocation etc they do I don't know how much
> > of them generally remains in the <1MB region.
> 
> Recall the example: (host) RMRR naming E0000-EFFFF, which
> overlaps with the init-time guest BIOS image, but doesn't overlap
> with its resident part (as long as that doesn't exceed 64k in size).

Right, that means second precondition above doesn't really hold, which
is a shame.

In principal it might be possible to have some of the RMRR setup and
conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
take advantage of the same init-time vs resident distinction, but I
suspect that won't lead to an overall design we are happy with, mainly
since such things are typically done by hvmloader in a Xen system.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:18                                                   ` Ian Campbell
@ 2015-01-14 15:39                                                     ` George Dunlap
  2015-01-14 15:43                                                       ` Ian Campbell
  2015-01-14 16:26                                                       ` Jan Beulich
  0 siblings, 2 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-14 15:39 UTC (permalink / raw)
  To: Ian Campbell, Jan Beulich
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/14/2015 03:18 PM, Ian Campbell wrote:
>>> Host BIOSes are generally large compared to the guest BIOS, but with the
>>> amount of decompression and relocation etc they do I don't know how much
>>> of them generally remains in the <1MB region.
>>
>> Recall the example: (host) RMRR naming E0000-EFFFF, which
>> overlaps with the init-time guest BIOS image, but doesn't overlap
>> with its resident part (as long as that doesn't exceed 64k in size).
> 
> Right, that means second precondition above doesn't really hold, which
> is a shame.
> 
> In principal it might be possible to have some of the RMRR setup and
> conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
> take advantage of the same init-time vs resident distinction, but I
> suspect that won't lead to an overall design we are happy with, mainly
> since such things are typically done by hvmloader in a Xen system.

Actually, I was just thinking about this -- I'm not really sure why we
do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
than the fact that we need to tell Xen about updates to the physical
address space?  If not, it seems like doing it in SeaBIOS would make a
lot more sense, rather than having to maintain duplicate functionality
in hvmloader.

Anthony is looking into this, but if SeaBIOS inside KVM is able to
notify qemu about changes to the memory map, then it seems like teaching
SeaBIOS how to tell Xen about those changes (or have qemu do it) would
make a lot of our problems in this area a lot simpler.

For RMRRs, presumably SeaBIOS is already set up to avoid them; so if we
can just give it an e820 with the RMRRs in it, then everything will just
fall out of that.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:39                                                     ` George Dunlap
@ 2015-01-14 15:43                                                       ` Ian Campbell
  2015-01-14 18:14                                                         ` George Dunlap
  2015-01-14 16:26                                                       ` Jan Beulich
  1 sibling, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-14 15:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Jan Beulich, Yang Z Zhang, Tiejun Chen

On Wed, 2015-01-14 at 15:39 +0000, George Dunlap wrote:
> On 01/14/2015 03:18 PM, Ian Campbell wrote:
> >>> Host BIOSes are generally large compared to the guest BIOS, but with the
> >>> amount of decompression and relocation etc they do I don't know how much
> >>> of them generally remains in the <1MB region.
> >>
> >> Recall the example: (host) RMRR naming E0000-EFFFF, which
> >> overlaps with the init-time guest BIOS image, but doesn't overlap
> >> with its resident part (as long as that doesn't exceed 64k in size).
> > 
> > Right, that means second precondition above doesn't really hold, which
> > is a shame.
> > 
> > In principal it might be possible to have some of the RMRR setup and
> > conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
> > take advantage of the same init-time vs resident distinction, but I
> > suspect that won't lead to an overall design we are happy with, mainly
> > since such things are typically done by hvmloader in a Xen system.
> 
> Actually, I was just thinking about this -- I'm not really sure why we
> do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
> than the fact that we need to tell Xen about updates to the physical
> address space?  If not, it seems like doing it in SeaBIOS would make a
> lot more sense, rather than having to maintain duplicate functionality
> in hvmloader.

I don't remember exactly, but I think it was because something about the
PCI enumeration required reflecting in the ACPI tables, which hvmloader
also provides. Splitting it up was tricky, that was what I initially
tried when adding SeaBIOS support, it turned into a rats nest.

> Anthony is looking into this, but if SeaBIOS inside KVM is able to
> notify qemu about changes to the memory map, then it seems like teaching
> SeaBIOS how to tell Xen about those changes (or have qemu do it) would
> make a lot of our problems in this area a lot simpler.

SeaBIOS on qemu uses the firmware cfg interface (a bit bashed protocol
over a magic port) to split these responsibilities. I'm not sure of the
exact split but I know that not so long ago responsibility for
constructing the ACPI tables moved from SeaBIOS to qemu (or maybe just a
subset, perhaps someone else knows better).

> For RMRRs, presumably SeaBIOS is already set up to avoid them; so if we
> can just give it an e820 with the RMRRs in it, then everything will just
> fall out of that.

I suppose, my guess would be that any code which would go anywhere near
stuff like is already gated on Xen because hvmloader takes care of it.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:39                                                     ` George Dunlap
  2015-01-14 15:43                                                       ` Ian Campbell
@ 2015-01-14 16:26                                                       ` Jan Beulich
  1 sibling, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-14 16:26 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, Ian Campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 14.01.15 at 16:39, <george.dunlap@eu.citrix.com> wrote:
> On 01/14/2015 03:18 PM, Ian Campbell wrote:
>>>> Host BIOSes are generally large compared to the guest BIOS, but with the
>>>> amount of decompression and relocation etc they do I don't know how much
>>>> of them generally remains in the <1MB region.
>>>
>>> Recall the example: (host) RMRR naming E0000-EFFFF, which
>>> overlaps with the init-time guest BIOS image, but doesn't overlap
>>> with its resident part (as long as that doesn't exceed 64k in size).
>> 
>> Right, that means second precondition above doesn't really hold, which
>> is a shame.
>> 
>> In principal it might be possible to have some of the RMRR setup and
>> conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
>> take advantage of the same init-time vs resident distinction, but I
>> suspect that won't lead to an overall design we are happy with, mainly
>> since such things are typically done by hvmloader in a Xen system.
> 
> Actually, I was just thinking about this -- I'm not really sure why we
> do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
> than the fact that we need to tell Xen about updates to the physical
> address space?  If not, it seems like doing it in SeaBIOS would make a
> lot more sense, rather than having to maintain duplicate functionality
> in hvmloader.

Fully agreed - this really is the BIOSes job. The only caveat being
that for as long as we support it, things still need to continue to
work with qemu-trad.

> Anthony is looking into this, but if SeaBIOS inside KVM is able to
> notify qemu about changes to the memory map, then it seems like teaching
> SeaBIOS how to tell Xen about those changes (or have qemu do it) would
> make a lot of our problems in this area a lot simpler.
> 
> For RMRRs, presumably SeaBIOS is already set up to avoid them; so if we
> can just give it an e820 with the RMRRs in it, then everything will just
> fall out of that.

Not sure - building the E820 table is also the BIOSes job, as is (on
real hardware) placing the RMRRs. That latter thing is what is
entirely different for our case - BIOS/hvmloader can't pick where
they want the RMRRs, they have to live with where they are.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:43                                                       ` Ian Campbell
@ 2015-01-14 18:14                                                         ` George Dunlap
  2015-01-15 10:05                                                           ` Ian Campbell
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 18:14 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Jan Beulich, Yang Z Zhang, Tiejun Chen

On 01/14/2015 03:43 PM, Ian Campbell wrote:
> On Wed, 2015-01-14 at 15:39 +0000, George Dunlap wrote:
>> On 01/14/2015 03:18 PM, Ian Campbell wrote:
>>>>> Host BIOSes are generally large compared to the guest BIOS, but with the
>>>>> amount of decompression and relocation etc they do I don't know how much
>>>>> of them generally remains in the <1MB region.
>>>>
>>>> Recall the example: (host) RMRR naming E0000-EFFFF, which
>>>> overlaps with the init-time guest BIOS image, but doesn't overlap
>>>> with its resident part (as long as that doesn't exceed 64k in size).
>>>
>>> Right, that means second precondition above doesn't really hold, which
>>> is a shame.
>>>
>>> In principal it might be possible to have some of the RMRR setup and
>>> conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
>>> take advantage of the same init-time vs resident distinction, but I
>>> suspect that won't lead to an overall design we are happy with, mainly
>>> since such things are typically done by hvmloader in a Xen system.
>>
>> Actually, I was just thinking about this -- I'm not really sure why we
>> do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
>> than the fact that we need to tell Xen about updates to the physical
>> address space?  If not, it seems like doing it in SeaBIOS would make a
>> lot more sense, rather than having to maintain duplicate functionality
>> in hvmloader.
> 
> I don't remember exactly, but I think it was because something about the
> PCI enumeration required reflecting in the ACPI tables, which hvmloader
> also provides. Splitting it up was tricky, that was what I initially
> tried when adding SeaBIOS support, it turned into a rats nest.

Blah. :-(

>> Anthony is looking into this, but if SeaBIOS inside KVM is able to
>> notify qemu about changes to the memory map, then it seems like teaching
>> SeaBIOS how to tell Xen about those changes (or have qemu do it) would
>> make a lot of our problems in this area a lot simpler.
> 
> SeaBIOS on qemu uses the firmware cfg interface (a bit bashed protocol
> over a magic port) to split these responsibilities. I'm not sure of the
> exact split but I know that not so long ago responsibility for
> constructing the ACPI tables moved from SeaBIOS to qemu (or maybe just a
> subset, perhaps someone else knows better).
> 
>> For RMRRs, presumably SeaBIOS is already set up to avoid them; so if we
>> can just give it an e820 with the RMRRs in it, then everything will just
>> fall out of that.
> 
> I suppose, my guess would be that any code which would go anywhere near
> stuff like is already gated on Xen because hvmloader takes care of it.

Yes, that's in fact what happens (I'm pretty sure in the MMIO placement
code it has an "if (xen) return;" at the top); I was trying to envision
a future where this was all rationalized and de-duplicated.

And for the record, Anthony has just looked into what happens with the
MMIO hole on KVM, and apparently SeaBIOS is just given either a 0.5 or
1G hole, and if it can't fit everything, it calls panic().  (i.e.,
there's no way for SeaBIOS to make the hole bigger on KVM either).

So it looks like the path to a rational system is more difficult than I
had initially hoped.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 14:39                                                     ` Jan Beulich
@ 2015-01-14 18:16                                                       ` George Dunlap
  0 siblings, 0 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-14 18:16 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/14/2015 02:39 PM, Jan Beulich wrote:
>>>> On 14.01.15 at 13:16, <george.dunlap@eu.citrix.com> wrote:
>> On 01/14/2015 09:43 AM, Tian, Kevin wrote:
>>> for other usages like hotplug/migration:
>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>> explicit ranges according to his detail requirement.
>>>
>>> based on above configuration interface, libxl can construct necessary
>>> reserve regions with individual try/force policies.
>>
>> Same here; I'd do something like:
>>
>>  rmrr = [ "0xe0000:0xeffff,check=try", "0xa000000:0xa0000fff" ]
>>
>> Where here the first one would be allowed to conflict in the domain
>> builder; but the second would error out if it couldn't be made for some
>> reason.
> 
> Just to avoid confusion - I continue to think that the try flag on
> explicitly specified regions makes no sense, i.e. I'd see only
> something like
> 
>>  rmrr = [ "host,check=try", "0xe0000:0xeffff", "0xa000000:0xa0000fff" ]
> 
> as viable (with the token "check" not necessarily being the most
> expressive one for the purpose it has).

OK -- this is a fairly minor detail that we can discuss when the patches
are actually submitted.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 14:42                                                 ` Jan Beulich
@ 2015-01-14 18:22                                                   ` George Dunlap
  2015-01-15  8:18                                                     ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 18:22 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang, Tiejun Chen

On 01/14/2015 02:42 PM, Jan Beulich wrote:
>>>> On 14.01.15 at 13:29, <george.dunlap@eu.citrix.com> wrote:
>> On 01/14/2015 08:06 AM, Tian, Kevin wrote:
>>> We discussed earlier there are two reasons that some conflicts may not be 
>>> avoided:
>>> 	- RMRRs conflicting with guest BIOS in <1MB area, as an example of 
>>> hard conflicts
>>> 	- RMRRs conflicting with lowmem which is low enough then avoiding it
>>> will either break lowmem or make lowmem too low to impact guest (just
>>> an option being discussed)
>>
>> So here you're assuming that we're going to keep the lowmem / mmio hole
>> / himem thing.  Is that necessary?  I was assuming that if we have
>> arbitrary RMRRs, that we would just have to accept that we'd need to be
>> able to punch an arbitrary number of holes in the p2m space.
> 
> On the basis that the host would have placed the RMRRs in its MMIO
> hole, I think I agree with Kevin that if possible we should stick with
> the simpler lowmem / mmio-hole / highmem model if possible. If we
> really find this too limiting, switching to the more fine grained model
> later on will still be possible.

OK, sounds good.

One detail to work out in that case then is if / when we want to error
out or warn the user that the mmio hole is "too big" (or the RMRR is
"too low").

(I may think about it and post some thoughts tomorrow.)

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 14:47                                                           ` Jan Beulich
@ 2015-01-14 18:29                                                             ` George Dunlap
  2015-01-15  8:37                                                               ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-14 18:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, TiejunChen

On 01/14/2015 02:47 PM, Jan Beulich wrote:
>>>> On 14.01.15 at 15:37, <george.dunlap@eu.citrix.com> wrote:
>> On 01/14/2015 02:32 PM, Jan Beulich wrote:
>>>>>> On 14.01.15 at 13:01, <george.dunlap@eu.citrix.com> wrote:
>>>> On 01/14/2015 10:24 AM, Jan Beulich wrote:
>>>>>>>> On 14.01.15 at 10:43, <kevin.tian@intel.com> wrote:
>>>>>>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>>>> Sent: Wednesday, January 14, 2015 5:00 PM
>>>>>>>
>>>>>>>>>> On 14.01.15 at 09:06, <kevin.tian@intel.com> wrote:
>>>>>>>> Now the open is whether we want to fail domain creation for all of above
>>>>>>>> conflicts. user may choose to bear with conflicts at his own disposal, or
>>>>>>>> libxl doesn't want to fail conflicts as preparation for future
>>>>>>>> hotplug/migration.
>>>>>>>> One possible option is to add a per-region flag to specify whether treating
>>>>>>>> relevant conflict as an error, when libxl composes the list to domain
>>>>>>>> builder.
>>>>>>>> and this information will be saved in a user space database accessible to
>>>>>>>> all components and also waterfall to Xen hypervisor when libxl requests
>>>>>>>> actual device assignment.
>>>>>>>
>>>>>>> That's certainly a possibility, albeit saying (in the guest config) that
>>>>>>> a region to be reserved only when possible is about the same as
>>>>>>> not stating that region. If at all, I'd see the rmrr-host value be a
>>>>>>> tristate (don't, try, and force) to that effect.
>>>>>>>
>>>>>>
>>>>>> how about something like below with bi-state?
>>>>>>
>>>>>> for statically assigned device:
>>>>>> 	pci = [ "00:02.0, 0/1" ]
>>>>>> where '0/1' represents try/force (or use 'try/force', or have a meaningful 
>>>>>> attribute like rmrr_check=try/force?)
>>>>>
>>>>> As said many times before, for statically assigned devices such a flag
>>>>> makes no sense.
>>>>>
>>>>>> for other usages like hotplug/migration:
>>>>>> 	reserved_regions = [ 'host, 0/1', 'start, end, 0/1', 'start, end, 0/1', ...]
>>>>>> If 'host' is specified, it implies rmrr_host, besides user can specific 
>>>>>> explicit ranges according to his detail requirement.
>>>>>
>>>>> For host the flag makes sense, but for the explicitly specified regions
>>>>> - as said before - I don't think it does.
>>>>
>>>> You don't think there are any circumstances where an admin should be
>>>> allowed to "shoot himself in the foot" by assigning a device which he
>>>> knows the RMRRs conflict -- perhaps because he "knows" that the RMRRs
>>>> won't actually be used?
>>>
>>> I did advocate for allowing this, and continue to do so. But I think
>>> the necessary override for this would apply at assignment time,
>>> not when punching the holes (i.e. would need to be a different
>>> setting).
>>
>> But essentially what you're saying then is that for such devices, you
>> should not be able to statically assign them; you are only allowed to
>> hotplug them.
>>
>> If you want to statically assign such a device, then libxl *should* try
>> to make the RMRR if possible, but shouldn't fail if it can't; and, it
>> needs to tell Xen not to fail the assignment when setting up the domain.
>>
>> For that purpose, adding "rmrr=try" to the pci config spec makes the
>> most sense, doesn't it?
>>
>> Or am I missing something?
> 
> No, you're right. The model just is a little more complicated: The
> rmrr = [] settings need to be combined with the statically assigned
> devices' pci = [] settings. What will get most problematic is if you
> want rmrr = [ "host,check=force" ] but then make an exception for
> a statically assigned device (like a USB controller).

Yes, that's a policy question we'll have to think about; but that can
probably wait until the patches are posted.

Just to be clear -- what we're talking about here is that at the
do_domain_create() level (called by libxl_domain_create_new()), it will
take a list of pci devices, and the rmrr list above (including "host"
and individual ranges), and generate a list of RMRRs to pass to the
lower layer.  The lower layer will simply see the range, and a "force /
no force" flag, and behave appropriately.  The determination of which
RMRRs to force will be done at the domain_create level.

Is that about right?

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-12 12:12           ` Ian Campbell
@ 2015-01-14 20:06             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 139+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-14 20:06 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Yang Z Zhang,
	Tiejun Chen

On Mon, Jan 12, 2015 at 12:12:50PM +0000, Ian Campbell wrote:
> On Fri, 2015-01-09 at 15:27 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Jan 08, 2015 at 06:02:04PM +0000, George Dunlap wrote:
> > > On Thu, Jan 8, 2015 at 4:10 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > > >>>> On 08.01.15 at 16:59, <dunlapg@umich.edu> wrote:
> > > >> On Thu, Jan 8, 2015 at 1:54 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > > >>>> the 1st invocation of this interface will save all reported reserved
> > > >>>> regions under domain structure, and later invocation (e.g. from
> > > >>>> hvmloader) gets saved content.
> > > >>>
> > > >>> Why would the reserved regions need attaching to the domain
> > > >>> structure? The combination of (to be) assigned devices and
> > > >>> global RMRR list always allow reproducing the intended set of
> > > >>> regions without any extra storage.
> > > >>
> > > >> So when you say "(to be) assigned devices", you mean any device which
> > > >> is currently assigned, *or may be assigned at some point in the
> > > >> future*?
> > > >
> > > > Yes.
> > > >
> > > >> Do you think the extra storage for "this VM might possibly be assigned
> > > >> this device at some point" wouldn't really be that much bigger than
> > > >> "this VM might possibly map this RMRR at some point in the future"?
> > > >
> > > > Since listing devices without RMRR association would be pointless,
> > > > I think a list of devices would require less storage. But see below.
> > > >
> > > >> It seems a lot cleaner to me to have the toolstack tell Xen what
> > > >> ranges are reserved for RMRR per VM, and then have Xen check again
> > > >> when assigning a device to make sure that the RMRRs have already been
> > > >> reserved.
> > > >
> > > > With an extra level of what can be got wrong by the admin.
> > > > However, I now realize that doing it this way would allow
> > > > specifying regions not associated with any device on the host
> > > > the guest boots on, but associated with one on a host the guest
> > > > may later migrate to.
> > > 
> > > I did say the toolstack, not the admin. :-)
> > > 
> > > At the xl level, I envisioned a single boolean that would say, "Make
> > > my memory layout resemble the host system" -- so the MMIO hole would
> > > be the same size, and all the RMRRs would be reserved.
> > 
> > Like the e820_host=1 ? :-)
> 
> I'd been thinking about that all the way down this thread ;-) It seems
> like a fairly reasonable approach, and the interfaces (e.g. get host
> memory e820) are mostly already there. But maybe there are HVM specific
> reasons why its not...

Originally the hypercall (when e820_host was added) could only run under PV
guests (Jan's huge split pv and hvm struct domain). That changed with Mukesh's
PVH patches which now make the e820 part of the struct domain so both PV
and HVM can use. In regards to using it under HVM (hvmloader) there are no
issues - exepth that of course if you map the E820 as the host, all the
hard-coded ACPI tables locations, etc have to become dynamic.
> 
> Ian.
> 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14  8:13                                       ` Tian, Kevin
  2015-01-14  9:02                                         ` Jan Beulich
@ 2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
  2015-01-15  8:09                                           ` Tian, Kevin
  2015-01-15  8:43                                           ` Jan Beulich
  1 sibling, 2 replies; 139+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-14 20:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Zhang, Yang Z, Chen,
	Tiejun

On Wed, Jan 14, 2015 at 08:13:14AM +0000, Tian, Kevin wrote:
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Wednesday, January 14, 2015 12:46 AM
> > 
> > 
> > Perhaps an easier way of this is to use the existing
> > mechanism we have - that is the XENMEM_memory_map (which
> > BTW e820_host uses). If all of this is done in the libxl (which
> > already does this based on the host E820, thought it can
> > be modified to query the hypervisor for other 'reserved
> > regions') and hvmloader is modified to use XENMEM_memory_map
> > and base its E820 on that (and also QEMU-xen), then we solve
> > this problem and also the http://bugs.xenproject.org/xen/bug/28?
> > 
> 
> I'm not familiar with that option, but a quick search looks saying
> it's only for PV guest?

It was originally for PV, but the hypercall can be executed under
HVM now too (thanks for PVH patches).
> 
> and please note XENMEM_memory_map only includes RAM entries
> (looks also only for pv), while following above intention what we
> really want is real e820_host w/ all entries filled.

It includeds whatever we want to put in there. It can have
RAM, RSV, etc. That is what  e820_host does for PV guests - it fills
it out with an E820 that looks like the real thing.

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
@ 2015-01-15  8:09                                           ` Tian, Kevin
  2015-01-16 17:17                                             ` Konrad Rzeszutek Wilk
  2015-01-15  8:43                                           ` Jan Beulich
  1 sibling, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-15  8:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Zhang, Yang Z, Chen,
	Tiejun

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Thursday, January 15, 2015 4:43 AM
> 
> On Wed, Jan 14, 2015 at 08:13:14AM +0000, Tian, Kevin wrote:
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > Sent: Wednesday, January 14, 2015 12:46 AM
> > >
> > >
> > > Perhaps an easier way of this is to use the existing
> > > mechanism we have - that is the XENMEM_memory_map (which
> > > BTW e820_host uses). If all of this is done in the libxl (which
> > > already does this based on the host E820, thought it can
> > > be modified to query the hypervisor for other 'reserved
> > > regions') and hvmloader is modified to use XENMEM_memory_map
> > > and base its E820 on that (and also QEMU-xen), then we solve
> > > this problem and also the http://bugs.xenproject.org/xen/bug/28?
> > >
> >
> > I'm not familiar with that option, but a quick search looks saying
> > it's only for PV guest?
> 
> It was originally for PV, but the hypercall can be executed under
> HVM now too (thanks for PVH patches).
> >
> > and please note XENMEM_memory_map only includes RAM entries
> > (looks also only for pv), while following above intention what we
> > really want is real e820_host w/ all entries filled.
> 
> It includeds whatever we want to put in there. It can have
> RAM, RSV, etc. That is what  e820_host does for PV guests - it fills
> it out with an E820 that looks like the real thing.
> 

Thanks for help me understand the status. So if we want to use this
interface, the major work would be in the caller (e.g. hvmloader)
to favor that layout.

Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 12:23                                                 ` George Dunlap
@ 2015-01-15  8:12                                                   ` Tian, Kevin
  0 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-15  8:12 UTC (permalink / raw)
  To: George Dunlap, Ian Campbell
  Cc: wei.liu2, stefano.stabellini, tim, ian.jackson, xen-devel,
	Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Wednesday, January 14, 2015 8:23 PM
> 
> On 01/14/2015 12:14 PM, Ian Campbell wrote:
> > On Wed, 2015-01-14 at 06:52 +0000, Tian, Kevin wrote:
> >>> From: Jan Beulich [mailto:JBeulich@suse.com]
> >>> Sent: Wednesday, January 14, 2015 12:06 AM
> >>>
> >>>>>> On 13.01.15 at 17:00, <george.dunlap@eu.citrix.com> wrote:
> >>>> Another option I was thinking about: Before assigning a device to a
> >>>> guest, you have to unplug the device and assign it to pci-back (e.g.,
> >>>> with xl pci-assignable-add).  In addition to something like rmmr=host,
> >>>> we could add rmrr=assignable, which would add all of the RMRRs of all
> >>>> devices currently listed as "assignable".  The idea would then be that
> >>>> you first make all your devices assignable, then just start your guests,
> >>>> and everything you've made assignable will be able to be assigned.
> >>>
> >>> Nice idea indeed, but I'm not sure about its practicability: It may
> >>> not be desirable to make all devices eventually to be handed to a
> >>> guest prior to starting any of the guests it may get handed to. In
> >>> particular there may be reasons why the host needs the device
> >>> while (or until after) creating the guests.
> >>>
> >>
> >> and I'm not sure whether there's enough knowledge to judge whether
> >> a device is assignable since potential conflicts may be detected only
> >> when the guest is launched.
> >
> > I don't think George was intending to imply otherwise, assignable here
> > just means "bound to xen-pciback", there may be other reasons why the
> > device cannot be assigned in practice when you come to actually use it,
> > i.e. RMRR conflicts which may only be discovered when a guest is started
> > would be one such practical reason.
> 
> Yes -- xl has a concept called "pci-assignable".  Before you can add a
> device to a guest, you have to call "xl pci-assignable-add [device
> spec]".  You can also run "xl pci-assignable-list" to see which devices
> are currently assignable.
> 
> Normally this is true even for statically-assigned devices: If you add
> pci= [ "$bdi" ] to a config file, and $bdi hasn't been made assignable,
> then the pci-attach in domain creation will fail and the domain will be
> destroyed.  You can make the domain builder do this automatically with
> the "seize=1" parameter; i.e., pci = [ "$bdf,seize=1" ].
> 
> My suggestion was that in addition to specifying the particular ranges,
> and specifying rmrr=host, we could also specify "rmrr=assignable", which
> would cause the domain builder to internally run
> libxl_pci_assignable_list() and find the RMRRs for all devices on the list.
> 
> But as Ian says, that's a "nice to have", not a requirement.
> 

I see. thanks

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 18:22                                                   ` George Dunlap
@ 2015-01-15  8:18                                                     ` Tian, Kevin
  0 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-15  8:18 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Thursday, January 15, 2015 2:22 AM
> 
> On 01/14/2015 02:42 PM, Jan Beulich wrote:
> >>>> On 14.01.15 at 13:29, <george.dunlap@eu.citrix.com> wrote:
> >> On 01/14/2015 08:06 AM, Tian, Kevin wrote:
> >>> We discussed earlier there are two reasons that some conflicts may not be
> >>> avoided:
> >>> 	- RMRRs conflicting with guest BIOS in <1MB area, as an example of
> >>> hard conflicts
> >>> 	- RMRRs conflicting with lowmem which is low enough then avoiding
> it
> >>> will either break lowmem or make lowmem too low to impact guest (just
> >>> an option being discussed)
> >>
> >> So here you're assuming that we're going to keep the lowmem / mmio hole
> >> / himem thing.  Is that necessary?  I was assuming that if we have
> >> arbitrary RMRRs, that we would just have to accept that we'd need to be
> >> able to punch an arbitrary number of holes in the p2m space.
> >
> > On the basis that the host would have placed the RMRRs in its MMIO
> > hole, I think I agree with Kevin that if possible we should stick with
> > the simpler lowmem / mmio-hole / highmem model if possible. If we
> > really find this too limiting, switching to the more fine grained model
> > later on will still be possible.
> 
> OK, sounds good.
> 
> One detail to work out in that case then is if / when we want to error
> out or warn the user that the mmio hole is "too big" (or the RMRR is
> "too low").
> 
> (I may think about it and post some thoughts tomorrow.)
> 

yes, this is a balance between real observation and ideal possibilities. :-)
hardcode a value (e.g. 2G) is simple but not flexible. I'm wondering 
whether there's a simple method to decide a reasonable boundary
based on host e820, or even makes this boundary as a configurable
option if we anyway go into the direction of more fine-grained control
to user?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 18:29                                                             ` George Dunlap
@ 2015-01-15  8:37                                                               ` Jan Beulich
  2015-01-15  9:36                                                                 ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-15  8:37 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, YangZ Zhang, TiejunChen

>>> On 14.01.15 at 19:29, <george.dunlap@eu.citrix.com> wrote:
> Just to be clear -- what we're talking about here is that at the
> do_domain_create() level (called by libxl_domain_create_new()), it will
> take a list of pci devices, and the rmrr list above (including "host"
> and individual ranges), and generate a list of RMRRs to pass to the
> lower layer.  The lower layer will simply see the range, and a "force /
> no force" flag, and behave appropriately.  The determination of which
> RMRRs to force will be done at the domain_create level.
> 
> Is that about right?

That's certainly a sensible model.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 15:07                                                 ` Jan Beulich
  2015-01-14 15:18                                                   ` Ian Campbell
@ 2015-01-15  8:40                                                   ` Tian, Kevin
  1 sibling, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-15  8:40 UTC (permalink / raw)
  To: Jan Beulich, Ian Campbell
  Cc: wei.liu2, stefano.stabellini, George Dunlap, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, January 14, 2015 11:08 PM
> 
> >>> On 14.01.15 at 13:17, <Ian.Campbell@citrix.com> wrote:
> > On Wed, 2015-01-14 at 08:06 +0000, Tian, Kevin wrote:
> >> - RMRRs conflicting with guest BIOS in <1MB area, as an example of
> >> hard conflicts
> >
> > OOI what is the (estimated) probability of such an RMRR existing which
> > doesn't already conflict with the real host BIOS?
> 
> Surely the host BIOS will know to place the RMRRs outside its BIOS
> image.
> 
> > Host BIOSes are generally large compared to the guest BIOS, but with the
> > amount of decompression and relocation etc they do I don't know how much
> > of them generally remains in the <1MB region.
> 
> Recall the example: (host) RMRR naming E0000-EFFFF, which
> overlaps with the init-time guest BIOS image, but doesn't overlap
> with its resident part (as long as that doesn't exceed 64k in size).
> 

such RMRR could be in resident part of host BIOS. Here is one example how
such a reserved region is used to support legacy keyboard emulation for 
USB keyboard. Actually a SMI handler is triggered when a 8042 controller
driver accesses legacy keyboard resource (e.g. 60h/64h), and then SMI
handler will access the reserved region which holds meaningful data 
from the USB controller. Host BIOS can allocate such page in any place
where it thinks appropriate, in E0000-EFFFF, in F0000-FFFFF, or in higher
end RAM. so strictly speaking it could be a hard conflict with guest BIOS
though if we know all the information some conflict may be avoided like 
Jan mentioned.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
  2015-01-15  8:09                                           ` Tian, Kevin
@ 2015-01-15  8:43                                           ` Jan Beulich
  1 sibling, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-15  8:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	George Dunlap, tim, ian.jackson, xen-devel, Yang Z Zhang,
	Tiejun Chen

>>> On 14.01.15 at 21:42, <konrad.wilk@oracle.com> wrote:
> On Wed, Jan 14, 2015 at 08:13:14AM +0000, Tian, Kevin wrote:
>> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
>> > Sent: Wednesday, January 14, 2015 12:46 AM
>> > 
>> > 
>> > Perhaps an easier way of this is to use the existing
>> > mechanism we have - that is the XENMEM_memory_map (which
>> > BTW e820_host uses). If all of this is done in the libxl (which
>> > already does this based on the host E820, thought it can
>> > be modified to query the hypervisor for other 'reserved
>> > regions') and hvmloader is modified to use XENMEM_memory_map
>> > and base its E820 on that (and also QEMU-xen), then we solve
>> > this problem and also the http://bugs.xenproject.org/xen/bug/28? 
>> > 
>> 
>> I'm not familiar with that option, but a quick search looks saying
>> it's only for PV guest?
> 
> It was originally for PV, but the hypercall can be executed under
> HVM now too (thanks for PVH patches).

Except that XENMEM_set_memory_map refuses to work on HVM
guests (but that clause is easily dropped).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15  8:37                                                               ` Jan Beulich
@ 2015-01-15  9:36                                                                 ` Tian, Kevin
  2015-01-15 10:06                                                                   ` Jan Beulich
  2015-01-15 11:45                                                                   ` George Dunlap
  0 siblings, 2 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-15  9:36 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, January 15, 2015 4:38 PM
> 
> >>> On 14.01.15 at 19:29, <george.dunlap@eu.citrix.com> wrote:
> > Just to be clear -- what we're talking about here is that at the
> > do_domain_create() level (called by libxl_domain_create_new()), it will
> > take a list of pci devices, and the rmrr list above (including "host"
> > and individual ranges), and generate a list of RMRRs to pass to the
> > lower layer.  The lower layer will simply see the range, and a "force /
> > no force" flag, and behave appropriately.  The determination of which
> > RMRRs to force will be done at the domain_create level.
> >
> > Is that about right?
> 
> That's certainly a sensible model.
> 

It's really a mess for my outlook to sort multiple threads under same
topic... so I'll reply to this one after reading through all previous good
discussions.

First, sorry that I used '0/1' as a bad example for user options, and thanks
for your suggestion on a right way defining them.

I also agree above model. Policy is all decided at domain_create while
lower layer only reacts to specified regions which have individual policy
to indicate 'force' or 'not force'. 

We'll need to make that skeleton ready first. Then regarding to config 
interface, how about making some simplification by only considering
statically-assigned device and hotplug now (leaving migration to future 
based on necessity, and extending that doesn't change low level skeleton)?

pci option can be extended as:
	pci = [ 'bdf, rmrr_check=force/try' ]
domain_create queries Xen hypervisor about RMRRs associated with
assigned devices, and then mark each region with user specified override. 

User can also specify a rmrr_host option as:
	rmrr = [ 'host, check=force/try' ]
when rmrr_host is specified, domain_create queries all RMRRs reported
in this platform, and mark per-region policy accordingly. 

per-device override is always favored if a conflicting setting in rmrr_host.
 
The composed reserved region list is then passed to domain builder,
which tries to detect and avoid conflicts when populating guest RAM.
To avoid breaking lowmem/highmem layout, we can define a 
lowmem_guard so if making hole for a region would make lowmem_top
below lowmem_guard we'll treat this region as a conflict. We may
either just hardcode the value like 2G (or other reasonable value in your
mind), or allow user to config e.g.:
	rmrr = [ 'host, check=force/try', 'lowmem_boundary=2G' ]

after domain builder, libxl will request actual device assignment to
Xen hypervisor. At that point, current assignment hypercall needs
to be extended to include the override value, so Xen will handle
conflict accordingly when setting up identity map.

last step is in hvmloader, which is passed with all the necessary RMRR
information, and then handles potential conflicts accordingly.

In the future when we think necessary to allow user specify random
regions for migration usage, it can be easily extended and we can
argue whether to introduce same override.

If above high level flow can be agreed, then we can move forward to
discuss next level detail e.g. how to pass the rmrr list cross different
components. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-14 18:14                                                         ` George Dunlap
@ 2015-01-15 10:05                                                           ` Ian Campbell
  2015-01-15 11:58                                                             ` George Dunlap
  0 siblings, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-15 10:05 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Jan Beulich, Yang Z Zhang, Tiejun Chen

On Wed, 2015-01-14 at 18:14 +0000, George Dunlap wrote:
> On 01/14/2015 03:43 PM, Ian Campbell wrote:
> > On Wed, 2015-01-14 at 15:39 +0000, George Dunlap wrote:
> >> On 01/14/2015 03:18 PM, Ian Campbell wrote:
> >>>>> Host BIOSes are generally large compared to the guest BIOS, but with the
> >>>>> amount of decompression and relocation etc they do I don't know how much
> >>>>> of them generally remains in the <1MB region.
> >>>>
> >>>> Recall the example: (host) RMRR naming E0000-EFFFF, which
> >>>> overlaps with the init-time guest BIOS image, but doesn't overlap
> >>>> with its resident part (as long as that doesn't exceed 64k in size).
> >>>
> >>> Right, that means second precondition above doesn't really hold, which
> >>> is a shame.
> >>>
> >>> In principal it might be possible to have some of the RMRR setup and
> >>> conflict detection stuff in SeaBIOS rather than hvmloader, and therefore
> >>> take advantage of the same init-time vs resident distinction, but I
> >>> suspect that won't lead to an overall design we are happy with, mainly
> >>> since such things are typically done by hvmloader in a Xen system.
> >>
> >> Actually, I was just thinking about this -- I'm not really sure why we
> >> do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
> >> than the fact that we need to tell Xen about updates to the physical
> >> address space?  If not, it seems like doing it in SeaBIOS would make a
> >> lot more sense, rather than having to maintain duplicate functionality
> >> in hvmloader.
> > 
> > I don't remember exactly, but I think it was because something about the
> > PCI enumeration required reflecting in the ACPI tables, which hvmloader
> > also provides. Splitting it up was tricky, that was what I initially
> > tried when adding SeaBIOS support, it turned into a rats nest.
> 
> Blah. :-(

It *might* have been more complicated because I was also trying to keep
ROMBIOS+qemu-trad doing something sensible and worrying about code
duplication, plus the whole seabios thing was pretty new to me at the
time as well.

It probably wouldn't be a waste of time for someone for spend say 1/2 a
day taking another poke at it (modulo what you said below perhaps making
it a little moot).

> >> Anthony is looking into this, but if SeaBIOS inside KVM is able to
> >> notify qemu about changes to the memory map, then it seems like teaching
> >> SeaBIOS how to tell Xen about those changes (or have qemu do it) would
> >> make a lot of our problems in this area a lot simpler.
> > 
> > SeaBIOS on qemu uses the firmware cfg interface (a bit bashed protocol
> > over a magic port) to split these responsibilities. I'm not sure of the
> > exact split but I know that not so long ago responsibility for
> > constructing the ACPI tables moved from SeaBIOS to qemu (or maybe just a
> > subset, perhaps someone else knows better).
> > 
> >> For RMRRs, presumably SeaBIOS is already set up to avoid them; so if we
> >> can just give it an e820 with the RMRRs in it, then everything will just
> >> fall out of that.
> > 
> > I suppose, my guess would be that any code which would go anywhere near
> > stuff like is already gated on Xen because hvmloader takes care of it.
> 
> Yes, that's in fact what happens (I'm pretty sure in the MMIO placement
> code it has an "if (xen) return;" at the top); I was trying to envision
> a future where this was all rationalized and de-duplicated.
> 
> And for the record, Anthony has just looked into what happens with the
> MMIO hole on KVM, and apparently SeaBIOS is just given either a 0.5 or
> 1G hole, and if it can't fit everything, it calls panic().  (i.e.,
> there's no way for SeaBIOS to make the hole bigger on KVM either).

It's possible that solving this for both would gain some traction with
upstreams, although whether our preferred solution would be a good fit
for kvm I dunno, since we have different concepts of who is the ultimate
authority on the address space.

> So it looks like the path to a rational system is more difficult than I
> had initially hoped.

True.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15  9:36                                                                 ` Tian, Kevin
@ 2015-01-15 10:06                                                                   ` Jan Beulich
  2015-01-18  8:36                                                                     ` Tian, Kevin
  2015-01-15 11:45                                                                   ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-15 10:06 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 15.01.15 at 10:36, <kevin.tian@intel.com> wrote:
> We'll need to make that skeleton ready first. Then regarding to config 
> interface, how about making some simplification by only considering
> statically-assigned device and hotplug now (leaving migration to future 
> based on necessity, and extending that doesn't change low level skeleton)?
> 
> pci option can be extended as:
> 	pci = [ 'bdf, rmrr_check=force/try' ]
> domain_create queries Xen hypervisor about RMRRs associated with
> assigned devices, and then mark each region with user specified override. 
> 
> User can also specify a rmrr_host option as:
> 	rmrr = [ 'host, check=force/try' ]
> when rmrr_host is specified, domain_create queries all RMRRs reported
> in this platform, and mark per-region policy accordingly. 
> 
> per-device override is always favored if a conflicting setting in rmrr_host.
>  
> The composed reserved region list is then passed to domain builder,
> which tries to detect and avoid conflicts when populating guest RAM.
> To avoid breaking lowmem/highmem layout, we can define a 
> lowmem_guard so if making hole for a region would make lowmem_top
> below lowmem_guard we'll treat this region as a conflict. We may
> either just hardcode the value like 2G (or other reasonable value in your
> mind), or allow user to config e.g.:
> 	rmrr = [ 'host, check=force/try', 'lowmem_boundary=2G' ]

To me it looks like lowmem_boundary makes sense only when
check=try.

> after domain builder, libxl will request actual device assignment to
> Xen hypervisor. At that point, current assignment hypercall needs
> to be extended to include the override value, so Xen will handle
> conflict accordingly when setting up identity map.
> 
> last step is in hvmloader, which is passed with all the necessary RMRR
> information, and then handles potential conflicts accordingly.
> 
> In the future when we think necessary to allow user specify random
> regions for migration usage, it can be easily extended and we can
> argue whether to introduce same override.
> 
> If above high level flow can be agreed, then we can move forward to
> discuss next level detail e.g. how to pass the rmrr list cross different
> components. :-)

Apart from the minor detail mentioned I think the above is a good
representation of what we want/need.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15  9:36                                                                 ` Tian, Kevin
  2015-01-15 10:06                                                                   ` Jan Beulich
@ 2015-01-15 11:45                                                                   ` George Dunlap
  2015-01-18  8:58                                                                     ` Tian, Kevin
  1 sibling, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-15 11:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Thu, Jan 15, 2015 at 9:36 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, January 15, 2015 4:38 PM
>>
>> >>> On 14.01.15 at 19:29, <george.dunlap@eu.citrix.com> wrote:
>> > Just to be clear -- what we're talking about here is that at the
>> > do_domain_create() level (called by libxl_domain_create_new()), it will
>> > take a list of pci devices, and the rmrr list above (including "host"
>> > and individual ranges), and generate a list of RMRRs to pass to the
>> > lower layer.  The lower layer will simply see the range, and a "force /
>> > no force" flag, and behave appropriately.  The determination of which
>> > RMRRs to force will be done at the domain_create level.
>> >
>> > Is that about right?
>>
>> That's certainly a sensible model.
>>
>
> It's really a mess for my outlook to sort multiple threads under same
> topic... so I'll reply to this one after reading through all previous good
> discussions.
>
> First, sorry that I used '0/1' as a bad example for user options, and thanks
> for your suggestion on a right way defining them.
>
> I also agree above model. Policy is all decided at domain_create while
> lower layer only reacts to specified regions which have individual policy
> to indicate 'force' or 'not force'.
>
> We'll need to make that skeleton ready first. Then regarding to config
> interface, how about making some simplification by only considering
> statically-assigned device and hotplug now (leaving migration to future
> based on necessity, and extending that doesn't change low level skeleton)?
>
> pci option can be extended as:
>         pci = [ 'bdf, rmrr_check=force/try' ]
> domain_create queries Xen hypervisor about RMRRs associated with
> assigned devices, and then mark each region with user specified override.
>
> User can also specify a rmrr_host option as:
>         rmrr = [ 'host, check=force/try' ]
> when rmrr_host is specified, domain_create queries all RMRRs reported
> in this platform, and mark per-region policy accordingly.
>
> per-device override is always favored if a conflicting setting in rmrr_host.
>
> The composed reserved region list is then passed to domain builder,
> which tries to detect and avoid conflicts when populating guest RAM.
> To avoid breaking lowmem/highmem layout, we can define a
> lowmem_guard so if making hole for a region would make lowmem_top
> below lowmem_guard we'll treat this region as a conflict. We may
> either just hardcode the value like 2G (or other reasonable value in your
> mind), or allow user to config e.g.:
>         rmrr = [ 'host, check=force/try', 'lowmem_boundary=2G' ]
>
> after domain builder, libxl will request actual device assignment to
> Xen hypervisor. At that point, current assignment hypercall needs
> to be extended to include the override value, so Xen will handle
> conflict accordingly when setting up identity map.
>
> last step is in hvmloader, which is passed with all the necessary RMRR
> information, and then handles potential conflicts accordingly.
>
> In the future when we think necessary to allow user specify random
> regions for migration usage, it can be easily extended and we can
> argue whether to introduce same override.
>
> If above high level flow can be agreed, then we can move forward to
> discuss next level detail e.g. how to pass the rmrr list cross different
> components. :-)

I think we're definitely ready to move on.  There are a bunch of tiny
details we could discuss, but those are  mostly minor changes that can
be tweaked when the patches are submitted.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15 10:05                                                           ` Ian Campbell
@ 2015-01-15 11:58                                                             ` George Dunlap
  0 siblings, 0 replies; 139+ messages in thread
From: George Dunlap @ 2015-01-15 11:58 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Yang Z Zhang, Tiejun Chen

On Thu, Jan 15, 2015 at 10:05 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Wed, 2015-01-14 at 18:14 +0000, George Dunlap wrote:
>> On 01/14/2015 03:43 PM, Ian Campbell wrote:
>> > On Wed, 2015-01-14 at 15:39 +0000, George Dunlap wrote:
>> >> Actually, I was just thinking about this -- I'm not really sure why we
>> >> do the PCI MMIO stuff in hvmloader at all.  Is there any reason, other
>> >> than the fact that we need to tell Xen about updates to the physical
>> >> address space?  If not, it seems like doing it in SeaBIOS would make a
>> >> lot more sense, rather than having to maintain duplicate functionality
>> >> in hvmloader.
>> >
>> > I don't remember exactly, but I think it was because something about the
>> > PCI enumeration required reflecting in the ACPI tables, which hvmloader
>> > also provides. Splitting it up was tricky, that was what I initially
>> > tried when adding SeaBIOS support, it turned into a rats nest.
>>
>> Blah. :-(
>
> It *might* have been more complicated because I was also trying to keep
> ROMBIOS+qemu-trad doing something sensible and worrying about code
> duplication, plus the whole seabios thing was pretty new to me at the
> time as well.
>
> It probably wouldn't be a waste of time for someone for spend say 1/2 a
> day taking another poke at it (modulo what you said below perhaps making
> it a little moot).

Another option to "solve" xenbug #28 might be actually to just start
by following what appears to be KVM's model -- i.e., rather than
creating a the minimal MMIO hole possible and making it larger (as
hvmloader does), just start with a 0.5 or 1 G hole.  Modifying SeaBIOS
to understand Xen's mmio_hole_size paramater shouldn't be *too* hard;
then we could look at adding memory relocation back in (coming up with
something that works for both qemu-kvm and xen) if/when it turns out
to be necessary.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15  8:09                                           ` Tian, Kevin
@ 2015-01-16 17:17                                             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 139+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-16 17:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, tim, xen-devel, Jan Beulich, Zhang, Yang Z, Chen,
	Tiejun

On Thu, Jan 15, 2015 at 08:09:34AM +0000, Tian, Kevin wrote:
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Thursday, January 15, 2015 4:43 AM
> > 
> > On Wed, Jan 14, 2015 at 08:13:14AM +0000, Tian, Kevin wrote:
> > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > > Sent: Wednesday, January 14, 2015 12:46 AM
> > > >
> > > >
> > > > Perhaps an easier way of this is to use the existing
> > > > mechanism we have - that is the XENMEM_memory_map (which
> > > > BTW e820_host uses). If all of this is done in the libxl (which
> > > > already does this based on the host E820, thought it can
> > > > be modified to query the hypervisor for other 'reserved
> > > > regions') and hvmloader is modified to use XENMEM_memory_map
> > > > and base its E820 on that (and also QEMU-xen), then we solve
> > > > this problem and also the http://bugs.xenproject.org/xen/bug/28?
> > > >
> > >
> > > I'm not familiar with that option, but a quick search looks saying
> > > it's only for PV guest?
> > 
> > It was originally for PV, but the hypercall can be executed under
> > HVM now too (thanks for PVH patches).
> > >
> > > and please note XENMEM_memory_map only includes RAM entries
> > > (looks also only for pv), while following above intention what we
> > > really want is real e820_host w/ all entries filled.
> > 
> > It includeds whatever we want to put in there. It can have
> > RAM, RSV, etc. That is what  e820_host does for PV guests - it fills
> > it out with an E820 that looks like the real thing.
> > 
> 
> Thanks for help me understand the status. So if we want to use this
> interface, the major work would be in the caller (e.g. hvmloader)
> to favor that layout.

Right, and it would also help in other situations (such as when
doing PCI passthrough in HVM guests with some inflexible PCIe devices
that really want an 1-1 BAR mapping - which this would help with).

> 
> Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15 10:06                                                                   ` Jan Beulich
@ 2015-01-18  8:36                                                                     ` Tian, Kevin
  2015-01-19  8:42                                                                       ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-18  8:36 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, January 15, 2015 6:06 PM
> 
> > The composed reserved region list is then passed to domain builder,
> > which tries to detect and avoid conflicts when populating guest RAM.
> > To avoid breaking lowmem/highmem layout, we can define a
> > lowmem_guard so if making hole for a region would make lowmem_top
> > below lowmem_guard we'll treat this region as a conflict. We may
> > either just hardcode the value like 2G (or other reasonable value in your
> > mind), or allow user to config e.g.:
> > 	rmrr = [ 'host, check=force/try', 'lowmem_boundary=2G' ]
> 
> To me it looks like lowmem_boundary makes sense only when
> check=try.

yes it only makes sense when check=try but the setting should be global
i.e. we don't want to have it configured per-device, right? do you have
a thought on a better option here?

> >
> > If above high level flow can be agreed, then we can move forward to
> > discuss next level detail e.g. how to pass the rmrr list cross different
> > components. :-)
> 
> Apart from the minor detail mentioned I think the above is a good
> representation of what we want/need.
> 

Thanks for your valuable inputs.
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-15 11:45                                                                   ` George Dunlap
@ 2015-01-18  8:58                                                                     ` Tian, Kevin
  2015-01-19  9:32                                                                       ` Jan Beulich
  2015-01-19 10:21                                                                       ` George Dunlap
  0 siblings, 2 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-18  8:58 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Thursday, January 15, 2015 7:45 PM
> 
> >
> > If above high level flow can be agreed, then we can move forward to
> > discuss next level detail e.g. how to pass the rmrr list cross different
> > components. :-)
> 
> I think we're definitely ready to move on.  There are a bunch of tiny
> details we could discuss, but those are  mostly minor changes that can
> be tweaked when the patches are submitted.
> 

Thanks for all the good discussions in the thread, and good we have 
consensus to move forward now.

still one open to hear suggestion though, regarding to how we want 
to pass the reserved regions to domain builder and hvmloader (for Xen 
we will extend related assignment hypercall to include per device override).

one simple solution is to extend xc_hvm_build_args and hvm_info_table 
to include specified regions, with the limitation on defining a fixed 
number (possibly use E820_MAX as a reasonable assumption)

another option is to place the information in xenstore which is more
flexible. However domain builder doesn't use xenstore right now (suppose 
extending use xenstore is not complex?)

or any other thought to be evaluated?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-18  8:36                                                                     ` Tian, Kevin
@ 2015-01-19  8:42                                                                       ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-19  8:42 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 18.01.15 at 09:36, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, January 15, 2015 6:06 PM
>> 
>> > The composed reserved region list is then passed to domain builder,
>> > which tries to detect and avoid conflicts when populating guest RAM.
>> > To avoid breaking lowmem/highmem layout, we can define a
>> > lowmem_guard so if making hole for a region would make lowmem_top
>> > below lowmem_guard we'll treat this region as a conflict. We may
>> > either just hardcode the value like 2G (or other reasonable value in your
>> > mind), or allow user to config e.g.:
>> > 	rmrr = [ 'host, check=force/try', 'lowmem_boundary=2G' ]
>> 
>> To me it looks like lowmem_boundary makes sense only when
>> check=try.
> 
> yes it only makes sense when check=try but the setting should be global
> i.e. we don't want to have it configured per-device, right? do you have
> a thought on a better option here?

No, I think the naming is acceptable. I really just wanted to point
out that the example line you gave wasn't fully consistent.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-18  8:58                                                                     ` Tian, Kevin
@ 2015-01-19  9:32                                                                       ` Jan Beulich
  2015-01-19 11:24                                                                         ` Tian, Kevin
  2015-01-19 10:21                                                                       ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-19  9:32 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 18.01.15 at 09:58, <kevin.tian@intel.com> wrote:
> still one open to hear suggestion though, regarding to how we want 
> to pass the reserved regions to domain builder and hvmloader (for Xen 
> we will extend related assignment hypercall to include per device override).
> 
> one simple solution is to extend xc_hvm_build_args and hvm_info_table 
> to include specified regions, with the limitation on defining a fixed 
> number (possibly use E820_MAX as a reasonable assumption)

As said (in other contexts?) a couple of times recently - I think we
should try to avoid altering struct hvm_info_table if at all possible.
It should really only be used for information that cannot be passed
by any other means between the involved components.

> another option is to place the information in xenstore which is more
> flexible. However domain builder doesn't use xenstore right now (suppose 
> extending use xenstore is not complex?)
> 
> or any other thought to be evaluated?

What's wrong with attaching them to the domain with a domctl, and
having a suitable "normal" hypercall (along the lines of
XENMEM_reserved_device_memory_map, just that this one would
need to be domain specific) for the domain (including hvmloader) to
retrieve it?

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-18  8:58                                                                     ` Tian, Kevin
  2015-01-19  9:32                                                                       ` Jan Beulich
@ 2015-01-19 10:21                                                                       ` George Dunlap
  2015-01-19 11:08                                                                         ` Ian Campbell
  1 sibling, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-19 10:21 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On 01/18/2015 08:58 AM, Tian, Kevin wrote:
>> From: George Dunlap
>> Sent: Thursday, January 15, 2015 7:45 PM
>>
>>>
>>> If above high level flow can be agreed, then we can move forward to
>>> discuss next level detail e.g. how to pass the rmrr list cross different
>>> components. :-)
>>
>> I think we're definitely ready to move on.  There are a bunch of tiny
>> details we could discuss, but those are  mostly minor changes that can
>> be tweaked when the patches are submitted.
>>
> 
> Thanks for all the good discussions in the thread, and good we have 
> consensus to move forward now.
> 
> still one open to hear suggestion though, regarding to how we want 
> to pass the reserved regions to domain builder and hvmloader (for Xen 
> we will extend related assignment hypercall to include per device override).
> 
> one simple solution is to extend xc_hvm_build_args and hvm_info_table 
> to include specified regions, with the limitation on defining a fixed 
> number (possibly use E820_MAX as a reasonable assumption)
> 
> another option is to place the information in xenstore which is more
> flexible. However domain builder doesn't use xenstore right now (suppose 
> extending use xenstore is not complex?)

I *think* the last time I asked such a question, the answer was that
allowing the domain builder to access xenstore would introduce a
cyclical dependency.  But I can't remember the details now (and I may h
ave it wrong).

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 10:21                                                                       ` George Dunlap
@ 2015-01-19 11:08                                                                         ` Ian Campbell
  0 siblings, 0 replies; 139+ messages in thread
From: Ian Campbell @ 2015-01-19 11:08 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, wei.liu2, stefano.stabellini, ian.jackson, tim,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, 2015-01-19 at 10:21 +0000, George Dunlap wrote:
> On 01/18/2015 08:58 AM, Tian, Kevin wrote:
> >> From: George Dunlap
> >> Sent: Thursday, January 15, 2015 7:45 PM
> >>
> >>>
> >>> If above high level flow can be agreed, then we can move forward to
> >>> discuss next level detail e.g. how to pass the rmrr list cross different
> >>> components. :-)
> >>
> >> I think we're definitely ready to move on.  There are a bunch of tiny
> >> details we could discuss, but those are  mostly minor changes that can
> >> be tweaked when the patches are submitted.
> >>
> > 
> > Thanks for all the good discussions in the thread, and good we have 
> > consensus to move forward now.
> > 
> > still one open to hear suggestion though, regarding to how we want 
> > to pass the reserved regions to domain builder and hvmloader (for Xen 
> > we will extend related assignment hypercall to include per device override).
> > 
> > one simple solution is to extend xc_hvm_build_args and hvm_info_table 
> > to include specified regions, with the limitation on defining a fixed 
> > number (possibly use E820_MAX as a reasonable assumption)
> > 
> > another option is to place the information in xenstore which is more
> > flexible. However domain builder doesn't use xenstore right now (suppose 
> > extending use xenstore is not complex?)
> 
> I *think* the last time I asked such a question, the answer was that
> allowing the domain builder to access xenstore would introduce a
> cyclical dependency.  But I can't remember the details now (and I may h
> ave it wrong).

IIRC libxenstore depends on libxenctrl, so having libxenctrl depend on
libxenstore would be problematic. (There has been talk recently of
refactoring libxenctrl into multiple more single-minded libraries, which
might help with this so of thing).

I don't think xenstore is particularly the right answer here though,
either hvm_info (or a table referenced from it) or, since Jan doesn't
like that approach, a hypercall as he suggests would work.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19  9:32                                                                       ` Jan Beulich
@ 2015-01-19 11:24                                                                         ` Tian, Kevin
  2015-01-19 11:33                                                                           ` Tim Deegan
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-19 11:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 19, 2015 5:33 PM
> 
> >>> On 18.01.15 at 09:58, <kevin.tian@intel.com> wrote:
> > still one open to hear suggestion though, regarding to how we want
> > to pass the reserved regions to domain builder and hvmloader (for Xen
> > we will extend related assignment hypercall to include per device override).
> >
> > one simple solution is to extend xc_hvm_build_args and hvm_info_table
> > to include specified regions, with the limitation on defining a fixed
> > number (possibly use E820_MAX as a reasonable assumption)
> 
> As said (in other contexts?) a couple of times recently - I think we
> should try to avoid altering struct hvm_info_table if at all possible.
> It should really only be used for information that cannot be passed
> by any other means between the involved components.

yes, I should add this note when describing that option. Just want to
put what's discussed before in the table.

> 
> > another option is to place the information in xenstore which is more
> > flexible. However domain builder doesn't use xenstore right now (suppose
> > extending use xenstore is not complex?)
> >
> > or any other thought to be evaluated?
> 
> What's wrong with attaching them to the domain with a domctl, and
> having a suitable "normal" hypercall (along the lines of
> XENMEM_reserved_device_memory_map, just that this one would
> need to be domain specific) for the domain (including hvmloader) to
> retrieve it?
> 

definitely it's not wrong. I meant to include this option but looks it got
missed when sending the mail. Using hypercall is similarly to use xenstore 
(just matter where to centrally save the information). Originally I thought
people may think user space option is more flexible. But now since
using xenstore has dependency problem, let's go with this hypercall option.

so in a summary two new hypercalls will be added: one general to allow
libxl query per-device RMRR, and then the other domain specific to allow 
libxl specifying reserved regions.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 11:24                                                                         ` Tian, Kevin
@ 2015-01-19 11:33                                                                           ` Tim Deegan
  2015-01-19 11:41                                                                             ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tim Deegan @ 2015-01-19 11:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

At 11:24 +0000 on 19 Jan (1421663081), Tian, Kevin wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Monday, January 19, 2015 5:33 PM
> > 
> > >>> On 18.01.15 at 09:58, <kevin.tian@intel.com> wrote:
> > > still one open to hear suggestion though, regarding to how we want
> > > to pass the reserved regions to domain builder and hvmloader (for Xen
> > > we will extend related assignment hypercall to include per device override).
> > >
> > > one simple solution is to extend xc_hvm_build_args and hvm_info_table
> > > to include specified regions, with the limitation on defining a fixed
> > > number (possibly use E820_MAX as a reasonable assumption)
> > 
> > As said (in other contexts?) a couple of times recently - I think we
> > should try to avoid altering struct hvm_info_table if at all possible.
> > It should really only be used for information that cannot be passed
> > by any other means between the involved components.
> 
> yes, I should add this note when describing that option. Just want to
> put what's discussed before in the table.
> 
> > 
> > > another option is to place the information in xenstore which is more
> > > flexible. However domain builder doesn't use xenstore right now (suppose
> > > extending use xenstore is not complex?)
> > >
> > > or any other thought to be evaluated?
> > 
> > What's wrong with attaching them to the domain with a domctl, and
> > having a suitable "normal" hypercall (along the lines of
> > XENMEM_reserved_device_memory_map, just that this one would
> > need to be domain specific) for the domain (including hvmloader) to
> > retrieve it?

FWIW, I don't like adding hypervisor state (and even more so
hypervisor mechanism like a new hypercall) for things that the
hypervisor doesn't need to know about.  Since the e820 is only shared
between the tools and the guest, I'd prefer it to go in either
the hvm_info_table or xenstore.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 11:33                                                                           ` Tim Deegan
@ 2015-01-19 11:41                                                                             ` Jan Beulich
  2015-01-19 12:23                                                                               ` Tim Deegan
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-19 11:41 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	George Dunlap, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
> FWIW, I don't like adding hypervisor state (and even more so
> hypervisor mechanism like a new hypercall) for things that the
> hypervisor doesn't need to know about.  Since the e820 is only shared
> between the tools and the guest, I'd prefer it to go in either
> the hvm_info_table or xenstore.

But we have the guest E820 in the hypervisor already, which we
also can't drop (as XENMEM_memory_map is a generally accessible
hypercall).

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 11:41                                                                             ` Jan Beulich
@ 2015-01-19 12:23                                                                               ` Tim Deegan
  2015-01-19 13:00                                                                                 ` George Dunlap
  2015-01-19 13:52                                                                                 ` Jan Beulich
  0 siblings, 2 replies; 139+ messages in thread
From: Tim Deegan @ 2015-01-19 12:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	George Dunlap, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
> >>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
> > FWIW, I don't like adding hypervisor state (and even more so
> > hypervisor mechanism like a new hypercall) for things that the
> > hypervisor doesn't need to know about.  Since the e820 is only shared
> > between the tools and the guest, I'd prefer it to go in either
> > the hvm_info_table or xenstore.
> 
> But we have the guest E820 in the hypervisor already, which we
> also can't drop (as XENMEM_memory_map is a generally accessible
> hypercall).

So we do. :(  What is the difference between that (with appropriate
reserved regions in the map) and the proposed new hypercall?

Tim.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 12:23                                                                               ` Tim Deegan
@ 2015-01-19 13:00                                                                                 ` George Dunlap
  2015-01-20  0:52                                                                                   ` Tian, Kevin
  2015-01-19 13:52                                                                                 ` Jan Beulich
  1 sibling, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-19 13:00 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On 01/19/2015 12:23 PM, Tim Deegan wrote:
> At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
>>>>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
>>> FWIW, I don't like adding hypervisor state (and even more so
>>> hypervisor mechanism like a new hypercall) for things that the
>>> hypervisor doesn't need to know about.  Since the e820 is only shared
>>> between the tools and the guest, I'd prefer it to go in either
>>> the hvm_info_table or xenstore.
>>
>> But we have the guest E820 in the hypervisor already, which we
>> also can't drop (as XENMEM_memory_map is a generally accessible
>> hypercall).
> 
> So we do. :(  What is the difference between that (with appropriate
> reserved regions in the map) and the proposed new hypercall?

Well one thing that's been proposed that we extend that e820 to include
RMRRs, so we can just re-use the same hypercall in hvmloader.

If we're sticking with the "lowmem / mmio hole / himem" thing for now,
does libxc actually need access to the RMRRs?

For RMRRs outside the BIOS area, libxl will either be making the mmio
hole large enough (in which case it will definitely know that there are
no conflicts) or it will not (in which case it will definitely know that
there are conflicts).

For RMRRs in the BIOS area, libxl will already need to know where that
area is (to know that it doesn't need to fit it into the MMIO hole); if
we just make it smart enough to know where the actual BIOS resides, then
it can detect the conflict itself without needing to involve libxc.  Not
sure if that's easier than teaching libxc how to use XENMEM_memory_map.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 12:23                                                                               ` Tim Deegan
  2015-01-19 13:00                                                                                 ` George Dunlap
@ 2015-01-19 13:52                                                                                 ` Jan Beulich
  2015-01-19 15:29                                                                                   ` Tim Deegan
  2015-01-20  0:45                                                                                   ` Tian, Kevin
  1 sibling, 2 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-19 13:52 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	George Dunlap, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 19.01.15 at 13:23, <tim@xen.org> wrote:
> At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
>> >>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
>> > FWIW, I don't like adding hypervisor state (and even more so
>> > hypervisor mechanism like a new hypercall) for things that the
>> > hypervisor doesn't need to know about.  Since the e820 is only shared
>> > between the tools and the guest, I'd prefer it to go in either
>> > the hvm_info_table or xenstore.
>> 
>> But we have the guest E820 in the hypervisor already, which we
>> also can't drop (as XENMEM_memory_map is a generally accessible
>> hypercall).
> 
> So we do. :(  What is the difference between that (with appropriate
> reserved regions in the map) and the proposed new hypercall?

The proposed new hypercall represents _only_ reserved regions.
But it was said several times that making the existing one work
for HVM (and then fit the purposes here) is at least an option
worth investigating.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 13:52                                                                                 ` Jan Beulich
@ 2015-01-19 15:29                                                                                   ` Tim Deegan
  2015-01-20  0:45                                                                                   ` Tian, Kevin
  1 sibling, 0 replies; 139+ messages in thread
From: Tim Deegan @ 2015-01-19 15:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini,
	George Dunlap, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

At 13:52 +0000 on 19 Jan (1421671928), Jan Beulich wrote:
> >>> On 19.01.15 at 13:23, <tim@xen.org> wrote:
> > At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
> >> >>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
> >> > FWIW, I don't like adding hypervisor state (and even more so
> >> > hypervisor mechanism like a new hypercall) for things that the
> >> > hypervisor doesn't need to know about.  Since the e820 is only shared
> >> > between the tools and the guest, I'd prefer it to go in either
> >> > the hvm_info_table or xenstore.
> >> 
> >> But we have the guest E820 in the hypervisor already, which we
> >> also can't drop (as XENMEM_memory_map is a generally accessible
> >> hypercall).
> > 
> > So we do. :(  What is the difference between that (with appropriate
> > reserved regions in the map) and the proposed new hypercall?
> 
> The proposed new hypercall represents _only_ reserved regions.
> But it was said several times that making the existing one work
> for HVM (and then fit the purposes here) is at least an option
> worth investigating.

Ah, OK.  That sounds reasonable, thanks.

Tim.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 13:52                                                                                 ` Jan Beulich
  2015-01-19 15:29                                                                                   ` Tim Deegan
@ 2015-01-20  0:45                                                                                   ` Tian, Kevin
  2015-01-20  7:29                                                                                     ` Jan Beulich
  1 sibling, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-20  0:45 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, January 19, 2015 9:52 PM
> 
> >>> On 19.01.15 at 13:23, <tim@xen.org> wrote:
> > At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
> >> >>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
> >> > FWIW, I don't like adding hypervisor state (and even more so
> >> > hypervisor mechanism like a new hypercall) for things that the
> >> > hypervisor doesn't need to know about.  Since the e820 is only shared
> >> > between the tools and the guest, I'd prefer it to go in either
> >> > the hvm_info_table or xenstore.
> >>
> >> But we have the guest E820 in the hypervisor already, which we
> >> also can't drop (as XENMEM_memory_map is a generally accessible
> >> hypercall).
> >
> > So we do. :(  What is the difference between that (with appropriate
> > reserved regions in the map) and the proposed new hypercall?
> 
> The proposed new hypercall represents _only_ reserved regions.
> But it was said several times that making the existing one work
> for HVM (and then fit the purposes here) is at least an option
> worth investigating.
> 

I did consider this option but there's a reason which makes it not
suitable. Based on current discussion, we need provide per-region
override (force/try) to the caller e.g. hvmloader here, while 
XENMEM_memory_map only provides plain e820 information, and
extending it w/ such override for general e820 entry looks a bit
weird. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-19 13:00                                                                                 ` George Dunlap
@ 2015-01-20  0:52                                                                                   ` Tian, Kevin
  2015-01-20  8:43                                                                                     ` Jan Beulich
  2015-01-20 12:56                                                                                     ` George Dunlap
  0 siblings, 2 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-20  0:52 UTC (permalink / raw)
  To: George Dunlap, Tim Deegan, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, ian.jackson,
	xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Monday, January 19, 2015 9:01 PM
> 
> On 01/19/2015 12:23 PM, Tim Deegan wrote:
> > At 11:41 +0000 on 19 Jan (1421664109), Jan Beulich wrote:
> >>>>> On 19.01.15 at 12:33, <tim@xen.org> wrote:
> >>> FWIW, I don't like adding hypervisor state (and even more so
> >>> hypervisor mechanism like a new hypercall) for things that the
> >>> hypervisor doesn't need to know about.  Since the e820 is only shared
> >>> between the tools and the guest, I'd prefer it to go in either
> >>> the hvm_info_table or xenstore.
> >>
> >> But we have the guest E820 in the hypervisor already, which we
> >> also can't drop (as XENMEM_memory_map is a generally accessible
> >> hypercall).
> >
> > So we do. :(  What is the difference between that (with appropriate
> > reserved regions in the map) and the proposed new hypercall?
> 
> Well one thing that's been proposed that we extend that e820 to include
> RMRRs, so we can just re-use the same hypercall in hvmloader.
> 
> If we're sticking with the "lowmem / mmio hole / himem" thing for now,
> does libxc actually need access to the RMRRs?
> 
> For RMRRs outside the BIOS area, libxl will either be making the mmio
> hole large enough (in which case it will definitely know that there are
> no conflicts) or it will not (in which case it will definitely know that
> there are conflicts).
> 
> For RMRRs in the BIOS area, libxl will already need to know where that
> area is (to know that it doesn't need to fit it into the MMIO hole); if
> we just make it smart enough to know where the actual BIOS resides, then
> it can detect the conflict itself without needing to involve libxc.  Not
> sure if that's easier than teaching libxc how to use XENMEM_memory_map.
> 

We may make a reasonable simplification to treat all RMRRs <1MB as
conflicts (all real observations so far are in BIOS region).

If above is possible, are you proposing to use xenstore instead of introducing
new hypercall (definitely still one required to query per-device RMRR for
libxl), given that libxc may not require change now?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  0:45                                                                                   ` Tian, Kevin
@ 2015-01-20  7:29                                                                                     ` Jan Beulich
  2015-01-20  8:59                                                                                       ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-20  7:29 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> The proposed new hypercall represents _only_ reserved regions.
>> But it was said several times that making the existing one work
>> for HVM (and then fit the purposes here) is at least an option
>> worth investigating.
> 
> I did consider this option but there's a reason which makes it not
> suitable. Based on current discussion, we need provide per-region
> override (force/try) to the caller e.g. hvmloader here, while 
> XENMEM_memory_map only provides plain e820 information, and
> extending it w/ such override for general e820 entry looks a bit
> weird. 

I don't see why - the returned table only resembles an E820 one,
i.e. I can't see why we couldn't steal a flag bit from e.g. the type
field, or define a maybe-reserved type.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  0:52                                                                                   ` Tian, Kevin
@ 2015-01-20  8:43                                                                                     ` Jan Beulich
  2015-01-20  8:56                                                                                       ` Tian, Kevin
  2015-01-20 12:56                                                                                     ` George Dunlap
  1 sibling, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-20  8:43 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 20.01.15 at 01:52, <kevin.tian@intel.com> wrote:
> We may make a reasonable simplification to treat all RMRRs <1MB as
> conflicts (all real observations so far are in BIOS region).

I'm not really agreeing to this, in particular because of you (once
again) dropping the distinction between BIOS resident and init-time
regions. Avoiding conflicts with the non-resident part of it ought
to be possible, as said before.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  8:43                                                                                     ` Jan Beulich
@ 2015-01-20  8:56                                                                                       ` Tian, Kevin
  0 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-20  8:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 20, 2015 4:43 PM
> 
> >>> On 20.01.15 at 01:52, <kevin.tian@intel.com> wrote:
> > We may make a reasonable simplification to treat all RMRRs <1MB as
> > conflicts (all real observations so far are in BIOS region).
> 
> I'm not really agreeing to this, in particular because of you (once
> again) dropping the distinction between BIOS resident and init-time
> regions. Avoiding conflicts with the non-resident part of it ought
> to be possible, as said before.
> 

yes possible but not necessary to do so for the initial implementation that's
why I say a reasonable simplification (anyway the only example observed
so far is USB controller and we'll provide override option to user)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  7:29                                                                                     ` Jan Beulich
@ 2015-01-20  8:59                                                                                       ` Tian, Kevin
  2015-01-20  9:10                                                                                         ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-20  8:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 20, 2015 3:29 PM
> 
> >>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> The proposed new hypercall represents _only_ reserved regions.
> >> But it was said several times that making the existing one work
> >> for HVM (and then fit the purposes here) is at least an option
> >> worth investigating.
> >
> > I did consider this option but there's a reason which makes it not
> > suitable. Based on current discussion, we need provide per-region
> > override (force/try) to the caller e.g. hvmloader here, while
> > XENMEM_memory_map only provides plain e820 information, and
> > extending it w/ such override for general e820 entry looks a bit
> > weird.
> 
> I don't see why - the returned table only resembles an E820 one,
> i.e. I can't see why we couldn't steal a flag bit from e.g. the type
> field, or define a maybe-reserved type.
> 

Originally I was not sure whether any caller assumption is made on
an exactly-mimicked e820 behavior. so looks not a problem here
from your thought.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  8:59                                                                                       ` Tian, Kevin
@ 2015-01-20  9:10                                                                                         ` Jan Beulich
  2015-01-20 10:38                                                                                           ` Ian Campbell
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-20  9:10 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 20.01.15 at 09:59, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, January 20, 2015 3:29 PM
>> 
>> >>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> The proposed new hypercall represents _only_ reserved regions.
>> >> But it was said several times that making the existing one work
>> >> for HVM (and then fit the purposes here) is at least an option
>> >> worth investigating.
>> >
>> > I did consider this option but there's a reason which makes it not
>> > suitable. Based on current discussion, we need provide per-region
>> > override (force/try) to the caller e.g. hvmloader here, while
>> > XENMEM_memory_map only provides plain e820 information, and
>> > extending it w/ such override for general e820 entry looks a bit
>> > weird.
>> 
>> I don't see why - the returned table only resembles an E820 one,
>> i.e. I can't see why we couldn't steal a flag bit from e.g. the type
>> field, or define a maybe-reserved type.
> 
> Originally I was not sure whether any caller assumption is made on
> an exactly-mimicked e820 behavior. so looks not a problem here
> from your thought.

The main aspect being that there is no existing caller in the HVM world,
since at least one of [gs]et is currently getting explicitly refused for
HVM guests. And PVH is still experimental...

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  9:10                                                                                         ` Jan Beulich
@ 2015-01-20 10:38                                                                                           ` Ian Campbell
  2015-01-20 10:48                                                                                             ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Ian Campbell @ 2015-01-20 10:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

On Tue, 2015-01-20 at 09:10 +0000, Jan Beulich wrote:
> >>> On 20.01.15 at 09:59, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Tuesday, January 20, 2015 3:29 PM
> >> 
> >> >>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> The proposed new hypercall represents _only_ reserved regions.
> >> >> But it was said several times that making the existing one work
> >> >> for HVM (and then fit the purposes here) is at least an option
> >> >> worth investigating.
> >> >
> >> > I did consider this option but there's a reason which makes it not
> >> > suitable. Based on current discussion, we need provide per-region
> >> > override (force/try) to the caller e.g. hvmloader here, while
> >> > XENMEM_memory_map only provides plain e820 information, and
> >> > extending it w/ such override for general e820 entry looks a bit
> >> > weird.
> >> 
> >> I don't see why - the returned table only resembles an E820 one,
> >> i.e. I can't see why we couldn't steal a flag bit from e.g. the type
> >> field, or define a maybe-reserved type.
> > 
> > Originally I was not sure whether any caller assumption is made on
> > an exactly-mimicked e820 behavior. so looks not a problem here
> > from your thought.

hvmloader could update the table to convert any magic entries into
standard ones, such that by the time any guest software sees it it would
look like a normal e820.

In fact it would have to do that for the version it passes on to the
guest via SeaBIOS (i.e. the thing which would become the actual e820),
maybe it's an open question what the guest would see if it chose to use
the hypercall directly.

> The main aspect being that there is no existing caller in the HVM world,
> since at least one of [gs]et is currently getting explicitly refused for
> HVM guests. And PVH is still experimental...

Indeed.

Ian.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20 10:38                                                                                           ` Ian Campbell
@ 2015-01-20 10:48                                                                                             ` Jan Beulich
  2015-01-21  2:30                                                                                               ` Tian, Kevin
  0 siblings, 1 reply; 139+ messages in thread
From: Jan Beulich @ 2015-01-20 10:48 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, wei.liu2, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, TiejunChen

>>> On 20.01.15 at 11:38, <Ian.Campbell@citrix.com> wrote:
> On Tue, 2015-01-20 at 09:10 +0000, Jan Beulich wrote:
>> >>> On 20.01.15 at 09:59, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Tuesday, January 20, 2015 3:29 PM
>> >> 
>> >> >>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
>> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> The proposed new hypercall represents _only_ reserved regions.
>> >> >> But it was said several times that making the existing one work
>> >> >> for HVM (and then fit the purposes here) is at least an option
>> >> >> worth investigating.
>> >> >
>> >> > I did consider this option but there's a reason which makes it not
>> >> > suitable. Based on current discussion, we need provide per-region
>> >> > override (force/try) to the caller e.g. hvmloader here, while
>> >> > XENMEM_memory_map only provides plain e820 information, and
>> >> > extending it w/ such override for general e820 entry looks a bit
>> >> > weird.
>> >> 
>> >> I don't see why - the returned table only resembles an E820 one,
>> >> i.e. I can't see why we couldn't steal a flag bit from e.g. the type
>> >> field, or define a maybe-reserved type.
>> > 
>> > Originally I was not sure whether any caller assumption is made on
>> > an exactly-mimicked e820 behavior. so looks not a problem here
>> > from your thought.
> 
> hvmloader could update the table to convert any magic entries into
> standard ones, such that by the time any guest software sees it it would
> look like a normal e820.
> 
> In fact it would have to do that for the version it passes on to the
> guest via SeaBIOS (i.e. the thing which would become the actual e820),
> maybe it's an open question what the guest would see if it chose to use
> the hypercall directly.

Yeah, for the actual E820 the conversion of course has to happen.
But I think there's no strong need for it to be done on the variant
obtainable via hypercall - it would only destroy information, and who
knows what having that piece of information available may be good
for in the future.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20  0:52                                                                                   ` Tian, Kevin
  2015-01-20  8:43                                                                                     ` Jan Beulich
@ 2015-01-20 12:56                                                                                     ` George Dunlap
  2015-01-21  2:43                                                                                       ` Tian, Kevin
  1 sibling, 1 reply; 139+ messages in thread
From: George Dunlap @ 2015-01-20 12:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, ian.campbell, stefano.stabellini, Tim Deegan,
	ian.jackson, xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Tue, Jan 20, 2015 at 12:52 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> For RMRRs in the BIOS area, libxl will already need to know where that
>> area is (to know that it doesn't need to fit it into the MMIO hole); if
>> we just make it smart enough to know where the actual BIOS resides, then
>> it can detect the conflict itself without needing to involve libxc.  Not
>> sure if that's easier than teaching libxc how to use XENMEM_memory_map.
>>
>
> We may make a reasonable simplification to treat all RMRRs <1MB as
> conflicts (all real observations so far are in BIOS region).
>
> If above is possible, are you proposing to use xenstore instead of introducing
> new hypercall (definitely still one required to query per-device RMRR for
> libxl), given that libxc may not require change now?

Well I'm beginning to have less strong feelings, as we're moving from
public interface (which we have to try very hard not to change) into
an internal interface (which we can re-work if we need to).

But one thing I was thinking: if we add the entries to xenstore now,
we will not be *able* to then add RMRR checking in libxc (say, for
BIOS area conflicts) without another re-architecting.  Going with the
e820 hypercall might make adding RMRR checking in libxc easier.  OTOH,
it may be better to just directly pass some ranges into libxc anyway,
so perhaps that doesn't matter.

 -George

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20 10:48                                                                                             ` Jan Beulich
@ 2015-01-21  2:30                                                                                               ` Tian, Kevin
  2015-01-21 10:18                                                                                                 ` Jan Beulich
  0 siblings, 1 reply; 139+ messages in thread
From: Tian, Kevin @ 2015-01-21  2:30 UTC (permalink / raw)
  To: Jan Beulich, Ian Campbell
  Cc: wei.liu2, stefano.stabellini, George Dunlap, Tim Deegan,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 20, 2015 6:49 PM
> 
> >>> On 20.01.15 at 11:38, <Ian.Campbell@citrix.com> wrote:
> > On Tue, 2015-01-20 at 09:10 +0000, Jan Beulich wrote:
> >> >>> On 20.01.15 at 09:59, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Tuesday, January 20, 2015 3:29 PM
> >> >>
> >> >> >>> On 20.01.15 at 01:45, <kevin.tian@intel.com> wrote:
> >> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> The proposed new hypercall represents _only_ reserved regions.
> >> >> >> But it was said several times that making the existing one work
> >> >> >> for HVM (and then fit the purposes here) is at least an option
> >> >> >> worth investigating.
> >> >> >
> >> >> > I did consider this option but there's a reason which makes it not
> >> >> > suitable. Based on current discussion, we need provide per-region
> >> >> > override (force/try) to the caller e.g. hvmloader here, while
> >> >> > XENMEM_memory_map only provides plain e820 information, and
> >> >> > extending it w/ such override for general e820 entry looks a bit
> >> >> > weird.
> >> >>
> >> >> I don't see why - the returned table only resembles an E820 one,
> >> >> i.e. I can't see why we couldn't steal a flag bit from e.g. the type
> >> >> field, or define a maybe-reserved type.
> >> >
> >> > Originally I was not sure whether any caller assumption is made on
> >> > an exactly-mimicked e820 behavior. so looks not a problem here
> >> > from your thought.
> >
> > hvmloader could update the table to convert any magic entries into
> > standard ones, such that by the time any guest software sees it it would
> > look like a normal e820.
> >
> > In fact it would have to do that for the version it passes on to the
> > guest via SeaBIOS (i.e. the thing which would become the actual e820),
> > maybe it's an open question what the guest would see if it chose to use
> > the hypercall directly.
> 
> Yeah, for the actual E820 the conversion of course has to happen.
> But I think there's no strong need for it to be done on the variant
> obtainable via hypercall - it would only destroy information, and who
> knows what having that piece of information available may be good
> for in the future.
> 

If not destroying the new flag, it may break BIOS or OS if they don't know
the new flag. So I'd interpret the life cycle of new flag bit only valid
between hypercall return and constructing actual e820, and it will be
translated into E820_RESERVE in actual e820 if no conflict is detected.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-20 12:56                                                                                     ` George Dunlap
@ 2015-01-21  2:43                                                                                       ` Tian, Kevin
  0 siblings, 0 replies; 139+ messages in thread
From: Tian, Kevin @ 2015-01-21  2:43 UTC (permalink / raw)
  To: George Dunlap
  Cc: wei.liu2, ian.campbell, stefano.stabellini, Tim Deegan,
	ian.jackson, xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: George Dunlap
> Sent: Tuesday, January 20, 2015 8:57 PM
> 
> On Tue, Jan 20, 2015 at 12:52 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> For RMRRs in the BIOS area, libxl will already need to know where that
> >> area is (to know that it doesn't need to fit it into the MMIO hole); if
> >> we just make it smart enough to know where the actual BIOS resides, then
> >> it can detect the conflict itself without needing to involve libxc.  Not
> >> sure if that's easier than teaching libxc how to use
> XENMEM_memory_map.
> >>
> >
> > We may make a reasonable simplification to treat all RMRRs <1MB as
> > conflicts (all real observations so far are in BIOS region).
> >
> > If above is possible, are you proposing to use xenstore instead of introducing
> > new hypercall (definitely still one required to query per-device RMRR for
> > libxl), given that libxc may not require change now?
> 
> Well I'm beginning to have less strong feelings, as we're moving from
> public interface (which we have to try very hard not to change) into
> an internal interface (which we can re-work if we need to).
> 
> But one thing I was thinking: if we add the entries to xenstore now,
> we will not be *able* to then add RMRR checking in libxc (say, for
> BIOS area conflicts) without another re-architecting.  Going with the
> e820 hypercall might make adding RMRR checking in libxc easier.  OTOH,
> it may be better to just directly pass some ranges into libxc anyway,
> so perhaps that doesn't matter.
> 

OK, by combining all the comments together we'll take a close look at
how extending XENMEM_{get/set}_memory_map may work. Initial
goal would be to minimizing major re-architecting (i.e. having libxl 
to do coarse-grained conflict avoidance and set memory map, and 
then having hvmloader to get memory map when constructing 
actual e820. no change to libxc so does hvm_info etc.)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: (v2) Design proposal for RMRR fix
  2015-01-21  2:30                                                                                               ` Tian, Kevin
@ 2015-01-21 10:18                                                                                                 ` Jan Beulich
  0 siblings, 0 replies; 139+ messages in thread
From: Jan Beulich @ 2015-01-21 10:18 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, Ian Campbell, stefano.stabellini, George Dunlap,
	Tim Deegan, ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 21.01.15 at 03:30, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Yeah, for the actual E820 the conversion of course has to happen.
>> But I think there's no strong need for it to be done on the variant
>> obtainable via hypercall - it would only destroy information, and who
>> knows what having that piece of information available may be good
>> for in the future.
> 
> If not destroying the new flag, it may break BIOS or OS if they don't know
> the new flag. So I'd interpret the life cycle of new flag bit only valid
> between hypercall return and constructing actual e820, and it will be
> translated into E820_RESERVE in actual e820 if no conflict is detected.

That's what I said above. I'm advocating for keeping the flag set only
in what the hypercall returns.

Jan

^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2015-01-21 10:18 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-26 11:23 (v2) Design proposal for RMRR fix Tian, Kevin
2015-01-08  0:43 ` Tian, Kevin
2015-01-08 12:32 ` Tim Deegan
2015-01-09  0:53   ` Tian, Kevin
2015-01-09 12:00     ` Andrew Cooper
2015-01-08 12:49 ` George Dunlap
2015-01-08 12:54   ` George Dunlap
2015-01-08 13:00     ` Jan Beulich
2015-01-08 15:15       ` George Dunlap
2015-01-08 15:21         ` Jan Beulich
2015-01-09  2:43     ` Tian, Kevin
2015-01-12 11:25       ` George Dunlap
2015-01-12 13:56         ` Pasi Kärkkäinen
2015-01-12 14:23           ` George Dunlap
2015-01-08 12:58   ` Jan Beulich
2015-01-09  2:29     ` Tian, Kevin
2015-01-09  9:24       ` Jan Beulich
2015-01-09 10:03         ` Tian, Kevin
2015-01-09  2:42   ` Tian, Kevin
2015-01-08 13:54 ` Jan Beulich
2015-01-08 15:59   ` George Dunlap
2015-01-08 16:10     ` Jan Beulich
2015-01-08 18:02       ` George Dunlap
2015-01-08 18:12         ` Pasi Kärkkäinen
2015-01-09  3:12         ` Tian, Kevin
2015-01-09  8:58         ` Jan Beulich
2015-01-09 20:27         ` Konrad Rzeszutek Wilk
2015-01-12  9:44           ` Tian, Kevin
2015-01-12 12:12           ` Ian Campbell
2015-01-14 20:06             ` Konrad Rzeszutek Wilk
2015-01-09  2:49     ` Tian, Kevin
2015-01-09  2:27   ` Tian, Kevin
2015-01-09  9:21     ` Jan Beulich
2015-01-09 10:10       ` Tian, Kevin
2015-01-09 10:35         ` Jan Beulich
2015-01-12  8:46           ` Tian, Kevin
2015-01-12  9:32             ` Jan Beulich
2015-01-12  9:41               ` Tian, Kevin
2015-01-12  9:50                 ` Jan Beulich
2015-01-12  9:56                   ` Tian, Kevin
2015-01-12 10:08                     ` Jan Beulich
2015-01-12 10:12                       ` Tian, Kevin
2015-01-12 10:22                         ` Jan Beulich
2015-01-12 11:22                           ` Tian, Kevin
2015-01-12 11:37                             ` Jan Beulich
2015-01-12 11:41                               ` Tian, Kevin
2015-01-12 12:03                                 ` Jan Beulich
2015-01-12 12:16                                   ` Tian, Kevin
2015-01-12 12:46                                     ` Jan Beulich
2015-01-12 12:13                             ` George Dunlap
2015-01-12 12:23                               ` Ian Campbell
2015-01-12 12:28                               ` Tian, Kevin
2015-01-12 14:19                                 ` George Dunlap
2015-01-13 11:03                                   ` Tian, Kevin
2015-01-13 11:56                                     ` Jan Beulich
2015-01-13 12:03                                       ` Tian, Kevin
2015-01-13 15:52                                         ` Jan Beulich
2015-01-13 15:58                                           ` George Dunlap
2015-01-14  8:06                                             ` Tian, Kevin
2015-01-14  9:00                                               ` Jan Beulich
2015-01-14  9:43                                                 ` Tian, Kevin
2015-01-14 10:24                                                   ` Jan Beulich
2015-01-14 12:01                                                     ` George Dunlap
2015-01-14 12:11                                                       ` Tian, Kevin
2015-01-14 14:32                                                       ` Jan Beulich
2015-01-14 14:37                                                         ` George Dunlap
2015-01-14 14:47                                                           ` Jan Beulich
2015-01-14 18:29                                                             ` George Dunlap
2015-01-15  8:37                                                               ` Jan Beulich
2015-01-15  9:36                                                                 ` Tian, Kevin
2015-01-15 10:06                                                                   ` Jan Beulich
2015-01-18  8:36                                                                     ` Tian, Kevin
2015-01-19  8:42                                                                       ` Jan Beulich
2015-01-15 11:45                                                                   ` George Dunlap
2015-01-18  8:58                                                                     ` Tian, Kevin
2015-01-19  9:32                                                                       ` Jan Beulich
2015-01-19 11:24                                                                         ` Tian, Kevin
2015-01-19 11:33                                                                           ` Tim Deegan
2015-01-19 11:41                                                                             ` Jan Beulich
2015-01-19 12:23                                                                               ` Tim Deegan
2015-01-19 13:00                                                                                 ` George Dunlap
2015-01-20  0:52                                                                                   ` Tian, Kevin
2015-01-20  8:43                                                                                     ` Jan Beulich
2015-01-20  8:56                                                                                       ` Tian, Kevin
2015-01-20 12:56                                                                                     ` George Dunlap
2015-01-21  2:43                                                                                       ` Tian, Kevin
2015-01-19 13:52                                                                                 ` Jan Beulich
2015-01-19 15:29                                                                                   ` Tim Deegan
2015-01-20  0:45                                                                                   ` Tian, Kevin
2015-01-20  7:29                                                                                     ` Jan Beulich
2015-01-20  8:59                                                                                       ` Tian, Kevin
2015-01-20  9:10                                                                                         ` Jan Beulich
2015-01-20 10:38                                                                                           ` Ian Campbell
2015-01-20 10:48                                                                                             ` Jan Beulich
2015-01-21  2:30                                                                                               ` Tian, Kevin
2015-01-21 10:18                                                                                                 ` Jan Beulich
2015-01-19 10:21                                                                       ` George Dunlap
2015-01-19 11:08                                                                         ` Ian Campbell
2015-01-14 12:03                                                     ` Tian, Kevin
2015-01-14 14:34                                                       ` Jan Beulich
2015-01-14 12:12                                                     ` George Dunlap
2015-01-14 14:36                                                       ` Jan Beulich
2015-01-14 12:16                                                   ` George Dunlap
2015-01-14 14:39                                                     ` Jan Beulich
2015-01-14 18:16                                                       ` George Dunlap
2015-01-14 12:21                                                   ` Ian Campbell
2015-01-14 12:17                                               ` Ian Campbell
2015-01-14 15:07                                                 ` Jan Beulich
2015-01-14 15:18                                                   ` Ian Campbell
2015-01-14 15:39                                                     ` George Dunlap
2015-01-14 15:43                                                       ` Ian Campbell
2015-01-14 18:14                                                         ` George Dunlap
2015-01-15 10:05                                                           ` Ian Campbell
2015-01-15 11:58                                                             ` George Dunlap
2015-01-14 16:26                                                       ` Jan Beulich
2015-01-15  8:40                                                   ` Tian, Kevin
2015-01-14 12:29                                               ` George Dunlap
2015-01-14 14:42                                                 ` Jan Beulich
2015-01-14 18:22                                                   ` George Dunlap
2015-01-15  8:18                                                     ` Tian, Kevin
2015-01-13 13:45                                     ` George Dunlap
2015-01-13 15:47                                       ` Jan Beulich
2015-01-13 16:00                                         ` George Dunlap
2015-01-13 16:06                                           ` Jan Beulich
2015-01-14  6:52                                             ` Tian, Kevin
2015-01-14 12:14                                               ` Ian Campbell
2015-01-14 12:23                                                 ` George Dunlap
2015-01-15  8:12                                                   ` Tian, Kevin
2015-01-13 16:45                                     ` Konrad Rzeszutek Wilk
2015-01-14  8:13                                       ` Tian, Kevin
2015-01-14  9:02                                         ` Jan Beulich
2015-01-14  9:44                                           ` Tian, Kevin
2015-01-14 10:25                                             ` Jan Beulich
2015-01-14 20:42                                         ` Konrad Rzeszutek Wilk
2015-01-15  8:09                                           ` Tian, Kevin
2015-01-16 17:17                                             ` Konrad Rzeszutek Wilk
2015-01-15  8:43                                           ` Jan Beulich
2015-01-14 12:47                                       ` George Dunlap
2015-01-12 12:30                               ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.