All of lore.kernel.org
 help / color / mirror / Atom feed
* <summary-1> (v2) Design proposal for RMRR fix
@ 2015-01-09  6:57 Tian, Kevin
  2015-01-09  9:46 ` Jan Beulich
  2015-01-12 13:42 ` Ian Campbell
  0 siblings, 2 replies; 8+ messages in thread
From: Tian, Kevin @ 2015-01-09  6:57 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich, Chen, Tiejun, ian.campbell, wei.liu2,
	ian.jackson, stefano.stabellini, Zhang, Yang Z, xen-devel,
	konrad.wilk, tim, george.dunlap

Thanks Jan/George/Tim for your valuable inputs. To make discussion more 
efficient, let's do a summary for previous discussions and remaining opens
here:

--
1) 'fail' vs. 'warn' upon gfn confliction

Assigning device which fails RMRR confliction check (i.e. intended gfns 
already allocated for other resources) actually brings unknown stability 
problem (device may clobber those valid resources) and potentially security 
issue within VM (but not worse than what a malicious driver can do w/o 
virtual IOMMU). 

So by default we should not move forward if gfn confliction is detected 
when setting up RMRR identity mapping, so-called a 'fail' policy.

One open though, is whether we want to allow admin to override default
'fail' policy with a 'warn' policy, i.e. throwing out confliction detail but
succeeding device assignment. USB is discussed as one example before
(hack how it works today upon <1MB confliction), so it might be good to 
allow enthusiast trying device assignment, or provide flexibility to users
who already verified predicted potential problem not a real issue to their
specific deployment.

I'd like to hear your votes on whether to provide such 'warn' option. If 
final decision is 'no', that actually simplifies later design and please skip
1.1) and jump to 1.2). otherwise 1.1) is the next open to be considered.

--
1.1) per-device 'warn' vs. global 'warn'

Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead 
of a global option.

In a glimpse a per-device 'warn' option provides more fine-grained control 
than a global option, however if thinking it carefully allowing one device w/ 
potential problem isn't more correct or secure than allowing multiple 
devices w/ potential problem. Even in practice a device like USB can
work bearing <1MB confliction, like Jan pointed out there's always corner
cases which we might not know so as long as we open door for one device,
it implies a problematic environment to users and user's judge on whether
he can live up to this problem is not impacted by how many devices the door
is opened for (he anyway needs to study warning message and do verification
if choosing to live up)

Regarding to that, imo if we agree to provide 'warn' option, just providing
a global overriding option (definitely per-vm) is acceptable and simpler.

--
1.2) when to 'fail'

There is one open whether we should fail immediately in domain builder
if a confliction is detected. 

Jan's comment is yes, we should 'fail' the VM creation as it's an error.

My previous point is more mimicking native behavior, where a device 
failure (in our case it's actually potential device failure since VM is not 
powered yet) doesn't impact user until its function is actually touched. 
In our case, even domain builder fails to re-arrange guest RAM to skip 
reserved regions, we have centralized policy (either 'fail' or 'warn' per 
above conclusion) in Xen hypervisor when the device is actually assigned. 
so a 'warn' should be fine, but my insist on this is not strong.

and another point is about hotplug. 'fail' for future devices is too strict,
but to differentiate that from static-assigned devices, domain builder
will then need maintain a per-device reserved region structure. just
'warn' makes things simple.

for Xen hypervisor it's clear to follow 1) conclusion, either only 'fail'
or to favor admin decision. that's fair since it's where assignment 
actually happens.

for hvmloader, same puzzle on static vs. hotplug also exists. it can just
'warn' RAM conflictions since if any confliction exists on RAM previous 
Xen hypervisor will catch it to block or warn. hvmloader can just focus 
on avoid its own allocations and PCI BAR to avoid confliction, and 'fail'
if any problem.

--
2) RMRR management

George raised a good point that RMRR reserved regions can be maintained
in toolstack, and it's toolstack to tell Xen which regions to be reserved. When
providing more flexibility, another benefit from Jan is to specify reserved 
regions in another node (might-be-migrated-to) as a preparation for migration.

When it sounds like a good long term plan, my feeling is that it might be
some parallel effort driving from toolstack experts. Xen can't simply rely
on user space to setup all necessary reserved regions, since it violates the
isolation philosophy in Xen. Whatever a toolstack may tell Xen, Xen still
needs to setup identity mapping for all reserved regions reported for the
assigned device.

So I still prefer to current way i.e. having Xen to organize reserve regions
according to assigned device, and then having libxc/hvmloader to query
to avoid confliction. In the future new interface can be created to allow
toolstack specific plain reserved regions for whatever reason to Xen, as
a compliment.

--
3) report-sel vs. report-all

report-sel means report reserved regions selectively (all potentially-to-be 
assigned are listed for hotplug, but doing this is not user friendly)

report-all means report reserved regions for all available devices in this
platform (cover hotplug w/ enough flexibility)

report-sel is opted by Jan from start, as report-all leaves some confusing
reserved regions to end user. 

otoh, our proposal seeks report-all as a simplified option, because we don't
think user should set assumption on e820 layout which is the platform
attribute and at most it's similar to a physical layout.

first, report-all doesn't cause more conflictions than report-sel in a reasonable
thinking. virtual platform is simpler than physical platform. Since those regions
can be reserved in physical platform, it's reasonable to assume same reservation
can succeed in virtual platform (putting 1MB confliction aside .

second, I'm not sure to what level users care about those reserved regions.
At most it's same layout as physical so even sensitive users won't see it as
a UFO. :-) and e820 is platform attributes so user shouldn't set assumption
on it.

there's a mixed mode to have report-sel for static and report-all for hotplug.

so this is another open w/ clear options but welcome more people to help draw
a conclusion!

--
4) handle conflictions

there are several points discussed.

Jan raised a good point that reasonable assumption can be made to avoid
split lowmen into scattered structure, i.e. assuming reserved region only 
<1MB or >host lowmem. Scatter structure has an impact on RAM layout
sharing between domain builder and hvmloader, and further per George's
comment impacting qemu upstream. By doing that reasonable assumption
domain builder can arrange lowmem always under high end reserved
regions and thus preserve existing coarse-grained structure w/ low/highmem
and mmio hole. definitely detection will still be done if a reserved region
breaks that assumption but no attempt to break guest RAM to avoid confliction.

following that hvmloader changes would become simpler too, with focus on
BIOS/ACPI and PCI BARs.

other two ideas from Jan. One is to move more layout stuff (like PCI BAR, etc.)
into domain builder however per discussion it's not simple and as long as
hvmloader still allocates gfns then it still needs handle confliction anyway. 
The other is to let libxc only populate a small RAM necessary to bring up 
hvmloader and then have hvmloader do the bulk populating. again, such 
change is not small and earlier suggestion on lowmem assumption is more reasonable.

--
(there are other good comments which are either small/clear or next level
detail associated with above major opens. will include them in next version
after above high level opens are closed)

Thanks
Kevin

> From: Tian, Kevin
> Sent: Friday, December 26, 2014 7:23 PM
> 
> (please note some proposal is different from last sent version after more
> discussions. But I tried to summarize previous discussions and explained why
> we choose a different way. Sorry if I may miss some opens/conclusions
> discussed in past months. Please help point it out which is very appreciated. :-)
> 
> ----
> TOC:
> 	1. What's RMRR
> 	2. RMRR status in Xen
> 	3. High Level Design
> 		3.1 Guidelines
> 		3.2 Confliction detection
> 		3.3 Policies
> 		3.4 Xen: setup RMRR identity mapping
> 		3.5 New interface: expose reserved region information
> 		3.6 Libxc/hvmloader: detect and avoid conflictions
> 		3.7 Hvmloader: reserve 'reserved regions' in guest E820
> 		3.8 Xen: Handle devices sharing reserved regions
> 	4. Plan
> 		4.1 Stage-1: hypervisor hardening
> 		4.2 Stage-2: libxc/hvmloader hardening
> 
> 1. What's RMRR?
> ================================================================
> =====
> 
> RMRR is an acronym for Reserved Memory Region Reporting, expected to
> be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
> reserved memory.
> 
> (From vt-d spec)
> ----
> Reserved system memory regions are typically allocated by BIOS at boot
> time and reported to OS as reserved address ranges in the system memory
> map. Requests to these reserved regions may either occur as a result of
> operations performed by the system software driver (for example in the
> case of DMA from unified memory access (UMA) graphics controllers to
> graphics reserved memory) or may be initiated by non system software
> (for example in case of DMA performed by a USB controller under BIOS
> SMM control for legacy keyboard emulation).
> 
> For proper functioning of these legacy reserved memory usages, when
> system software enables DMA remapping, the translation structures for
> the respective devices are expected to be set up to provide identity
> mapping for the specified reserved memory regions with read and write
> permissions. The system software is also responsible for ensuring
> that any input addresses used for device accesses to OS-visible memory
> do not overlap with the reserved system memory address ranges.
> 
> BIOS may report each such reserved memory region through the RMRR
> structures, along with the devices that requires access to the
> specified reserved memory region. Reserved memory ranges that are
> either not DMA targets, or memory ranges that may be target of BIOS
> initiated DMA only during pre-boot phase (such as from a boot disk
> drive) must not be included in the reserved memory region reporting.
> The base address of each RMRR region must be 4KB aligned and the size
> must be an integer multiple of 4KB. If there are no RMRR structures,
> the system software concludes that the platform does not have any
> reserved memory ranges that are DMA targets.
> 
> Platform designers should avoid or limit use of reserved memory regions
> since these require system software to create holes in the DMA virtual
> address range available to system software and its drivers.
> ----
> 
> Below is one example from a BDW machine:
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address
> ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address
> af7fffff
> 
> Here the 1st reserved region is for USB controller, with the 2nd one
> belonging to IGD.
> 
> 
> 
> 2. RMRR status in Xen
> ================================================================
> =====
> 
> There are two main design goals according to VT-d spec:
> 
> a) Setup identity mapping for reserved regions in IOMMU page table
> b) Ensure reserved regions not conflicting with OS-visible memory
> (OS-visible memory in a VM means guest physical memory, and more
> strictly it also means no confliction with other types of allocations
> in guest physical address space, such as PCI MMIO, ACPI, etc.)
> 
> However current RMRR implementation in Xen only partially achieves a)
> and completely misses b), which cause some issues:
> 
> --
> [Issue-1] Identity mapping is not setup in shared ept case, so a device
> with RMRR may not function correctly if assigned to a VM.
> 
> This was the original problem we found when assigning IGD on BDW
> platform, which triggered the whole long discussion in past months
> 
> --
> [Issue-2] Being lacking of goal-b), existing device assignment with
> RMRR works only when reserved regions happen to not conflicting with
> other valid allocations in the guest physical address space. This could
> lead to unpredicted failures in various deployments, due to non-detected
> conflictions caused by platform difference and VM configuration
> difference.
> 
> One example is about USB controller assignment. It's already identified
> as a problem on some platforms, that USB reserved regions conflict with
> guest BIOS region. However, being the fact that host BIOS only touches
> those reserved regions for legacy keyboard emulation at early Dom0 boot
> phase, a trick is added in Xen to bypass RMRR handling for usb
> controllers.
> 
> --
> [Issue-3] devices may share same reserved regions, however
> there is no logic to handle this in Xen. Assigning such devices to
> different VMs could lead to secure concern
> 
> 
> 
> 3. High Level Design
> ================================================================
> =====
> 
> To achieve aforementioned two goals, major enhancements are required
> cross Xen hypervisor, libxc, and hvmloader, to address the gap in
> goal-b), i.e. handling possible conflictions in gfn space. Fixing
> goal-a) is straightforward.
> 
> >>>3.1 Guidelines
> ----
> There are several guidelines considered in the design:
> 
> --
> [Guideline-1] No regression in a VM w/o statically-assigned devices
> 
>   If a VM isn't configured with assigned devices at creation, new
> confliction detection logic shouldn't block the VM boot progress
> (either skipped, or just throw warning)
> 
> --
> [Guideline-2] No regression on devices which do not have RMRR reported
> 
>   If a VM is assigned with a device which doesn't have RMRR reported,
> either statically-assigned or dynamically-assigned, new confliction
> detection logic shouldn't fail the assignment request for this device.
> 
> --
> [Guideline-3] New interface should be kept as common as possible
> 
>   New interface will be introduced to expose reserved regions to the
> user space. Though RMRR is a VT-d specific terminology, the interface
> design should be generic enough, i.e. to support a function which
> allows hypervisor to force reserving one or more gfn ranges.
> 
> --
> [Guideline-4] Keep changes simple
> 
>   RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification. Per our observations, there are
> only a few reported examples (USB, IGD) on real platforms. So we need
> to balance the code complexity and usage limitation. If one limitation
> is only in niche scenarios, we'd like to vote no-support to simplify
> changes for now.
> 
> >>>3.2 Confliction detection
> ----
> Confliction must be detected in several places as far as gfn is
> concerned (how to handle confliction is discussed in 3.3)
> 
> 1) libxc domain builder
>   Here coarse-grained gfn layout is created, including two contiguous
> guest RAM trunks (lowmem and/or highmem) and mmio holes (VGA, PCI),
> which are passed to hvmloader for later fine-grained manipulation. Guest
> RAM trunks are populated with valid translation setup in underlying p2m
> layer. Device reserved regions must be detected in that layout.
> 
> 2) Xen hypervisor device assignment
>   Device assignment can happen either at VM creation time (after domain
> builder), or anytime thru hotplug after VM is booted. Regardless of
> how userspace handles confliction, Xen hypervisor will always do the
> last-conservative detection when setting up identity mapping:
> 	* gfn space unoccupied:
> 		-> insert identity mapping; no confliction
> 	* gfn space already occupied with identity mapping:
> 		-> do nothing; no confliction
> 	* gfn space already occupied with other mapping:
> 		-> confliction detected
> 
> 3) hvmloader
>   Hvmloader allocates other resources (ACPI, PCI MMIO, etc.) and
> internal data structures in gfn space, and it creates the final guest
> e820. So hvmloader also needs to detect conflictions when conducting
> those operations. If there's no confliction, hvmloader will reserve
> those regions in guest e820 to let guest OS aware.
> 
> >>>3.3 Policies
> ----
> An intuitive thought is to fail immediately upon a confliction, however
> it is not flexible regarding to different requirments:
> 
> a) it's not appropriate to fail libxc domain builder just because such
> confliction. We still want the guest to boot even w/o assigned device;
> 
> b) whether to fail in hvmloader has several dependencies. If it's
> to check for hotplug preparation, warning is also an acceptable option
> since assignment may not happen at all. Or if it's a USB controller
> but user doesn't care about legacy keyboard emulation, it's also OK to
> move forward upon a confliction;
> 
> c) in Xen hypervisor it is reasonable to fail upon confliction, where
> device is actually assigned. But due to the same requirement on USB
> controller, sometimes we might want it succeed just w/ warnings.
> 
> Regarding to the complexity of addressing all above flexibilities (user
> preferences, per-device), which requires inventing quite some parameters
> passed among different components, and regarding to the fact that
> failures would be rare (except some USB) with proactive avoidance
> in userspace, we'd like to propose below simplified policy following
> [Guideline-4]:
> 
> - 'warn' conflictions in user space (libxc and hvmloader)
> - a boot option to specify 'fail' or 'warn' confliction in Xen device
> assignment path, default to 'fail' (user can set to 'warn' for USB case)
> 
> Such policy provides a relaxed user space policy w/ hypervisor to do
> final judge. It has a unique merit to simplify later interface design
> and hotplug support, w/o breaking [Guideline-1/2] even when all possible
> reserved regions are exposed.
> 
>     ******agreement is first required on above policy******
> 
> >>>3.4 Xen: setup RMRR identity mapping
> ----
> Regardless of whether userspace has detected confliction, Xen hypervisor
> always needs to detect confliction itself when setting up identify
> mapping for reserved gfn regions, following above defined policy.
> 
> Identity mapping should be really handled from the general p2m layer,
> so the same r/w permissions apply equally to CPU/DMA access paths,
> regardless of the underlying fact whether EPT is shared with IOMMU.
> 
> This is to match the behavior on bare metal, where although reserved
> regions are marked as E820_RESERVED, it's just a hint to the system
> software which can still read data back because physically those bits
> do exist. So in the virtualization case we don't need to specially
> treat CPU accesses to RMRR reserved regions (similar to other reserved
> regions like ACPI NVS)
> 
> >>>3.5 New interface: expose reserved region information
> ----
> As explained in [Guideline-3], we'd like to keep this interface general
> enough, as a common interface for hypervisor to force reserving gfn
> ranges, due to various reasons (RMRR is a client of this feature).
> 
> One design open was discussed back-and-forth accordingly, regarding to
> whether the interface should return regions reported for all devices
> in the platform (report-all), or selectively return regions only
> belonging to assigned devices (report-sel). report-sel can be built on
> top of report-all, with extra work to help hypervisor generate filtered
> regions (e.g. introduce new interface or make device assignment happened
> before domain builder)
> 
> We propose report-all as the simple solution (different from last sent
> version which used report-sel), regarding to the below facts:
> 
>   - 'warn' policy in user space makes report-all not harmful
>   - 'report-all' still means only a few entries in reality:
>     * RMRR reserved regions should be avoided or limited by platform
> designers, per VT-d specification;
>     * RMRR reserved regions are only a few on real platforms, per our
> current observations;
>   - anyway OS needs to handle all the reserved regions on bare metal;
>   - hotplug friendly;
>   - report-all can be extended to report-sel if really required
> 
> In this way, there are two situations libxc domain builder may request
> to query reserved region information w/ same interface:
> 
> a) if any statically-assigned devices, and/or
> b) if a new parameter is specified, asking for hotplug preparation
> 	('rdm_check' or 'prepare_hotplug'?)
> 
> the 1st invocation of this interface will save all reported reserved
> regions under domain structure, and later invocation (e.g. from
> hvmloader) gets saved content.
> 
> If a VM is configured w/o assigned devices, this interface is not
> invoked so there's no impact and [Guideline-1] is enforced;
> 
> If a VM is configured w/ assigned devices which don't have reserved
> regions, this interface is invoked. In some cases warning may be
> thrown out due to confliction caused by other non-assigned devices,
> but it's just informational and there is no impact on assigned devices
> so [Guideline-2] is enforced;
> 
> >>>3.6 Libxc/hvmloader: detect and avoid conflictions
> ----
> libxc needs to detect reserved region conflictions with:
> 	- guest RAM
> 	- monolithic PCI MMIO hole
> 
> hvmloader needs to detect reserved region confliction with:
> 	- guest RAM
> 	- PCI MMIO allocation
> 	- memory allocation
> 	- some e820 entries like ACPI Opregion, etc.
> 
> When there's a confliction detected, libxc/hvmloader first try to
> relocate conflicting gfn resources to avoid confliction. warning
> will be thrown out when such relocation fails. The relocation policy
> is straightforward for most resources, however there remains a major
> design tradeoff for guest RAM, regarding to handoff between libxc
> and hvmloader...
> 
> In current implementation, guest RAM is contiguous in gfn space, w/
> at most two trunks: lowmem (<4G) and highmem(>4G), which are passed
> to hvmloader through hvm_info. Now by relocating guest RAM to avoid
> confliction with reserved regions, sparse memory trunks are created
> and it's not thought as an extensible way to introduce such sparse
> structure into hvm_info.
> 
> There are several other options discussed so far:
> 
> a) Duplicate same relocation algorithm within libxc domain builder
> (when populating physmap) and hvmloader (when creating e820)
>   - Pros:
> 	* no interface/structure change
> 	* anyway hvmloader still needs to handle reserved regions
>   - Cons:
> 	* duplication is not good
> 
> b) pass sparse information through Xenstore
>   (no much idea. need input from toolstack maintainers)
> 
> c) utilize XENMEM_{set,}_memory_map pair of hypercalls, with libxc to
> set and hvmloader to get. Extension required to allow hvm invoke.
>   - Pros:
> 	* centralized ownership in libxc. flexible for extension
>   - Cons:
> 	* limiting entry to E820MAX (should be fine)
> 	* hvmloader e820 construction may become more complex, given
> two predefined tables (reserved_regions, memory_map)
> 
> ********Inputs are required to find a good option here*********
> 
> >>>3.7 hvmloader: reserve 'reserved regions' in guest E820
> ----
> If there is no confliction detected, hvmloader needs to mark those
> reserved regions as E820_RESERVED in guest E820 table, so the guest OS
> is aware of those reserved regions (thus not does problematic actions
> e.g. when re-allocating PCI MMIO)
> 
> >>>3.8 Xen: Handle devices sharing reserved regions
> ----
> Per VT-d spec, it's possible to have two devices sharing same reserved
> region. Though we didn't see such example in reality, hypervisor needs
> to detect and handle such scenario, otherwise vulnerability may exist
> if two devices are assigned to different VMs (so a malicious VM may
> program its assigned device to clobber the shared region to malform
> another VM's device)
> 
> Ideally all devices sharing reserved regions should be assigned to a
> single VM. However achieving this goal can't be done sole in hypervisor
> w/o reworking current device assignment interface. Assignment is managed
> by toolstack, which requires exposing group sharing information to
> userspace and then extends toolstack to manage assignment in bundle.
> 
> Given the problem only in ideal space, we propose to not support such
> scenario, i.e. having hypervisor to fail the assignment, if the target
> device happens to share some reserved regions with another device,
> following [Guideline-4] to keep things simple.
> 
> 
> 
> 4. Plan
> ================================================================
> =====
> We're seeking an incremental way to split above tasks into 2 stages,
> and in each stage we move forward a step w/o causing regression. Doing
> so can benefit people who want to use device assignment early, and
> also benefit newbie developer to rampup, toward a final sane solution.
> 
> 4.1 Stage-1: hypervisor hardening
> ----
>   [Tasks]
> 	1) Setup RMRR identity mapping in p2m layer with confliction
> detection
> 	2) add a boot option for fail/warn policy
> 	3) remove USB hack
> 	4) Detect and fail device assignment w/ shared reserve regions
> 
>   [Enhancements]
> 	* fix [Issue-1] and [Issue-3]
> 	* partially fix [Issue-2] with limitations:
> 		- w/o userspace relocation there's larger chance to
> see conflictions.
> 		- w/o reserve in guest e820, guest OS may allocate
> reserved pfn when re-enumerating PCI resource
> 
>   [Regressions]
> 	* devices which can be assigned successfully before may be
> failed now due to confliction detection. However it's not a regression
> per se. and user can change policy to 'warn' if required.
> 
> 4.2 Stage-2: libxc/hvmloader hardening
> ----
>   [Tasks]
> 	5) Introduce new interface to expose reserve region information
> 	6) Detect and avoid reserved region conflictions in libxc
> 	7) Pass libxc guest RAM layout to hvmloader
> 	8) Detect and avoid reserved region conflictions in hvmloader
> 	9) Reserve 'reserved regions' in guest E820 in hvmloader
> 
>   [Enhancements]
> 	* completely fix [Issue-2]
> 
>   [Regression]
> 	* n/a
> 
> Thanks,
> Kevin
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-09  6:57 <summary-1> (v2) Design proposal for RMRR fix Tian, Kevin
@ 2015-01-09  9:46 ` Jan Beulich
  2015-01-09 10:26   ` Tian, Kevin
  2015-01-12 13:42 ` Ian Campbell
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2015-01-09  9:46 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, george.dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 09.01.15 at 07:57, <kevin.tian@intel.com> wrote:
> 1) 'fail' vs. 'warn' upon gfn confliction
> 
> Assigning device which fails RMRR confliction check (i.e. intended gfns 
> already allocated for other resources) actually brings unknown stability 
> problem (device may clobber those valid resources) and potentially security 
> issue within VM (but not worse than what a malicious driver can do w/o 
> virtual IOMMU). 
> 
> So by default we should not move forward if gfn confliction is detected 
> when setting up RMRR identity mapping, so-called a 'fail' policy.
> 
> One open though, is whether we want to allow admin to override default
> 'fail' policy with a 'warn' policy, i.e. throwing out confliction detail but
> succeeding device assignment. USB is discussed as one example before
> (hack how it works today upon <1MB confliction), so it might be good to 
> allow enthusiast trying device assignment, or provide flexibility to users
> who already verified predicted potential problem not a real issue to their
> specific deployment.
> 
> I'd like to hear your votes on whether to provide such 'warn' option.

Yes, I certainly see value in such an option, to have a means to
circumvent the other possible case of perceived regressions in not
being able to pass through certain devices anymore.

> 1.1) per-device 'warn' vs. global 'warn'
> 
> Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead 
> of a global option.
> 
> In a glimpse a per-device 'warn' option provides more fine-grained control 
> than a global option, however if thinking it carefully allowing one device 
> w/ 
> potential problem isn't more correct or secure than allowing multiple 
> devices w/ potential problem. Even in practice a device like USB can
> work bearing <1MB confliction, like Jan pointed out there's always corner
> cases which we might not know so as long as we open door for one device,
> it implies a problematic environment to users and user's judge on whether
> he can live up to this problem is not impacted by how many devices the door
> is opened for (he anyway needs to study warning message and do verification
> if choosing to live up)
> 
> Regarding to that, imo if we agree to provide 'warn' option, just providing
> a global overriding option (definitely per-vm) is acceptable and simpler.

If the admin determined that ignoring the RMRR requirements for one
devices is safe, that doesn't (and shouldn't) mean this is the case for
all other devices too.

> 1.2) when to 'fail'
> 
> There is one open whether we should fail immediately in domain builder
> if a confliction is detected. 
> 
> Jan's comment is yes, we should 'fail' the VM creation as it's an error.
> 
> My previous point is more mimicking native behavior, where a device 
> failure (in our case it's actually potential device failure since VM is not 
> powered yet) doesn't impact user until its function is actually touched. 
> In our case, even domain builder fails to re-arrange guest RAM to skip 
> reserved regions, we have centralized policy (either 'fail' or 'warn' per 
> above conclusion) in Xen hypervisor when the device is actually assigned. 
> so a 'warn' should be fine, but my insist on this is not strong.

See my earlier reply: Failure to add a device to me is more like a
device preventing a bare metal system from coming up altogether.

> and another point is about hotplug. 'fail' for future devices is too strict,
> but to differentiate that from static-assigned devices, domain builder
> will then need maintain a per-device reserved region structure. just
> 'warn' makes things simple.

Whereas here I agree - hotplug should just fail (without otherwise
impacting the guest).

> 2) RMRR management
> 
> George raised a good point that RMRR reserved regions can be maintained
> in toolstack, and it's toolstack to tell Xen which regions to be reserved. 
> When
> providing more flexibility, another benefit from Jan is to specify reserved 
> regions in another node (might-be-migrated-to) as a preparation for migration.
> 
> When it sounds like a good long term plan, my feeling is that it might be
> some parallel effort driving from toolstack experts. Xen can't simply rely
> on user space to setup all necessary reserved regions, since it violates the
> isolation philosophy in Xen. Whatever a toolstack may tell Xen, Xen still
> needs to setup identity mapping for all reserved regions reported for the
> assigned device.

Of course. If the tool stack failed to reserve a certain page in a guest's
memory map, failure will result.

> So I still prefer to current way i.e. having Xen to organize reserve regions
> according to assigned device, and then having libxc/hvmloader to query
> to avoid confliction. In the future new interface can be created to allow
> toolstack specific plain reserved regions for whatever reason to Xen, as
> a compliment.

Indeed we should presumably allow for both - the guest config may
specify regions independent on what the host properties are, yet by
default reserved regions within the guest layout will depend on host
properties (and guest config settings - as pointed out before, the
default ought to be no reserved regions anyway). How much of this
gets implemented right away vs deferred until found necessary is a
different question.

> 3) report-sel vs. report-all
> 
> report-sel means report reserved regions selectively (all potentially-to-be 
> assigned are listed for hotplug, but doing this is not user friendly)
> 
> report-all means report reserved regions for all available devices in this
> platform (cover hotplug w/ enough flexibility)
> 
> report-sel is opted by Jan from start, as report-all leaves some confusing
> reserved regions to end user. 
> 
> otoh, our proposal seeks report-all as a simplified option, because we don't
> think user should set assumption on e820 layout which is the platform
> attribute and at most it's similar to a physical layout.
> 
> first, report-all doesn't cause more conflictions than report-sel in a 
> reasonable
> thinking. virtual platform is simpler than physical platform. Since those 
> regions
> can be reserved in physical platform, it's reasonable to assume same 
> reservation
> can succeed in virtual platform (putting 1MB confliction aside .

As again said in an earlier reply, tying the guest layout to the one of
the host where it boots is going to lead to inconsistencies when the
guest later gets migrated to a host with a different memory layout.

> second, I'm not sure to what level users care about those reserved regions.
> At most it's same layout as physical so even sensitive users won't see it as
> a UFO. :-) and e820 is platform attributes so user shouldn't set assumption
> on it.

Just consider the case where, in order to accommodate the reserved
regions, low memory needs to be reduced from the default of over
3Gb to say 1Gb. If the guest OS then is incapable of using memory
above 4Gb (say Linux with HIGHMEM=n), there is a significant
difference to be seen by the user.

> 4) handle conflictions
> 
> there are several points discussed.
> 
> Jan raised a good point that reasonable assumption can be made to avoid
> split lowmen into scattered structure, i.e. assuming reserved region only 
> <1MB or >host lowmem. Scatter structure has an impact on RAM layout
> sharing between domain builder and hvmloader, and further per George's
> comment impacting qemu upstream. By doing that reasonable assumption
> domain builder can arrange lowmem always under high end reserved
> regions and thus preserve existing coarse-grained structure w/ low/highmem
> and mmio hole. definitely detection will still be done if a reserved region
> breaks that assumption but no attempt to break guest RAM to avoid 
> confliction.
> 
> following that hvmloader changes would become simpler too, with focus on
> BIOS/ACPI and PCI BARs.
> 
> other two ideas from Jan. One is to move more layout stuff (like PCI BAR, 
> etc.)

To clarify - I didn't mean libxc to do BAR assignments, all I meant was
that it would need to size the MMIO hole in a way that hvmloader can
do the assignments without needing to fiddle with the lowmem/highmem
split.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-09  9:46 ` Jan Beulich
@ 2015-01-09 10:26   ` Tian, Kevin
  2015-01-09 10:46     ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Tian, Kevin @ 2015-01-09 10:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, george.dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, January 09, 2015 5:46 PM
> 
> >>> On 09.01.15 at 07:57, <kevin.tian@intel.com> wrote:
> > 1) 'fail' vs. 'warn' upon gfn confliction
> >
> > Assigning device which fails RMRR confliction check (i.e. intended gfns
> > already allocated for other resources) actually brings unknown stability
> > problem (device may clobber those valid resources) and potentially security
> > issue within VM (but not worse than what a malicious driver can do w/o
> > virtual IOMMU).
> >
> > So by default we should not move forward if gfn confliction is detected
> > when setting up RMRR identity mapping, so-called a 'fail' policy.
> >
> > One open though, is whether we want to allow admin to override default
> > 'fail' policy with a 'warn' policy, i.e. throwing out confliction detail but
> > succeeding device assignment. USB is discussed as one example before
> > (hack how it works today upon <1MB confliction), so it might be good to
> > allow enthusiast trying device assignment, or provide flexibility to users
> > who already verified predicted potential problem not a real issue to their
> > specific deployment.
> >
> > I'd like to hear your votes on whether to provide such 'warn' option.
> 
> Yes, I certainly see value in such an option, to have a means to
> circumvent the other possible case of perceived regressions in not
> being able to pass through certain devices anymore.
> 
> > 1.1) per-device 'warn' vs. global 'warn'
> >
> > Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead
> > of a global option.
> >
> > In a glimpse a per-device 'warn' option provides more fine-grained control
> > than a global option, however if thinking it carefully allowing one device
> > w/
> > potential problem isn't more correct or secure than allowing multiple
> > devices w/ potential problem. Even in practice a device like USB can
> > work bearing <1MB confliction, like Jan pointed out there's always corner
> > cases which we might not know so as long as we open door for one device,
> > it implies a problematic environment to users and user's judge on whether
> > he can live up to this problem is not impacted by how many devices the door
> > is opened for (he anyway needs to study warning message and do
> verification
> > if choosing to live up)
> >
> > Regarding to that, imo if we agree to provide 'warn' option, just providing
> > a global overriding option (definitely per-vm) is acceptable and simpler.
> 
> If the admin determined that ignoring the RMRR requirements for one
> devices is safe, that doesn't (and shouldn't) mean this is the case for
> all other devices too.

I don't think admin can determine whether it's 100% safe. What admin can 
decide is whether he lives up to the potential problem based on his purpose
or based on some experiments. only device vendor knows when and how
RMRR is used. So as long as warn is opened for one device, I think it
already means a problem environment and then adding more device is
just same situation.

> 
> > 1.2) when to 'fail'
> >
> > There is one open whether we should fail immediately in domain builder
> > if a confliction is detected.
> >
> > Jan's comment is yes, we should 'fail' the VM creation as it's an error.
> >
> > My previous point is more mimicking native behavior, where a device
> > failure (in our case it's actually potential device failure since VM is not
> > powered yet) doesn't impact user until its function is actually touched.
> > In our case, even domain builder fails to re-arrange guest RAM to skip
> > reserved regions, we have centralized policy (either 'fail' or 'warn' per
> > above conclusion) in Xen hypervisor when the device is actually assigned.
> > so a 'warn' should be fine, but my insist on this is not strong.
> 
> See my earlier reply: Failure to add a device to me is more like a
> device preventing a bare metal system from coming up altogether.

not all devices are required for bare metal to boot. it causes problem
only when it's being used in the boot process. say at powering up the
disk (insert in the PCI slot) is broken (not sure whether you call such
thing as 'failure to add a device'), it is only error when BIOS tries to
read disk.

note device assignment path is the actual path to decide whether a
device will be present to the guest. not at this domain build time.

> 
> > and another point is about hotplug. 'fail' for future devices is too strict,
> > but to differentiate that from static-assigned devices, domain builder
> > will then need maintain a per-device reserved region structure. just
> > 'warn' makes things simple.
> 
> Whereas here I agree - hotplug should just fail (without otherwise
> impacting the guest).

so 'should' -> 'shoundn't'?

> 
> > 2) RMRR management
> >
> > George raised a good point that RMRR reserved regions can be maintained
> > in toolstack, and it's toolstack to tell Xen which regions to be reserved.
> > When
> > providing more flexibility, another benefit from Jan is to specify reserved
> > regions in another node (might-be-migrated-to) as a preparation for
> migration.
> >
> > When it sounds like a good long term plan, my feeling is that it might be
> > some parallel effort driving from toolstack experts. Xen can't simply rely
> > on user space to setup all necessary reserved regions, since it violates the
> > isolation philosophy in Xen. Whatever a toolstack may tell Xen, Xen still
> > needs to setup identity mapping for all reserved regions reported for the
> > assigned device.
> 
> Of course. If the tool stack failed to reserve a certain page in a guest's
> memory map, failure will result.
> 
> > So I still prefer to current way i.e. having Xen to organize reserve regions
> > according to assigned device, and then having libxc/hvmloader to query
> > to avoid confliction. In the future new interface can be created to allow
> > toolstack specific plain reserved regions for whatever reason to Xen, as
> > a compliment.
> 
> Indeed we should presumably allow for both - the guest config may
> specify regions independent on what the host properties are, yet by
> default reserved regions within the guest layout will depend on host
> properties (and guest config settings - as pointed out before, the
> default ought to be no reserved regions anyway). How much of this

if no device assigned, yes no reserved regions at all. the open on later
report-all or report-sel is only relevant when assignment is concerned.

> gets implemented right away vs deferred until found necessary is a
> different question.

agree.

> 
> > 3) report-sel vs. report-all
> >
> > report-sel means report reserved regions selectively (all potentially-to-be
> > assigned are listed for hotplug, but doing this is not user friendly)
> >
> > report-all means report reserved regions for all available devices in this
> > platform (cover hotplug w/ enough flexibility)
> >
> > report-sel is opted by Jan from start, as report-all leaves some confusing
> > reserved regions to end user.
> >
> > otoh, our proposal seeks report-all as a simplified option, because we don't
> > think user should set assumption on e820 layout which is the platform
> > attribute and at most it's similar to a physical layout.
> >
> > first, report-all doesn't cause more conflictions than report-sel in a
> > reasonable
> > thinking. virtual platform is simpler than physical platform. Since those
> > regions
> > can be reserved in physical platform, it's reasonable to assume same
> > reservation
> > can succeed in virtual platform (putting 1MB confliction aside .
> 
> As again said in an earlier reply, tying the guest layout to the one of
> the host where it boots is going to lead to inconsistencies when the
> guest later gets migrated to a host with a different memory layout.

my point is that such inconsistency can be caused not by this design, as I
said in other reply if you hot remove a device (which is a boot device
having RMRR reported) there's no way to erase that knowledge.

> 
> > second, I'm not sure to what level users care about those reserved regions.
> > At most it's same layout as physical so even sensitive users won't see it as
> > a UFO. :-) and e820 is platform attributes so user shouldn't set assumption
> > on it.
> 
> Just consider the case where, in order to accommodate the reserved
> regions, low memory needs to be reduced from the default of over
> 3Gb to say 1Gb. If the guest OS then is incapable of using memory
> above 4Gb (say Linux with HIGHMEM=n), there is a significant
> difference to be seen by the user.

that makes some sense... but if yes it's also a limitation to your below
proposal on avoid fiddling lowmem, if there's a region at 1GB. I think
for this we can go with your earlier assumption, that we only support
the case which is reasonable high say 3G. violating that assumption
will be warned (so guest RAM is not moved) and later device assignment
will fail.

> 
> > 4) handle conflictions
> >
> > there are several points discussed.
> >
> > Jan raised a good point that reasonable assumption can be made to avoid
> > split lowmen into scattered structure, i.e. assuming reserved region only
> > <1MB or >host lowmem. Scatter structure has an impact on RAM layout
> > sharing between domain builder and hvmloader, and further per George's
> > comment impacting qemu upstream. By doing that reasonable assumption
> > domain builder can arrange lowmem always under high end reserved
> > regions and thus preserve existing coarse-grained structure w/
> low/highmem
> > and mmio hole. definitely detection will still be done if a reserved region
> > breaks that assumption but no attempt to break guest RAM to avoid
> > confliction.
> >
> > following that hvmloader changes would become simpler too, with focus on
> > BIOS/ACPI and PCI BARs.
> >
> > other two ideas from Jan. One is to move more layout stuff (like PCI BAR,
> > etc.)
> 
> To clarify - I didn't mean libxc to do BAR assignments, all I meant was
> that it would need to size the MMIO hole in a way that hvmloader can
> do the assignments without needing to fiddle with the lowmem/highmem
> split.

sorry for miscatch. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-09 10:26   ` Tian, Kevin
@ 2015-01-09 10:46     ` Jan Beulich
  2015-01-12  9:25       ` Tian, Kevin
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2015-01-09 10:46 UTC (permalink / raw)
  To: Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, george.dunlap, tim,
	ian.jackson, xen-devel, Yang Z Zhang, Tiejun Chen

>>> On 09.01.15 at 11:26, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >>> On 09.01.15 at 07:57, <kevin.tian@intel.com> wrote:
>> > 1.1) per-device 'warn' vs. global 'warn'
>> >
>> > Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead
>> > of a global option.
>> >
>> > In a glimpse a per-device 'warn' option provides more fine-grained control
>> > than a global option, however if thinking it carefully allowing one device
>> > w/
>> > potential problem isn't more correct or secure than allowing multiple
>> > devices w/ potential problem. Even in practice a device like USB can
>> > work bearing <1MB confliction, like Jan pointed out there's always corner
>> > cases which we might not know so as long as we open door for one device,
>> > it implies a problematic environment to users and user's judge on whether
>> > he can live up to this problem is not impacted by how many devices the door
>> > is opened for (he anyway needs to study warning message and do
>> verification
>> > if choosing to live up)
>> >
>> > Regarding to that, imo if we agree to provide 'warn' option, just providing
>> > a global overriding option (definitely per-vm) is acceptable and simpler.
>> 
>> If the admin determined that ignoring the RMRR requirements for one
>> devices is safe, that doesn't (and shouldn't) mean this is the case for
>> all other devices too.
> 
> I don't think admin can determine whether it's 100% safe. What admin can 
> decide is whether he lives up to the potential problem based on his purpose
> or based on some experiments. only device vendor knows when and how
> RMRR is used. So as long as warn is opened for one device, I think it
> already means a problem environment and then adding more device is
> just same situation.

What if the admin consulted the device and BIOS vendors, and got
assured there's not going to be any accesses to the reserved regions
post-boot?

>> > 1.2) when to 'fail'
>> >
>> > There is one open whether we should fail immediately in domain builder
>> > if a confliction is detected.
>> >
>> > Jan's comment is yes, we should 'fail' the VM creation as it's an error.
>> >
>> > My previous point is more mimicking native behavior, where a device
>> > failure (in our case it's actually potential device failure since VM is not
>> > powered yet) doesn't impact user until its function is actually touched.
>> > In our case, even domain builder fails to re-arrange guest RAM to skip
>> > reserved regions, we have centralized policy (either 'fail' or 'warn' per
>> > above conclusion) in Xen hypervisor when the device is actually assigned.
>> > so a 'warn' should be fine, but my insist on this is not strong.
>> 
>> See my earlier reply: Failure to add a device to me is more like a
>> device preventing a bare metal system from coming up altogether.
> 
> not all devices are required for bare metal to boot. it causes problem
> only when it's being used in the boot process. say at powering up the
> disk (insert in the PCI slot) is broken (not sure whether you call such
> thing as 'failure to add a device'), it is only error when BIOS tries to
> read disk.

Not necessarily. Any malfunctioning device touched by the BIOS,
irrespective of whether the device is needed for booting, can cause
the boot process to hang. Again, the analogy to bare metal is
device presence, not whether the device is functioning properly.

> note device assignment path is the actual path to decide whether a
> device will be present to the guest. not at this domain build time.

That would only make a marginal difference in time of when domain
creation fails.

>> > and another point is about hotplug. 'fail' for future devices is too 
> strict,
>> > but to differentiate that from static-assigned devices, domain builder
>> > will then need maintain a per-device reserved region structure. just
>> > 'warn' makes things simple.
>> 
>> Whereas here I agree - hotplug should just fail (without otherwise
>> impacting the guest).
> 
> so 'should' -> 'shoundn't'?

No. Perhaps what you imply from fail is different from my reading:
I mean this to be the result of the hotplug operation - the device
would just not appear in the guest. The guest isn't to be brought
down because of such failure (i.e. behavior here is different from
the boot time assignment, where the guest would be prevented
from coming up).

>> > second, I'm not sure to what level users care about those reserved regions.
>> > At most it's same layout as physical so even sensitive users won't see it as
>> > a UFO. :-) and e820 is platform attributes so user shouldn't set assumption
>> > on it.
>> 
>> Just consider the case where, in order to accommodate the reserved
>> regions, low memory needs to be reduced from the default of over
>> 3Gb to say 1Gb. If the guest OS then is incapable of using memory
>> above 4Gb (say Linux with HIGHMEM=n), there is a significant
>> difference to be seen by the user.
> 
> that makes some sense... but if yes it's also a limitation to your below
> proposal on avoid fiddling lowmem, if there's a region at 1GB. I think
> for this we can go with your earlier assumption, that we only support
> the case which is reasonable high say 3G. violating that assumption
> will be warned (so guest RAM is not moved) and later device assignment
> will fail.

No, we shouldn't put in arbitrary restrictions on where RMRRs can sit.
If there's one at 1Gb, and the associated device is to be passed through,
so be it. All I wanted to make clear is that the report-all approach is
going to have too heavy an impact on the guest.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-09 10:46     ` Jan Beulich
@ 2015-01-12  9:25       ` Tian, Kevin
  0 siblings, 0 replies; 8+ messages in thread
From: Tian, Kevin @ 2015-01-12  9:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, george.dunlap, tim,
	ian.jackson, xen-devel, Zhang, Yang Z, Chen, Tiejun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, January 09, 2015 6:46 PM
> 
> >>> On 09.01.15 at 11:26, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >>> On 09.01.15 at 07:57, <kevin.tian@intel.com> wrote:
> >> > 1.1) per-device 'warn' vs. global 'warn'
> >> >
> >> > Both Tim/Jan prefer to 'warn' as a per-device option to the admin instead
> >> > of a global option.
> >> >
> >> > In a glimpse a per-device 'warn' option provides more fine-grained
> control
> >> > than a global option, however if thinking it carefully allowing one device
> >> > w/
> >> > potential problem isn't more correct or secure than allowing multiple
> >> > devices w/ potential problem. Even in practice a device like USB can
> >> > work bearing <1MB confliction, like Jan pointed out there's always corner
> >> > cases which we might not know so as long as we open door for one
> device,
> >> > it implies a problematic environment to users and user's judge on
> whether
> >> > he can live up to this problem is not impacted by how many devices the
> door
> >> > is opened for (he anyway needs to study warning message and do
> >> verification
> >> > if choosing to live up)
> >> >
> >> > Regarding to that, imo if we agree to provide 'warn' option, just providing
> >> > a global overriding option (definitely per-vm) is acceptable and simpler.
> >>
> >> If the admin determined that ignoring the RMRR requirements for one
> >> devices is safe, that doesn't (and shouldn't) mean this is the case for
> >> all other devices too.
> >
> > I don't think admin can determine whether it's 100% safe. What admin can
> > decide is whether he lives up to the potential problem based on his purpose
> > or based on some experiments. only device vendor knows when and how
> > RMRR is used. So as long as warn is opened for one device, I think it
> > already means a problem environment and then adding more device is
> > just same situation.
> 
> What if the admin consulted the device and BIOS vendors, and got
> assured there's not going to be any accesses to the reserved regions
> post-boot?

consultancy could be still inaccurate, or man-error may happen. 

> 
> >> > 1.2) when to 'fail'
> >> >
> >> > There is one open whether we should fail immediately in domain builder
> >> > if a confliction is detected.
> >> >
> >> > Jan's comment is yes, we should 'fail' the VM creation as it's an error.
> >> >
> >> > My previous point is more mimicking native behavior, where a device
> >> > failure (in our case it's actually potential device failure since VM is not
> >> > powered yet) doesn't impact user until its function is actually touched.
> >> > In our case, even domain builder fails to re-arrange guest RAM to skip
> >> > reserved regions, we have centralized policy (either 'fail' or 'warn' per
> >> > above conclusion) in Xen hypervisor when the device is actually assigned.
> >> > so a 'warn' should be fine, but my insist on this is not strong.
> >>
> >> See my earlier reply: Failure to add a device to me is more like a
> >> device preventing a bare metal system from coming up altogether.
> >
> > not all devices are required for bare metal to boot. it causes problem
> > only when it's being used in the boot process. say at powering up the
> > disk (insert in the PCI slot) is broken (not sure whether you call such
> > thing as 'failure to add a device'), it is only error when BIOS tries to
> > read disk.
> 
> Not necessarily. Any malfunctioning device touched by the BIOS,
> irrespective of whether the device is needed for booting, can cause
> the boot process to hang. Again, the analogy to bare metal is
> device presence, not whether the device is functioning properly.
> 
> > note device assignment path is the actual path to decide whether a
> > device will be present to the guest. not at this domain build time.
> 
> That would only make a marginal difference in time of when domain
> creation fails.

it's not marginal difference. instead it's about who owns the policy.

to me, detect/avoid conflictions in domain builder is just a preparation
for later device assignment (either deterministic static assignment or
non-deterministic hotplug). As a preparation, we don't need to make
a failure here as a blocker to prevent guest boot. Instead, leave the
decision to where device assignment actually happens then hard
requirement is made on any conflictions a.t.m. Then we just follow
the existing policy of device assignment (either block guest boot, or 
move forward w/o presenting the device), if confliction is treat as
a failure by default (w/o 'warn' override)

> 
> >> > and another point is about hotplug. 'fail' for future devices is too
> > strict,
> >> > but to differentiate that from static-assigned devices, domain builder
> >> > will then need maintain a per-device reserved region structure. just
> >> > 'warn' makes things simple.
> >>
> >> Whereas here I agree - hotplug should just fail (without otherwise
> >> impacting the guest).
> >
> > so 'should' -> 'shoundn't'?
> 
> No. Perhaps what you imply from fail is different from my reading:
> I mean this to be the result of the hotplug operation - the device
> would just not appear in the guest. The guest isn't to be brought
> down because of such failure (i.e. behavior here is different from
> the boot time assignment, where the guest would be prevented
> from coming up).

yes, the guest shouldn't be blocked for failure which is only in possible
future. but to differentiate such case from static assignment, as you
proposed earlier we need whitelist all potentially-to-be hotplugged 
devices which are user unfriendly to figure out. That's why I want to
check whether just report-all is simple enough w/o big impact (as
we discussed in another mail we can't make boot time reservation
adapting to dynamic device change in the future so anyway there's
a point user may see unrelated reserved region)

> 
> >> > second, I'm not sure to what level users care about those reserved
> regions.
> >> > At most it's same layout as physical so even sensitive users won't see it as
> >> > a UFO. :-) and e820 is platform attributes so user shouldn't set
> assumption
> >> > on it.
> >>
> >> Just consider the case where, in order to accommodate the reserved
> >> regions, low memory needs to be reduced from the default of over
> >> 3Gb to say 1Gb. If the guest OS then is incapable of using memory
> >> above 4Gb (say Linux with HIGHMEM=n), there is a significant
> >> difference to be seen by the user.
> >
> > that makes some sense... but if yes it's also a limitation to your below
> > proposal on avoid fiddling lowmem, if there's a region at 1GB. I think
> > for this we can go with your earlier assumption, that we only support
> > the case which is reasonable high say 3G. violating that assumption
> > will be warned (so guest RAM is not moved) and later device assignment
> > will fail.
> 
> No, we shouldn't put in arbitrary restrictions on where RMRRs can sit.
> If there's one at 1Gb, and the associated device is to be passed through,
> so be it. All I wanted to make clear is that the report-all approach is
> going to have too heavy an impact on the guest.
> 

I think the key here is back to above policy discussion about domain
builder.

If we think 'warn' in domain builder is acceptable, then report-all doesn't
make things worse. If 1G confliction is caused by the statically assigned
device, both report-all and report-sel will fail in later assignment path.
If 1G confliction is caused by unrelated devices, report-all will only throw
warning but no failure since that device is not assigned. hotplug is same
for both.

it's only a problem for report-all if we think any confliction as a 'failure'
to block guest boot in domain builder.

if 'warn' is acceptable, then the only impact we discussed is that report-all
will have more reserved regions than report-sel, but per our other
discussion I haven't see it as a hard problem now.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-09  6:57 <summary-1> (v2) Design proposal for RMRR fix Tian, Kevin
  2015-01-09  9:46 ` Jan Beulich
@ 2015-01-12 13:42 ` Ian Campbell
  2015-01-12 13:53   ` Tian, Kevin
  1 sibling, 1 reply; 8+ messages in thread
From: Ian Campbell @ 2015-01-12 13:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, george.dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Fri, 2015-01-09 at 06:57 +0000, Tian, Kevin wrote:
> 3) report-sel vs. report-all

One thing I'm not clear on is whether you are suggesting to reserve RMRR
(either -all or -sel) for every domain by default, or whether the guest
CFG will need to explicitly opt-in, IOW is there a 3rd report-none
option which is the default unless otherwise requested (e.g. by
e820_host=1, or some other new option)?

Ian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-12 13:42 ` Ian Campbell
@ 2015-01-12 13:53   ` Tian, Kevin
  2015-01-12 13:57     ` Ian Campbell
  0 siblings, 1 reply; 8+ messages in thread
From: Tian, Kevin @ 2015-01-12 13:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: wei.liu2, stefano.stabellini, george.dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Monday, January 12, 2015 9:42 PM
> 
> On Fri, 2015-01-09 at 06:57 +0000, Tian, Kevin wrote:
> > 3) report-sel vs. report-all
> 
> One thing I'm not clear on is whether you are suggesting to reserve RMRR
> (either -all or -sel) for every domain by default, or whether the guest
> CFG will need to explicitly opt-in, IOW is there a 3rd report-none
> option which is the default unless otherwise requested (e.g. by
> e820_host=1, or some other new option)?

only when a device is assigned (or potentially prepare for hotplug usage),
and report-all/sel is from hypervisor p.o.v to tell userspace about all RMRR
regions on this platform or just RMRR regions belonging to specified devices.
'e820_host' only makes holes to prepare (more than RMRR requires). finally 
we still need query actual reserved regions reported by hypervisor and 
then mark them reserved in guest e820 table and avoid conflict for PCI BAR 
allocation etc. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: <summary-1> (v2) Design proposal for RMRR fix
  2015-01-12 13:53   ` Tian, Kevin
@ 2015-01-12 13:57     ` Ian Campbell
  0 siblings, 0 replies; 8+ messages in thread
From: Ian Campbell @ 2015-01-12 13:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wei.liu2, stefano.stabellini, george.dunlap, tim, ian.jackson,
	xen-devel, Jan Beulich, Zhang, Yang Z, Chen, Tiejun

On Mon, 2015-01-12 at 13:53 +0000, Tian, Kevin wrote:
> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > Sent: Monday, January 12, 2015 9:42 PM
> > 
> > On Fri, 2015-01-09 at 06:57 +0000, Tian, Kevin wrote:
> > > 3) report-sel vs. report-all
> > 
> > One thing I'm not clear on is whether you are suggesting to reserve RMRR
> > (either -all or -sel) for every domain by default, or whether the guest
> > CFG will need to explicitly opt-in, IOW is there a 3rd report-none
> > option which is the default unless otherwise requested (e.g. by
> > e820_host=1, or some other new option)?
> 
> only when a device is assigned (or potentially prepare for hotplug usage),

How is this triggered though? Via pci= non-empty in the guest CFG?

This sounds like it should behave similarly (in terms of when it is
enabled, not necessarily in functionality) to the e820_host boolean,
which is set by default if the pci= list is not empty, or can be
overridden if the host admin anticipates doing hotplug.

Ian.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-01-12 13:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-09  6:57 <summary-1> (v2) Design proposal for RMRR fix Tian, Kevin
2015-01-09  9:46 ` Jan Beulich
2015-01-09 10:26   ` Tian, Kevin
2015-01-09 10:46     ` Jan Beulich
2015-01-12  9:25       ` Tian, Kevin
2015-01-12 13:42 ` Ian Campbell
2015-01-12 13:53   ` Tian, Kevin
2015-01-12 13:57     ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.