[PATCH 0/8] Relocate devices rather than memory for qemu-xen

* [PATCH 0/8] Relocate devices rather than memory for qemu-xen
@ 2013-06-21 10:46 George Dunlap
  2013-06-21 10:46 ` [PATCH v4 1/8] hvmloader: Remove all 64-bit print arguments George Dunlap
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: George Dunlap @ 2013-06-21 10:46 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Hanweidong, George Dunlap,
	Stefano Stabellini, Ian Jackson

This is the third version of a patch series to address the issue of
qemu-xen not being able to handle moving guest memory in order to
resize the lowmem MMIO hole.

A brief summary can be seen below:

- 1/8 hvmloader: Remove all 64-bit print arguments
- 2/8 hvmloader: Make the printfs more informative
- 3/8 hvmloader: Set up highmem resouce appropriately if there is no RAM above 4G
A 4/8 hvmloader: Fix check for needing a 64-bit bar
A 5/8 hvmloader: Correct bug in low mmio region accounting
A 6/8 hvmloader: Load large devices into high MMIO space as needed
A 7/8 hvmloader: Remove minimum size for BARs to relocate to 64-bit space
- 8/8 libxl,hvmloader: Don't relocate memory for MMIO hole

Key 
 -: Changes in v4
 A: Reviewed / Acked by 2 or more people

= The situation = 

The default MMIO hole for Xen systems starts at 0xf0000000; this
leaves just under 256MiB of space for PCI devices.  At the moment,
hvmloader will scan the pci bus for devices and resize this hole, up
to 2GiB (i.e., starting at 0x80000000) to make space, relocating any
overlapping guest memory above 4GiB (0x100000000).  (After that point,
if there is still not enough space, the intention seemed to be that it
would begin mapping devices with 64-bit-capabile BARs into high memory
as well, just above the end of RAM; however, there seems to be a bug
in the code which detects this condition; it is likely that the 64-bit
remapping code was never capable of being triggered.)

We expect the default MMIO hole to be insufficient only when passing
through devices to guests.

This works fine for qemu-traditional, but qemu-xen unfortunately has
expectations of where guest memory will be, and will get confused if
it moves.  If hvmloader does relocate guest RAM, then at some point
qemu will try to map that pfn space, resulting in a seg fault and qemu
crashing.

hvmloader of course will only move RAM if it would overlap the MMIO
region; this means that if the guest has a small enough amount of RAM
-- say, 2GiB -- then this "memory move" condition will also never be
triggered. 

So at the moment, then under the following conditions:
 * A user is passing through a device or set of devices requiring more
 MMIO space than the default MMIO hole
 * The user has enough memory that resizing the hole will overlap
 guest memory
 * The user is using qemu-xen (not qemu-traditoinal)
then the user will shortly after boot experience qemu crashing.

= The proposed fix = 

This patch series makes the following functional changes:

The core change is this:

 * When running qemu-xen, don't resize the MMIO hole; instead rely on
devices being moved into the 64-bit MMIO region.

In order to make this more effective, we also make the following
changes to the 64-bit relocation code:

 * Allow devices smaller than 512MiB to be relocated to the high MMIO
region.

 * When moving devices into the 64-bit MMIO region, start with the
ones with the largest BARs, and only relocate them if there is not
enough space for all the remaining BARs

= Risk analysis =

There are two kinds of risks: risks due to unintended changes (i.e., a
bug in the patch itself), and risks due to intended changes.

We hope that we can solve the first by a combination of testing and
code review.

The rest of this analysis will assume that the patch is correct, and
will try to do a risk analysis on the effects of the patch.

The main risk is that moving some devices into 64-bit memory will
cause problems with the operation of those devices.  Relocating a
device may have the following outcomes:
 1. In the best case, the relocated device will Just Work.  
 2. A relocated device may fail in a way that leaves the OS intact: the
guest OS may not be able to see them, or the driver may not load.
 3. A relocated device may fail in a way that crashes the guest OS: the
driver may crash, or one of the relocated devices which fails may be
system-critical.
 4. A relocated device may fail in a way which is unpredictable, but
does not cause data loss: crashing the guest randomly at some point in
the future, or causing strange quirks in functionality (e.g.,
network connectivity dropping, glitches when watching video).
 5. A relocated device may fail in a way that is unpredictable, and
corrupts data.

Outcomes 1-3 are equivalent or strictly better than crashing within a
few minutes of boot.  Outcome 4 is arguably also not much worse.

The main risk to our users would be #5.  However:
 - This is definitely a bug in the driver, OS, or the hardware
 - This is a bug that might be seen running on real hardware, or in
KVM (or some other hypervisor)
 - This is not a bug that we would be likely to catch, even if we had
a full development cycle worth of testing.

I think we should therefore not worry about #5, and consider in
general that relocating a device into 64-bit space will be no worse,
and potentially better, than crashing within a few minutes of boot.

There is another risk with this method, which is that a user may end
up passing through a number of devices with NON-64-bit BARs such that
the devices cannot all fit in the default lowmem MMIO region, but also
cannot be remapped above the 64-bit region.  If this is the case, then
some devices will simply not be able to be mapped.  If these
non-mapped devices are system critical, the VM will not boot; if they
are not, then the devices will simply be invisible.  Both of these are
either no worse than, and potentially better than, crashing within a
few minutes of boot.

Starting with all VMs:

Any VM running in PV mode will be unaffected.

Any VM running in HVM mode but not passing through devices will be
unaffected.

Any VM running in HVM mode and passing through devices that fit inside
the default MMIO space will be unaffected.

Any VM running in HVM mode, and passing through devices that require
less than 2GiB of MMIO space, *and* having a low enough guest memory
that the MMIO hole can be enlarged without moving guest memory, will
be unaffected.  (For example, if you need 512MiB and you have <3584
MiB of guest RAM; or if you need 1024MiB and have <3072 MiB of guest
RAM.)

Any VM running in HVM mode, is passing through devices requiring less
than 2GiB of MMIO space, and is using qemu-traditional will be
unaffected.

For a VM running in HVM mode, passing through devices which require more
than 2GiB of MMIO space, and using qemu-traditional, and having more
than 2GiB of guest memory:  
 * We believe that at the moment what will happen is that because of a
bug in hvmloader (fixed in this series), no devices will be mapped in
64-bit space; instead, the smallest devices will simply not be mapped.
This will likely cause critical platform devices not to be mapped,
causing the VM not to be able to boot.
 * With this patch, the largest devices *will* be remapped into 64-bit
space.  
 * If we are right that the current code will fail, this is a uniform
improvement, even if the devices don't work.  
 * If the current code would work, then a different set of devices
will be re-mapped to high memory.  This may change some configurations
from "works" into "doesn't work".

I think this is a small enough contingent of users, that this is an
acceptable amount of risk to take.

We have now covered all configurations of qemu-traditional.  Since
xend only knows how to use qemu-traditional, this also covers all
configurations using xend.

For VMs running in HVM mode, using qemu-xen, but not using libxl, this
patch will have no effect: qemu-xen will crash.  NB that this cannot
include xend, as it only knows how to drive qemu-traditional.  This
can be worked around by using qemu-traditional instead, or by setting
the appropriate xenstore key on boot.  This is acceptable, because
this is not really a supported configuration; users should use one of
the supported toolstacks, or use libxl.

We have now covered all users of any non-libxl-based toolstack.

For VM running in HVM mode, using qemu-xen, using libxl, and passing
through devices such that the required 32-bit only MMIO space does not
fit in the default MMIO hole, and with enough memory that resizing the
MMIO hole requires moving guest RAM:
 * At the moment, hvmloader will relocate guest memory.  This will
cause qemu-xen to crash within a few minutes.
 * With this change, the devices with the smallest BARs will simply
not be mapped.  If these devices are non-critical, they will simply be
invisible to the OS; if these devices are critical, the OS will not
boot.

Crashing immediately or having non-visible devices are the same or
better than crashing a few minutes into boot, so this is an
improvement (or at least not a regression).

For a VM running in HVM mode, using qemu-xen, using libxl, having a
required 32-bit only MMIO space that does fit within the default MMIO
hole, but a total MMIO space that does not, and having enough memory
that resizing the MMIO hole requires moving guest RAM:
 * At the moment, hvmloader will relocate memory.  This will cause
qemu-xen to crash within a few minutes of booting.  Note that this is
true whether the total MMIO space is less than 2GiB or more.
 * With this change, devices with the largest BARs will be relocated
to 64-bit space.  We expect that in general, the devices thus
relocated will be the passed-through PCI devices.

We have decided already to consider any outcome of mapping a device
into a 64-bit address space to be no worse than, and potentially
better than, qemu-xen crashing; so this can be considered an improvement.

We have now covered all possible configurations.

In summary:
 * The vast majority of configurations are unaffected
 * For those that are affected, the vast majority are either a strict
improvement, or no worse than, the status quo.
 * There is a slight possibility that in one extreme corner case
(using qemu-traditional with >2GiB of MMIO space), we may possibly be
changing "works" into "fails".  I think this is an acceptable risk.

Therefore, I think the risks posed by this change are acceptable.

CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Ian Campbell <ian.campbell@citrix.com>
CC: Ian Jackson <ian.jackson@citrix.com>
CC: Stefano Stabellini <stefano.stabellini@citrix.com>
CC: Hanweidong <hanweidong@huawei.com>
CC: Keir Fraser <keir@xen.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread