All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2).
@ 2011-05-26 14:04 Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-26 14:04 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

Hello

This set of v4.2 patches allows a PV domain to see the machine's
E820 and figure out where the PCI I/O gap is and match it with the reality.

Changes since v4.1 posting:
 - Fixed compile problems under 32-bit built.
[v4.1]:
 - Added 'e820_host' and fixed tab/space issues. Also by mistake posted
   that patchset as v3.1 which was incorrect.
[v3 posting]
 - Applied Ian's review comments. Retested with SLES,RHEL5, NetBSD.
[v2 posting]:
 - Moved 'libxl__e820_alloc' to be called from do_domain_create and if
   machine_e820 == true.
 - Made no_machine_e820 be set to true, if the guest has no PCI devices (and
   is PV)
 - Used Keir's re-worked code for E820 creation.
[v1 posting]:
 - Squashed the "x86: make the pv-only e820 array be dynamic" and
   "x86: adjust the size of the e820 for pv guest to be dynamic" together.
 - Made xc_domain_set_memmap_limit use the 'xc_domain_set_memory_map'
 - Moved 'libxl_e820_alloc' and 'libxl_e820_sanitize' to be an internal
   operation and called from 'libxl_device_pci_parse_bdf'.
 - Expanded 'libxl_device_pci_parse_bdf' API call to have an extra argument
   (optional).

The short end is that with these patches a PV domain can:

 - Use the correct PCI I/O gap. Before these patches, Linux guest would
   boot up and would tell:
   [    0.000000] Allocating PCI resources starting at 40000000 (gap:
40000000:c0000000)
   while in actuality the PCI I/O gap should have been:
   [    0.000000] Allocating PCI resources starting at b0000000 (gap:
b0000000:4c000000)

 - The PV domain with PCI devices was limited to 3GB. It now can be booted
   with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_
   conflict with System RAM. Meaning the drivers can load.

 - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be
   now automatically applied to regions that are considerd PCI I/O regions. You
   can find out which those are by looking for '1-1' in the kernel bootup.

To use this patchset, the guest config file has to have the parameter
'pci=['<BDF>',...]' and 'e820_host=1' enabled.

This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37,
2.6.38, and 2.6.39 kernels. Also tested with PV NetBSD 5.1.

Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests
with success.

 libxc/xc_domain.c      |   77 +++++++++-----
 libxc/xc_e820.h        |    3 
 libxc/xenctrl.h        |   11 ++
 libxl/libxl.idl        |    1 
 libxl/libxl_create.c   |    8 +
 libxl/libxl_internal.h |    1 
 libxl/libxl_pci.c      |  257 +++++++++++++++++++++++++++++++++++++++++++++++--
 libxl/xl_cmdimpl.c     |   13 ++
 8 files changed, 338 insertions(+), 33 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only)
  2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
@ 2011-05-26 14:04 ` Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 2 of 4] libxl: Add support for passing in the host's E820 for PCI passthrough Konrad Rzeszutek Wilk
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-26 14:04 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1306417988 14400
# Node ID 7826a85cf195a65012224777281539fd6faa0f50
# Parent  37c77bacb52aa7795978b994f9d371b979b2cb07
tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86,amd64 only).

The later retrieves the E820 as seen by the hypervisor (completely
unchanged) and the second call sets the E820 for the specified guest.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r 37c77bacb52a -r 7826a85cf195 tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c	Mon May 23 17:38:28 2011 +0100
+++ b/tools/libxc/xc_domain.c	Thu May 26 09:53:08 2011 -0400
@@ -478,37 +478,64 @@ int xc_domain_pin_memory_cacheattr(xc_in
 }
 
 #if defined(__i386__) || defined(__x86_64__)
-#include "xc_e820.h"
+int xc_domain_set_memory_map(xc_interface *xch,
+                               uint32_t domid,
+                               struct e820entry entries[],
+                               uint32_t nr_entries)
+{
+    int rc;
+    struct xen_foreign_memory_map fmap = {
+        .domid = domid,
+        .map = { .nr_entries = nr_entries }
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry),
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+
+    if ( !entries || xc_hypercall_bounce_pre(xch, entries) )
+        return -1;
+
+    set_xen_guest_handle(fmap.map.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    return rc;
+}
+int xc_get_machine_memory_map(xc_interface *xch,
+                              struct e820entry entries[],
+                              uint32_t max_entries)
+{
+    int rc;
+    struct xen_memory_map memmap = {
+        .nr_entries = max_entries
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries, sizeof(struct e820entry) * max_entries,
+                             XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( !entries || xc_hypercall_bounce_pre(xch, entries) || max_entries <= 1)
+        return -1;
+
+
+    set_xen_guest_handle(memmap.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_machine_memory_map, &memmap, sizeof(memmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    return rc ? rc : memmap.nr_entries;
+}
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
                                unsigned long map_limitkb)
 {
-    int rc;
-    struct xen_foreign_memory_map fmap = {
-        .domid = domid,
-        .map = { .nr_entries = 1 }
-    };
-    DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
+    struct e820entry e820;
 
-    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
+    e820.addr = 0;
+    e820.size = (uint64_t)map_limitkb << 10;
+    e820.type = E820_RAM;
 
-    if ( e820 == NULL )
-    {
-        PERROR("Could not allocate memory for xc_domain_set_memmap_limit hypercall");
-        return -1;
-    }
-
-    e820->addr = 0;
-    e820->size = (uint64_t)map_limitkb << 10;
-    e820->type = E820_RAM;
-
-    set_xen_guest_handle(fmap.map.buffer, e820);
-
-    rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
-
-    xc_hypercall_buffer_free(xch, e820);
-
-    return rc;
+    return xc_domain_set_memory_map(xch, domid, &e820, 1);
 }
 #else
 int xc_domain_set_memmap_limit(xc_interface *xch,
diff -r 37c77bacb52a -r 7826a85cf195 tools/libxc/xc_e820.h
--- a/tools/libxc/xc_e820.h	Mon May 23 17:38:28 2011 +0100
+++ b/tools/libxc/xc_e820.h	Thu May 26 09:53:08 2011 -0400
@@ -26,6 +26,9 @@
 #define E820_RESERVED     2
 #define E820_ACPI         3
 #define E820_NVS          4
+#define E820_UNUSABLE     5
+
+#define E820MAX           (128)
 
 struct e820entry {
     uint64_t addr;
diff -r 37c77bacb52a -r 7826a85cf195 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Mon May 23 17:38:28 2011 +0100
+++ b/tools/libxc/xenctrl.h	Thu May 26 09:53:08 2011 -0400
@@ -966,6 +966,17 @@ int xc_domain_set_memmap_limit(xc_interf
                                uint32_t domid,
                                unsigned long map_limitkb);
 
+#if defined(__i386__) || defined(__x86_64__)
+#include "xc_e820.h"
+int xc_domain_set_memory_map(xc_interface *xch,
+                               uint32_t domid,
+                               struct e820entry entries[],
+                               uint32_t nr_entries);
+
+int xc_get_machine_memory_map(xc_interface *xch,
+                              struct e820entry entries[],
+                              uint32_t max_entries);
+#endif
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
                               int32_t time_offset_seconds);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2 of 4] libxl: Add support for passing in the host's E820 for PCI passthrough
  2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
@ 2011-05-26 14:04 ` Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 3 of 4] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-26 14:04 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1306418186 14400
# Node ID a929aa838964173ad0a9fa5423e4f2a9b0264a33
# Parent  7826a85cf195a65012224777281539fd6faa0f50
libxl: Add support for passing in the host's E820 for PCI passthrough

The code that populates E820 is unconditionally triggered by the guest
configuration having "pci=['<BDF>,..']", being a PV guest, and if
b_info->u.pv.e820_host is set.

The code do_domain_create calls the libxl__e820_alloc when
it notices that the guest is PV, has at least one PCI devices, and has
the e820_host flag set.

libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems
E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as
well remove any E820_RAM or E820_UNUSED regions as the guest does not need to
know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to
get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any
gap in the E820 is considered PCI I/O space which means that if we pass
in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the
gap between 2GB and 3GB will be considered as PCI I/O space. To guard against
that we also create an E820_UNUSABLE between the region of 'target_kb'
(called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region.
Lastly, the xc_domain_set_memory_map is called to install the new E820.

When tested with another PV guest (NetBSD 5.1) the modified E820 gave
it no trouble. The code has also been tested with older "classic" Xen Linux
and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid,
Debian Squeeze, 2.6.37, 2.6.38, 2.6.39).

Memory that is slack or for balloon (so 'maxmem' in guest configuration)
is put behind the machine E820. Which in most cases is after the 4GB.

The reason for doing the fetching of the E820 using the hypercall in
the toolstack (instead of the guest doing it) is that when a guest
would do a hypercall to 'XENMEM_machine_memory_map' it would
retrieve an E820 with I/O range caps added in. Meaning that the
region after 4GB up to end of possible memory would be marked as unusable
and the kernel would not have any space to allocate a balloon
region.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r 7826a85cf195 -r a929aa838964 tools/libxl/libxl.idl
--- a/tools/libxl/libxl.idl	Thu May 26 09:53:08 2011 -0400
+++ b/tools/libxl/libxl.idl	Thu May 26 09:56:26 2011 -0400
@@ -180,6 +180,7 @@ libxl_domain_build_info = Struct("domain
                                         ("cmdline", string),
                                         ("ramdisk", libxl_file_reference),
                                         ("features", string, True),
+                                        ("e820_host", bool, False, "Use host's E820 for PCI passthrough."),
                                         ])),
                  ])),
     ],
diff -r 7826a85cf195 -r a929aa838964 tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c	Thu May 26 09:53:08 2011 -0400
+++ b/tools/libxl/libxl_create.c	Thu May 26 09:56:26 2011 -0400
@@ -522,6 +522,14 @@ static int do_domain_create(libxl__gc *g
     for (i = 0; i < d_config->num_pcidevs; i++)
         libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
 
+    if (!d_config->c_info.hvm && d_config->b_info.u.pv.e820_host) {
+        int rc;
+        rc = libxl__e820_alloc(ctx, domid, d_config);
+        if (rc)
+            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
+                      "Failed while collecting E820 with: %d (errno:%d)\n",
+                      rc, errno);
+    }
     if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) {
         if ( (*cb)(ctx, domid, priv) )
             goto error_out;
diff -r 7826a85cf195 -r a929aa838964 tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h	Thu May 26 09:53:08 2011 -0400
+++ b/tools/libxl/libxl_internal.h	Thu May 26 09:56:26 2011 -0400
@@ -362,4 +362,5 @@ _hidden int libxl__error_set(libxl__gc *
 _hidden int libxl__file_reference_map(libxl_file_reference *f);
 _hidden int libxl__file_reference_unmap(libxl_file_reference *f);
 
+_hidden int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config);
 #endif
diff -r 7826a85cf195 -r a929aa838964 tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c	Thu May 26 09:53:08 2011 -0400
+++ b/tools/libxl/libxl_pci.c	Thu May 26 09:56:26 2011 -0400
@@ -1047,3 +1047,167 @@ int libxl_device_pci_shutdown(libxl_ctx 
     free(pcidevs);
     return 0;
 }
+
+static const char *e820_names(int type)
+{
+    switch (type) {
+        case E820_RAM: return "RAM";
+        case E820_RESERVED: return "Reserved";
+        case E820_ACPI: return "ACPI";
+        case E820_NVS: return "ACPI NVS";
+        case E820_UNUSABLE: return "Unusable";
+        default: break;
+    }
+    return "Unknown";
+}
+
+static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+                         uint32_t *nr_entries,
+                         unsigned long map_limitkb,
+                         unsigned long balloon_kb)
+{
+    uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end;
+    uint32_t i, idx = 0, nr;
+    struct e820entry e820[E820MAX];
+
+    if (!src || !map_limitkb || !balloon_kb || !nr_entries)
+        return ERROR_INVAL;
+
+    nr = *nr_entries;
+    if (!nr)
+        return ERROR_INVAL;
+
+    if (nr > E820MAX)
+        return ERROR_NOMEM;
+
+    /* Weed out anything under 1MB */
+    for (i = 0; i < nr; i++) {
+        if (src[i].addr > 0x100000)
+            continue;
+
+        src[i].type = 0;
+        src[i].size = 0;
+        src[i].addr = -1ULL;
+    }
+
+    /* Find the lowest and highest entry in E820, skipping over
+     * undesired entries. */
+    start = -1ULL;
+    last = 0;
+    for (i = 0; i < nr; i++) {
+        if ((src[i].type == E820_RAM) ||
+            (src[i].type == E820_UNUSABLE) ||
+            (src[i].type == 0))
+		continue;
+
+        start = src[i].addr < start ? src[i].addr : start;
+        last = src[i].addr + src[i].size > last ?
+               src[i].addr + src[i].size > last : last;
+    }
+    if (start > 1024)
+        start_kb = start >> 10;
+
+    /* Add the memory RAM region for the guest */
+    e820[idx].addr = 0;
+    e820[idx].size = (uint64_t)map_limitkb << 10;
+    e820[idx].type = E820_RAM;
+
+    /* .. and trim if neccessary */
+    if (start_kb && map_limitkb > start_kb) {
+        delta_kb = map_limitkb - start_kb;
+        if (delta_kb)
+            e820[idx].size -= (uint64_t)(delta_kb << 10);
+    }
+    /* Note: We don't touch balloon_kb here. Will add it at the end. */
+    ram_end = e820[idx].addr + e820[idx].size;
+    idx ++;
+
+    LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %"PRIu64"kB End of RAM: " \
+               "0x%"PRIx64" (PFN) Delta: %"PRIu64"kB, PCI start: %"PRIu64"kB " \
+               "(0x%"PRIx64" PFN), Balloon %"PRIu64"kB\n", (uint64_t)map_limitkb,
+               ram_end >> 12, delta_kb, start_kb ,start >> 12,
+               (uint64_t)balloon_kb);
+
+    /* Check if there is a region between ram_end and start. */
+    if (start > ram_end) {
+        /* .. and if not present, add it in. This is to guard against
+          the Linux guest assuming that the gap between the end of
+          RAM region and the start of the E820_[ACPI,NVS,RESERVED]
+          is PCI I/O space. Which it certainly is _not_. */
+        e820[idx].type = E820_UNUSABLE;
+        e820[idx].addr = ram_end;
+        e820[idx].size = start - ram_end;
+        idx++;
+    }
+    /* Almost done: copy them over, ignoring the undesireable ones */
+    for (i = 0; i < nr; i++) {
+        if ((src[i].type == E820_RAM) ||
+	    (src[i].type == E820_UNUSABLE) ||
+	    (src[i].type == 0))
+	    continue;
+
+        e820[idx].type = src[i].type;
+        e820[idx].addr = src[i].addr;
+        e820[idx].size = src[i].size;
+        idx++;
+    }
+    /* At this point we have the mapped RAM + E820 entries from src. */
+    if (balloon_kb) {
+        /* and if we truncated the RAM region, then add it to the end. */
+        e820[idx].type = E820_RAM;
+        e820[idx].addr = (uint64_t)(1ULL << 32) > last ?
+                         (uint64_t)(1ULL << 32) : last;
+        /* also add the balloon memory to the end. */
+        e820[idx].size = (uint64_t)(delta_kb << 10) +
+                         (uint64_t)(balloon_kb << 10);
+        idx++;
+
+    }
+    nr = idx;
+
+    for (i = 0; i < nr; i++) {
+      LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":\t[%"PRIx64" -> %"PRIx64"] %s",
+                 e820[i].addr >> 12, (e820[i].addr + e820[i].size) >> 12,
+                 e820_names(e820[i].type));
+    }
+
+    /* Done: copy the sanitized version. */
+    *nr_entries = nr;
+    memcpy(src, e820, nr * sizeof(struct e820entry));
+    return 0;
+}
+
+int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config)
+{
+    int rc;
+    uint32_t nr;
+    struct e820entry map[E820MAX];
+    libxl_domain_build_info *b_info;
+
+    if (d_config == NULL || d_config->c_info.hvm)
+        return ERROR_INVAL;
+
+    b_info = &d_config->b_info;
+    if (!b_info->u.pv.e820_host)
+        return ERROR_INVAL;
+
+    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
+    if (rc < 0) {
+        errno = rc;
+        return ERROR_FAIL;
+    }
+    nr = rc;
+    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    if (rc)
+        return ERROR_FAIL;
+
+    rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
+
+    if (rc < 0) {
+        errno  = rc;
+        return ERROR_FAIL;
+    }
+    return 0;
+}
diff -r 7826a85cf195 -r a929aa838964 tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Thu May 26 09:53:08 2011 -0400
+++ b/tools/libxl/xl_cmdimpl.c	Thu May 26 09:56:26 2011 -0400
@@ -373,6 +373,7 @@ static void printf_info(int domid,
         printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel.path);
         printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline);
         printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk.path);
+        printf("\t\t\t(e820_host %d)\n", b_info->u.pv.e820_host);
         printf("\t\t)\n");
     }
     printf("\t)\n");
@@ -994,6 +995,8 @@ skip_vfb:
             if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
                 d_config->num_pcidevs++;
         }
+        if (d_config->num_pcidevs && !c_info->hvm)
+            b_info->u.pv.e820_host = true;
     }
 
     switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) {

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 3 of 4] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate
  2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 2 of 4] libxl: Add support for passing in the host's E820 for PCI passthrough Konrad Rzeszutek Wilk
@ 2011-05-26 14:04 ` Konrad Rzeszutek Wilk
  2011-05-26 14:04 ` [PATCH 4 of 4] libxl: Add 'e820_host' option to config file Konrad Rzeszutek Wilk
  2011-05-26 15:02 ` [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Ian Jackson
  4 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-26 14:04 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1306418190 14400
# Node ID 81e090660d86d3bfb41ea0c75fdc7ef2b4107bf5
# Parent  a929aa838964173ad0a9fa5423e4f2a9b0264a33
libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate.

Most machines after the RAM regions in the e802 have a couple of
E820_RESERVED, with E820_ACPI and E820_NVS. On some Intel machines, the
E820 looks like swiss cheese:

(XEN) Initial Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009d000 (usable)
(XEN)  000000000009d000 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 000000009cf66000 (usable)
(XEN)  000000009cf66000 - 000000009d102000 (ACPI NVS)
(XEN)  000000009d102000 - 000000009f6bd000 (usable)  <--
(XEN)  000000009f6bd000 - 000000009f6bf000 (reserved)
(XEN)  000000009f6bf000 - 000000009f714000 (usable)  <--
(XEN)  000000009f714000 - 000000009f7bf000 (ACPI NVS)
(XEN)  000000009f7bf000 - 000000009f7e0000 (usable)  <--
(XEN)  000000009f7e0000 - 000000009f7ff000 (ACPI data)
(XEN)  000000009f7ff000 - 000000009f800000 (usable)  <--
(XEN)  000000009f800000 - 00000000a0000000 (reserved)
(XEN)  00000000a0000000 - 00000000b0000000 (reserved)
(XEN)  00000000fc000000 - 00000000fd000000 (reserved)
(XEN)  00000000ffe00000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000160000000 (usable)

Which means we have to pay attention to the E820_RAM that are
between the E820_[ACPI,NVS,RESERVED]. If we remove those
E820_RAM (b/c the amount of memory passed to the guest
is less that where those E820 regions reside) from the E820, the
Linux kernel interprets those "gaps" as PCI I/O space.
This is what we are currently doing.

This can be disastrous if we pass in an Intel IGD card which tries
to use the first available PCI I/O space - and ends up
using the MFNs which are actually RAM instead of being the
PCI I/O space.

To make this work, we convert all E820_RAM that are above
the 'target_kb' (those that overlap the 'target_kb'
are truncated appropriately) to be E820_UNUSABLE. We also limit this
alternation up to 4GB. This means that an E820 for a guest
from this (target_kb=1024, maxmem=2048):

[    0.000000] Set 405658 page(s) to 1-1 mapping.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
[    0.000000]  Xen: 0000000040000000 - 000000009cf66000 (unusable)
[    0.000000]  Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS)
[    0.000000]  Xen: 000000009f6bd000 - 000000009f6bf000 (reserved)
[    0.000000]  Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS)
[    0.000000]  Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data)
[    0.000000]  Xen: 000000009f800000 - 00000000b0000000 (reserved)
[    0.000000]  Xen: 00000000fc000000 - 00000000fd000000 (reserved)
[    0.000000]  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
[    0.000000]  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  Xen: 00000000ffe00000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000100000000 - 0000000140800000 (usable)


Will look as so:

[    0.000000] Set 395880 page(s) to 1-1 mapping.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
[    0.000000]  Xen: 0000000040000000 - 000000009cf66000 (unusable)
[    0.000000]  Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS)
[    0.000000]  Xen: 000000009d102000 - 000000009f6bd000 (unusable)
[    0.000000]  Xen: 000000009f6bd000 - 000000009f6bf000 (reserved)
[    0.000000]  Xen: 000000009f6bf000 - 000000009f714000 (unusable)
[    0.000000]  Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS)
[    0.000000]  Xen: 000000009f7bf000 - 000000009f7e0000 (unusable)
[    0.000000]  Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data)
[    0.000000]  Xen: 000000009f7ff000 - 000000009f800000 (unusable)
[    0.000000]  Xen: 000000009f800000 - 00000000b0000000 (reserved)
[    0.000000]  Xen: 00000000fc000000 - 00000000fd000000 (reserved)
[    0.000000]  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
[    0.000000]  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  Xen: 00000000ffe00000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000100000000 - 0000000140800000 (usable)

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r a929aa838964 -r 81e090660d86 tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c	Thu May 26 09:56:26 2011 -0400
+++ b/tools/libxl/libxl_pci.c	Thu May 26 09:56:30 2011 -0400
@@ -1128,21 +1128,98 @@ static int e820_sanitize(libxl_ctx *ctx,
                ram_end >> 12, delta_kb, start_kb ,start >> 12,
                (uint64_t)balloon_kb);
 
+
+    /* This whole code below is to guard against if the Intel IGD is passed into
+     * the guest. If we don't pass in IGD, this whole code can be ignored.
+     *
+     * The reason for this code is that Intel boxes fill their E820 with
+     * E820_RAM amongst E820_RESERVED and we can't just ditch those E820_RAM.
+     * That is b/c any "gaps" in the E820 is considered PCI I/O space by
+     * Linux and it would be utilized by the Intel IGD as I/O space while
+     * in reality it was an RAM region.
+     *
+     * What this means is that we have to walk the E820 and for any region
+     * that is RAM and below 4GB and above ram_end, needs to change its type
+     * to E820_UNUSED. We also need to move some of the E820_RAM regions if
+     * the overlap with ram_end. */
+    for (i = 0; i < nr; i++) {
+	uint64_t end = src[i].addr + src[i].size;
+
+	/* We don't care about E820_UNUSABLE, but we need to
+         * change the type to zero b/c the loop after this
+         * sticks E820_UNUSABLE on the guest's E820 but ignores
+         * the ones with type zero. */
+        if ((src[i].type == E820_UNUSABLE) ||
+	    /* Any region that is within the "RAM region" can
+             * be safely ditched. */
+            (end < ram_end)) {
+                src[i].type = 0;
+                continue;
+        }
+
+        /* Look only at RAM regions. */
+	if (src[i].type != E820_RAM)
+            continue;
+
+        /* We only care about RAM regions below 4GB. */
+        if (src[i].addr >= (1ULL<<32))
+            continue;
+
+	/* E820_RAM overlaps with our RAM region. Move it */
+	if (src[i].addr < ram_end) {
+            uint64_t delta;
+
+            src[i].type = E820_UNUSABLE;
+            delta = ram_end - src[i].addr;
+            /* The end < ram_end should weed this out */
+            if (src[i].size - delta < 0)
+                src[i].type = 0;
+            else {
+                src[i].size -= delta;
+                src[i].addr = ram_end;
+            }
+            if (src[i].addr + src[i].size != end) {
+                /* We messed up somewhere */
+                src[i].type = 0;
+                LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Computed E820 wrongly. Continuing on.");
+            }
+        }
+        /* Lastly, convert the RAM to UNSUABLE. Look in the Linux kernel
+           at git commit 2f14ddc3a7146ea4cd5a3d1ecd993f85f2e4f948
+            "xen/setup: Inhibit resource API from using System RAM E820
+           gaps as PCI mem gaps" for full explanation. */
+        if (end > ram_end)
+            src[i].type = E820_UNUSABLE;
+    }
+
     /* Check if there is a region between ram_end and start. */
     if (start > ram_end) {
+        int add_unusable = 1;
+        for (i = 0; i < nr && add_unusable; i++) {
+            if (src[i].type != E820_UNUSABLE)
+                continue;
+            if (ram_end != src[i].addr)
+                continue;
+            if (start != src[i].addr + src[i].size) {
+                /* there is one, adjust it */
+                src[i].size = start - src[i].addr;
+            }
+            add_unusable = 0;
+        }
         /* .. and if not present, add it in. This is to guard against
-          the Linux guest assuming that the gap between the end of
-          RAM region and the start of the E820_[ACPI,NVS,RESERVED]
-          is PCI I/O space. Which it certainly is _not_. */
-        e820[idx].type = E820_UNUSABLE;
-        e820[idx].addr = ram_end;
-        e820[idx].size = start - ram_end;
-        idx++;
+           the Linux guest assuming that the gap between the end of
+           RAM region and the start of the E820_[ACPI,NVS,RESERVED]
+           is PCI I/O space. Which it certainly is _not_. */
+        if (add_unusable) {
+            e820[idx].type = E820_UNUSABLE;
+            e820[idx].addr = ram_end;
+            e820[idx].size = start - ram_end;
+            idx++;
+        }
     }
     /* Almost done: copy them over, ignoring the undesireable ones */
     for (i = 0; i < nr; i++) {
         if ((src[i].type == E820_RAM) ||
-	    (src[i].type == E820_UNUSABLE) ||
 	    (src[i].type == 0))
 	    continue;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 4 of 4] libxl: Add 'e820_host' option to config file
  2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2011-05-26 14:04 ` [PATCH 3 of 4] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
@ 2011-05-26 14:04 ` Konrad Rzeszutek Wilk
  2011-05-26 15:02 ` [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Ian Jackson
  4 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-26 14:04 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1306418197 14400
# Node ID 980ec2be580408e71891298bc4d1e72c0aa564a4
# Parent  81e090660d86d3bfb41ea0c75fdc7ef2b4107bf5
libxl: Add 'e820_host' option to config file.

.. which will be removed once the auto-ballooning of guests
with PCI devices works. During testing of the patches which provide
a host E820 in a PV guest, certain inconsistencies were found with
guests. When launching a RHEL5 or SLES11 PV guest with 4GB and a PCI device,
the kernel would report 4GB, but have 1.5G "used". What happend was that
the P2M that fall within the E820 I/O holes would never be used and was just
wasted. The mechanism to go around this is to shrink the size of the guest
before launch (say memory=2048, maxmem=4096) and then balloon back to 4096M
after start. For PVOPS type kernels it would detect the E820 I/O holes and
deflate by the correct amount but would not inflate back to 4GB.
Manually inflating makes it work.

The fix in the future for guests where the memory amount flows over the
PCI hole, is to launch the guest with decreased amount right up to the cusp
of where the E820 PCI hole starts. Also increase the 'maxmem' by the delta
and then when the guest has launched, balloon up to the delta number.

This will require some careful surgery so for right now this parameter
will guard against unsuspecting users seeing their PV guests memory "vanish."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r 81e090660d86 -r 980ec2be5804 tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Thu May 26 09:56:30 2011 -0400
+++ b/tools/libxl/xl_cmdimpl.c	Thu May 26 09:56:37 2011 -0400
@@ -979,6 +979,16 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
         pci_power_mgmt = l;
 
+    /* To be reworked (automatically enabled) once the auto ballooning
+     * after guest starts is done (with PCI devices passed in). */
+    if (!xlu_cfg_get_long (config, "e820_host", &l)) {
+        if (c_info->hvm)
+          fprintf(stderr, "Can't do e820_host in HVM mode!");
+        else {
+          if (l)
+            b_info->u.pv.e820_host = true;
+        }
+    }
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         int i;
         d_config->num_pcidevs = 0;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2).
  2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2011-05-26 14:04 ` [PATCH 4 of 4] libxl: Add 'e820_host' option to config file Konrad Rzeszutek Wilk
@ 2011-05-26 15:02 ` Ian Jackson
  2012-05-24 16:24   ` Pasi Kärkkäinen
  4 siblings, 1 reply; 7+ messages in thread
From: Ian Jackson @ 2011-05-26 15:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian.Campbell, stefano.stabellini

Konrad Rzeszutek Wilk writes ("[Xen-devel] [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2)."):
> This set of v4.2 patches allows a PV domain to see the machine's
> E820 and figure out where the PCI I/O gap is and match it with the reality.

Thanks.  I have applied 2-4; 1 was applied earlier.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2).
  2011-05-26 15:02 ` [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Ian Jackson
@ 2012-05-24 16:24   ` Pasi Kärkkäinen
  0 siblings, 0 replies; 7+ messages in thread
From: Pasi Kärkkäinen @ 2012-05-24 16:24 UTC (permalink / raw)
  To: Ian Jackson
  Cc: xen-devel, Ian.Campbell, stefano.stabellini, Konrad Rzeszutek Wilk

On Thu, May 26, 2011 at 04:02:40PM +0100, Ian Jackson wrote:
> Konrad Rzeszutek Wilk writes ("[Xen-devel] [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2)."):
> > This set of v4.2 patches allows a PV domain to see the machine's
> > E820 and figure out where the PCI I/O gap is and match it with the reality.
> 
> Thanks.  I have applied 2-4; 1 was applied earlier.
> 

Hello,

I've seen people asking for e820_host= option for Xen 4.1.x. 

Are these patches something that could/should be backported to xen-4.1-testing.hg ?

-- Pasi

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-24 16:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-26 14:04 [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Konrad Rzeszutek Wilk
2011-05-26 14:04 ` [PATCH 1 of 4] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
2011-05-26 14:04 ` [PATCH 2 of 4] libxl: Add support for passing in the host's E820 for PCI passthrough Konrad Rzeszutek Wilk
2011-05-26 14:04 ` [PATCH 3 of 4] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
2011-05-26 14:04 ` [PATCH 4 of 4] libxl: Add 'e820_host' option to config file Konrad Rzeszutek Wilk
2011-05-26 15:02 ` [PATCH 0 of 4] Patches for PCI passthrough with modified E820 (v4.2) Ian Jackson
2012-05-24 16:24   ` Pasi Kärkkäinen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.