All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
@ 2011-05-04 14:17 Konrad Rzeszutek Wilk
  2011-05-04 14:17 ` [PATCH 1 of 3] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-04 14:17 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

Hello,

This set of v3 patches allows a PV domain to see the machine's
E820 and figure out where the "PCI I/O" gap is and match it with the reality.

Changelog since v2 posting:
 - Moved 'libxl__e820_alloc' to be called from do_domain_create and if
   machine_e820 == true.
 - Made no_machine_e820 be set to true, if the guest has no PCI devices (and is PV)
 - Used Keir's re-worked code for E820 creation.
Changelog since v1 posting:
 - Squashed the "x86: make the pv-only e820 array be dynamic" and 
   "x86: adjust the size of the e820 for pv guest to be dynamic" together.
 - Made xc_domain_set_memmap_limit use the 'xc_domain_set_memory_map'
 - Moved 'libxl_e820_alloc' and 'libxl_e820_sanitize' to be an internal
   operation and called from 'libxl_device_pci_parse_bdf'.
 - Expanded 'libxl_device_pci_parse_bdf' API call to have an extra argument
   (optional).

The short end is that with these patches a PV domain can:

 - Use the correct PCI I/O gap. Before these patches, Linux guest would
   boot up and would tell:
   [    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000)
   while in actuality the PCI I/O gap should have been:
   [    0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000)

 - The PV domain with PCI devices was limited to 3GB. It now can be booted
   with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_ conflict
   with System RAM. Meaning the drivers can load.

 - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be
   now automatically applied to regions that are considerd PCI I/O regions. You can
   find out which those are by looking for '1-1' in the kernel bootup.

To use this patchset, the guest config file has to have the parameter 'pci=['<BDF>',...]'
enabled.

This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
and 2.6.39 kernels. Also tested with PV NetBSD 5.1.

Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests
with success.

 libxc/xc_domain.c      |   77 +++++++++++-----
 libxc/xc_e820.h        |    3 
 libxc/xenctrl.h        |   11 ++
 libxl/libxl.idl        |    1 
 libxl/libxl_create.c   |    8 +
 libxl/libxl_internal.h |    1 
 libxl/libxl_pci.c      |  230 +++++++++++++++++++++++++++++++++++++++++++++++++
 libxl/xl_cmdimpl.c     |    3 
 8 files changed, 309 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1 of 3] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only)
  2011-05-04 14:17 [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Konrad Rzeszutek Wilk
@ 2011-05-04 14:17 ` Konrad Rzeszutek Wilk
  2011-05-04 14:17 ` [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-04 14:17 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1304518568 14400
# Node ID b6af9b428bb16c4c5364ace0617923ffa44ad887
# Parent  476b0d68e7d5405babc1182da3b345b1e4cc1bca
tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86,amd64 only).

The later retrieves the E820 as seen by the hypervisor (completely
unchanged) and the second call sets the E820 for the specified guest.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r 476b0d68e7d5 -r b6af9b428bb1 tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c	Sat Apr 30 09:48:16 2011 +0100
+++ b/tools/libxc/xc_domain.c	Wed May 04 10:16:08 2011 -0400
@@ -478,37 +478,64 @@ int xc_domain_pin_memory_cacheattr(xc_in
 }
 
 #if defined(__i386__) || defined(__x86_64__)
-#include "xc_e820.h"
+int xc_domain_set_memory_map(xc_interface *xch,
+                               uint32_t domid,
+                               struct e820entry entries[],
+                               uint32_t nr_entries)
+{
+    int rc;
+    struct xen_foreign_memory_map fmap = {
+        .domid = domid,
+        .map = { .nr_entries = nr_entries }
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries, nr_entries * sizeof(struct e820entry),
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+
+    if ( !entries || xc_hypercall_bounce_pre(xch, entries) )
+        return -1;
+
+    set_xen_guest_handle(fmap.map.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    return rc;
+}
+int xc_get_machine_memory_map(xc_interface *xch,
+                              struct e820entry entries[],
+                              uint32_t max_entries)
+{
+    int rc;
+    struct xen_memory_map memmap = {
+        .nr_entries = max_entries
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries, sizeof(struct e820entry) * max_entries,
+                             XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( !entries || xc_hypercall_bounce_pre(xch, entries) || max_entries <= 1)
+        return -1;
+
+
+    set_xen_guest_handle(memmap.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_machine_memory_map, &memmap, sizeof(memmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    return rc ? rc : memmap.nr_entries;
+}
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
                                unsigned long map_limitkb)
 {
-    int rc;
-    struct xen_foreign_memory_map fmap = {
-        .domid = domid,
-        .map = { .nr_entries = 1 }
-    };
-    DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
+    struct e820entry e820;
 
-    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
+    e820.addr = 0;
+    e820.size = (uint64_t)map_limitkb << 10;
+    e820.type = E820_RAM;
 
-    if ( e820 == NULL )
-    {
-        PERROR("Could not allocate memory for xc_domain_set_memmap_limit hypercall");
-        return -1;
-    }
-
-    e820->addr = 0;
-    e820->size = (uint64_t)map_limitkb << 10;
-    e820->type = E820_RAM;
-
-    set_xen_guest_handle(fmap.map.buffer, e820);
-
-    rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
-
-    xc_hypercall_buffer_free(xch, e820);
-
-    return rc;
+    return xc_domain_set_memory_map(xch, domid, &e820, 1);
 }
 #else
 int xc_domain_set_memmap_limit(xc_interface *xch,
diff -r 476b0d68e7d5 -r b6af9b428bb1 tools/libxc/xc_e820.h
--- a/tools/libxc/xc_e820.h	Sat Apr 30 09:48:16 2011 +0100
+++ b/tools/libxc/xc_e820.h	Wed May 04 10:16:08 2011 -0400
@@ -26,6 +26,9 @@
 #define E820_RESERVED     2
 #define E820_ACPI         3
 #define E820_NVS          4
+#define E820_UNUSABLE     5
+
+#define E820MAX           (128)
 
 struct e820entry {
     uint64_t addr;
diff -r 476b0d68e7d5 -r b6af9b428bb1 tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h	Sat Apr 30 09:48:16 2011 +0100
+++ b/tools/libxc/xenctrl.h	Wed May 04 10:16:08 2011 -0400
@@ -966,6 +966,17 @@ int xc_domain_set_memmap_limit(xc_interf
                                uint32_t domid,
                                unsigned long map_limitkb);
 
+#if defined(__i386__) || defined(__x86_64__)
+#include "xc_e820.h"
+int xc_domain_set_memory_map(xc_interface *xch,
+                               uint32_t domid,
+                               struct e820entry entries[],
+                               uint32_t nr_entries);
+
+int xc_get_machine_memory_map(xc_interface *xch,
+                              struct e820entry entries[],
+                              uint32_t max_entries);
+#endif
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
                               int32_t time_offset_seconds);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-04 14:17 [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Konrad Rzeszutek Wilk
  2011-05-04 14:17 ` [PATCH 1 of 3] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
@ 2011-05-04 14:17 ` Konrad Rzeszutek Wilk
  2011-05-10  8:29   ` Ian Campbell
  2011-05-17 15:10   ` Ian Jackson
  2011-05-04 14:17 ` [PATCH 3 of 3] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
  2011-05-09  9:00 ` [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Ian Campbell
  3 siblings, 2 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-04 14:17 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1304518568 14400
# Node ID ba218fa1a48ed682651fd90dd8eeb68eeab68e7a
# Parent  b6af9b428bb16c4c5364ace0617923ffa44ad887
libxl: Add support for passing in the machine's E820 for PCI passthrough

The code that populates E820 is unconditionally triggered by the guest
configuration having "pci=['<BDF>,..']", being a PV guest, and if
b_info->u.pv.machine_e820 is set.

The code do_domain_create calls the libxl__e820_alloc when
it notices that the guest is PV, has at least one PCI devices, and has
the machine_e820 flag set.

libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems
E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as
well remove any E820_RAM or E820_UNUSED regions as the guest does not need to
know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to
get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any
gap in the E820 is considered PCI I/O space which means that if we pass
in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the
gap between 2GB and 3GB will be considered as PCI I/O space. To guard against
that we also create an E820_UNUSABLE between the region of 'target_kb'
(called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region.
Lastly, the xc_domain_set_memory_map is called to install the new E820.

When tested with another PV guest (NetBSD 5.1) the modified E820 gave
it no trouble. The code has also been tested with older "classic" Xen Linux
and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid,
Debian Squeeze, 2.6.37, 2.6.38, 2.6.39).

Memory that is slack or for balloon (so 'maxmem' in guest configuration)
is put behind the machine E820. Which in most cases is after the 4GB.

The reason for doing the fetching of the E820 using the hypercall in
the toolstack (instead of the guest doing it) is that when a guest
would do a hypercall to 'XENMEM_machine_memory_map' it would
retrieve an E820 with I/O range caps added in. Meaning that the
region after 4GB up to end of possible memory would be marked as unusable
and the kernel would not have any space to allocate a balloon
region.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl.idl
--- a/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
@@ -180,6 +180,7 @@ libxl_domain_build_info = Struct("domain
                                         ("cmdline", string),
                                         ("ramdisk", libxl_file_reference),
                                         ("features", string, True),
+                                        ("machine_e820", bool, False, "Use machine's E820 for PCI passthrough."),
                                         ])),
                  ])),
     ],
diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
@@ -519,6 +519,14 @@ static int do_domain_create(libxl__gc *g
     for (i = 0; i < d_config->num_pcidevs; i++)
         libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
 
+    if (!d_config->c_info.hvm && d_config->b_info.u.pv.machine_e820) {
+        int rc = 0;
+        rc = libxl__e820_alloc(ctx, domid, d_config);
+        if (rc)
+            LIBXL__LOG(ctx, LIBXL__LOG_WARNING,
+                      "Failed while collecting E820 with: %d (errno:%d)\n",
+                      rc, errno);
+    }
     if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) {
         if ( (*cb)(ctx, domid, priv) )
             goto error_out;
diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/libxl_internal.h	Wed May 04 10:16:08 2011 -0400
@@ -362,4 +362,5 @@ _hidden int libxl__error_set(libxl__gc *
 _hidden int libxl__file_reference_map(libxl_file_reference *f);
 _hidden int libxl__file_reference_unmap(libxl_file_reference *f);
 
+_hidden int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config);
 #endif
diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
@@ -1047,3 +1047,157 @@ int libxl_device_pci_shutdown(libxl_ctx 
     free(pcidevs);
     return 0;
 }
+
+static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
+                         uint32_t *nr_entries,
+                         unsigned long map_limitkb,
+                         unsigned long balloon_kb)
+{
+    uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end;
+    uint32_t i, idx = 0, nr;
+    struct e820entry e820[E820MAX];
+
+    if (!src || !map_limitkb || !balloon_kb || !nr_entries)
+        return ERROR_INVAL;
+
+    nr = *nr_entries;
+    if (!nr)
+        return ERROR_INVAL;
+
+    if (nr > E820MAX)
+        return ERROR_NOMEM;
+
+    /* Weed out anything under 16MB */
+    for (i = 0; i < nr; i++) {
+      if (src[i].addr > 0x100000)
+        continue;
+
+      src[i].type = 0;
+      src[i].size = 0;
+      src[i].addr = -1ULL;
+    }
+
+    /* Find the lowest and highest entry in E820, skipping over
+     * undersired entries. */
+    start = -1ULL;
+    last = 0;
+    for (i = 0; i < nr; i++) {
+        if ((src[i].type == E820_RAM) ||
+            (src[i].type == E820_UNUSABLE) ||
+            (src[i].type == 0)) 
+		continue;
+
+            start = src[i].addr < start ? src[i].addr : start;
+            last = src[i].addr + src[i].size > last ?
+                    src[i].addr + src[i].size > last : last;
+    }
+    if (start > 1024)
+      start_kb = start >> 10;
+
+    /* Add the memory RAM region for the guest */
+    e820[idx].addr = 0;
+    e820[idx].size = (uint64_t)map_limitkb << 10;
+    e820[idx].type = E820_RAM;
+
+    /* .. and trim if neccessary */
+    if (start_kb && map_limitkb > start_kb) {
+        delta_kb = map_limitkb - start_kb;
+        if (delta_kb)
+          e820[idx].size -= (uint64_t)(delta_kb << 10);
+    }
+    /* Note: We don't touch balloon_kb here. Will add it at the end. */
+    ram_end = e820[idx].addr + e820[idx].size;
+    idx ++;
+
+    LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, "Memory: %ldkB End of RAM: 0x%lx (PFN) " \
+               "Delta: %ldkB, PCI start: %ldkB (0x%lx PFN), Balloon %ldkB\n",
+		map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12,
+                balloon_kb);
+
+    /* Check if there is a region between ram_end and start. */
+    if (start > ram_end) {
+      /* .. and if not present, add it in. This is to guard against
+       the Linux guest assuming that the gap between the end of
+       RAM region and the start of the E820_[ACPI,NVS,RESERVED]
+       is PCI I/O space. Which it certainly is _not_. */
+      e820[idx].type = E820_UNUSABLE;
+      e820[idx].addr = ram_end;
+      e820[idx].size = start - ram_end;
+      idx++;
+    } 
+     /* Almost done: copy them over, ignoring the undesireable ones */
+     for (i = 0; i < nr; i++) {
+            if ((src[i].type == E820_RAM) ||
+	        (src[i].type == E820_UNUSABLE) ||
+	         (src[i].type == 0))
+		continue;
+            e820[idx].type = src[i].type;
+            e820[idx].addr = src[i].addr;
+            e820[idx].size = src[i].size;
+            idx++;
+     }
+
+     /* At this point we have the mapped RAM + E820 entries from src. */
+     if (balloon_kb) {
+        /* and if we truncated the RAM region, then add it to the end. */
+        e820[idx].type = E820_RAM;
+        e820[idx].addr = (uint64_t)(1ULL << 32) > last ? (uint64_t)(1ULL << 32) : last;
+        /* also add the balloon memory to the end. */
+        e820[idx].size = (uint64_t)(delta_kb << 10) + (uint64_t)(balloon_kb << 10);
+        idx++;
+
+    }
+    nr = idx;
+
+    for (i = 0; i < nr; i++) {
+      LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]",
+	e820[i].type == E820_RAM ? "RAM " :
+	(e820[i].type == E820_RESERVED ? "RSV " :
+	 e820[i].type == E820_ACPI ? "ACPI" :
+         (e820[i].type == E820_NVS ? "NVS " :
+          (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))),
+          e820[i].addr >> 12,
+         (e820[i].addr + e820[i].size) >> 12);
+    }
+
+    /* Done: copy the sanitized version. */
+    *nr_entries = nr;
+    memcpy(src, e820, nr * sizeof(struct e820entry));
+    return 0;
+}
+      
+
+int libxl__e820_alloc(libxl_ctx *ctx, uint32_t domid, libxl_domain_config *d_config)
+{
+    int rc;
+    uint32_t nr;
+    struct e820entry map[E820MAX];
+    libxl_domain_build_info *b_info;
+
+    if (d_config == NULL || d_config->c_info.hvm)
+      return ERROR_INVAL;
+
+    b_info = &d_config->b_info;
+    if (!b_info->u.pv.machine_e820)
+      return ERROR_INVAL;
+
+    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
+    if (rc < 0) {
+        errno = rc;
+        return ERROR_FAIL;
+    }
+    nr = rc;
+    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    if (rc)
+        return ERROR_FAIL;
+
+    rc = xc_domain_set_memory_map(ctx->xch, domid, map, nr);
+
+    if (rc < 0) {
+        errno  = rc;
+        return ERROR_FAIL;
+    }
+    return 0;
+}
diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/xl_cmdimpl.c	Wed May 04 10:16:08 2011 -0400
@@ -373,6 +373,7 @@ static void printf_info(int domid,
         printf("\t\t\t(kernel %s)\n", b_info->u.pv.kernel.path);
         printf("\t\t\t(cmdline %s)\n", b_info->u.pv.cmdline);
         printf("\t\t\t(ramdisk %s)\n", b_info->u.pv.ramdisk.path);
+        printf("\t\t\t(machine_e820 %d)\n", b_info->u.pv.machine_e820);
         printf("\t\t)\n");
     }
     printf("\t)\n");
@@ -994,6 +995,8 @@ skip_vfb:
             if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
                 d_config->num_pcidevs++;
         }
+        if (d_config->num_pcidevs && !c_info->hvm)
+          b_info->u.pv.machine_e820 = true;
     }
 
     switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) {

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3 of 3] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate
  2011-05-04 14:17 [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Konrad Rzeszutek Wilk
  2011-05-04 14:17 ` [PATCH 1 of 3] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
  2011-05-04 14:17 ` [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough Konrad Rzeszutek Wilk
@ 2011-05-04 14:17 ` Konrad Rzeszutek Wilk
  2011-05-09  9:00 ` [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Ian Campbell
  3 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-04 14:17 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1304518568 14400
# Node ID dd3e8dd061891127a75ee26961b7b3b1dd6c5af7
# Parent  ba218fa1a48ed682651fd90dd8eeb68eeab68e7a
libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate.

Most machines after the RAM regions in the e802 have a couple of
E820_RESERVED, with E820_ACPI and E820_NVS. On some Intel machines, the
E820 looks like swiss cheese:

(XEN) Initial Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009d000 (usable)
(XEN)  000000000009d000 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 000000009cf66000 (usable)
(XEN)  000000009cf66000 - 000000009d102000 (ACPI NVS)
(XEN)  000000009d102000 - 000000009f6bd000 (usable)  <--
(XEN)  000000009f6bd000 - 000000009f6bf000 (reserved)
(XEN)  000000009f6bf000 - 000000009f714000 (usable)  <--
(XEN)  000000009f714000 - 000000009f7bf000 (ACPI NVS)
(XEN)  000000009f7bf000 - 000000009f7e0000 (usable)  <--
(XEN)  000000009f7e0000 - 000000009f7ff000 (ACPI data)
(XEN)  000000009f7ff000 - 000000009f800000 (usable)  <--
(XEN)  000000009f800000 - 00000000a0000000 (reserved)
(XEN)  00000000a0000000 - 00000000b0000000 (reserved)
(XEN)  00000000fc000000 - 00000000fd000000 (reserved)
(XEN)  00000000ffe00000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 0000000160000000 (usable)

Which means we have to pay attention to the E820_RAM that are
between the E820_[ACPI,NVS,RESERVED]. If we remove those
E820_RAM (b/c the amount of memory passed to the guest
is less that where those E820 regions reside) from the E820, the
Linux kernel interprets those "gaps" as PCI I/O space.
This is what we are currently doing.

This can be disastrous if we pass in an Intel IGD card which tries
to use the first available PCI I/O space - and ends up
using the MFNs which are actually RAM instead of being the
PCI I/O space.

To make this work, we convert all E820_RAM that are above
the 'target_kb' (those that overlap the 'target_kb'
are truncated appropriately) to be E820_UNUSABLE. We also limit this
alternation up to 4GB. This means that an E820 for a guest
from this (target_kb=1024, maxmem=2048):

[    0.000000] Set 405658 page(s) to 1-1 mapping.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
[    0.000000]  Xen: 0000000040000000 - 000000009cf66000 (unusable)
[    0.000000]  Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS)
[    0.000000]  Xen: 000000009f6bd000 - 000000009f6bf000 (reserved)
[    0.000000]  Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS)
[    0.000000]  Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data)
[    0.000000]  Xen: 000000009f800000 - 00000000b0000000 (reserved)
[    0.000000]  Xen: 00000000fc000000 - 00000000fd000000 (reserved)
[    0.000000]  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
[    0.000000]  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  Xen: 00000000ffe00000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000100000000 - 0000000140800000 (usable)


Will look as so:

[    0.000000] Set 395880 page(s) to 1-1 mapping.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000040000000 (usable)
[    0.000000]  Xen: 0000000040000000 - 000000009cf66000 (unusable)
[    0.000000]  Xen: 000000009cf66000 - 000000009d102000 (ACPI NVS)
[    0.000000]  Xen: 000000009d102000 - 000000009f6bd000 (unusable)
[    0.000000]  Xen: 000000009f6bd000 - 000000009f6bf000 (reserved)
[    0.000000]  Xen: 000000009f6bf000 - 000000009f714000 (unusable)
[    0.000000]  Xen: 000000009f714000 - 000000009f7bf000 (ACPI NVS)
[    0.000000]  Xen: 000000009f7bf000 - 000000009f7e0000 (unusable)
[    0.000000]  Xen: 000000009f7e0000 - 000000009f7ff000 (ACPI data)
[    0.000000]  Xen: 000000009f7ff000 - 000000009f800000 (unusable)
[    0.000000]  Xen: 000000009f800000 - 00000000b0000000 (reserved)
[    0.000000]  Xen: 00000000fc000000 - 00000000fd000000 (reserved)
[    0.000000]  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
[    0.000000]  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  Xen: 00000000ffe00000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000100000000 - 0000000140800000 (usable)

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r ba218fa1a48e -r dd3e8dd06189 tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
+++ b/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
@@ -1114,22 +1114,98 @@ static int e820_sanitize(libxl_ctx *ctx,
 		map_limitkb, ram_end >> 12, delta_kb, start_kb ,start >> 12,
                 balloon_kb);
 
+    /* This whole code below is to guard against if the Intel IGD is passed into
+     * the guest. If we don't pass in IGD, this whole code can be ignored.
+     *
+     * The reason for this code is that Intel boxes fill their E820 with
+     * E820_RAM amongst E820_RESERVED and we can't just ditch the E820_RAM.
+     * That is b/c any "gaps" in the E820 is considered PCI I/O space by
+     * Linux and it would be utilized by the Intel IGD as I/O space while
+     * in reality it was an RAM region.
+     *
+     * What this means is that we have to walk the E820 and for any region
+     * that is RAM and below 4GB and above ram_end, needs to change its type
+     * to E820_UNUSED. We also need to move some of the E820_RAM regions if
+     * the overlap with ram_end. */
+    for (i = 0; i < nr; i++) {
+	uint64_t end = src[i].addr + src[i].size;
+
+	/* We don't care about E820_UNUSABLE, but we need to
+         * change the type to zero b/c the loop after this
+         * sticks E820_UNUSABLE on the guest's E820 but ignores
+         * the ones with type zero. */
+        if ((src[i].type == E820_UNUSABLE) ||
+	    /* Any region that is within the "RAM region" can
+             * be safely ditched. */
+            (end < ram_end)) {
+		src[i].type = 0;
+                continue;
+        }
+
+        /* Look only at RAM regions. */
+	if (src[i].type != E820_RAM)
+            continue;
+
+        /* We only care about RAM regions below 4GB. */
+        if (src[i].addr >= (1ULL<<32))
+            continue;
+
+	/* E820_RAM overlaps with our RAM region. Move it */
+	if (src[i].addr < ram_end) {
+		uint64_t delta;
+
+		src[i].type = E820_UNUSABLE;
+		delta = ram_end - src[i].addr;
+		/* The end < ram_end should weed this out */
+		if (src[i].size - delta < 0)
+			src[i].type = 0;
+		else {
+			src[i].size -= delta;
+			src[i].addr = ram_end;
+		}
+		if (src[i].addr + src[i].size != end) {
+			/* We messed up somewhere */
+			src[i].type = 0;
+                        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Computed E820 wrongly. Continuing on.");
+		}
+	}
+	/* Lastly, convert the RAM to UNSUABLE. Look in the Linux kernel
+           at git commit 2f14ddc3a7146ea4cd5a3d1ecd993f85f2e4f948
+            "xen/setup: Inhibit resource API from using System RAM E820
+           gaps as PCI mem gaps" for full explanation. */
+	if (end > ram_end)
+          src[i].type = E820_UNUSABLE;
+    }
+
     /* Check if there is a region between ram_end and start. */
     if (start > ram_end) {
+        int add_unusable = 1;
+        for (i = 0; i < nr && add_unusable; i++) {
+          if (src[i].type != E820_UNUSABLE)
+            continue;
+          if (ram_end != src[i].addr)
+            continue;
+          if (start != src[i].addr + src[i].size)
+            /* there is one, adjust it */
+            src[i].size = start - src[i].addr;
+
+          add_unusable = 0;
+      }
       /* .. and if not present, add it in. This is to guard against
        the Linux guest assuming that the gap between the end of
        RAM region and the start of the E820_[ACPI,NVS,RESERVED]
        is PCI I/O space. Which it certainly is _not_. */
-      e820[idx].type = E820_UNUSABLE;
-      e820[idx].addr = ram_end;
-      e820[idx].size = start - ram_end;
-      idx++;
+      if (add_unusable) {
+        e820[idx].type = E820_UNUSABLE;
+        e820[idx].addr = ram_end;
+        e820[idx].size = start - ram_end;
+        idx++;
+      }
     } 
-     /* Almost done: copy them over, ignoring the undesireable ones */
-     for (i = 0; i < nr; i++) {
+    /* Almost done: copy them over, ignoring the undesireable ones */
+    for (i = 0; i < nr; i++) {
             if ((src[i].type == E820_RAM) ||
-	        (src[i].type == E820_UNUSABLE) ||
-	         (src[i].type == 0))
+	        (src[i].type == 0))
 		continue;
             e820[idx].type = src[i].type;
             e820[idx].addr = src[i].addr;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-04 14:17 [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2011-05-04 14:17 ` [PATCH 3 of 3] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
@ 2011-05-09  9:00 ` Ian Campbell
  2011-05-09 21:01   ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-09  9:00 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Wed, 2011-05-04 at 15:17 +0100, Konrad Rzeszutek Wilk wrote:
> Hello,
> 
> This set of v3 patches allows a PV domain to see the machine's
> E820 and figure out where the "PCI I/O" gap is and match it with the reality.
> 
> Changelog since v2 posting:
>  - Moved 'libxl__e820_alloc' to be called from do_domain_create and if
>    machine_e820 == true.
>  - Made no_machine_e820 be set to true, if the guest has no PCI devices (and is PV)
>  - Used Keir's re-worked code for E820 creation.
> Changelog since v1 posting:
>  - Squashed the "x86: make the pv-only e820 array be dynamic" and 
>    "x86: adjust the size of the e820 for pv guest to be dynamic" together.
>  - Made xc_domain_set_memmap_limit use the 'xc_domain_set_memory_map'
>  - Moved 'libxl_e820_alloc' and 'libxl_e820_sanitize' to be an internal
>    operation and called from 'libxl_device_pci_parse_bdf'.
>  - Expanded 'libxl_device_pci_parse_bdf' API call to have an extra argument
>    (optional).
> 
> The short end is that with these patches a PV domain can:
> 
>  - Use the correct PCI I/O gap. Before these patches, Linux guest would
>    boot up and would tell:
>    [    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000)
>    while in actuality the PCI I/O gap should have been:
>    [    0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000)

The reason it needs to be a particular gap is that we can't (easily? at
all?) rewrite the device BARs to match the guest's idea of the hole, is
that right? So it needs to be consistent with the underlying host hole.

I wonder if it is time to enable IOMMU for PV guests by default.
Presumably in that case we can manufacture any hole we like in the e820,
which is useful e.g. when migrating to not-completely-homogeneous hosts.

>  - The PV domain with PCI devices was limited to 3GB. It now can be booted
>    with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_ conflict
>    with System RAM. Meaning the drivers can load.
> 
>  - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be
>    now automatically applied to regions that are considerd PCI I/O regions. You can
>    find out which those are by looking for '1-1' in the kernel bootup.
> 
> To use this patchset, the guest config file has to have the parameter 'pci=['<BDF>',...]'
> enabled.
> 
> This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> 
> Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests
> with success.
> 
>  libxc/xc_domain.c      |   77 +++++++++++-----
>  libxc/xc_e820.h        |    3 
>  libxc/xenctrl.h        |   11 ++
>  libxl/libxl.idl        |    1 
>  libxl/libxl_create.c   |    8 +
>  libxl/libxl_internal.h |    1 
>  libxl/libxl_pci.c      |  230 +++++++++++++++++++++++++++++++++++++++++++++++++
>  libxl/xl_cmdimpl.c     |    3 
>  8 files changed, 309 insertions(+), 25 deletions(-)
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-09  9:00 ` [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Ian Campbell
@ 2011-05-09 21:01   ` Konrad Rzeszutek Wilk
  2011-05-10  8:30     ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-09 21:01 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Mon, May 09, 2011 at 10:00:30AM +0100, Ian Campbell wrote:
> On Wed, 2011-05-04 at 15:17 +0100, Konrad Rzeszutek Wilk wrote:
> > Hello,
> > 
> > This set of v3 patches allows a PV domain to see the machine's
> > E820 and figure out where the "PCI I/O" gap is and match it with the reality.
> > 
> > Changelog since v2 posting:
> >  - Moved 'libxl__e820_alloc' to be called from do_domain_create and if
> >    machine_e820 == true.
> >  - Made no_machine_e820 be set to true, if the guest has no PCI devices (and is PV)
> >  - Used Keir's re-worked code for E820 creation.
> > Changelog since v1 posting:
> >  - Squashed the "x86: make the pv-only e820 array be dynamic" and 
> >    "x86: adjust the size of the e820 for pv guest to be dynamic" together.
> >  - Made xc_domain_set_memmap_limit use the 'xc_domain_set_memory_map'
> >  - Moved 'libxl_e820_alloc' and 'libxl_e820_sanitize' to be an internal
> >    operation and called from 'libxl_device_pci_parse_bdf'.
> >  - Expanded 'libxl_device_pci_parse_bdf' API call to have an extra argument
> >    (optional).
> > 
> > The short end is that with these patches a PV domain can:
> > 
> >  - Use the correct PCI I/O gap. Before these patches, Linux guest would
> >    boot up and would tell:
> >    [    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000)
> >    while in actuality the PCI I/O gap should have been:
> >    [    0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000)
> 
> The reason it needs to be a particular gap is that we can't (easily? at
> all?) rewrite the device BARs to match the guest's idea of the hole, is
> that right? So it needs to be consistent with the underlying host hole.

correct.
> 
> I wonder if it is time to enable IOMMU for PV guests by default.

Would be nice. I thought if the IOMMU was present it wouldautomatically do that?

> Presumably in that case we can manufacture any hole we like in the e820,
> which is useful e.g. when migrating to not-completely-homogeneous hosts.


Hmm. I want to say yes, but not entirely sure what are all the pieces that
this would entail.

> 
> >  - The PV domain with PCI devices was limited to 3GB. It now can be booted
> >    with 4GB, 8GB, or whatever number you want. The PCI devices will now _not_ conflict
> >    with System RAM. Meaning the drivers can load.
> > 
> >  - With 2.6.39 kernels (which has the 1-1 mapping code), the VM_IO flag will be
> >    now automatically applied to regions that are considerd PCI I/O regions. You can
> >    find out which those are by looking for '1-1' in the kernel bootup.
> > 
> > To use this patchset, the guest config file has to have the parameter 'pci=['<BDF>',...]'
> > enabled.
> > 
> > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > 
> > Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests
> > with success.
> > 
> >  libxc/xc_domain.c      |   77 +++++++++++-----
> >  libxc/xc_e820.h        |    3 
> >  libxc/xenctrl.h        |   11 ++
> >  libxl/libxl.idl        |    1 
> >  libxl/libxl_create.c   |    8 +
> >  libxl/libxl_internal.h |    1 
> >  libxl/libxl_pci.c      |  230 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  libxl/xl_cmdimpl.c     |    3 
> >  8 files changed, 309 insertions(+), 25 deletions(-)
> > 
> > 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-04 14:17 ` [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough Konrad Rzeszutek Wilk
@ 2011-05-10  8:29   ` Ian Campbell
  2011-05-10 14:56     ` Konrad Rzeszutek Wilk
  2011-05-17 15:10   ` Ian Jackson
  1 sibling, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-10  8:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Wed, 2011-05-04 at 15:17 +0100, Konrad Rzeszutek Wilk wrote:
> # HG changeset patch
> # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> # Date 1304518568 14400
> # Node ID ba218fa1a48ed682651fd90dd8eeb68eeab68e7a
> # Parent  b6af9b428bb16c4c5364ace0617923ffa44ad887
> libxl: Add support for passing in the machine's E820 for PCI passthrough
> 
> The code that populates E820 is unconditionally triggered by the guest
> configuration having "pci=['<BDF>,..']", being a PV guest, and if
> b_info->u.pv.machine_e820 is set.
> 
> The code do_domain_create calls the libxl__e820_alloc when
> it notices that the guest is PV, has at least one PCI devices, and has
> the machine_e820 flag set.
> 
> libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems
> E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as
> well remove any E820_RAM or E820_UNUSED regions as the guest does not need to
> know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to
> get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any
> gap in the E820 is considered PCI I/O space which means that if we pass
> in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the
> gap between 2GB and 3GB will be considered as PCI I/O space. To guard against
> that we also create an E820_UNUSABLE between the region of 'target_kb'
> (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region.
> Lastly, the xc_domain_set_memory_map is called to install the new E820.

Do we need to document the requirements this places on a PV guest
somewhere more prominent?

Phrases like "up to the first E820...." make me worry that this will
only work on "sensible" machines, but I suppose we can cross those
bridges when we come to them...

> When tested with another PV guest (NetBSD 5.1) the modified E820 gave
> it no trouble. The code has also been tested with older "classic" Xen Linux
> and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid,
> Debian Squeeze, 2.6.37, 2.6.38, 2.6.39).
> 
> Memory that is slack or for balloon (so 'maxmem' in guest configuration)
> is put behind the machine E820. Which in most cases is after the 4GB.
> 
> The reason for doing the fetching of the E820 using the hypercall in
> the toolstack (instead of the guest doing it) is that when a guest
> would do a hypercall to 'XENMEM_machine_memory_map' it would
> retrieve an E820 with I/O range caps added in. Meaning that the
> region after 4GB up to end of possible memory would be marked as unusable
> and the kernel would not have any space to allocate a balloon
> region.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl.idl
> --- a/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
> +++ b/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
> @@ -180,6 +180,7 @@ libxl_domain_build_info = Struct("domain
>                                          ("cmdline", string),
>                                          ("ramdisk", libxl_file_reference),
>                                          ("features", string, True),
> +                                        ("machine_e820", bool, False, "Use machine's E820 for PCI passthrough."),
>                                          ])),
>                   ])),
>      ],
> diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
> +++ b/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
> @@ -519,6 +519,14 @@ static int do_domain_create(libxl__gc *g
>      for (i = 0; i < d_config->num_pcidevs; i++)
>          libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
>  
> +    if (!d_config->c_info.hvm && d_config->b_info.u.pv.machine_e820) {
> +        int rc = 0;
> +        rc = libxl__e820_alloc(ctx, domid, d_config);

rc is pretty comprehensively initialised ;-0

> +        if (rc)
> +            LIBXL__LOG(ctx, LIBXL__LOG_WARNING,
> +                      "Failed while collecting E820 with: %d (errno:%d)\n",
> +                      rc, errno);

LIBXL__LOG_ERRNO takes care of logging errno for you.

> +    }
>      if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) {
>          if ( (*cb)(ctx, domid, priv) )
>              goto error_out;

> diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_pci.c
> --- a/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
> +++ b/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
> @@ -1047,3 +1047,157 @@ int libxl_device_pci_shutdown(libxl_ctx 
>      free(pcidevs);
>      return 0;
>  }
> +
> +static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> +                         uint32_t *nr_entries,
> +                         unsigned long map_limitkb,
> +                         unsigned long balloon_kb)
> +{
> +    uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end;
> +    uint32_t i, idx = 0, nr;
> +    struct e820entry e820[E820MAX];
> +
> +    if (!src || !map_limitkb || !balloon_kb || !nr_entries)
> +        return ERROR_INVAL;
> +
> +    nr = *nr_entries;
> +    if (!nr)
> +        return ERROR_INVAL;
> +
> +    if (nr > E820MAX)
> +        return ERROR_NOMEM;
> +
> +    /* Weed out anything under 16MB */
> +    for (i = 0; i < nr; i++) {
> +      if (src[i].addr > 0x100000)

0x100000 is 1MB not 16MB. I'm not sure if the code or the comment is
correct.

> +        continue;
> +
> +      src[i].type = 0;
> +      src[i].size = 0;
> +      src[i].addr = -1ULL;
> +    }
> +
> +    /* Find the lowest and highest entry in E820, skipping over
> +     * undersired entries. */

          undesired ?

> +    start = -1ULL;
> +    last = 0;
> +    for (i = 0; i < nr; i++) {
> +        if ((src[i].type == E820_RAM) ||
> +            (src[i].type == E820_UNUSABLE) ||
> +            (src[i].type == 0)) 
[...]
>     nr = idx;
> +
> +    for (i = 0; i < nr; i++) {
> +      LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]",
> +	e820[i].type == E820_RAM ? "RAM " :
> +	(e820[i].type == E820_RESERVED ? "RSV " :
> +	 e820[i].type == E820_ACPI ? "ACPI" :
> +         (e820[i].type == E820_NVS ? "NVS " :
> +          (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))),
> +          e820[i].addr >> 12,
> +         (e820[i].addr + e820[i].size) >> 12);

This screams out for a e820_type_as_string style function or a lookup
table, or just about anything else really ;-).

Also you can use "%-4s" instead of manually aligning all the strings.

> +    }
> +
> +    /* Done: copy the sanitized version. */
> +    *nr_entries = nr;
> +    memcpy(src, e820, nr * sizeof(struct e820entry));
> +    return 0;
> +}
> +

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-09 21:01   ` Konrad Rzeszutek Wilk
@ 2011-05-10  8:30     ` Ian Campbell
  2011-05-10 14:53       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-10  8:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Mon, 2011-05-09 at 22:01 +0100, Konrad Rzeszutek Wilk wrote:
> On Mon, May 09, 2011 at 10:00:30AM +0100, Ian Campbell wrote:
> > On Wed, 2011-05-04 at 15:17 +0100, Konrad Rzeszutek Wilk wrote:
> > > Hello,
> > > 
> > > This set of v3 patches allows a PV domain to see the machine's
> > > E820 and figure out where the "PCI I/O" gap is and match it with the reality.
> > > 
> > > Changelog since v2 posting:
> > >  - Moved 'libxl__e820_alloc' to be called from do_domain_create and if
> > >    machine_e820 == true.
> > >  - Made no_machine_e820 be set to true, if the guest has no PCI devices (and is PV)
> > >  - Used Keir's re-worked code for E820 creation.
> > > Changelog since v1 posting:
> > >  - Squashed the "x86: make the pv-only e820 array be dynamic" and 
> > >    "x86: adjust the size of the e820 for pv guest to be dynamic" together.
> > >  - Made xc_domain_set_memmap_limit use the 'xc_domain_set_memory_map'
> > >  - Moved 'libxl_e820_alloc' and 'libxl_e820_sanitize' to be an internal
> > >    operation and called from 'libxl_device_pci_parse_bdf'.
> > >  - Expanded 'libxl_device_pci_parse_bdf' API call to have an extra argument
> > >    (optional).
> > > 
> > > The short end is that with these patches a PV domain can:
> > > 
> > >  - Use the correct PCI I/O gap. Before these patches, Linux guest would
> > >    boot up and would tell:
> > >    [    0.000000] Allocating PCI resources starting at 40000000 (gap: 40000000:c0000000)
> > >    while in actuality the PCI I/O gap should have been:
> > >    [    0.000000] Allocating PCI resources starting at b0000000 (gap: b0000000:4c000000)
> > 
> > The reason it needs to be a particular gap is that we can't (easily? at
> > all?) rewrite the device BARs to match the guest's idea of the hole, is
> > that right? So it needs to be consistent with the underlying host hole.
> 
> correct.
> > 
> > I wonder if it is time to enable IOMMU for PV guests by default.
> 
> Would be nice. I thought if the IOMMU was present it wouldautomatically do that?

I must admit I wasn't sure but I thought not, however 21770:510c797ee115
"iommu: Remove pointless iommu=pv boot option" removed the option so I
think you are right, passing iommu=1 is sufficient to enable IOMMU for
both PV and HVM guests.

> > Presumably in that case we can manufacture any hole we like in the e820,
> > which is useful e.g. when migrating to not-completely-homogeneous hosts.

> Hmm. I want to say yes, but not entirely sure what are all the pieces that
> this would entail.

I think it's a decision which will be internal to libxl (in particular
the important thing is that the option isn't exposed in the guest cfg in
a non-forward compatible way) so we can implement it as and when we get
round to it and not block this series on anything along these lines.


> > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.

They all saw the full amount of RAM? Since the domain builder does not
obey the e820 I'd have thought they would end up with RAM in their I/O
holes which needs to be swizzled around, which at least some of those
guests won't do...

> > > 
> > > Tested this with the PCI devices (NIC, MSI), and with 2GB, 4GB, and 6GB guests
> > > with success.
> > > 
> > >  libxc/xc_domain.c      |   77 +++++++++++-----
> > >  libxc/xc_e820.h        |    3 
> > >  libxc/xenctrl.h        |   11 ++
> > >  libxl/libxl.idl        |    1 
> > >  libxl/libxl_create.c   |    8 +
> > >  libxl/libxl_internal.h |    1 
> > >  libxl/libxl_pci.c      |  230 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >  libxl/xl_cmdimpl.c     |    3 
> > >  8 files changed, 309 insertions(+), 25 deletions(-)
> > > 
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10  8:30     ` Ian Campbell
@ 2011-05-10 14:53       ` Konrad Rzeszutek Wilk
  2011-05-10 15:09         ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-10 14:53 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

> > > Presumably in that case we can manufacture any hole we like in the e820,
> > > which is useful e.g. when migrating to not-completely-homogeneous hosts.
> 
> > Hmm. I want to say yes, but not entirely sure what are all the pieces that
> > this would entail.
> 
> I think it's a decision which will be internal to libxl (in particular
> the important thing is that the option isn't exposed in the guest cfg in
> a non-forward compatible way) so we can implement it as and when we get

Right.
> round to it and not block this series on anything along these lines.

OK.
> 
> 
> > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> 
> They all saw the full amount of RAM? Since the domain builder does not

Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
4GB mark, and then rest - 3GB got stuck behind the 4GB.

> obey the e820 I'd have thought they would end up with RAM in their I/O
> holes which needs to be swizzled around, which at least some of those
> guests won't do...

So, Linux works with no trouble (32-bit or 64-bit, ancient or new).
NetBSD seemed to work just fine. I passed in 4GB (I think - I can re-run
the test just to make sure), and a PCI card (82574L) and it booted.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-10  8:29   ` Ian Campbell
@ 2011-05-10 14:56     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-10 14:56 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, May 10, 2011 at 09:29:50AM +0100, Ian Campbell wrote:
> On Wed, 2011-05-04 at 15:17 +0100, Konrad Rzeszutek Wilk wrote:
> > # HG changeset patch
> > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > # Date 1304518568 14400
> > # Node ID ba218fa1a48ed682651fd90dd8eeb68eeab68e7a
> > # Parent  b6af9b428bb16c4c5364ace0617923ffa44ad887
> > libxl: Add support for passing in the machine's E820 for PCI passthrough
> > 
> > The code that populates E820 is unconditionally triggered by the guest
> > configuration having "pci=['<BDF>,..']", being a PV guest, and if
> > b_info->u.pv.machine_e820 is set.
> > 
> > The code do_domain_create calls the libxl__e820_alloc when
> > it notices that the guest is PV, has at least one PCI devices, and has
> > the machine_e820 flag set.
> > 
> > libxl__e820_alloc calls the xc_get_machine_memory_map to retrieve the systems
> > E820. Then the E820 is sanitized to weed out E820 entries below 16MB, and as
> > well remove any E820_RAM or E820_UNUSED regions as the guest does not need to
> > know about them. The guest only needs the E820_ACPI, E820_NVS, E820_RESERVED to
> > get an idea of where the PCI I/O space is. Mostly.. The Linux kernel assumes that any
> > gap in the E820 is considered PCI I/O space which means that if we pass
> > in the guest 2GB, and the E820_ACPI, and its friend start at 3GB, the
> > gap between 2GB and 3GB will be considered as PCI I/O space. To guard against
> > that we also create an E820_UNUSABLE between the region of 'target_kb'
> > (called ram_end in the code) up to the first E820_[ACPI,NVS,RESERVED] region.
> > Lastly, the xc_domain_set_memory_map is called to install the new E820.
> 
> Do we need to document the requirements this places on a PV guest
> somewhere more prominent?
> 
> Phrases like "up to the first E820...." make me worry that this will
> only work on "sensible" machines, but I suppose we can cross those
> bridges when we come to them...

I did test it with non-sensible machines that have a E820 peppered with
E820_RAM amongs the E820_RESERVED. The last patches makes that work.


> 
> > When tested with another PV guest (NetBSD 5.1) the modified E820 gave
> > it no trouble. The code has also been tested with older "classic" Xen Linux
> > and with the newer "pvops" with success (SLES11, RHEL5, Ubuntu Lucid,
> > Debian Squeeze, 2.6.37, 2.6.38, 2.6.39).
> > 
> > Memory that is slack or for balloon (so 'maxmem' in guest configuration)
> > is put behind the machine E820. Which in most cases is after the 4GB.
> > 
> > The reason for doing the fetching of the E820 using the hypercall in
> > the toolstack (instead of the guest doing it) is that when a guest
> > would do a hypercall to 'XENMEM_machine_memory_map' it would
> > retrieve an E820 with I/O range caps added in. Meaning that the
> > region after 4GB up to end of possible memory would be marked as unusable
> > and the kernel would not have any space to allocate a balloon
> > region.
> > 
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > 
> > diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl.idl
> > --- a/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
> > +++ b/tools/libxl/libxl.idl	Wed May 04 10:16:08 2011 -0400
> > @@ -180,6 +180,7 @@ libxl_domain_build_info = Struct("domain
> >                                          ("cmdline", string),
> >                                          ("ramdisk", libxl_file_reference),
> >                                          ("features", string, True),
> > +                                        ("machine_e820", bool, False, "Use machine's E820 for PCI passthrough."),
> >                                          ])),
> >                   ])),
> >      ],
> > diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_create.c
> > --- a/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
> > +++ b/tools/libxl/libxl_create.c	Wed May 04 10:16:08 2011 -0400
> > @@ -519,6 +519,14 @@ static int do_domain_create(libxl__gc *g
> >      for (i = 0; i < d_config->num_pcidevs; i++)
> >          libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
> >  
> > +    if (!d_config->c_info.hvm && d_config->b_info.u.pv.machine_e820) {
> > +        int rc = 0;
> > +        rc = libxl__e820_alloc(ctx, domid, d_config);
> 
> rc is pretty comprehensively initialised ;-0

Duh!
> 
> > +        if (rc)
> > +            LIBXL__LOG(ctx, LIBXL__LOG_WARNING,
> > +                      "Failed while collecting E820 with: %d (errno:%d)\n",
> > +                      rc, errno);
> 
> LIBXL__LOG_ERRNO takes care of logging errno for you.

OK.
> 
> > +    }
> >      if ( cb && (d_config->c_info.hvm || d_config->b_info.u.pv.bootloader )) {
> >          if ( (*cb)(ctx, domid, priv) )
> >              goto error_out;
> 
> > diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/libxl_pci.c
> > --- a/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
> > +++ b/tools/libxl/libxl_pci.c	Wed May 04 10:16:08 2011 -0400
> > @@ -1047,3 +1047,157 @@ int libxl_device_pci_shutdown(libxl_ctx 
> >      free(pcidevs);
> >      return 0;
> >  }
> > +
> > +static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> > +                         uint32_t *nr_entries,
> > +                         unsigned long map_limitkb,
> > +                         unsigned long balloon_kb)
> > +{
> > +    uint64_t delta_kb = 0, start = 0, start_kb = 0, last = 0, ram_end;
> > +    uint32_t i, idx = 0, nr;
> > +    struct e820entry e820[E820MAX];
> > +
> > +    if (!src || !map_limitkb || !balloon_kb || !nr_entries)
> > +        return ERROR_INVAL;
> > +
> > +    nr = *nr_entries;
> > +    if (!nr)
> > +        return ERROR_INVAL;
> > +
> > +    if (nr > E820MAX)
> > +        return ERROR_NOMEM;
> > +
> > +    /* Weed out anything under 16MB */
> > +    for (i = 0; i < nr; i++) {
> > +      if (src[i].addr > 0x100000)
> 
> 0x100000 is 1MB not 16MB. I'm not sure if the code or the comment is
> correct.

Comment should have said 1MB.
> 
> > +        continue;
> > +
> > +      src[i].type = 0;
> > +      src[i].size = 0;
> > +      src[i].addr = -1ULL;
> > +    }
> > +
> > +    /* Find the lowest and highest entry in E820, skipping over
> > +     * undersired entries. */
> 
>           undesired ?
> 
> > +    start = -1ULL;
> > +    last = 0;
> > +    for (i = 0; i < nr; i++) {
> > +        if ((src[i].type == E820_RAM) ||
> > +            (src[i].type == E820_UNUSABLE) ||
> > +            (src[i].type == 0)) 
> [...]
> >     nr = idx;
> > +
> > +    for (i = 0; i < nr; i++) {
> > +      LIBXL__LOG(ctx, LIBXL__LOG_DEBUG, ":%s\t[%lx -> %lx]",
> > +	e820[i].type == E820_RAM ? "RAM " :
> > +	(e820[i].type == E820_RESERVED ? "RSV " :
> > +	 e820[i].type == E820_ACPI ? "ACPI" :
> > +         (e820[i].type == E820_NVS ? "NVS " :
> > +          (e820[i].type == E820_UNUSABLE ? "UNU " : "----"))),
> > +          e820[i].addr >> 12,
> > +         (e820[i].addr + e820[i].size) >> 12);
> 
> This screams out for a e820_type_as_string style function or a lookup
> table, or just about anything else really ;-).

<nods> I had it at some point in the patches, not sure why it got lost.
Let me add it back in.
> Also you can use "%-4s" instead of manually aligning all the strings.

/me nods.
> 
> > +    }
> > +
> > +    /* Done: copy the sanitized version. */
> > +    *nr_entries = nr;
> > +    memcpy(src, e820, nr * sizeof(struct e820entry));
> > +    return 0;
> > +}
> > +
> 
> Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10 14:53       ` Konrad Rzeszutek Wilk
@ 2011-05-10 15:09         ` Ian Campbell
  2011-05-10 15:27           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-10 15:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2011-05-10 at 15:53 +0100, Konrad Rzeszutek Wilk wrote:
>  
> > > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > 
> > They all saw the full amount of RAM? Since the domain builder does not
> 
> Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
> 4GB mark, and then rest - 3GB got stuck behind the 4GB.

Do you happen to know what caused that to work? I wasn't aware of any
code which moved the p2m around based on e820 in older guests (and I
can't find any right now) and I didn't think the builder itself paid any
attention to the e820 either.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10 15:09         ` Ian Campbell
@ 2011-05-10 15:27           ` Konrad Rzeszutek Wilk
  2011-05-10 15:32             ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-10 15:27 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, May 10, 2011 at 04:09:58PM +0100, Ian Campbell wrote:
> On Tue, 2011-05-10 at 15:53 +0100, Konrad Rzeszutek Wilk wrote:
> >  
> > > > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > > 
> > > They all saw the full amount of RAM? Since the domain builder does not
> > 
> > Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
> > 4GB mark, and then rest - 3GB got stuck behind the 4GB.
> 
> Do you happen to know what caused that to work? I wasn't aware of any
> code which moved the p2m around based on e820 in older guests (and I
> can't find any right now) and I didn't think the builder itself paid any
> attention to the e820 either.

Correct. The build does not pay any attention to it. But the Xen setup code
in the older kernels (or perhaps it is the balloon code itself), does inflate
back to what was deflated. The PVOPS did not do that (argh, I didn't say
that in the earlier email - sorry about misleading you about that).

But just to make sure I am not feeding false information let me re-run this
with an RHEL5 guest and double-check.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10 15:27           ` Konrad Rzeszutek Wilk
@ 2011-05-10 15:32             ` Ian Campbell
  2011-05-10 15:51               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-10 15:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2011-05-10 at 16:27 +0100, Konrad Rzeszutek Wilk wrote:
> On Tue, May 10, 2011 at 04:09:58PM +0100, Ian Campbell wrote:
> > On Tue, 2011-05-10 at 15:53 +0100, Konrad Rzeszutek Wilk wrote:
> > >  
> > > > > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > > > 
> > > > They all saw the full amount of RAM? Since the domain builder does not
> > > 
> > > Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
> > > 4GB mark, and then rest - 3GB got stuck behind the 4GB.
> > 
> > Do you happen to know what caused that to work? I wasn't aware of any
> > code which moved the p2m around based on e820 in older guests (and I
> > can't find any right now) and I didn't think the builder itself paid any
> > attention to the e820 either.
> 
> Correct. The build does not pay any attention to it. But the Xen setup code
> in the older kernels (or perhaps it is the balloon code itself), does inflate
> back to what was deflated.

I'm not concerned with the ballooned case, I'm thinking of the normal
"boot with all memory" case. If the builder has populated 0..4G with RAM
but the E820 says 0..3G,4G..5G (i.e. an IO hole at 3G..4G) then somebody
must move the mfns from pfn-space 3G..4G to 4G..5G, mustn't they? Or at
least deallocate the ones in 3G..4G so that ballooning will subsequently
take over.

>  The PVOPS did not do that (argh, I didn't say
> that in the earlier email - sorry about misleading you about that).
> 
> But just to make sure I am not feeding false information let me re-run this
> with an RHEL5 guest and double-check.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10 15:32             ` Ian Campbell
@ 2011-05-10 15:51               ` Konrad Rzeszutek Wilk
  2011-05-11  7:49                 ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-10 15:51 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

> > > > > > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > > > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > > > > 
> > > > > They all saw the full amount of RAM? Since the domain builder does not
> > > > 
> > > > Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
> > > > 4GB mark, and then rest - 3GB got stuck behind the 4GB.
> > > 
> > > Do you happen to know what caused that to work? I wasn't aware of any
> > > code which moved the p2m around based on e820 in older guests (and I
> > > can't find any right now) and I didn't think the builder itself paid any
> > > attention to the e820 either.
> > 
> > Correct. The build does not pay any attention to it. But the Xen setup code
> > in the older kernels (or perhaps it is the balloon code itself), does inflate
> > back to what was deflated.
> 
> I'm not concerned with the ballooned case, I'm thinking of the normal
> "boot with all memory" case. If the builder has populated 0..4G with RAM
> but the E820 says 0..3G,4G..5G (i.e. an IO hole at 3G..4G) then somebody
> must move the mfns from pfn-space 3G..4G to 4G..5G, mustn't they? Or at
> least deallocate the ones in 3G..4G so that ballooning will subsequently
> take over.

Right - and in the case of RHEL5, SLES11 ..
> 
> >  The PVOPS did not do that (argh, I didn't say
> > that in the earlier email - sorry about misleading you about that).
> > 
> > But just to make sure I am not feeding false information let me re-run this
> > with an RHEL5 guest and double-check.
> 

.. the kernels do all the magic stuff. I provided 4G to the guest, with a PCI
device (and hence enabled the E820 patches), and while the bootup at the start
initially finds:

Linux version 2.6.18-194.el5xen (mockbuild@builder10.centos.org) (gcc version 4.
1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Apr 2 15:34:40 EDT 2010
BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 00000000bffb0000 (usable)
 Xen: 00000000bffb0000 - 00000000bffbe000 (ACPI data)
 Xen: 00000000bffbe000 - 00000000bffe0000 (ACPI NVS)
 Xen: 00000000bffe0000 - 00000000c0000000 (reserved)
 Xen: 00000000fec00000 - 00000000fec01000 (reserved)
 Xen: 00000000fee00000 - 00000000fee01000 (reserved)
 Xen: 00000000fff00000 - 0000000100000000 (reserved)
 Xen: 0000000100000000 - 0000000140850000 (usable)
..
Memory: 3026548k/5251392k available (2512k kernel code, 118176k reserved, 1396k 
..
when it finished booting, I found that 4GBs are available:
[root@g-rhel5-amd64 ~]# cat /proc/meminfo  | grep MemTotal
MemTotal:      4186112 kB

With the PVOPS one, it does almost all of the same thing except
it does not balloon back to the initial size (4GB), so the user
has to manually inflate. But that seems like a Linux kernel feature
is missing: I get the same thing when my 8GB machine boots the PVOPS kernel.
I end up with 6GB (no dom0_mem thing). If I try to 'inflate' past the 6GB it
ignores me - while if I boot with the older Xen-O-Linux I get 8GB to Dom0.

NetBSD seems to do the same thing as RHEL5, SLES11 (note, I could not actually
give it 4096MB, b/c the NetBSD unstable kernel would crash - irregardless if
a PCI device was passed in. But passing in 3900MB made it work). With 3900MB
and a PCI device (so with a modified E820), the guest told me I've
3891 MB free (the rest seem to be used by the kernel).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-10 15:51               ` Konrad Rzeszutek Wilk
@ 2011-05-11  7:49                 ` Ian Campbell
  2011-05-12 17:41                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-11  7:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2011-05-10 at 16:51 +0100, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > This has been tested with 2.6.18 (RHEL5), 2.6.27(SLES11), 2.6.36, 2.6.37, 2.6.38,
> > > > > > > > > and 2.6.39 kernels. Also tested with PV NetBSD 5.1.
> > > > > > 
> > > > > > They all saw the full amount of RAM? Since the domain builder does not
> > > > > 
> > > > > Correct. So if the guest had 6GB passed in, roughly 3GB ended below the
> > > > > 4GB mark, and then rest - 3GB got stuck behind the 4GB.
> > > > 
> > > > Do you happen to know what caused that to work? I wasn't aware of any
> > > > code which moved the p2m around based on e820 in older guests (and I
> > > > can't find any right now) and I didn't think the builder itself paid any
> > > > attention to the e820 either.
> > > 
> > > Correct. The build does not pay any attention to it. But the Xen setup code
> > > in the older kernels (or perhaps it is the balloon code itself), does inflate
> > > back to what was deflated.
> > 
> > I'm not concerned with the ballooned case, I'm thinking of the normal
> > "boot with all memory" case. If the builder has populated 0..4G with RAM
> > but the E820 says 0..3G,4G..5G (i.e. an IO hole at 3G..4G) then somebody
> > must move the mfns from pfn-space 3G..4G to 4G..5G, mustn't they? Or at
> > least deallocate the ones in 3G..4G so that ballooning will subsequently
> > take over.
> 
> Right - and in the case of RHEL5, SLES11 ..
> > 
> > >  The PVOPS did not do that (argh, I didn't say
> > > that in the earlier email - sorry about misleading you about that).
> > > 
> > > But just to make sure I am not feeding false information let me re-run this
> > > with an RHEL5 guest and double-check.
> > 
> 
> .. the kernels do all the magic stuff. I provided 4G to the guest, with a PCI
> device (and hence enabled the E820 patches), and while the bootup at the start
> initially finds:
> 
> Linux version 2.6.18-194.el5xen (mockbuild@builder10.centos.org) (gcc version 4.
> 1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Apr 2 15:34:40 EDT 2010
> BIOS-provided physical RAM map:
>  Xen: 0000000000000000 - 00000000bffb0000 (usable)
>  Xen: 00000000bffb0000 - 00000000bffbe000 (ACPI data)
>  Xen: 00000000bffbe000 - 00000000bffe0000 (ACPI NVS)
>  Xen: 00000000bffe0000 - 00000000c0000000 (reserved)
>  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
>  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
>  Xen: 00000000fff00000 - 0000000100000000 (reserved)
>  Xen: 0000000100000000 - 0000000140850000 (usable)
> ..
> Memory: 3026548k/5251392k available (2512k kernel code, 118176k reserved, 1396k 
> ..
> when it finished booting, I found that 4GBs are available:
> [root@g-rhel5-amd64 ~]# cat /proc/meminfo  | grep MemTotal
> MemTotal:      4186112 kB

Can you actually allocate and use (nearly) all of that 4G? (modulo
kernel allocations/overheads etc). The above doesn't really show that
the p2m above 4G is actually sane (I admit I'd be surprised if we got
away with it enough to boot if it wasn't...).

linux-2.6.18-xen.hg has a decrease_reservation in in setup_arch which
will give back pages between max_pfn and xen_start_info->nr_pages but I
don't think that will trigger for this case, will it? I don't see any
other plausible looking decrease_reservation calls...

Anyway, I guess if it works then it works and although I'm curious as to
how/why I'm not curious enough to start digging too deeply...

> With the PVOPS one, it does almost all of the same thing except
> it does not balloon back to the initial size (4GB), so the user
> has to manually inflate. But that seems like a Linux kernel feature
> is missing: I get the same thing when my 8GB machine boots the PVOPS kernel.
> I end up with 6GB (no dom0_mem thing). If I try to 'inflate' past the 6GB it
> ignores me - while if I boot with the older Xen-O-Linux I get 8GB to Dom0.
> 
> NetBSD seems to do the same thing as RHEL5, SLES11 (note, I could not actually
> give it 4096MB, b/c the NetBSD unstable kernel would crash - irregardless if
> a PCI device was passed in. But passing in 3900MB made it work). With 3900MB
> and a PCI device (so with a modified E820), the guest told me I've
> 3891 MB free (the rest seem to be used by the kernel).
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-11  7:49                 ` Ian Campbell
@ 2011-05-12 17:41                   ` Konrad Rzeszutek Wilk
  2011-05-13  8:47                     ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-12 17:41 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

> > Linux version 2.6.18-194.el5xen (mockbuild@builder10.centos.org) (gcc version 4.
> > 1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Apr 2 15:34:40 EDT 2010
> > BIOS-provided physical RAM map:
> >  Xen: 0000000000000000 - 00000000bffb0000 (usable)
> >  Xen: 00000000bffb0000 - 00000000bffbe000 (ACPI data)
> >  Xen: 00000000bffbe000 - 00000000bffe0000 (ACPI NVS)
> >  Xen: 00000000bffe0000 - 00000000c0000000 (reserved)
> >  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
> >  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
> >  Xen: 00000000fff00000 - 0000000100000000 (reserved)
> >  Xen: 0000000100000000 - 0000000140850000 (usable)
> > ..
> > Memory: 3026548k/5251392k available (2512k kernel code, 118176k reserved, 1396k 
> > ..
> > when it finished booting, I found that 4GBs are available:
> > [root@g-rhel5-amd64 ~]# cat /proc/meminfo  | grep MemTotal
> > MemTotal:      4186112 kB
> 
> Can you actually allocate and use (nearly) all of that 4G? (modulo
> kernel allocations/overheads etc). The above doesn't really show that

memhog 4G worked great.. but then I noticed it started slowing down and
it was using the swap disk?
> the p2m above 4G is actually sane (I admit I'd be surprised if we got
> away with it enough to boot if it wasn't...).

So your inquisitive questions did find an issue. It looks as while
top and /proc/meminfo both report 4G, they also report over 1G being
'used'.

Mem:   4186112k total,  1451684k used,  2734428k free,    13664k buffers

I think that you are spot on with the P2M between bffb0 and
10000 not being used. And the P2M's after 10000 are.

So then I played with the 'maxmem=4096', 'memory=2048' :

BIOS-provided physical RAM map:
 Xen: 0000000000000000 - 0000000080000000 (usable)
 Xen: 0000000080000000 - 00000000bffb0000 type 5
 Xen: 00000000bffb0000 - 00000000bffbe000 (ACPI data)
 Xen: 00000000bffbe000 - 00000000bffe0000 (ACPI NVS)
 Xen: 00000000bffe0000 - 00000000c0000000 (reserved)
 Xen: 00000000fec00000 - 00000000fec01000 (reserved)
 Xen: 00000000fee00000 - 00000000fee01000 (reserved)
 Xen: 00000000fff00000 - 0000000100000000 (reserved)
 Xen: 0000000100000000 - 0000000180800000 (usable)

and when I ballooned the memory up: 'xl memset  4096' it showed 
up it was using the memory above the 4G:

Mem:   4186112k total,   419964k used,  3766148k free,    13776k buffers

So much much better. And yes, memhog 4G ran with no trouble.


PVOPS is much better equiped for this: 'memory=4096':

[    0.000000] released 261886 pages of unused memory
[    0.000000] 1-1 mapping on bffb0->100000
[    0.000000] Set 262224 page(s) to 1-1 mapping.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 00000000bffb0000 (usable)
[    0.000000]  Xen: 00000000bffb0000 - 00000000bffbe000 (ACPI data)
[    0.000000]  Xen: 00000000bffbe000 - 00000000bffe0000 (ACPI NVS)
[    0.000000]  Xen: 00000000bffe0000 - 00000000c0000000 (reserved)
[    0.000000]  Xen: 00000000fec00000 - 00000000fec01000 (reserved)
[    0.000000]  Xen: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  Xen: 00000000fff00000 - 0000000100000000 (reserved)
[    0.000000]  Xen: 0000000100000000 - 000000018074e000 (usable)
..
[    0.000000] Memory: 2535068k/6298936k available (4653k kernel code, 1049344k absent, 2714524k reserved, 4034k data, 716k init)

top shows:
Mem:   2978632k total,   514812k used,  2463820k free,        0k buffers

and dom0:
konrad@phenom: sudo xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2718     6     r-----  633351.2
latest                                      20  3073     1     -b----       4.1


and after ballonning:
Mem:   4027200k total,   518884k used,  3508316k free,        0k buffers

konrad@phenom:~$ sudo xl list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2718     6     r-----  633385.5
latest                                      20  4097     1     -b----       7.3

(thought it was curious, the balloon driver thought the target_kb was 4G
and not 3G (which is what 'xl list' showed). Giving a big number (I did 9G, but
probably 5G would have done) to ./devices/system/xen_memory/xen_memory0/target_kb
made it balloon up.

Looks like there is some need for fixes in the balloon accounting code.

Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
use 'memory' and 'maxmem'. With the PVOPS, need to balloon up and you are OK.

Thought I do want to see about writting the code that would automatically balloon
back to the amount of memory that was deflated.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-12 17:41                   ` Konrad Rzeszutek Wilk
@ 2011-05-13  8:47                     ` Ian Campbell
  2011-05-13 13:57                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-13  8:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Thu, 2011-05-12 at 18:41 +0100, Konrad Rzeszutek Wilk wrote: 
> > Can you actually allocate and use (nearly) all of that 4G? (modulo
> > kernel allocations/overheads etc). The above doesn't really show that
> 
> memhog 4G worked great.. but then I noticed it started slowing down and
> it was using the swap disk?

I guess the I/O holes shadowed the RAM and hence it is basically wasted.

[...]
> Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
> use 'memory' and 'maxmem'.

Hrm, changing behaviour for existing guests isn't so nice, at least not
without a way to turn the behaviour off, perhaps we do need an explicit
cfg file variable to control this after all?

>  With the PVOPS, need to balloon up and you are OK.
> 
> Thought I do want to see about writting the code that would automatically balloon
> back to the amount of memory that was deflated.

I wonder if just writing the correct balloon target to xenstore while
building the guest would be sufficient for the guest to balloon up once
it's up?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-13  8:47                     ` Ian Campbell
@ 2011-05-13 13:57                       ` Konrad Rzeszutek Wilk
  2011-05-17 16:02                         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-13 13:57 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

> > memhog 4G worked great.. but then I noticed it started slowing down and
> > it was using the swap disk?
> 
> I guess the I/O holes shadowed the RAM and hence it is basically wasted.

<nods>
> > Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
> > use 'memory' and 'maxmem'.
> 
> Hrm, changing behaviour for existing guests isn't so nice, at least not
> without a way to turn the behaviour off, perhaps we do need an explicit
> cfg file variable to control this after all?

We could do that, and then once your idea below has been completly working
we can rip out the parameter?
> 
> >  With the PVOPS, need to balloon up and you are OK.
> > 
> > Thought I do want to see about writting the code that would automatically balloon
> > back to the amount of memory that was deflated.
> 
> I wonder if just writing the correct balloon target to xenstore while
> building the guest would be sufficient for the guest to balloon up once
> it's up?

Yeah, that way we can also fix the RHEL5, SLES11 ones too. Just subtract the
delta from the 'memory', put it in a variable ('target_now'?) that will be written
to the 'target' XenStore key (not sure if that was the correct name). However it also
implies that we need to do some extra steps with the P2M allocation _before_ we play
with the 'memory' value as the P2M allocation kicks of the populate_physmap and
we don't want it to shadow the I/O holes.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-04 14:17 ` [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough Konrad Rzeszutek Wilk
  2011-05-10  8:29   ` Ian Campbell
@ 2011-05-17 15:10   ` Ian Jackson
  2011-05-17 16:00     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 24+ messages in thread
From: Ian Jackson @ 2011-05-17 15:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ian Campbell, xen-devel, Stabellini

Sorry for the delay replying.  But:

Konrad Rzeszutek Wilk writes ("[PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough"):
> diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/xl_cmdimpl.c
...
> @@ -994,6 +995,8 @@ skip_vfb:
>              if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
>                  d_config->num_pcidevs++;
>          }
> +        if (d_config->num_pcidevs && !c_info->hvm)
> +          b_info->u.pv.machine_e820 = true;
>      }

This doesn't seem to provide a way to force this behaviour on, which
is (I assume) necessary if PCI devices are to be hotplugged later ?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-17 15:10   ` Ian Jackson
@ 2011-05-17 16:00     ` Konrad Rzeszutek Wilk
  2011-05-17 16:05       ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-17 16:00 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Ian Campbell, xen-devel, Stefano Stabellini

On Tue, May 17, 2011 at 04:10:54PM +0100, Ian Jackson wrote:
> Sorry for the delay replying.  But:
> 
> Konrad Rzeszutek Wilk writes ("[PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough"):
> > diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/xl_cmdimpl.c
> ...
> > @@ -994,6 +995,8 @@ skip_vfb:
> >              if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
> >                  d_config->num_pcidevs++;
> >          }
> > +        if (d_config->num_pcidevs && !c_info->hvm)
> > +          b_info->u.pv.machine_e820 = true;
> >      }
> 
> This doesn't seem to provide a way to force this behaviour on, which
> is (I assume) necessary if PCI devices are to be hotplugged later ?

Right. We never resolved that question. One idea was that if you had
'pci=[]' it would trigger this. But that just seemed wrong.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-13 13:57                       ` Konrad Rzeszutek Wilk
@ 2011-05-17 16:02                         ` Konrad Rzeszutek Wilk
  2011-05-17 16:07                           ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-17 16:02 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Fri, May 13, 2011 at 09:57:08AM -0400, Konrad Rzeszutek Wilk wrote:
> > > memhog 4G worked great.. but then I noticed it started slowing down and
> > > it was using the swap disk?
> > 
> > I guess the I/O holes shadowed the RAM and hence it is basically wasted.
> 
> <nods>
> > > Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
> > > use 'memory' and 'maxmem'.
> > 
> > Hrm, changing behaviour for existing guests isn't so nice, at least not
> > without a way to turn the behaviour off, perhaps we do need an explicit
> > cfg file variable to control this after all?
> 
> We could do that, and then once your idea below has been completly working
> we can rip out the parameter?

How does this patch look to your eyes:

# HG changeset patch
# Parent c6fa04014d6e99ca4e62d04132180338403c0478
libxl: Add 'e820_host' option to config file.

.. which will be removed once the auto-ballooning of guests
with PCI devices works. During testing of the patches which provide
a host E820 in a PV guest, certain inconsistencies were found with
guests. When launching a RHEL5 or SLES11 PV guest with 4GB and a PCI device,
the kernel would report 4GB, but have 1.5G "used". What happend was that
the P2M that fall within the E820 I/O holes would never be used and was just
wasted. The mechanism to go around this is to shrink the size of the guest
before launch (say memory=2048, maxmem=4096) and then balloon back to 4096M
after start. For PVOPS type kernels it would detect the E820 I/O holes and
deflate by the correct amount but would not inflate back to 4GB.
Manually inflating makes it work.

The fix in the future for guests where the memory amount flows over the
PCI hole, is to launch the guest with decreased amount right up to the cusp
of where the E820 PCI hole starts. Also increase the 'maxmem' by the delta
and then when the guest has launched, balloon up to the delta number.

This will require some careful surgery so for right now this parameter
will guard against unsuspecting users seeing their PV guests memory "vanish."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r c6fa04014d6e tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Tue May 17 10:33:27 2011 -0400
+++ b/tools/libxl/xl_cmdimpl.c	Tue May 17 11:03:44 2011 -0400
@@ -626,6 +626,7 @@ static void parse_config_data(const char
     XLU_ConfigList *vbds, *nics, *pcis, *cvfbs, *cpuids;
     int pci_power_mgmt = 0;
     int pci_msitranslate = 1;
+    int e820_host_override = 0;
     int e;
 
     libxl_domain_create_info *c_info = &d_config->c_info;
@@ -979,6 +980,10 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
         pci_power_mgmt = l;
 
+    /* To be removed once the auto ballooning after guest starts is done. */
+    if (!xlu_cfg_get_long (config, "e820_host", &l))
+        e820_host_override = l;
+
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         int i;
         d_config->num_pcidevs = 0;
@@ -995,7 +1000,7 @@ skip_vfb:
             if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
                 d_config->num_pcidevs++;
         }
-        if (d_config->num_pcidevs && !c_info->hvm)
+        if (d_config->num_pcidevs && !c_info->hvm && e820_host_override)
           b_info->u.pv.machine_e820 = true;
     }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough
  2011-05-17 16:00     ` Konrad Rzeszutek Wilk
@ 2011-05-17 16:05       ` Ian Campbell
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Campbell @ 2011-05-17 16:05 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2011-05-17 at 17:00 +0100, Konrad Rzeszutek Wilk wrote:
> On Tue, May 17, 2011 at 04:10:54PM +0100, Ian Jackson wrote:
> > Sorry for the delay replying.  But:
> > 
> > Konrad Rzeszutek Wilk writes ("[PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough"):
> > > diff -r b6af9b428bb1 -r ba218fa1a48e tools/libxl/xl_cmdimpl.c
> > ...
> > > @@ -994,6 +995,8 @@ skip_vfb:
> > >              if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
> > >                  d_config->num_pcidevs++;
> > >          }
> > > +        if (d_config->num_pcidevs && !c_info->hvm)
> > > +          b_info->u.pv.machine_e820 = true;
> > >      }
> > 
> > This doesn't seem to provide a way to force this behaviour on, which
> > is (I assume) necessary if PCI devices are to be hotplugged later ?
> 
> Right. We never resolved that question. One idea was that if you had
> 'pci=[]' it would trigger this. But that just seemed wrong.

The e820_host option which you just posted suits this purpose (so long
as we decide to make it permanent), doesn't it?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-17 16:02                         ` Konrad Rzeszutek Wilk
@ 2011-05-17 16:07                           ` Ian Campbell
  2011-05-17 16:34                             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2011-05-17 16:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, 2011-05-17 at 17:02 +0100, Konrad Rzeszutek Wilk wrote:
> On Fri, May 13, 2011 at 09:57:08AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > memhog 4G worked great.. but then I noticed it started slowing down and
> > > > it was using the swap disk?
> > > 
> > > I guess the I/O holes shadowed the RAM and hence it is basically wasted.
> > 
> > <nods>
> > > > Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
> > > > use 'memory' and 'maxmem'.
> > > 
> > > Hrm, changing behaviour for existing guests isn't so nice, at least not
> > > without a way to turn the behaviour off, perhaps we do need an explicit
> > > cfg file variable to control this after all?
> > 
> > We could do that, and then once your idea below has been completly working
> > we can rip out the parameter?
> 
> How does this patch look to your eyes:

Looks ok to me.

We've been using the _override suffix for the cfg visible symbol, not
just the internal variables, so if we think this is something the user
typically should not touch then we should call it e820_host_override in
the cfg file too. Although see my earlier comment about this option also
enabling hotplug -- perhaps this is an option user will want to care
about in the long run?

Ian.

> 
> # HG changeset patch
> # Parent c6fa04014d6e99ca4e62d04132180338403c0478
> libxl: Add 'e820_host' option to config file.
> 
> .. which will be removed once the auto-ballooning of guests
> with PCI devices works. During testing of the patches which provide
> a host E820 in a PV guest, certain inconsistencies were found with
> guests. When launching a RHEL5 or SLES11 PV guest with 4GB and a PCI device,
> the kernel would report 4GB, but have 1.5G "used". What happend was that
> the P2M that fall within the E820 I/O holes would never be used and was just
> wasted. The mechanism to go around this is to shrink the size of the guest
> before launch (say memory=2048, maxmem=4096) and then balloon back to 4096M
> after start. For PVOPS type kernels it would detect the E820 I/O holes and
> deflate by the correct amount but would not inflate back to 4GB.
> Manually inflating makes it work.
> 
> The fix in the future for guests where the memory amount flows over the
> PCI hole, is to launch the guest with decreased amount right up to the cusp
> of where the E820 PCI hole starts. Also increase the 'maxmem' by the delta
> and then when the guest has launched, balloon up to the delta number.
> 
> This will require some careful surgery so for right now this parameter
> will guard against unsuspecting users seeing their PV guests memory "vanish."
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> diff -r c6fa04014d6e tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c	Tue May 17 10:33:27 2011 -0400
> +++ b/tools/libxl/xl_cmdimpl.c	Tue May 17 11:03:44 2011 -0400
> @@ -626,6 +626,7 @@ static void parse_config_data(const char
>      XLU_ConfigList *vbds, *nics, *pcis, *cvfbs, *cpuids;
>      int pci_power_mgmt = 0;
>      int pci_msitranslate = 1;
> +    int e820_host_override = 0;
>      int e;
>  
>      libxl_domain_create_info *c_info = &d_config->c_info;
> @@ -979,6 +980,10 @@ skip_vfb:
>      if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
>          pci_power_mgmt = l;
>  
> +    /* To be removed once the auto ballooning after guest starts is done. */
> +    if (!xlu_cfg_get_long (config, "e820_host", &l))
> +        e820_host_override = l;
> +
>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>          int i;
>          d_config->num_pcidevs = 0;
> @@ -995,7 +1000,7 @@ skip_vfb:
>              if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
>                  d_config->num_pcidevs++;
>          }
> -        if (d_config->num_pcidevs && !c_info->hvm)
> +        if (d_config->num_pcidevs && !c_info->hvm && e820_host_override)
>            b_info->u.pv.machine_e820 = true;
>      }
>  

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent.
  2011-05-17 16:07                           ` Ian Campbell
@ 2011-05-17 16:34                             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-17 16:34 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Tue, May 17, 2011 at 05:07:59PM +0100, Ian Campbell wrote:
> On Tue, 2011-05-17 at 17:02 +0100, Konrad Rzeszutek Wilk wrote:
> > On Fri, May 13, 2011 at 09:57:08AM -0400, Konrad Rzeszutek Wilk wrote:
> > > > > memhog 4G worked great.. but then I noticed it started slowing down and
> > > > > it was using the swap disk?
> > > > 
> > > > I guess the I/O holes shadowed the RAM and hence it is basically wasted.
> > > 
> > > <nods>
> > > > > Anyhow, seems that if you are using RHEL5, SLES11, you need to be carefull to
> > > > > use 'memory' and 'maxmem'.
> > > > 
> > > > Hrm, changing behaviour for existing guests isn't so nice, at least not
> > > > without a way to turn the behaviour off, perhaps we do need an explicit
> > > > cfg file variable to control this after all?
> > > 
> > > We could do that, and then once your idea below has been completly working
> > > we can rip out the parameter?
> > 
> > How does this patch look to your eyes:
> 
> Looks ok to me.
> 
> We've been using the _override suffix for the cfg visible symbol, not
> just the internal variables, so if we think this is something the user
> typically should not touch then we should call it e820_host_override in
> the cfg file too. Although see my earlier comment about this option also

Ok.
> enabling hotplug -- perhaps this is an option user will want to care
> about in the long run?

In which case we should decide on a good name since it will stay with us
forever. Perhaps just e820_host and drop the override? And do something like this:

# HG changeset patch
# Parent c6fa04014d6e99ca4e62d04132180338403c0478
libxl: Add 'e820_host' option to config file.

.. which will allow PV guests to see the host's E820. Previously
this was latched of the config having an entry in the 'pci' option.
But during testing of the patches which provide a host E820 in a PV guest,
certain inconsistencies were found with guests. When launching a RHEL5 or
SLES11 PV guest with 4GB and a PCI device, the kernel would report 4GB,
but have 1.5G "used". What happend was that the P2M that fall within the
E820 I/O holes would never be used and was just wasted. The mechanism to
go around this is to shrink the size of the guest
before launch (say memory=2048, maxmem=4096) and then balloon back to 4096M
after start. For PVOPS type kernels it would detect the E820 I/O holes and
deflate by the correct amount but would not inflate back to 4GB.
Manually inflating makes it work.

The fix in the future for guests where the memory amount flows over the
PCI hole, is to launch the guest with decreased amount right up to the cusp
of where the E820 PCI hole starts. Also increase the 'maxmem' by the delta
and then when the guest has launched, balloon up to the delta number.

This will require some careful surgery so for right now this parameter
will guard against unsuspecting users seeing their PV guests memory "vanish."

In the future, this option will remain (so that PCI hotplugging
can be done), and turn itself on when there is a 'pci' option.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r c6fa04014d6e tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c	Tue May 17 10:33:27 2011 -0400
+++ b/tools/libxl/xl_cmdimpl.c	Tue May 17 12:30:38 2011 -0400
@@ -979,6 +979,16 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
         pci_power_mgmt = l;
 
+    /* To be reworked (automatically enabled) once the auto ballooning
+     * after guest starts is done (with PCI devices passed in). */
+    if (!xlu_cfg_get_long (config, "e820_host", &l)) {
+        if (c_info->hvm)
+          fprintf(stderr, "Can't do e820_host in HVM mode!");
+        else {
+          if (l)
+            b_info->u.pv.machine_e820 = true;
+        }
+    }
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         int i;
         d_config->num_pcidevs = 0;
@@ -995,8 +1005,6 @@ skip_vfb:
             if (!libxl_device_pci_parse_bdf(ctx, pcidev, buf))
                 d_config->num_pcidevs++;
         }
-        if (d_config->num_pcidevs && !c_info->hvm)
-          b_info->u.pv.machine_e820 = true;
     }
 
     switch (xlu_cfg_get_list(config, "cpuid", &cpuids, 0, 1)) {

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-05-17 16:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-04 14:17 [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Konrad Rzeszutek Wilk
2011-05-04 14:17 ` [PATCH 1 of 3] tools: Add xc_domain_set_memory_map and xc_get_machine_memory_map calls (x86, amd64 only) Konrad Rzeszutek Wilk
2011-05-04 14:17 ` [PATCH 2 of 3] libxl: Add support for passing in the machine's E820 for PCI passthrough Konrad Rzeszutek Wilk
2011-05-10  8:29   ` Ian Campbell
2011-05-10 14:56     ` Konrad Rzeszutek Wilk
2011-05-17 15:10   ` Ian Jackson
2011-05-17 16:00     ` Konrad Rzeszutek Wilk
2011-05-17 16:05       ` Ian Campbell
2011-05-04 14:17 ` [PATCH 3 of 3] libxl: Convert E820_UNUSABLE and E820_RAM to E820_UNUSABLE as appropriate Konrad Rzeszutek Wilk
2011-05-09  9:00 ` [PATCH 0 of 3] Patches for PCI passthrough with modified E820 (v3) - resent Ian Campbell
2011-05-09 21:01   ` Konrad Rzeszutek Wilk
2011-05-10  8:30     ` Ian Campbell
2011-05-10 14:53       ` Konrad Rzeszutek Wilk
2011-05-10 15:09         ` Ian Campbell
2011-05-10 15:27           ` Konrad Rzeszutek Wilk
2011-05-10 15:32             ` Ian Campbell
2011-05-10 15:51               ` Konrad Rzeszutek Wilk
2011-05-11  7:49                 ` Ian Campbell
2011-05-12 17:41                   ` Konrad Rzeszutek Wilk
2011-05-13  8:47                     ` Ian Campbell
2011-05-13 13:57                       ` Konrad Rzeszutek Wilk
2011-05-17 16:02                         ` Konrad Rzeszutek Wilk
2011-05-17 16:07                           ` Ian Campbell
2011-05-17 16:34                             ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.