All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
@ 2010-11-12 23:08 Konrad Rzeszutek Wilk
  2010-11-12 23:16 ` Konrad Rzeszutek Wilk
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 23:08 UTC (permalink / raw)
  To: xen-devel, stefano.stabellini, gianni.tedesco
  Cc: Jeremy Fitzhardinge, bruce.edge

Hey guys,

Attached is an RFC patch for making a PCI hole in the PV guests. This allows
PV guests(*) with 4GB or more to now properly work with or without
PCI passthrough cards.

Previously the Linux kernel would not be able to allocate the PCI region
underneath the 4GB region as that region was all System RAM. And you would see
this:

[    0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
[    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
[    0.000000] PCI: Unassigned devices with 32bit resource registers may break!
[    0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000)


This patchset punches an PCI hole in the E820 region and as well fills the P2M properly,
so that now you can see (*):
[    0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000)

It adds a new option to guest config file, which is "pci_hole". The user can
specify the PFN number, such as '0xc0000' or in case of using the xl, '1' which
will automatically figure out the start of the PCI address.

*: This option requires support in the Linux kernel to actually deal with two 
entries in the E820 map and P2M space filled with ~0. 


The patches (draft, not ready for upstream) for the Linux kernel to support this are
available at:

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole

All of these patches make the E820 of the Linux guest with 4GB (or more) passed
look like this (2.6.37-rc1+devel/e820-hole):
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 00000000a0000000 (usable)
[    0.000000]  Xen: 0000000100000000 - 0000000160800000 (usable)

compared to (2.6.36)
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000100000000 (usable)

and (2.6.37-rc1):
[    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
[    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
[    0.000000]  Xen: 0000000000100000 - 0000000100800000 (usable)

In regards to the patches that I am attaching here, what is the magic incantention
to make the indentation/StyleGuide proper for the tools/libxc directory? The tab spacing
is off a bit (I think).

I've tested this so far only on 64-bit guests, and I am quite sure the tool-stack
needs some extra care for the 32-bit guests..

But please take a look and give feedback.

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -91,6 +91,8 @@ struct xc_dom_image {
 
     /* physical memory */
     xen_pfn_t total_pages;
+    /* start of the pci_hole. goes up to 4gb */
+    xen_pfn_t pci_hole;
     struct xc_dom_phys *phys_pages;
     int realmodearea_log;
 
diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c
--- a/tools/libxc/xc_dom_core.c
+++ b/tools/libxc/xc_dom_core.c
@@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do
     uint32_t *p2m_32;
     uint64_t *p2m_64;
     xen_pfn_t i;
+    size_t tot_pages;
 
     if ( !dom->p2m_guest )
         return 0;
 
+    tot_pages = dom->total_pages;
+    if (dom->pci_hole)
+         tot_pages += (0x100000 - dom->pci_hole);
+
     switch ( dom->arch_hooks->sizeof_pfn )
     {
     case 4:
         DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn "",
-                  __FUNCTION__, dom->total_pages);
+                  __FUNCTION__, tot_pages);
         p2m_32 = dom->p2m_guest;
-        for ( i = 0; i < dom->total_pages; i++ )
+        for ( i = 0; i < tot_pages; i++ )
             if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
                 p2m_32[i] = dom->p2m_host[i];
             else
@@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do
         break;
     case 8:
         DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn "",
-                  __FUNCTION__, dom->total_pages);
+                  __FUNCTION__, tot_pages);
         p2m_64 = dom->p2m_guest;
-        for ( i = 0; i < dom->total_pages; i++ )
+        for ( i = 0; i < tot_pages; i++ )
             if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
                 p2m_64[i] = dom->p2m_host[i];
             else
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d
 {
     size_t p2m_size = dom->total_pages * dom->arch_hooks->sizeof_pfn;
 
+    if (dom->pci_hole && (dom->total_pages > dom->pci_hole))
+    {
+	size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) *
+				  dom->arch_hooks->sizeof_pfn;
+
+        DOMPRINTF("%s: Expanding P2M to include PCI hole (%ld->%ld)\n",
+		__FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size);
+	p2m_size += p2m_pci_hole_size;
+    }
     /* allocate phys2mach table */
     if ( xc_dom_alloc_segment(dom, &dom->p2m_seg, "phys2mach", 0, p2m_size) )
         return -1;
@@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima
 {
     int rc;
     xen_pfn_t pfn, allocsz, i, j, mfn;
+    size_t p2m_size;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima
         if ( rc )
             return rc;
     }
+    p2m_size = dom->total_pages;
 
-    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * dom->total_pages);
+    if (dom->pci_hole && (dom->total_pages > dom->pci_hole))
+		p2m_size += (0x100000 - dom->pci_hole);
+
+    DOMPRINTF("Allocating %ld bytes for P2M", p2m_size * sizeof(xen_pfn_t));
+    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size);
     if ( dom->superpages )
     {
         int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT;
@@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima
     }
     else
     {
-        /* setup initial p2m */
-        for ( pfn = 0; pfn < dom->total_pages; pfn++ )
-            dom->p2m_host[pfn] = pfn;
-        
-        /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+	/* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */
+        if ( dom->pci_hole && (dom->total_pages > dom->pci_hole) )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            /* setup initial p2m in three passes. */
+            for (pfn = 0; pfn < dom->pci_hole; pfn++)
+              dom->p2m_host[pfn] = pfn;
+
+            xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has PFNs.", __FUNCTION__, pfn);
+            xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has INVALID_MFN",
+                          __FUNCTION__, pfn, 0x100000);
+            for (; pfn < 0x100000; pfn++)
+              dom->p2m_host[pfn] = INVALID_MFN;
+
+            for (; pfn < 0x100000 + dom->total_pages - dom->pci_hole; pfn++)
+              dom->p2m_host[pfn] = pfn;
+            xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has PFNs.", __FUNCTION__,
+                           0x100000, pfn);
+
+            /* allocate guest memory in two passes. */
+            for (i = rc = allocsz = 0; (i < dom->pci_hole) && !rc; i += allocsz)
+            {
+              allocsz = dom->pci_hole - i;
+              xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx",
+                             __FUNCTION__, i, i + allocsz);
+              rc = xc_domain_populate_physmap_exact (dom->xch, dom->guest_domid,
+                                 allocsz, 0, 0,
+                                 &dom->p2m_host[i]);
+            }
+            for (i = 0x100000, allocsz = rc = 0;
+                 (i < (0x100000 + dom->total_pages - dom->pci_hole))
+                  && !rc; i += allocsz)
+            {
+              allocsz = (dom->total_pages - dom->pci_hole) - (i - 0x100000);
+              if (allocsz > 1024 * 1024)
+                allocsz = 1024 * 1024;
+              xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx",
+                             __FUNCTION__, i, i + allocsz);
+              rc = xc_domain_populate_physmap_exact (dom->xch, dom->guest_domid,
+                                                      allocsz, 0, 0,
+                                                      &dom->p2m_host[i]);
+            }
+            xc_dom_printf (dom->xch, "%s: Done with PCI populate physmap",
+                          __FUNCTION__);
+        } else {
+                /* setup initial p2m */
+                for ( pfn = 0; pfn < dom->total_pages; pfn++ )
+                    dom->p2m_host[pfn] = pfn;
+                
+                /* allocate guest memory */
+                for ( i = rc = allocsz = 0;
+                      (i < dom->total_pages) && !rc;
+                      i += allocsz )
+                {
+                    allocsz = dom->total_pages - i;
+                    if ( allocsz > 1024*1024 )
+                        allocsz = 1024*1024;
+                    rc = xc_domain_populate_physmap_exact(
+                        dom->xch, dom->guest_domid, allocsz,
+                        0, 0, &dom->p2m_host[i]);
+                }
         }
     }
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in
 #include "xc_e820.h"
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb)
+                               unsigned long map_limitkb,
+                               xen_pfn_t pci_hole_start)
 {
     int rc;
+    uint64_t delta_kb;
+    size_t e820_sz;
     struct xen_foreign_memory_map fmap = {
         .domid = domid,
         .map = { .nr_entries = 1 }
     };
     DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
 
-    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
+    delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2);
+    if (pci_hole_start && (delta_kb > 0))
+      e820_sz = sizeof(*e820);
+    else
+      e820_sz = sizeof(*e820)*2;
+
+    e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz);
 
     if ( e820 == NULL )
     {
@@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf
     e820->size = (uint64_t)map_limitkb << 10;
     e820->type = E820_RAM;
 
+    if (pci_hole_start && (delta_kb > 0))
+    {
+	fmap.map.nr_entries ++;
+	e820[0].size = (uint64_t)pci_hole_start << 12;
+ 
+	e820[1].type = E820_RAM;
+	e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn...  */
+	e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB. */
+    }
+
     set_xen_guest_handle(fmap.map.buffer, e820);
 
     rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
@@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf
 #else
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb)
+                               unsigned long map_limitkb,
+                               xen_pfn_t pci_hole_start)
 {
     PERROR("Function not implemented");
     errno = ENOSYS;
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc
 
 int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
-                               unsigned long map_limitkb);
+                               unsigned long map_limitkb,
+				xen_pfn_t pci_hole_start);
 
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx 
                               libxl_device_disk *disk, libxl_diskinfo *diskinfo);
 int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk *disk);
 
+int libxl_find_pci_hole(uint32_t *start_pfn);
 /*
  * Make a disk available in this domain. Returns path to a device.
  */
diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl
--- a/tools/libxl/libxl.idl
+++ b/tools/libxl/libxl.idl
@@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain
                                         ])),
                  ("pv", "!%s", Struct(None,
                                        [("slack_memkb", uint32),
+                                        ("pci_hole_start", uint32),
                                         ("bootloader", string),
                                         ("bootloader_args", string),
                                         ("cmdline", string),
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
     xc_domain_set_memmap_limit(ctx->xch, domid, 
             (info->hvm) ? info->max_memkb : 
-            (info->max_memkb + info->u.pv.slack_memkb));
+            (info->max_memkb + info->u.pv.slack_memkb),
+            (info->hvm) ? 0 : info->u.pv.pci_hole_start);
     xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0);
     if ( info->disable_migrate )
         xc_domain_disable_migrate(ctx->xch, domid);
@@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint
             }
         }
     }
+    if ( info->u.pv.pci_hole_start)
+        dom->pci_hole = info->u.pv.pci_hole_start;
 
     dom->flags = flags;
     dom->console_evtchn = state->console_port;
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx 
     free(pcidevs);
     return 0;
 }
+
+#define MAX_LINE 300
+int libxl_find_pci_hole(uint32_t *start_pfn)
+{
+	FILE *fp;
+	char *s;
+	char buf[MAX_LINE];
+	int ret = -ENODEV;
+	long int pci_hole_phys;
+
+	*start_pfn = 0;
+	fp = fopen("/proc/iomem", "r");
+	if (!fp)
+		return ret;
+
+	while (1) {
+		s = fgets(buf, MAX_LINE, fp);
+		if (!s)
+			break;
+		if (strlen(buf) < 1)
+			continue;
+		if (buf[strlen(buf)-1] == '\n')
+			buf[strlen(buf)-1] = '\0';
+		s = strchr(buf,'P');
+		if (!s)
+			continue;
+		if (strncmp(s, "PCI", 3) == 0) {
+			if (buf[0] == ' ')
+				continue;
+			s = strchr(buf,'-');
+			if (!s)
+				break;
+			s[0]='\0';
+			pci_hole_phys = strtol(buf, NULL, 16);
+			if (!pci_hole_phys)
+				break;
+			/* We don't want to the holes below 16MB. */
+			if (pci_hole_phys <= 0x1000)
+				continue;
+			*start_pfn = pci_hole_phys >> 12;
+			fprintf(stderr,"The value is 0x%d\n", *start_pfn);
+			ret = 0;
+			break;
+		}
+	}
+	fclose(fp);
+	return ret;
+}
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -1078,6 +1078,14 @@ skip_vfb:
     if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
         pci_power_mgmt = l;
 
+    if (!xlu_cfg_get_long (config, "pci_hole", &l)) {
+	if (l == 1) {
+	   uint32_t pfn_start = 0;
+	   if (!libxl_find_pci_hole(&pfn_start))
+            	b_info->u.pv.pci_hole_start = pfn_start;
+	} else
+           b_info->u.pv.pci_hole_start = l;
+    }
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         int i;
         d_config->num_pcidevs = 0;
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje
     unsigned int mem_mb;
     unsigned long store_mfn = 0;
     unsigned long console_mfn = 0;
+    unsigned long pci_hole_start = 0;
     PyObject* elfnote_dict;
     PyObject* elfnote = NULL;
     PyObject* ret;
@@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje
                                 "console_evtchn", "image",
                                 /* optional */
                                 "ramdisk", "cmdline", "flags",
-                                "features", "vhpt", "superpages", NULL };
-
-    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisii", kwd_list,
+                                "features", "vhpt", "superpages",
+                                "pci_hole", NULL };
+
+    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisiii", kwd_list,
                                       &domid, &store_evtchn, &mem_mb,
                                       &console_evtchn, &image,
                                       /* optional */
                                       &ramdisk, &cmdline, &flags,
-                                      &features, &vhpt, &superpages) )
+                                      &features, &vhpt, &superpages,
+                                      &pci_hole_start) )
         return NULL;
 
     xc_dom_loginit(self->xc_handle);
@@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje
 
     dom->superpages = superpages;
 
+    dom->pci_hole = pci_hole_start;
+
     if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image,
                             ramdisk, flags, store_evtchn, &store_mfn,
                             console_evtchn, &console_mfn) != 0 ) {
@@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_
 {
     uint32_t dom;
     unsigned int maplimit_kb;
-
-    if ( !PyArg_ParseTuple(args, "ii", &dom, &maplimit_kb) )
+    unsigned long pci_hole_start = 0;
+
+    if ( !PyArg_ParseTuple(args, "ii|i", &dom, &maplimit_kb, &pci_hole_start) )
         return NULL;
 
-    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb) != 0 )
+    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb,
+                                    pci_hole_start) != 0 )
         return pyxc_error_to_exception(self->xc_handle);
     
     Py_INCREF(zero);
@@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = {
       "Set a domain's physical memory mappping limit\n"
       " dom [int]: Identifier of domain.\n"
       " map_limitkb [int]: .\n"
+      " pci_hole_start [int]: PFN for start of PCI hole (optional).\n"
       "Returns: [int] 0 on success; -1 on error.\n" },
 
 #ifdef __ia64__
diff --git a/tools/python/xen/xend/XendConfig.py b/tools/python/xen/xend/XendConfig.py
--- a/tools/python/xen/xend/XendConfig.py
+++ b/tools/python/xen/xend/XendConfig.py
@@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = {
     'suppress_spurious_page_faults': bool0,
     's3_integrity' : int,
     'superpages' : int,
+    'pci_hole' : int,
     'memory_sharing': int,
     'pool_name' : str,
     'Description': str,
@@ -422,6 +423,7 @@ class XendConfig(dict):
             'target': 0,
             'pool_name' : 'Pool-0',
             'superpages': 0,
+            'pci_hole': 0,
             'description': '',
         }
         
@@ -2135,6 +2137,9 @@ class XendConfig(dict):
             image.append(['args', self['PV_args']])
         if self.has_key('superpages'):
             image.append(['superpages', self['superpages']])
+        if self.has_key('pci_hole'):
+            image.append(['pci_hole', self['pci_hole']])
+
 
         for key in XENAPI_PLATFORM_CFG_TYPES.keys():
             if key in self['platform']:
@@ -2179,6 +2184,10 @@ class XendConfig(dict):
         val = sxp.child_value(image_sxp, 'superpages')
         if val is not None:
             self['superpages'] = val
+
+        val = sxp.child_value(image_sxp, 'pci_hole')
+        if val is not None:
+            self['pci_hole'] = val
         
         val = sxp.child_value(image_sxp, 'memory_sharing')
         if val is not None:
diff --git a/tools/python/xen/xend/image.py b/tools/python/xen/xend/image.py
--- a/tools/python/xen/xend/image.py
+++ b/tools/python/xen/xend/image.py
@@ -84,6 +84,7 @@ class ImageHandler:
 
     ostype = None
     superpages = 0
+    pci_hole = 0
     memory_sharing = 0
 
     def __init__(self, vm, vmConfig):
@@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler):
         self.vramsize = int(vmConfig['platform'].get('videoram',4)) * 1024
         self.is_stubdom = (self.kernel.find('stubdom') >= 0)
         self.superpages = int(vmConfig['superpages'])
+        self.pci_hole = int(vmConfig['pci_hole'])
 
     def buildDomain(self):
         store_evtchn = self.vm.getStorePort()
@@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler):
         log.debug("features       = %s", self.vm.getFeatures())
         log.debug("flags          = %d", self.flags)
         log.debug("superpages     = %d", self.superpages)
+        log.debug("pci_hole       = %d", self.pci_hole)
         if arch.type == "ia64":
             log.debug("vhpt          = %d", self.vhpt)
 
@@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler):
                               features       = self.vm.getFeatures(),
                               flags          = self.flags,
                               vhpt           = self.vhpt,
-                              superpages     = self.superpages)
+                              superpages     = self.superpages,
+                              pci_hole       = self.pci_hole)
 
     def getBitSize(self):
         return xc.getBitSize(image    = self.kernel,
@@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler):
         args = args + ([ "-M", "xenpv"])
         return args
 
-
 class HVMImageHandler(ImageHandler):
 
     ostype = "hvm"
@@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH
         # set physical mapping limit
         # add an 8MB slack to balance backend allocations.
         mem_kb = self.getRequiredMaximumReservation() + (8 * 1024)
-        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb)
+        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb, self.pci_hole)
         rc = LinuxImageHandler.buildDomain(self)
         self.setCpuid()
         return rc
diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py
--- a/tools/python/xen/xm/create.py
+++ b/tools/python/xen/xm/create.py
@@ -680,6 +680,11 @@ gopts.var('superpages', val='0|1',
            fn=set_int, default=0,
            use="Create domain with superpages")
 
+gopts.var('pci_hole', val='0x<XXX>|0',
+           fn=set_int, default=0,
+           use="""Create domain with a PCI hole. The value is the PFN of the
+           start of PCI hole. Usually that is 0xc0000.""")
+
 def err(msg):
     """Print an error to stderr and exit.
     """
@@ -770,6 +775,9 @@ def configure_image(vals):
         config_image.append(['args', vals.extra])
     if vals.superpages:
         config_image.append(['superpages', vals.superpages])
+    if vals.pci_hole:
+        config_image.append(['pci_hole', vals.pci_hole])
+
 
     if vals.builder == 'hvm':
         configure_hvm(config_image, vals) 
diff --git a/tools/python/xen/xm/xenapi_create.py b/tools/python/xen/xm/xenapi_create.py
--- a/tools/python/xen/xm/xenapi_create.py
+++ b/tools/python/xen/xm/xenapi_create.py
@@ -285,6 +285,8 @@ class xenapi_create:
                 vm.attributes["s3_integrity"].value,
             "superpages":
                 vm.attributes["superpages"].value,
+            "pci_hole":
+                vm.attributes["pci_hole"].value,
             "memory_static_max":
                 get_child_node_attribute(vm, "memory", "static_max"),
             "memory_static_min":
@@ -697,6 +699,8 @@ class sxp2xml:
             = str(get_child_by_name(config, "s3_integrity", 0))
         vm.attributes["superpages"] \
             = str(get_child_by_name(config, "superpages", 0))
+        vm.attributes["pci_hole"] \
+            = str(get_child_by_name(config, "pci_hole", 0))
         vm.attributes["pool_name"] \
             = str(get_child_by_name(config, "pool_name", "Pool-0"))

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-12 23:08 [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) Konrad Rzeszutek Wilk
@ 2010-11-12 23:16 ` Konrad Rzeszutek Wilk
  2010-11-13  7:40 ` Keir Fraser
  2010-11-17 11:14 ` Gianni Tedesco
  2 siblings, 0 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-12 23:16 UTC (permalink / raw)
  To: xen-devel, stefano.stabellini, gianni.tedesco
  Cc: Jeremy Fitzhardinge, bruce.edge

> It adds a new option to guest config file, which is "pci_hole". The user can
> specify the PFN number, such as '0xc0000' or in case of using the xl, '1' which
> will automatically figure out the start of the PCI address.

Here is an example:

kernel="/home/konrad/git/xtt/bootstrap/dist/common/vmlinuz"
ramdisk="/home/konrad/git/xtt/bootstrap/dist/common/initramfs.cpio.gz"
extra="inittab=/etc/inittab-xen console=hvc0 debug earlyprintk=xen memblock=debug iommu=soft"
memory=6048
vcpus=4
on_crash="preserve"
vif = [ 'mac=00:0f:4b:00:00:68, bridge=switch' ]
pci= ["0000:03:00.0","00:1d.0","00:1d.1","00:1d.2","00:1d.7"]
vfb = [ 'vnc=1, vnclisten=0.0.0.0,vncunused=1']
pci_hole=0xa0000


Also I found out that 'xl' does not understand multiple options in the pci line
so testing with that might give you some weird results. However "xm" works fine.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-12 23:08 [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) Konrad Rzeszutek Wilk
  2010-11-12 23:16 ` Konrad Rzeszutek Wilk
@ 2010-11-13  7:40 ` Keir Fraser
  2010-11-15 17:03   ` Konrad Rzeszutek Wilk
  2010-11-17 11:14 ` Gianni Tedesco
  2 siblings, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-13  7:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, stefano.stabellini, gianni.tedesco
  Cc: Jeremy Fitzhardinge, bruce.edge

Why doesn't the guest punch its own hole, by relocating RAM above 4GB?
That's what all HVM guests do (in hvmloader).

 -- Keir

On 12/11/2010 23:08, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:

> Hey guys,
> 
> Attached is an RFC patch for making a PCI hole in the PV guests. This allows
> PV guests(*) with 4GB or more to now properly work with or without
> PCI passthrough cards.
> 
> Previously the Linux kernel would not be able to allocate the PCI region
> underneath the 4GB region as that region was all System RAM. And you would see
> this:
> 
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 -
> 0000000000100000
> [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> [    0.000000] PCI: Unassigned devices with 32bit resource registers may
> break!
> [    0.000000] Allocating PCI resources starting at 100100000 (gap:
> 100100000:400000)
> 
> 
> This patchset punches an PCI hole in the E820 region and as well fills the P2M
> properly,
> so that now you can see (*):
> [    0.000000] Allocating PCI resources starting at a0000000 (gap:
> a0000000:60000000)
> 
> It adds a new option to guest config file, which is "pci_hole". The user can
> specify the PFN number, such as '0xc0000' or in case of using the xl, '1'
> which
> will automatically figure out the start of the PCI address.
> 
> *: This option requires support in the Linux kernel to actually deal with two
> entries in the E820 map and P2M space filled with ~0.
> 
> 
> The patches (draft, not ready for upstream) for the Linux kernel to support
> this are
> available at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole
> 
> All of these patches make the E820 of the Linux guest with 4GB (or more)
> passed
> look like this (2.6.37-rc1+devel/e820-hole):
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 00000000a0000000 (usable)
> [    0.000000]  Xen: 0000000100000000 - 0000000160800000 (usable)
> 
> compared to (2.6.36)
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 0000000100000000 (usable)
> 
> and (2.6.37-rc1):
> [    0.000000]  Xen: 0000000000000000 - 00000000000a0000 (usable)
> [    0.000000]  Xen: 00000000000a0000 - 0000000000100000 (reserved)
> [    0.000000]  Xen: 0000000000100000 - 0000000100800000 (usable)
> 
> In regards to the patches that I am attaching here, what is the magic
> incantention
> to make the indentation/StyleGuide proper for the tools/libxc directory? The
> tab spacing
> is off a bit (I think).
> 
> I've tested this so far only on 64-bit guests, and I am quite sure the
> tool-stack
> needs some extra care for the 32-bit guests..
> 
> But please take a look and give feedback.
> 
> diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
> --- a/tools/libxc/xc_dom.h
> +++ b/tools/libxc/xc_dom.h
> @@ -91,6 +91,8 @@ struct xc_dom_image {
>  
>      /* physical memory */
>      xen_pfn_t total_pages;
> +    /* start of the pci_hole. goes up to 4gb */
> +    xen_pfn_t pci_hole;
>      struct xc_dom_phys *phys_pages;
>      int realmodearea_log;
>  
> diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c
> --- a/tools/libxc/xc_dom_core.c
> +++ b/tools/libxc/xc_dom_core.c
> @@ -715,17 +715,22 @@ int xc_dom_update_guest_p2m(struct xc_do
>      uint32_t *p2m_32;
>      uint64_t *p2m_64;
>      xen_pfn_t i;
> +    size_t tot_pages;
>  
>      if ( !dom->p2m_guest )
>          return 0;
>  
> +    tot_pages = dom->total_pages;
> +    if (dom->pci_hole)
> +         tot_pages += (0x100000 - dom->pci_hole);
> +
>      switch ( dom->arch_hooks->sizeof_pfn )
>      {
>      case 4:
>          DOMPRINTF("%s: dst 32bit, pages 0x%" PRIpfn "",
> -                  __FUNCTION__, dom->total_pages);
> +                  __FUNCTION__, tot_pages);
>          p2m_32 = dom->p2m_guest;
> -        for ( i = 0; i < dom->total_pages; i++ )
> +        for ( i = 0; i < tot_pages; i++ )
>              if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
>                  p2m_32[i] = dom->p2m_host[i];
>              else
> @@ -733,9 +738,9 @@ int xc_dom_update_guest_p2m(struct xc_do
>          break;
>      case 8:
>          DOMPRINTF("%s: dst 64bit, pages 0x%" PRIpfn "",
> -                  __FUNCTION__, dom->total_pages);
> +                  __FUNCTION__, tot_pages);
>          p2m_64 = dom->p2m_guest;
> -        for ( i = 0; i < dom->total_pages; i++ )
> +        for ( i = 0; i < tot_pages; i++ )
>              if ( dom->p2m_host[i] != INVALID_P2M_ENTRY )
>                  p2m_64[i] = dom->p2m_host[i];
>              else
> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> --- a/tools/libxc/xc_dom_x86.c
> +++ b/tools/libxc/xc_dom_x86.c
> @@ -406,6 +406,15 @@ static int alloc_magic_pages(struct xc_d
>  {
>      size_t p2m_size = dom->total_pages * dom->arch_hooks->sizeof_pfn;
>  
> +    if (dom->pci_hole && (dom->total_pages > dom->pci_hole))
> +    {
> + size_t p2m_pci_hole_size = (0x100000 - dom->pci_hole) *
> +      dom->arch_hooks->sizeof_pfn;
> +
> +        DOMPRINTF("%s: Expanding P2M to include PCI hole (%ld->%ld)\n",
> +  __FUNCTION__, p2m_size, p2m_size + p2m_pci_hole_size);
> + p2m_size += p2m_pci_hole_size;
> +    }
>      /* allocate phys2mach table */
>      if ( xc_dom_alloc_segment(dom, &dom->p2m_seg, "phys2mach", 0, p2m_size) )
>          return -1;
> @@ -712,6 +721,7 @@ int arch_setup_meminit(struct xc_dom_ima
>  {
>      int rc;
>      xen_pfn_t pfn, allocsz, i, j, mfn;
> +    size_t p2m_size;
>  
>      rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
>      if ( rc )
> @@ -723,8 +733,13 @@ int arch_setup_meminit(struct xc_dom_ima
>          if ( rc )
>              return rc;
>      }
> +    p2m_size = dom->total_pages;
>  
> -    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * dom->total_pages);
> +    if (dom->pci_hole && (dom->total_pages > dom->pci_hole))
> +  p2m_size += (0x100000 - dom->pci_hole);
> +
> +    DOMPRINTF("Allocating %ld bytes for P2M", p2m_size * sizeof(xen_pfn_t));
> +    dom->p2m_host = xc_dom_malloc(dom, sizeof(xen_pfn_t) * p2m_size);
>      if ( dom->superpages )
>      {
>          int count = dom->total_pages >> SUPERPAGE_PFN_SHIFT;
> @@ -750,21 +765,66 @@ int arch_setup_meminit(struct xc_dom_ima
>      }
>      else
>      {
> -        /* setup initial p2m */
> -        for ( pfn = 0; pfn < dom->total_pages; pfn++ )
> -            dom->p2m_host[pfn] = pfn;
> -        
> -        /* allocate guest memory */
> -        for ( i = rc = allocsz = 0;
> -              (i < dom->total_pages) && !rc;
> -              i += allocsz )
> + /* for PCI mapping, stick INVALID_MFN in the PCI_HOLE */
> +        if ( dom->pci_hole && (dom->total_pages > dom->pci_hole) )
>          {
> -            allocsz = dom->total_pages - i;
> -            if ( allocsz > 1024*1024 )
> -                allocsz = 1024*1024;
> -            rc = xc_domain_populate_physmap_exact(
> -                dom->xch, dom->guest_domid, allocsz,
> -                0, 0, &dom->p2m_host[i]);
> +            /* setup initial p2m in three passes. */
> +            for (pfn = 0; pfn < dom->pci_hole; pfn++)
> +              dom->p2m_host[pfn] = pfn;
> +
> +            xc_dom_printf (dom->xch, "%s: 0x0->0x%lx has PFNs.",
> __FUNCTION__, pfn);
> +            xc_dom_printf (dom->xch, "%s: 0x%lx -> 0x%x has INVALID_MFN",
> +                          __FUNCTION__, pfn, 0x100000);
> +            for (; pfn < 0x100000; pfn++)
> +              dom->p2m_host[pfn] = INVALID_MFN;
> +
> +            for (; pfn < 0x100000 + dom->total_pages - dom->pci_hole; pfn++)
> +              dom->p2m_host[pfn] = pfn;
> +            xc_dom_printf (dom->xch, "%s: 0x%x -> 0x%lx has PFNs.",
> __FUNCTION__,
> +                           0x100000, pfn);
> +
> +            /* allocate guest memory in two passes. */
> +            for (i = rc = allocsz = 0; (i < dom->pci_hole) && !rc; i +=
> allocsz)
> +            {
> +              allocsz = dom->pci_hole - i;
> +              xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx",
> +                             __FUNCTION__, i, i + allocsz);
> +              rc = xc_domain_populate_physmap_exact (dom->xch,
> dom->guest_domid,
> +                                 allocsz, 0, 0,
> +                                 &dom->p2m_host[i]);
> +            }
> +            for (i = 0x100000, allocsz = rc = 0;
> +                 (i < (0x100000 + dom->total_pages - dom->pci_hole))
> +                  && !rc; i += allocsz)
> +            {
> +              allocsz = (dom->total_pages - dom->pci_hole) - (i - 0x100000);
> +              if (allocsz > 1024 * 1024)
> +                allocsz = 1024 * 1024;
> +              xc_dom_printf (dom->xch, "%s: Populating M2P 0x%lx->0x%lx",
> +                             __FUNCTION__, i, i + allocsz);
> +              rc = xc_domain_populate_physmap_exact (dom->xch,
> dom->guest_domid,
> +                                                      allocsz, 0, 0,
> +                                                      &dom->p2m_host[i]);
> +            }
> +            xc_dom_printf (dom->xch, "%s: Done with PCI populate physmap",
> +                          __FUNCTION__);
> +        } else {
> +                /* setup initial p2m */
> +                for ( pfn = 0; pfn < dom->total_pages; pfn++ )
> +                    dom->p2m_host[pfn] = pfn;
> +                
> +                /* allocate guest memory */
> +                for ( i = rc = allocsz = 0;
> +                      (i < dom->total_pages) && !rc;
> +                      i += allocsz )
> +                {
> +                    allocsz = dom->total_pages - i;
> +                    if ( allocsz > 1024*1024 )
> +                        allocsz = 1024*1024;
> +                    rc = xc_domain_populate_physmap_exact(
> +                        dom->xch, dom->guest_domid, allocsz,
> +                        0, 0, &dom->p2m_host[i]);
> +                }
>          }
>      }
>  
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -481,16 +481,25 @@ int xc_domain_pin_memory_cacheattr(xc_in
>  #include "xc_e820.h"
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb)
> +                               unsigned long map_limitkb,
> +                               xen_pfn_t pci_hole_start)
>  {
>      int rc;
> +    uint64_t delta_kb;
> +    size_t e820_sz;
>      struct xen_foreign_memory_map fmap = {
>          .domid = domid,
>          .map = { .nr_entries = 1 }
>      };
>      DECLARE_HYPERCALL_BUFFER(struct e820entry, e820);
>  
> -    e820 = xc_hypercall_buffer_alloc(xch, e820, sizeof(*e820));
> +    delta_kb = map_limitkb - (uint64_t)(pci_hole_start << 2);
> +    if (pci_hole_start && (delta_kb > 0))
> +      e820_sz = sizeof(*e820);
> +    else
> +      e820_sz = sizeof(*e820)*2;
> +
> +    e820 = xc_hypercall_buffer_alloc(xch, e820, e820_sz);
>  
>      if ( e820 == NULL )
>      {
> @@ -502,6 +511,16 @@ int xc_domain_set_memmap_limit(xc_interf
>      e820->size = (uint64_t)map_limitkb << 10;
>      e820->type = E820_RAM;
>  
> +    if (pci_hole_start && (delta_kb > 0))
> +    {
> + fmap.map.nr_entries ++;
> + e820[0].size = (uint64_t)pci_hole_start << 12;
> + 
> + e820[1].type = E820_RAM;
> + e820[1].addr = (uint64_t)0x100000 << 12; /* val in pfn...  */
> + e820[1].size = (uint64_t)delta_kb << 10; /* .. while here in in kB. */
> +    }
> +
>      set_xen_guest_handle(fmap.map.buffer, e820);
>  
>      rc = do_memory_op(xch, XENMEM_set_memory_map, &fmap, sizeof(fmap));
> @@ -513,7 +532,8 @@ int xc_domain_set_memmap_limit(xc_interf
>  #else
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb)
> +                               unsigned long map_limitkb,
> +                               xen_pfn_t pci_hole_start)
>  {
>      PERROR("Function not implemented");
>      errno = ENOSYS;
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -913,7 +913,8 @@ int xc_domain_setmaxmem(xc_interface *xc
>  
>  int xc_domain_set_memmap_limit(xc_interface *xch,
>                                 uint32_t domid,
> -                               unsigned long map_limitkb);
> +                               unsigned long map_limitkb,
> +    xen_pfn_t pci_hole_start);
>  
>  int xc_domain_set_time_offset(xc_interface *xch,
>                                uint32_t domid,
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -392,6 +392,7 @@ int libxl_device_disk_getinfo(libxl_ctx
>                                libxl_device_disk *disk, libxl_diskinfo
> *diskinfo);
>  int libxl_cdrom_insert(libxl_ctx *ctx, uint32_t domid, libxl_device_disk
> *disk);
>  
> +int libxl_find_pci_hole(uint32_t *start_pfn);
>  /*
>   * Make a disk available in this domain. Returns path to a device.
>   */
> diff --git a/tools/libxl/libxl.idl b/tools/libxl/libxl.idl
> --- a/tools/libxl/libxl.idl
> +++ b/tools/libxl/libxl.idl
> @@ -110,6 +110,7 @@ libxl_domain_build_info = Struct("domain
>                                          ])),
>                   ("pv", "!%s", Struct(None,
>                                         [("slack_memkb", uint32),
> +                                        ("pci_hole_start", uint32),
>                                          ("bootloader", string),
>                                          ("bootloader_args", string),
>                                          ("cmdline", string),
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -71,7 +71,8 @@ int libxl__build_pre(libxl_ctx *ctx, uin
>      xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
> LIBXL_MAXMEM_CONSTANT);
>      xc_domain_set_memmap_limit(ctx->xch, domid,
>              (info->hvm) ? info->max_memkb :
> -            (info->max_memkb + info->u.pv.slack_memkb));
> +            (info->max_memkb + info->u.pv.slack_memkb),
> +            (info->hvm) ? 0 : info->u.pv.pci_hole_start);
>      xc_domain_set_tsc_info(ctx->xch, domid, info->tsc_mode, 0, 0, 0);
>      if ( info->disable_migrate )
>          xc_domain_disable_migrate(ctx->xch, domid);
> @@ -181,6 +182,8 @@ int libxl__build_pv(libxl_ctx *ctx, uint
>              }
>          }
>      }
> +    if ( info->u.pv.pci_hole_start)
> +        dom->pci_hole = info->u.pv.pci_hole_start;
>  
>      dom->flags = flags;
>      dom->console_evtchn = state->console_port;
> diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
> --- a/tools/libxl/libxl_pci.c
> +++ b/tools/libxl/libxl_pci.c
> @@ -1066,3 +1066,51 @@ int libxl_device_pci_shutdown(libxl_ctx
>      free(pcidevs);
>      return 0;
>  }
> +
> +#define MAX_LINE 300
> +int libxl_find_pci_hole(uint32_t *start_pfn)
> +{
> + FILE *fp;
> + char *s;
> + char buf[MAX_LINE];
> + int ret = -ENODEV;
> + long int pci_hole_phys;
> +
> + *start_pfn = 0;
> + fp = fopen("/proc/iomem", "r");
> + if (!fp)
> +  return ret;
> +
> + while (1) {
> +  s = fgets(buf, MAX_LINE, fp);
> +  if (!s)
> +   break;
> +  if (strlen(buf) < 1)
> +   continue;
> +  if (buf[strlen(buf)-1] == '\n')
> +   buf[strlen(buf)-1] = '\0';
> +  s = strchr(buf,'P');
> +  if (!s)
> +   continue;
> +  if (strncmp(s, "PCI", 3) == 0) {
> +   if (buf[0] == ' ')
> +    continue;
> +   s = strchr(buf,'-');
> +   if (!s)
> +    break;
> +   s[0]='\0';
> +   pci_hole_phys = strtol(buf, NULL, 16);
> +   if (!pci_hole_phys)
> +    break;
> +   /* We don't want to the holes below 16MB. */
> +   if (pci_hole_phys <= 0x1000)
> +    continue;
> +   *start_pfn = pci_hole_phys >> 12;
> +   fprintf(stderr,"The value is 0x%d\n", *start_pfn);
> +   ret = 0;
> +   break;
> +  }
> + }
> + fclose(fp);
> + return ret;
> +}
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -1078,6 +1078,14 @@ skip_vfb:
>      if (!xlu_cfg_get_long (config, "pci_power_mgmt", &l))
>          pci_power_mgmt = l;
>  
> +    if (!xlu_cfg_get_long (config, "pci_hole", &l)) {
> + if (l == 1) {
> +    uint32_t pfn_start = 0;
> +    if (!libxl_find_pci_hole(&pfn_start))
> +             b_info->u.pv.pci_hole_start = pfn_start;
> + } else
> +           b_info->u.pv.pci_hole_start = l;
> +    }
>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>          int i;
>          d_config->num_pcidevs = 0;
> diff --git a/tools/python/xen/lowlevel/xc/xc.c
> b/tools/python/xen/lowlevel/xc/xc.c
> --- a/tools/python/xen/lowlevel/xc/xc.c
> +++ b/tools/python/xen/lowlevel/xc/xc.c
> @@ -458,6 +458,7 @@ static PyObject *pyxc_linux_build(XcObje
>      unsigned int mem_mb;
>      unsigned long store_mfn = 0;
>      unsigned long console_mfn = 0;
> +    unsigned long pci_hole_start = 0;
>      PyObject* elfnote_dict;
>      PyObject* elfnote = NULL;
>      PyObject* ret;
> @@ -467,14 +468,16 @@ static PyObject *pyxc_linux_build(XcObje
>                                  "console_evtchn", "image",
>                                  /* optional */
>                                  "ramdisk", "cmdline", "flags",
> -                                "features", "vhpt", "superpages", NULL };
> -
> -    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisii", kwd_list,
> +                                "features", "vhpt", "superpages",
> +                                "pci_hole", NULL };
> +
> +    if ( !PyArg_ParseTupleAndKeywords(args, kwds, "iiiis|ssisiii", kwd_list,
>                                        &domid, &store_evtchn, &mem_mb,
>                                        &console_evtchn, &image,
>                                        /* optional */
>                                        &ramdisk, &cmdline, &flags,
> -                                      &features, &vhpt, &superpages) )
> +                                      &features, &vhpt, &superpages,
> +                                      &pci_hole_start) )
>          return NULL;
>  
>      xc_dom_loginit(self->xc_handle);
> @@ -486,6 +489,8 @@ static PyObject *pyxc_linux_build(XcObje
>  
>      dom->superpages = superpages;
>  
> +    dom->pci_hole = pci_hole_start;
> +
>      if ( xc_dom_linux_build(self->xc_handle, dom, domid, mem_mb, image,
>                              ramdisk, flags, store_evtchn, &store_mfn,
>                              console_evtchn, &console_mfn) != 0 ) {
> @@ -1659,11 +1664,13 @@ static PyObject *pyxc_domain_set_memmap_
>  {
>      uint32_t dom;
>      unsigned int maplimit_kb;
> -
> -    if ( !PyArg_ParseTuple(args, "ii", &dom, &maplimit_kb) )
> +    unsigned long pci_hole_start = 0;
> +
> +    if ( !PyArg_ParseTuple(args, "ii|i", &dom, &maplimit_kb, &pci_hole_start)
> )
>          return NULL;
>  
> -    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb) != 0 )
> +    if ( xc_domain_set_memmap_limit(self->xc_handle, dom, maplimit_kb,
> +                                    pci_hole_start) != 0 )
>          return pyxc_error_to_exception(self->xc_handle);
>      
>      Py_INCREF(zero);
> @@ -2661,6 +2668,7 @@ static PyMethodDef pyxc_methods[] = {
>        "Set a domain's physical memory mappping limit\n"
>        " dom [int]: Identifier of domain.\n"
>        " map_limitkb [int]: .\n"
> +      " pci_hole_start [int]: PFN for start of PCI hole (optional).\n"
>        "Returns: [int] 0 on success; -1 on error.\n" },
>  
>  #ifdef __ia64__
> diff --git a/tools/python/xen/xend/XendConfig.py
> b/tools/python/xen/xend/XendConfig.py
> --- a/tools/python/xen/xend/XendConfig.py
> +++ b/tools/python/xen/xend/XendConfig.py
> @@ -241,6 +241,7 @@ XENAPI_CFG_TYPES = {
>      'suppress_spurious_page_faults': bool0,
>      's3_integrity' : int,
>      'superpages' : int,
> +    'pci_hole' : int,
>      'memory_sharing': int,
>      'pool_name' : str,
>      'Description': str,
> @@ -422,6 +423,7 @@ class XendConfig(dict):
>              'target': 0,
>              'pool_name' : 'Pool-0',
>              'superpages': 0,
> +            'pci_hole': 0,
>              'description': '',
>          }
>          
> @@ -2135,6 +2137,9 @@ class XendConfig(dict):
>              image.append(['args', self['PV_args']])
>          if self.has_key('superpages'):
>              image.append(['superpages', self['superpages']])
> +        if self.has_key('pci_hole'):
> +            image.append(['pci_hole', self['pci_hole']])
> +
>  
>          for key in XENAPI_PLATFORM_CFG_TYPES.keys():
>              if key in self['platform']:
> @@ -2179,6 +2184,10 @@ class XendConfig(dict):
>          val = sxp.child_value(image_sxp, 'superpages')
>          if val is not None:
>              self['superpages'] = val
> +
> +        val = sxp.child_value(image_sxp, 'pci_hole')
> +        if val is not None:
> +            self['pci_hole'] = val
>          
>          val = sxp.child_value(image_sxp, 'memory_sharing')
>          if val is not None:
> diff --git a/tools/python/xen/xend/image.py b/tools/python/xen/xend/image.py
> --- a/tools/python/xen/xend/image.py
> +++ b/tools/python/xen/xend/image.py
> @@ -84,6 +84,7 @@ class ImageHandler:
>  
>      ostype = None
>      superpages = 0
> +    pci_hole = 0
>      memory_sharing = 0
>  
>      def __init__(self, vm, vmConfig):
> @@ -711,6 +712,7 @@ class LinuxImageHandler(ImageHandler):
>          self.vramsize = int(vmConfig['platform'].get('videoram',4)) * 1024
>          self.is_stubdom = (self.kernel.find('stubdom') >= 0)
>          self.superpages = int(vmConfig['superpages'])
> +        self.pci_hole = int(vmConfig['pci_hole'])
>  
>      def buildDomain(self):
>          store_evtchn = self.vm.getStorePort()
> @@ -729,6 +731,7 @@ class LinuxImageHandler(ImageHandler):
>          log.debug("features       = %s", self.vm.getFeatures())
>          log.debug("flags          = %d", self.flags)
>          log.debug("superpages     = %d", self.superpages)
> +        log.debug("pci_hole       = %d", self.pci_hole)
>          if arch.type == "ia64":
>              log.debug("vhpt          = %d", self.vhpt)
>  
> @@ -742,7 +745,8 @@ class LinuxImageHandler(ImageHandler):
>                                features       = self.vm.getFeatures(),
>                                flags          = self.flags,
>                                vhpt           = self.vhpt,
> -                              superpages     = self.superpages)
> +                              superpages     = self.superpages,
> +                              pci_hole       = self.pci_hole)
>  
>      def getBitSize(self):
>          return xc.getBitSize(image    = self.kernel,
> @@ -774,7 +778,6 @@ class LinuxImageHandler(ImageHandler):
>          args = args + ([ "-M", "xenpv"])
>          return args
>  
> -
>  class HVMImageHandler(ImageHandler):
>  
>      ostype = "hvm"
> @@ -1065,7 +1068,7 @@ class X86_Linux_ImageHandler(LinuxImageH
>          # set physical mapping limit
>          # add an 8MB slack to balance backend allocations.
>          mem_kb = self.getRequiredMaximumReservation() + (8 * 1024)
> -        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb)
> +        xc.domain_set_memmap_limit(self.vm.getDomid(), mem_kb, self.pci_hole)
>          rc = LinuxImageHandler.buildDomain(self)
>          self.setCpuid()
>          return rc
> diff --git a/tools/python/xen/xm/create.py b/tools/python/xen/xm/create.py
> --- a/tools/python/xen/xm/create.py
> +++ b/tools/python/xen/xm/create.py
> @@ -680,6 +680,11 @@ gopts.var('superpages', val='0|1',
>             fn=set_int, default=0,
>             use="Create domain with superpages")
>  
> +gopts.var('pci_hole', val='0x<XXX>|0',
> +           fn=set_int, default=0,
> +           use="""Create domain with a PCI hole. The value is the PFN of the
> +           start of PCI hole. Usually that is 0xc0000.""")
> +
>  def err(msg):
>      """Print an error to stderr and exit.
>      """
> @@ -770,6 +775,9 @@ def configure_image(vals):
>          config_image.append(['args', vals.extra])
>      if vals.superpages:
>          config_image.append(['superpages', vals.superpages])
> +    if vals.pci_hole:
> +        config_image.append(['pci_hole', vals.pci_hole])
> +
>  
>      if vals.builder == 'hvm':
>          configure_hvm(config_image, vals)
> diff --git a/tools/python/xen/xm/xenapi_create.py
> b/tools/python/xen/xm/xenapi_create.py
> --- a/tools/python/xen/xm/xenapi_create.py
> +++ b/tools/python/xen/xm/xenapi_create.py
> @@ -285,6 +285,8 @@ class xenapi_create:
>                  vm.attributes["s3_integrity"].value,
>              "superpages":
>                  vm.attributes["superpages"].value,
> +            "pci_hole":
> +                vm.attributes["pci_hole"].value,
>              "memory_static_max":
>                  get_child_node_attribute(vm, "memory", "static_max"),
>              "memory_static_min":
> @@ -697,6 +699,8 @@ class sxp2xml:
>              = str(get_child_by_name(config, "s3_integrity", 0))
>          vm.attributes["superpages"] \
>              = str(get_child_by_name(config, "superpages", 0))
> +        vm.attributes["pci_hole"] \
> +            = str(get_child_by_name(config, "pci_hole", 0))
>          vm.attributes["pool_name"] \
>              = str(get_child_by_name(config, "pool_name", "Pool-0"))
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-13  7:40 ` Keir Fraser
@ 2010-11-15 17:03   ` Konrad Rzeszutek Wilk
  2010-11-15 17:20     ` Ian Campbell
  2010-11-15 17:48     ` Keir Fraser
  0 siblings, 2 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-15 17:03 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser wrote:
> Why doesn't the guest punch its own hole, by relocating RAM above 4GB?


 1). Did not work for me - I am not sure why but I had the hardest time do
     hypervisor_populate_physmap - it would just hang the guest.
 2). It is much simple to parse the E820 in the Linux kernel than actually
     creating new E820 entries in the kernel (hypercall), making a bunch of
     hypervisor calls that unmap, then remap the space, filling out the P2M
     with INVALID_MFN, and doing all of that before the "real" Linux kernel
     actually starts (all would have to be done in xen_start_kernel).
     I have a sinking feeling tha the upstream community would not like it
     this that much.

> That's what all HVM guests do (in hvmloader).

  3). Which is also part of the Xen tool-stack.


Keir, I think you posted it at some point - was there a standard 'indent'
incantention for the tools/libxc StyleGuide (which is similar to to the xen/*
one) - I can't find it in my mail archive...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 17:03   ` Konrad Rzeszutek Wilk
@ 2010-11-15 17:20     ` Ian Campbell
  2010-11-15 17:28       ` Konrad Rzeszutek Wilk
  2010-11-15 17:48     ` Keir Fraser
  1 sibling, 1 reply; 25+ messages in thread
From: Ian Campbell @ 2010-11-15 17:20 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, Keir Fraser, Stefano Stabellini,
	bruce.edge, Gianni Tedesco

On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk wrote:
>  2). It is much simple to parse the E820 in the Linux kernel[...]

The ability to ingest an e820 from the hypervisor is also needed for
dom0 to consume the host e820 so it's not like there is additional code
on the kernel side to cope with this.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 17:20     ` Ian Campbell
@ 2010-11-15 17:28       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-15 17:28 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jeremy Fitzhardinge, xen-devel, Keir Fraser, Stefano Stabellini,
	bruce.edge, Gianni Tedesco

On Mon, Nov 15, 2010 at 05:20:00PM +0000, Ian Campbell wrote:
> On Mon, 2010-11-15 at 17:03 +0000, Konrad Rzeszutek Wilk wrote:
> >  2). It is much simple to parse the E820 in the Linux kernel[...]
> 
> The ability to ingest an e820 from the hypervisor is also needed for
> dom0 to consume the host e820 so it's not like there is additional code
> on the kernel side to cope with this.

There is a bit. We need to decouple the info->nr_pages magic that is used
throughout the code b/c we are treating that value as a 'last pfn'. With
the offset, we need to be more careful about it.

P.S.
(devel code, probably going to post a refresh today):
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/e820-hole

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 17:03   ` Konrad Rzeszutek Wilk
  2010-11-15 17:20     ` Ian Campbell
@ 2010-11-15 17:48     ` Keir Fraser
  2010-11-15 18:15       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-15 17:48 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

On 15/11/2010 17:03, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:

> On Sat, Nov 13, 2010 at 07:40:30AM +0000, Keir Fraser wrote:
>> Why doesn't the guest punch its own hole, by relocating RAM above 4GB?
> 
> 
>  1). Did not work for me - I am not sure why but I had the hardest time do
>      hypervisor_populate_physmap - it would just hang the guest.

For a PV guest you don't need to do any alloc/free/move memory hypercalls.
You rewrite your own p2m to relocate mfns where you want them in pfn space.
Then some hypercalls just to update the m2p array to match.

>  2). It is much simple to parse the E820 in the Linux kernel than actually
>      creating new E820 entries in the kernel (hypercall), making a bunch of
>      hypervisor calls that unmap, then remap the space, filling out the P2M
>      with INVALID_MFN, and doing all of that before the "real" Linux kernel
>      actually starts (all would have to be done in xen_start_kernel).
>      I have a sinking feeling tha the upstream community would not like it
>      this that much.

Well it is all quite Xen specific, so I'm surprised.

>> That's what all HVM guests do (in hvmloader).
> 
>   3). Which is also part of the Xen tool-stack.
> 
> 
> Keir, I think you posted it at some point - was there a standard 'indent'
> incantention for the tools/libxc StyleGuide (which is similar to to the xen/*
> one) - I can't find it in my mail archive...

I think someone else worked one out, but I don't have it to hand I'm afraid.

 -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 17:48     ` Keir Fraser
@ 2010-11-15 18:15       ` Konrad Rzeszutek Wilk
  2010-11-15 18:41         ` Keir Fraser
  2010-11-15 19:30         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-15 18:15 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

> >  1). Did not work for me - I am not sure why but I had the hardest time do
> >      hypervisor_populate_physmap - it would just hang the guest.
> 
> For a PV guest you don't need to do any alloc/free/move memory hypercalls.
> You rewrite your own p2m to relocate mfns where you want them in pfn space.
> Then some hypercalls just to update the m2p array to match.

Ok, I can play with that see what fun/havoc I can create.
> 
> >  2). It is much simple to parse the E820 in the Linux kernel than actually
> >      creating new E820 entries in the kernel (hypercall), making a bunch of
> >      hypervisor calls that unmap, then remap the space, filling out the P2M
> >      with INVALID_MFN, and doing all of that before the "real" Linux kernel
> >      actually starts (all would have to be done in xen_start_kernel).
> >      I have a sinking feeling tha the upstream community would not like it
> >      this that much.
> 
> Well it is all quite Xen specific, so I'm surprised.

Oh, there was another reason that I so obvious that I completly forgot. DomU
has no idea where the host PCI hole starts. In most cases it is at 3GB (or even
further up - 3.5GB), but a quick look for 'Allocating PCI resources starting at' 
at Google shows that there are some that start at 1.2G.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 18:15       ` Konrad Rzeszutek Wilk
@ 2010-11-15 18:41         ` Keir Fraser
  2010-11-15 19:32           ` Jeremy Fitzhardinge
  2010-11-15 19:30         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-15 18:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

On 15/11/2010 18:15, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:

>> Well it is all quite Xen specific, so I'm surprised.
> 
> Oh, there was another reason that I so obvious that I completly forgot. DomU
> has no idea where the host PCI hole starts. In most cases it is at 3GB (or
> even
> further up - 3.5GB), but a quick look for 'Allocating PCI resources starting
> at' 
> at Google shows that there are some that start at 1.2G.

Hm, true. We could give you access to XENMEM_machine_memory_map? It's not
really got any big secrets or privileged things in it. :-)

Or, is there much disadvantage, to having a static really big PCI hole? Say
starting at 1GB? The advantage of this would be the ability to hotplug PCI
devices to a domU even across save/restore/migrate -- this may not work so
well if you commit yourself to the hole size of the original host, and the
restore/migrate target host has a bigger hole!

 -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 18:15       ` Konrad Rzeszutek Wilk
  2010-11-15 18:41         ` Keir Fraser
@ 2010-11-15 19:30         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-15 19:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, Keir Fraser, bruce.edge, gianni.tedesco, stefano.stabellini

On 11/15/2010 10:15 AM, Konrad Rzeszutek Wilk wrote:
>>>  1). Did not work for me - I am not sure why but I had the hardest time do
>>>      hypervisor_populate_physmap - it would just hang the guest.
>> For a PV guest you don't need to do any alloc/free/move memory hypercalls.
>> You rewrite your own p2m to relocate mfns where you want them in pfn space.
>> Then some hypercalls just to update the m2p array to match.
> Ok, I can play with that see what fun/havoc I can create.
>>>  2). It is much simple to parse the E820 in the Linux kernel than actually
>>>      creating new E820 entries in the kernel (hypercall), making a bunch of
>>>      hypervisor calls that unmap, then remap the space, filling out the P2M
>>>      with INVALID_MFN, and doing all of that before the "real" Linux kernel
>>>      actually starts (all would have to be done in xen_start_kernel).
>>>      I have a sinking feeling tha the upstream community would not like it
>>>      this that much.
>> Well it is all quite Xen specific, so I'm surprised.
> Oh, there was another reason that I so obvious that I completly forgot. DomU
> has no idea where the host PCI hole starts. In most cases it is at 3GB (or even
> further up - 3.5GB), but a quick look for 'Allocating PCI resources starting at' 
> at Google shows that there are some that start at 1.2G.

Yes, that's the main reason I think it should be in the toolstack.  The
domain doesn't know whether the PCI hole is necessary or not, and its
too early for it to poke around in xenstore to look for passed devices
or anything.  It could conservatively always reserve a 1G hole at 3G,
but that seems too pessimistic and is a waste of sub-4G adress space
which it might otherwise have use for.

If the toolstack can tell it where to make the hole by editing the E820,
then the dom0 and domU cases are very similar.

The question of whether all the pages given by the builder to the domain
should be distributed over the E820 RAM ranges, or should just be
considered ballooned out of the holes (which is what currently happens)
is pretty much orthogonal.

    J

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 18:41         ` Keir Fraser
@ 2010-11-15 19:32           ` Jeremy Fitzhardinge
  2010-11-15 19:57             ` Keir Fraser
  0 siblings, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-15 19:32 UTC (permalink / raw)
  To: Keir Fraser
  Cc: stefano.stabellini, xen-devel, bruce.edge, gianni.tedesco,
	Konrad Rzeszutek Wilk

On 11/15/2010 10:41 AM, Keir Fraser wrote:
>> Oh, there was another reason that I so obvious that I completly forgot. DomU
>> has no idea where the host PCI hole starts. In most cases it is at 3GB (or
>> even
>> further up - 3.5GB), but a quick look for 'Allocating PCI resources starting
>> at' 
>> at Google shows that there are some that start at 1.2G.
> Hm, true. We could give you access to XENMEM_machine_memory_map? It's not
> really got any big secrets or privileged things in it. :-)
>
> Or, is there much disadvantage, to having a static really big PCI hole? Say
> starting at 1GB? The advantage of this would be the ability to hotplug PCI
> devices to a domU even across save/restore/migrate -- this may not work so
> well if you commit yourself to the hole size of the original host, and the
> restore/migrate target host has a bigger hole!

Well, the other question is whether the devices have to have the same
pfn as mfn within the hole.  We're emulating the PCI config space anyway
- couldn't we stick the passthrough PCI space at 3G regardless of where
it is on the real hardware?

    J

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 19:32           ` Jeremy Fitzhardinge
@ 2010-11-15 19:57             ` Keir Fraser
  2010-11-15 23:11               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-15 19:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: stefano.stabellini, xen-devel, bruce.edge, gianni.tedesco,
	Konrad Rzeszutek Wilk

On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

>> Or, is there much disadvantage, to having a static really big PCI hole? Say
>> starting at 1GB? The advantage of this would be the ability to hotplug PCI
>> devices to a domU even across save/restore/migrate -- this may not work so
>> well if you commit yourself to the hole size of the original host, and the
>> restore/migrate target host has a bigger hole!
> 
> Well, the other question is whether the devices have to have the same
> pfn as mfn within the hole.  We're emulating the PCI config space anyway
> - couldn't we stick the passthrough PCI space at 3G regardless of where
> it is on the real hardware?

Well, I don't know. It sounds pretty sensible to me. :-)

Certain virtualisation feature sdisappearing after a save/restore/migrate --
or worsse, becoming unreliable -- would be a bit sad.

 -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 19:57             ` Keir Fraser
@ 2010-11-15 23:11               ` Konrad Rzeszutek Wilk
  2010-11-16  1:06                 ` Jeremy Fitzhardinge
  2010-11-16  7:40                 ` Keir Fraser
  0 siblings, 2 replies; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-15 23:11 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser wrote:
> On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
> 
> >> Or, is there much disadvantage, to having a static really big PCI hole? Say
> >> starting at 1GB? The advantage of this would be the ability to hotplug PCI
> >> devices to a domU even across save/restore/migrate -- this may not work so
> >> well if you commit yourself to the hole size of the original host, and the
> >> restore/migrate target host has a bigger hole!
> > 
> > Well, the other question is whether the devices have to have the same
> > pfn as mfn within the hole.  We're emulating the PCI config space anyway
> > - couldn't we stick the passthrough PCI space at 3G regardless of where
> > it is on the real hardware?

Your thinking is that on the Linux side, any of the pfns that are within
those System RAM gaps (irregardless if they are above or below 4GB) would
be consultated during PTE creation/lookup (xen_pte_val..). 

And if those PFNs are within those System RAM gaps, we would store the 
MFN in the P2M list and instead of doing:
   val = ((pteval_t)pfn << PAGE_SHIFT) | flags

we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP flag.

And  example patch (compiled tested, not tested any other way) attached at the
end of this email.

How does that work on the Xen side? Does the hypervisor depend on the pages
that belong to the DOM_IO domain to have a INVALID_MFN value in the mfn_list?

We do make the PTE that refer to physical devices to be the DOM_IO domain..


> 
> Well, I don't know. It sounds pretty sensible to me. :-)
> 
> Certain virtualisation feature sdisappearing after a save/restore/migrate --
> or worsse, becoming unreliable -- would be a bit sad.

So having the option of the PCI hole being passed through, and giving
the tools the value (pci_hole) would mean we could migrate an SR-IOV type
device from one machine to another. Constructing the PCI hole using the
XENMEM_machine_memory_map could generate different E820 for the two guests, which
would be indeed a bit sad.


--- the patch ---

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 50dc626..96a08ef 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr)
 
 static bool xen_iomap_pte(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_IOMAP;
+	return xen_pfn_is_pci(pte_mfn(pte));
 }
 
 void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid)
@@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
 void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
 		    pte_t *ptep, pte_t pteval)
 {
-	if (xen_iomap_pte(pteval)) {
-		xen_set_iomap_pte(ptep, pteval);
-		goto out;
-	}
-
 	ADD_STATS(set_pte_at, 1);
 //	ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
 	ADD_STATS(set_pte_at_current, mm == current->mm);
@@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
 	return val;
 }
 
-static pteval_t iomap_pte(pteval_t val)
-{
-	if (val & _PAGE_PRESENT) {
-		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
-		pteval_t flags = val & PTE_FLAGS_MASK;
-
-		/* We assume the pte frame number is a MFN, so
-		   just use it as-is. */
-		val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
-	}
-
-	return val;
-}
 
 pteval_t xen_pte_val(pte_t pte)
 {
@@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte)
 		pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT;
 	}
 
-	if (xen_initial_domain() && (pteval & _PAGE_IOMAP))
-		return pteval;
+	if (xen_pfn_is_pci(pte_mfn(pte)))
+		pteval |= _PAGE_IOMAP;
 
 	return pte_mfn_to_pfn(pteval);
 }
@@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte)
 	 * mappings are just dummy local mappings to keep other
 	 * parts of the kernel happy.
 	 */
-	if (unlikely(pte & _PAGE_IOMAP) &&
-	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) {
-		pte = iomap_pte(pte);
-	} else {
+	if ((unlikely(pte & _PAGE_IOMAP) &&
+	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) ||
+	    (unlikely(xen_pfn_is_pci(PFN_UP(addr)))))
+		pte |=  _PAGE_IOMAP;
+	else
 		pte &= ~_PAGE_IOMAP;
-		pte = pte_pfn_to_mfn(pte);
-	}
+
+	pte = pte_pfn_to_mfn(pte);
 
 	return native_make_pte(pte);
 }
@@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val)
 
 void xen_set_pte(pte_t *ptep, pte_t pte)
 {
-	if (xen_iomap_pte(pte)) {
+	if (xen_iomap_pte(pte))
 		xen_set_iomap_pte(ptep, pte);
-		return;
-	}
 
 	ADD_STATS(pte_update, 1);
 //	ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
@@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
 #ifdef CONFIG_X86_PAE
 void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
-	if (xen_iomap_pte(pte)) {
+	if (xen_iomap_pte(pte))
 		xen_set_iomap_pte(ptep, pte);
-		return;
-	}
 
 	set_64bit((u64 *)ptep, native_pte_val(pte));
 }
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 5a1f22d..bb424e3 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void)
 	xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn, xen_start_info->nr_pages);
 	return max_pfn;
 }
+
+int xen_pfn_is_pci(unsigned long pfn)
+{
+	static struct e820entry map[E820MAX] __initdata;
+	int rc, op, i;
+	struct xen_memory_map memmap;
+	unsigned long long addr = PFN_PHYS(pfn);
+	memmap.nr_entries = E820MAX;
+	set_xen_guest_handle(memmap.buffer, map);
+
+	op = xen_initial_domain() ?
+		XENMEM_machine_memory_map :
+		XENMEM_memory_map;
+	rc = HYPERVISOR_memory_op(op, &memmap);
+	BUG_ON(rc);
+
+	for (i = 0; i < memmap.nr_entries; i++) {
+		unsigned long long end = map[i].addr + map[i].size;
+		if (map[i].type != E820_RAM)
+			continue;
+		if (addr >= map[i].addr && addr <= end)
+			return 0;
+	}
+	return 1;
+}
 /**
  * machine_specific_memory_setup - Hook for machine specific memory setup.
  **/
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index eee2045..f859b04 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
 extern phys_addr_t xen_extra_mem_start;
 unsigned long xen_find_max_pfn(void);
 
+int xen_pfn_is_pci(unsigned long pfn);
+
 #endif /* INCLUDE_XEN_OPS_H */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 23:11               ` Konrad Rzeszutek Wilk
@ 2010-11-16  1:06                 ` Jeremy Fitzhardinge
  2010-11-16  9:26                   ` Ian Campbell
  2010-11-16  7:40                 ` Keir Fraser
  1 sibling, 1 reply; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-16  1:06 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, Keir Fraser, bruce.edge, gianni.tedesco, stefano.stabellini

On 11/15/2010 03:11 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 15, 2010 at 07:57:47PM +0000, Keir Fraser wrote:
>> On 15/11/2010 19:32, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
>>
>>>> Or, is there much disadvantage, to having a static really big PCI hole? Say
>>>> starting at 1GB? The advantage of this would be the ability to hotplug PCI
>>>> devices to a domU even across save/restore/migrate -- this may not work so
>>>> well if you commit yourself to the hole size of the original host, and the
>>>> restore/migrate target host has a bigger hole!
>>> Well, the other question is whether the devices have to have the same
>>> pfn as mfn within the hole.  We're emulating the PCI config space anyway
>>> - couldn't we stick the passthrough PCI space at 3G regardless of where
>>> it is on the real hardware?
> Your thinking is that on the Linux side, any of the pfns that are within
> those System RAM gaps (irregardless if they are above or below 4GB) would
> be consultated during PTE creation/lookup (xen_pte_val..). 
>
> And if those PFNs are within those System RAM gaps, we would store the 
> MFN in the P2M list and instead of doing:
>    val = ((pteval_t)pfn << PAGE_SHIFT) | flags
>
> we would actually do mfn = pfn_to_mfn(pfn) and stick on the _PAGE_IOMAP flag.
>
> And  example patch (compiled tested, not tested any other way) attached at the
> end of this email.

Right, it basically depends on dropping _PAGE_IOMAP and populating the
p2m with the correct mapping for both memory and hardware pages.

> How does that work on the Xen side? Does the hypervisor depend on the pages
> that belong to the DOM_IO domain to have a INVALID_MFN value in the mfn_list?

Xen wouldn't care.  I don't think its necessary to explicitly do a
cross-domain mapping with DOM_IO as we currently do; that's overkill
and/or a misunderstanding on my part.

> We do make the PTE that refer to physical devices to be the DOM_IO domain..

I think Xen will sort that out for itself when presented with a
hardware/device mfn.

>> Well, I don't know. It sounds pretty sensible to me. :-)
>>
>> Certain virtualisation feature sdisappearing after a save/restore/migrate --
>> or worsse, becoming unreliable -- would be a bit sad.
> So having the option of the PCI hole being passed through, and giving
> the tools the value (pci_hole) would mean we could migrate an SR-IOV type
> device from one machine to another. Constructing the PCI hole using the
> XENMEM_machine_memory_map could generate different E820 for the two guests, which
> would be indeed a bit sad.
>
>
> --- the patch ---
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 50dc626..96a08ef 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -699,7 +699,7 @@ static bool xen_page_pinned(void *ptr)
>  
>  static bool xen_iomap_pte(pte_t pte)
>  {
> -	return pte_flags(pte) & _PAGE_IOMAP;
> +	return xen_pfn_is_pci(pte_mfn(pte));
>  }

I think populating the p2m appropriately in advance is better than this
test; this is OK for prototyping I guess, but way to expensive for every
set_pte.

    J

>  
>  void xen_set_domain_pte(pte_t *ptep, pte_t pteval, unsigned domid)
> @@ -801,11 +801,6 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
>  void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
>  		    pte_t *ptep, pte_t pteval)
>  {
> -	if (xen_iomap_pte(pteval)) {
> -		xen_set_iomap_pte(ptep, pteval);
> -		goto out;
> -	}
> -
>  	ADD_STATS(set_pte_at, 1);
>  //	ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
>  	ADD_STATS(set_pte_at_current, mm == current->mm);
> @@ -889,19 +884,6 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
>  	return val;
>  }
>  
> -static pteval_t iomap_pte(pteval_t val)
> -{
> -	if (val & _PAGE_PRESENT) {
> -		unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
> -		pteval_t flags = val & PTE_FLAGS_MASK;
> -
> -		/* We assume the pte frame number is a MFN, so
> -		   just use it as-is. */
> -		val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
> -	}
> -
> -	return val;
> -}
>  
>  pteval_t xen_pte_val(pte_t pte)
>  {
> @@ -913,8 +895,8 @@ pteval_t xen_pte_val(pte_t pte)
>  		pteval = (pteval & ~_PAGE_PAT) | _PAGE_PWT;
>  	}
>  
> -	if (xen_initial_domain() && (pteval & _PAGE_IOMAP))
> -		return pteval;
> +	if (xen_pfn_is_pci(pte_mfn(pte)))
> +		pteval |= _PAGE_IOMAP;
>  
>  	return pte_mfn_to_pfn(pteval);
>  }
> @@ -974,13 +956,14 @@ pte_t xen_make_pte(pteval_t pte)
>  	 * mappings are just dummy local mappings to keep other
>  	 * parts of the kernel happy.
>  	 */
> -	if (unlikely(pte & _PAGE_IOMAP) &&
> -	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) {
> -		pte = iomap_pte(pte);
> -	} else {
> +	if ((unlikely(pte & _PAGE_IOMAP) &&
> +	    (xen_initial_domain() || addr >= ISA_END_ADDRESS)) ||
> +	    (unlikely(xen_pfn_is_pci(PFN_UP(addr)))))
> +		pte |=  _PAGE_IOMAP;
> +	else
>  		pte &= ~_PAGE_IOMAP;
> -		pte = pte_pfn_to_mfn(pte);
> -	}
> +
> +	pte = pte_pfn_to_mfn(pte);
>  
>  	return native_make_pte(pte);
>  }
> @@ -1037,10 +1020,8 @@ void xen_set_pud(pud_t *ptr, pud_t val)
>  
>  void xen_set_pte(pte_t *ptep, pte_t pte)
>  {
> -	if (xen_iomap_pte(pte)) {
> +	if (xen_iomap_pte(pte))
>  		xen_set_iomap_pte(ptep, pte);
> -		return;
> -	}
>  
>  	ADD_STATS(pte_update, 1);
>  //	ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
> @@ -1058,10 +1039,8 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
>  #ifdef CONFIG_X86_PAE
>  void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
>  {
> -	if (xen_iomap_pte(pte)) {
> +	if (xen_iomap_pte(pte))
>  		xen_set_iomap_pte(ptep, pte);
> -		return;
> -	}
>  
>  	set_64bit((u64 *)ptep, native_pte_val(pte));
>  }
> diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
> index 5a1f22d..bb424e3 100644
> --- a/arch/x86/xen/setup.c
> +++ b/arch/x86/xen/setup.c
> @@ -196,6 +196,31 @@ unsigned long xen_find_max_pfn(void)
>  	xen_raw_printk("E820 max_pfn = %ld (nr_pages: %ld)\n", max_pfn, xen_start_info->nr_pages);
>  	return max_pfn;
>  }
> +
> +int xen_pfn_is_pci(unsigned long pfn)
> +{
> +	static struct e820entry map[E820MAX] __initdata;
> +	int rc, op, i;
> +	struct xen_memory_map memmap;
> +	unsigned long long addr = PFN_PHYS(pfn);
> +	memmap.nr_entries = E820MAX;
> +	set_xen_guest_handle(memmap.buffer, map);
> +
> +	op = xen_initial_domain() ?
> +		XENMEM_machine_memory_map :
> +		XENMEM_memory_map;
> +	rc = HYPERVISOR_memory_op(op, &memmap);
> +	BUG_ON(rc);
> +
> +	for (i = 0; i < memmap.nr_entries; i++) {
> +		unsigned long long end = map[i].addr + map[i].size;
> +		if (map[i].type != E820_RAM)
> +			continue;
> +		if (addr >= map[i].addr && addr <= end)
> +			return 0;
> +	}
> +	return 1;
> +}
>  /**
>   * machine_specific_memory_setup - Hook for machine specific memory setup.
>   **/
> diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
> index eee2045..f859b04 100644
> --- a/include/xen/xen-ops.h
> +++ b/include/xen/xen-ops.h
> @@ -31,4 +31,6 @@ int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
>  extern phys_addr_t xen_extra_mem_start;
>  unsigned long xen_find_max_pfn(void);
>  
> +int xen_pfn_is_pci(unsigned long pfn);
> +
>  #endif /* INCLUDE_XEN_OPS_H */
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-15 23:11               ` Konrad Rzeszutek Wilk
  2010-11-16  1:06                 ` Jeremy Fitzhardinge
@ 2010-11-16  7:40                 ` Keir Fraser
  1 sibling, 0 replies; 25+ messages in thread
From: Keir Fraser @ 2010-11-16  7:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, bruce.edge, gianni.tedesco,
	stefano.stabellini

On 15/11/2010 23:11, "Konrad Rzeszutek Wilk" <konrad.wilk@oracle.com> wrote:

>> Well, I don't know. It sounds pretty sensible to me. :-)
>> 
>> Certain virtualisation feature sdisappearing after a save/restore/migrate --
>> or worsse, becoming unreliable -- would be a bit sad.
> 
> So having the option of the PCI hole being passed through, and giving
> the tools the value (pci_hole) would mean we could migrate an SR-IOV type
> device from one machine to another. Constructing the PCI hole using the
> XENMEM_machine_memory_map could generate different E820 for the two guests,
> which
> would be indeed a bit sad.

SR-IOV is a nice example, but even assuming the user does not have pass-thru
devices attached during migrate, or save/restore, it would still be nice if
they could have the pci-hotplug-passthru facility available to them fully
working both before and after such an event.

 -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16  1:06                 ` Jeremy Fitzhardinge
@ 2010-11-16  9:26                   ` Ian Campbell
  2010-11-16  9:52                     ` Keir Fraser
  2010-11-16 15:50                     ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 25+ messages in thread
From: Ian Campbell @ 2010-11-16  9:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: xen-devel, Keir Fraser, Konrad Rzeszutek Wilk,
	Stefano Stabellini, bruce.edge, Gianni, Tedesco

On Tue, 2010-11-16 at 01:06 +0000, Jeremy Fitzhardinge wrote:
> 
> > How does that work on the Xen side? Does the hypervisor depend on
> the pages
> > that belong to the DOM_IO domain to have a INVALID_MFN value in the
> mfn_list?
> 
> Xen wouldn't care.  I don't think its necessary to explicitly do a
> cross-domain mapping with DOM_IO as we currently do; that's overkill
> and/or a misunderstanding on my part.
> 
> > We do make the PTE that refer to physical devices to be the DOM_IO
> domain..
> 
> I think Xen will sort that out for itself when presented with a
> hardware/device mfn. 

My main concern would be with save/restore code will canonicalise all
the MFNs in the page tables back into PFNs and then convert back to MFNs
on the other side, which is likely to go pretty wrong on one end of the
other unless the save restore code is aware of which MFNs are device
MFNs and which are actual memory. I'm not sure there is any way it can
tell.

** scrobbles around in xc_domain_save.c **

Hrmm... The MFN_IS_IN_PSEUDOPHYS_MAP macro might have some impact on
this issue in some way (depending on what the m2p contains for DOM_IO
owned pages) but I don't think it actually fixes anything. I don't see
anything else which would make this work... Best case as it stands
AFAICT is that MFN_IS_IN_PSEUDOPHYS_MAP causes device mappings to get
zapped requiring the kernel to reinstate them on restore. Which isn't so
bad I guess.

On an unrelated note I think if we do go down the route of having the
guest kernel punch the holes itself and such we should do so iff
XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open
the possibility of cunning tricks on the tools side in the future.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16  9:26                   ` Ian Campbell
@ 2010-11-16  9:52                     ` Keir Fraser
  2010-11-16 10:02                       ` Ian Campbell
  2010-11-16 15:50                     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-16  9:52 UTC (permalink / raw)
  To: Ian Campbell, Jeremy Fitzhardinge
  Cc: xen-devel, Stefano Stabellini, bruce.edge, Gianni Tedesco,
	Konrad Rzeszutek Wilk

On 16/11/2010 09:26, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:

>>> We do make the PTE that refer to physical devices to be the DOM_IO
>> domain..
>> 
>> I think Xen will sort that out for itself when presented with a
>> hardware/device mfn.
> 
> My main concern would be with save/restore code will canonicalise all
> the MFNs in the page tables back into PFNs and then convert back to MFNs
> on the other side, which is likely to go pretty wrong on one end of the
> other unless the save restore code is aware of which MFNs are device
> MFNs and which are actual memory. I'm not sure there is any way it can
> tell.

The right answer is probably to refuse save/restore/migrate when devices are
passed through. It's somewhere between very hard and very nuts to attempt
that in general. For example, even with SR-IOV, we've only been talking
about it so far for NICs, and then in terms of having a Solarflare-like
acceleration abstraction allowing us to step off of SR-IOV for at least the
duration of the critical bit of the save/restore.

A sensible first goal would simply be to be able to do PCI passthrough both
before and after a s/r/m across reasonaly heterogenous hardware, but not
attempt to be able to maintain such a device passthru *during* the s/r/m.

 -- Keir

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16  9:52                     ` Keir Fraser
@ 2010-11-16 10:02                       ` Ian Campbell
  2010-11-16 10:11                         ` Keir Fraser
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Campbell @ 2010-11-16 10:02 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk,
	Stabellini, bruce.edge

On Tue, 2010-11-16 at 09:52 +0000, Keir Fraser wrote:
> On 16/11/2010 09:26, "Ian Campbell" <Ian.Campbell@citrix.com> wrote:
> 
> >>> We do make the PTE that refer to physical devices to be the DOM_IO
> >> domain..
> >> 
> >> I think Xen will sort that out for itself when presented with a
> >> hardware/device mfn.
> > 
> > My main concern would be with save/restore code will canonicalise all
> > the MFNs in the page tables back into PFNs and then convert back to MFNs
> > on the other side, which is likely to go pretty wrong on one end of the
> > other unless the save restore code is aware of which MFNs are device
> > MFNs and which are actual memory. I'm not sure there is any way it can
> > tell.
> 
> The right answer is probably to refuse save/restore/migrate when devices are
> passed through.

Absolutely. 

However we are talking about setting up a 1-1 mapping in the P2M region
corresponding to the PCI hole at guest boot and preserving that until
such a time as a device is plugged in, which may be after a migration. 

I don't think it matters that no device is passed through at the time of
the migration, in this configuration we still need arrange for the
relevant P2M entries to be correct after the migration (or at least
before the device gets plugged in, perhaps we can leave holes and only
establish the 1-1 p2m on demand in pcifront?).

So long as this configuration doesn't cause the save/restore code to go
mad it's something we can likely fixup in the guest on restore. My worry
is that the save/restore code will just barf before we get that
opportunity...

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16 10:02                       ` Ian Campbell
@ 2010-11-16 10:11                         ` Keir Fraser
  2010-11-16 18:01                           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 25+ messages in thread
From: Keir Fraser @ 2010-11-16 10:11 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jeremy Fitzhardinge, xen-devel, Stefano Stabellini,
	Konrad Rzeszutek Wilk, bruce.edge, Gianni Tedesco

On 16/11/2010 10:02, "Ian Campbell" <Ian.Campbell@eu.citrix.com> wrote:

>> The right answer is probably to refuse save/restore/migrate when devices are
>> passed through.
> 
> Absolutely. 
> 
> However we are talking about setting up a 1-1 mapping in the P2M region
> corresponding to the PCI hole at guest boot and preserving that until
> such a time as a device is plugged in, which may be after a migration.
> 
> I don't think it matters that no device is passed through at the time of
> the migration, in this configuration we still need arrange for the
> relevant P2M entries to be correct after the migration (or at least
> before the device gets plugged in, perhaps we can leave holes and only
> establish the 1-1 p2m on demand in pcifront?).

Leave the hole empty and populate on demand when devices are passed through
would seem sensible.

 -- Keir

> So long as this configuration doesn't cause the save/restore code to go
> mad it's something we can likely fixup in the guest on restore. My worry
> is that the save/restore code will just barf before we get that
> opportunity...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16  9:26                   ` Ian Campbell
  2010-11-16  9:52                     ` Keir Fraser
@ 2010-11-16 15:50                     ` Konrad Rzeszutek Wilk
  2010-11-17 14:23                       ` Ian Campbell
  1 sibling, 1 reply; 25+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-11-16 15:50 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jeremy Fitzhardinge, xen-devel, Keir Fraser, Stefano Stabellini,
	bruce.edge, Gianni, Tedesco

disclaimer:
This email got a bit lengthy - so make sure you got a cup of coffee when you read this.

> On an unrelated note I think if we do go down the route of having the
> guest kernel punch the holes itself and such we should do so iff
> XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open

When would that actually happen? Is that return value returned when the
hypervisor is not implementing it (what version was that implemented this)?

> the possibility of cunning tricks on the tools side in the future.

<shuders>

I think we have three options in regards to this RFC patch I posted:
 1). Continue with this and have the toolstack punch the PCI hole. It would
     fill the PCI hole area with INVALID_MFN. The toolstack determines where
     the PCI hole starts.
 2). Do this in the guest where the guest calls both XENMEM_machine_memory_map and
     XENMEM_memory_map to get an idea of the host side PCI hole and set it up.
     Requires changes in hypervisor to allow non-privileged PV guest to make
     XENMEM_machine_memory_map call. Linux kernel decides where PCI hole starts and
     the PCI hole is filled with INVALID_MFN.
 3). Make unconditionally a PCI hole, starting at 3GB. PCI hole filled with INVALID_MFN.
 4). Another one I didn't think of?

For all of those cases when devices show up we populate on demand the P2M array
with the MFNs). For the first two proposals the BARs we read of
the PCI devices are going to be written to the P2M array as identity (so
mfn_list[0xc0000] == 0xc0000). Code has not been written.

For the third proposal, we would have non-identity mappings in the P2M array, as
during the migration we could move from a device with BARs of 0xc0000 to 0x20000.
So mfn_list[0xc0000] = 0x20000.

But for the third case I am unsure how we would get the "real" MFNs. We initially get
the BARs via 0xcf8 calls and if we don't filter them, it gets to ioremap function.
Say the host side BAR is at 0x20000, and our PCI hole starts at 0xc0000. The ioremap
gets called with 0x20000, and in its E820 that region is 'System RAM'.

        last_pfn = last_addr >> PAGE_SHIFT;
        for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
                int is_ram = page_is_ram(pfn);

                if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
                        return NULL;
                WARN_ON_ONCE(is_ram);
        }   

Ugh, and it will think (correctly) that it falls within RAM.

If we filter the 0xcf8 calls, which we can do the Xen PCI backend case, we can then
provide BARs that always start at 0xC0000. But that does not help the PV guest to
know the "real" MFNs which it needs so it can program the P2M array. So the Xen
PCI front would have to do this - which it could, thought it adds a complexity to it.

We also need to make all of this works with Domain zero, and here 1) or 2) can
easily be used as the Xen hypervisor has given us the E820 nicely peppered with holes.
(I wonder, what happens if dom0 makes a XENMEM_memory_map call - does it get anything?)

If we then go with 3), we would need to instrument the code that reads the BARs so that
it can filter it properly. That would be low-level Linux pci_conf_read and that is not
going happen - so we would have to make the Xen hypervisor be aware of this and when
it traps the in/out provide new BAR values starting at 0xC0000.

I am not comfortable maintaining this filter/keep state code in both the Xen hypervisor
and the Xen PCI front module so I think 3) would not work that well, unless there are
better ways that I have missed?

Back to 1) and 2). Migration would work if we unplug the PCI devices before suspend and
on resume plug them back in - otherwise the PCI BARs might have changed between
migrations. When the guest gets recreated - how does it iterate over the E820 to create
the P2M list? Or is that something that is not done and we just save the P2M list and
restore as-is on the other side? Naturally, since we would unplug the PCI device the
entries in the E820 gaps would be INVALID_MFN...

If we consult the E820 during resume I think doing the PCI hole in the toolstack has
merits - simply b/c the user can set the PCI hole to an arbitrary address that is low
enough (0x2000 say) to cover all of the machines that he/she would migrate too. While
if we do it in the Linux kernel we do not have that information. Even if we don't
consult the E820, the toolstack still has merits - as the PCI hole start address
might be different between the migration machines and we might have started on
a box with the PCI hole being way up (3.9GB) while the other machines might have
at 1.2GB.

The other thing I don't know is how all of this works with 32-bit kernels?

P.S.
I've done the testing of 1) with 64-bit w/ and w/o ballooning and it worked fine.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16 10:11                         ` Keir Fraser
@ 2010-11-16 18:01                           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 25+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-16 18:01 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Stefano Stabellini, Konrad Rzeszutek Wilk, bruce.edge,
	Ian Campbell, Gianni Tedesco

On 11/16/2010 02:11 AM, Keir Fraser wrote:
> On 16/11/2010 10:02, "Ian Campbell" <Ian.Campbell@eu.citrix.com> wrote:
>
>>> The right answer is probably to refuse save/restore/migrate when devices are
>>> passed through.
>> Absolutely. 
>>
>> However we are talking about setting up a 1-1 mapping in the P2M region
>> corresponding to the PCI hole at guest boot and preserving that until
>> such a time as a device is plugged in, which may be after a migration.
>>
>> I don't think it matters that no device is passed through at the time of
>> the migration, in this configuration we still need arrange for the
>> relevant P2M entries to be correct after the migration (or at least
>> before the device gets plugged in, perhaps we can leave holes and only
>> establish the 1-1 p2m on demand in pcifront?).
> Leave the hole empty and populate on demand when devices are passed through
> would seem sensible.

Actually I was originally thinking that the hole would all be
INVALID_MFN but then pfn_to_mfn() would translate that to being an
identity translation.  But that's pretty hacky, and only works if you
actually want identity.  On-demand population of regions is much cleaner.

    J

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-12 23:08 [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) Konrad Rzeszutek Wilk
  2010-11-12 23:16 ` Konrad Rzeszutek Wilk
  2010-11-13  7:40 ` Keir Fraser
@ 2010-11-17 11:14 ` Gianni Tedesco
  2010-11-17 11:43   ` Ian Campbell
  2 siblings, 1 reply; 25+ messages in thread
From: Gianni Tedesco @ 2010-11-17 11:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, Stabellini, bruce.edge

On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:
> Hey guys,
> 
> Attached is an RFC patch for making a PCI hole in the PV guests. This allows
> PV guests(*) with 4GB or more to now properly work with or without
> PCI passthrough cards.
> 
> Previously the Linux kernel would not be able to allocate the PCI region
> underneath the 4GB region as that region was all System RAM. And you would see
> this:
> 
> [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
> [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> [    0.000000] PCI: Unassigned devices with 32bit resource registers may break!
> [    0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000)
> 
> 
> This patchset punches an PCI hole in the E820 region and as well fills the P2M properly,
> so that now you can see (*):
> [    0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000)
> 
> It adds a new option to guest config file, which is "pci_hole". The user can
> specify the PFN number, such as '0xc0000' or in case of using the xl, '1' which
> will automatically figure out the start of the PCI address.
> 
> *: This option requires support in the Linux kernel to actually deal with two
> entries in the E820 map and P2M space filled with ~0.

Doesn't this problem affect dom0 also? How is it to be fixed there?

Gianni

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-17 11:14 ` Gianni Tedesco
@ 2010-11-17 11:43   ` Ian Campbell
  2010-11-17 13:37     ` Gianni Tedesco
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Campbell @ 2010-11-17 11:43 UTC (permalink / raw)
  To: Gianni Tedesco
  Cc: Jeremy Fitzhardinge, xen-devel, Stefano Stabellini, bruce.edge,
	Konrad Rzeszutek Wilk

On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote:
> On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:
> > Hey guys,
> > 
> > Attached is an RFC patch for making a PCI hole in the PV guests. This allows
> > PV guests(*) with 4GB or more to now properly work with or without
> > PCI passthrough cards.
> > 
> > Previously the Linux kernel would not be able to allocate the PCI region
> > underneath the 4GB region as that region was all System RAM. And you would see
> > this:
> > 
> > [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
> > [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> > [    0.000000] PCI: Unassigned devices with 32bit resource registers may break!
> > [    0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000)
> > 
> > 
> > This patchset punches an PCI hole in the E820 region and as well fills the P2M properly,
> > so that now you can see (*):
> > [    0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000)
> > 
> > It adds a new option to guest config file, which is "pci_hole". The user can
> > specify the PFN number, such as '0xc0000' or in case of using the xl, '1' which
> > will automatically figure out the start of the PCI address.
> > 
> > *: This option requires support in the Linux kernel to actually deal with two
> > entries in the E820 map and P2M space filled with ~0.
> 
> Doesn't this problem affect dom0 also? How is it to be fixed there?

Domain 0 sees the actual host e820 map instead of the paravirtualised
one constructed by the tools for domU. So the BIOS effectively takes the
place of the tools side of this patch for dom0.

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-17 11:43   ` Ian Campbell
@ 2010-11-17 13:37     ` Gianni Tedesco
  0 siblings, 0 replies; 25+ messages in thread
From: Gianni Tedesco @ 2010-11-17 13:37 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Jeremy Fitzhardinge, xen-devel, Stefano Stabellini, bruce.edge,
	Konrad Rzeszutek Wilk

On Wed, 2010-11-17 at 11:43 +0000, Ian Campbell wrote:
> On Wed, 2010-11-17 at 11:14 +0000, Gianni Tedesco wrote:
> > On Fri, 2010-11-12 at 23:08 +0000, Konrad Rzeszutek Wilk wrote:
> > > Hey guys,
> > > 
> > > Attached is an RFC patch for making a PCI hole in the PV guests. This allows
> > > PV guests(*) with 4GB or more to now properly work with or without
> > > PCI passthrough cards.
> > > 
> > > Previously the Linux kernel would not be able to allocate the PCI region
> > > underneath the 4GB region as that region was all System RAM. And you would see
> > > this:
> > > 
> > > [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 0000000000100000
> > > [    0.000000] PCI: Warning: Cannot find a gap in the 32bit address range
> > > [    0.000000] PCI: Unassigned devices with 32bit resource registers may break!
> > > [    0.000000] Allocating PCI resources starting at 100100000 (gap: 100100000:400000)
> > > 
> > > 
> > > This patchset punches an PCI hole in the E820 region and as well fills the P2M properly,
> > > so that now you can see (*):
> > > [    0.000000] Allocating PCI resources starting at a0000000 (gap: a0000000:60000000)
> > > 
> > > It adds a new option to guest config file, which is "pci_hole". The user can
> > > specify the PFN number, such as '0xc0000' or in case of using the xl, '1' which
> > > will automatically figure out the start of the PCI address.
> > > 
> > > *: This option requires support in the Linux kernel to actually deal with two
> > > entries in the E820 map and P2M space filled with ~0.
> > 
> > Doesn't this problem affect dom0 also? How is it to be fixed there?
> 
> Domain 0 sees the actual host e820 map instead of the paravirtualised
> one constructed by the tools for domU. So the BIOS effectively takes the
> place of the tools side of this patch for dom0.

Hmm, of course, I did think xen "sanitised" it somehow though.

I suppose I just assumed (hoped) this may fix my dom0 boot crash issues.

Gianni

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm)
  2010-11-16 15:50                     ` Konrad Rzeszutek Wilk
@ 2010-11-17 14:23                       ` Ian Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Ian Campbell @ 2010-11-17 14:23 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jeremy Fitzhardinge, xen-devel, Keir Fraser, Stefano Stabellini,
	bruce.edge, Gianni, Gianni Tedesco

On Tue, 2010-11-16 at 15:50 +0000, Konrad Rzeszutek Wilk wrote:
> disclaimer:
> This email got a bit lengthy - so make sure you got a cup of coffee when you read this.
> 
> > On an unrelated note I think if we do go down the route of having the
> > guest kernel punch the holes itself and such we should do so iff
> > XENMEM_memory_map returns either ENOSYS or nr_entries == 1 to leave open
> 
> When would that actually happen? Is that return value returned when the
> hypervisor is not implementing it (what version was that implemented this)?

-ENOSYS implies an older Xen which does not have this interface.
Currently this causes the guest to create a fake one-entry e820 covering
0..nr_pages, which is what old guests which don't know about the
hypercall do too.

If the hypercall returns an e820 with nr_entries == 1 then this implies
a newer Xen which implements the interface but where the tools have only
poked down a simple one-entry e820 covering 0..nr_pages or possibly
0..max_pages (this is all any existing hypervisor/tools will do).

If the hypervisor returns nr_entries >= 2 then you have some future Xen
which has tools which (think they) know what they are doing and so we
should trust the e820 given to us. Without allowing for this now we will
end up with XENFEAT_tools_provide_a_useful_guest_e820 which would be a
shame!

Ian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2010-11-17 14:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-12 23:08 [RFC Patch] Support for making an E820 PCI hole in toolstack (xl + xm) Konrad Rzeszutek Wilk
2010-11-12 23:16 ` Konrad Rzeszutek Wilk
2010-11-13  7:40 ` Keir Fraser
2010-11-15 17:03   ` Konrad Rzeszutek Wilk
2010-11-15 17:20     ` Ian Campbell
2010-11-15 17:28       ` Konrad Rzeszutek Wilk
2010-11-15 17:48     ` Keir Fraser
2010-11-15 18:15       ` Konrad Rzeszutek Wilk
2010-11-15 18:41         ` Keir Fraser
2010-11-15 19:32           ` Jeremy Fitzhardinge
2010-11-15 19:57             ` Keir Fraser
2010-11-15 23:11               ` Konrad Rzeszutek Wilk
2010-11-16  1:06                 ` Jeremy Fitzhardinge
2010-11-16  9:26                   ` Ian Campbell
2010-11-16  9:52                     ` Keir Fraser
2010-11-16 10:02                       ` Ian Campbell
2010-11-16 10:11                         ` Keir Fraser
2010-11-16 18:01                           ` Jeremy Fitzhardinge
2010-11-16 15:50                     ` Konrad Rzeszutek Wilk
2010-11-17 14:23                       ` Ian Campbell
2010-11-16  7:40                 ` Keir Fraser
2010-11-15 19:30         ` Jeremy Fitzhardinge
2010-11-17 11:14 ` Gianni Tedesco
2010-11-17 11:43   ` Ian Campbell
2010-11-17 13:37     ` Gianni Tedesco

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.