All of lore.kernel.org
 help / color / mirror / Atom feed
* [v7][RFC][PATCH 01/13] xen: RMRR fix
@ 2014-10-24  7:34 Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
                   ` (14 more replies)
  0 siblings, 15 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

This series of patches try to reconcile those remaining problems but
just post as RFC to ask for any comments to refine everything.

The current whole scheme is as follows:

1. Reconcile guest mmio with RMRR in pci_setup
2. Reconcile guest RAM with RMRR in e820 table

Then in theory guest wouldn't access any RMRR range.

3. Just initialize all RMRR ranges as p2m_access_n in p2m table:
    gfn:mfn:p2m_access_n

Here I think we shouldn't set 1:1 to expose RMRR to guest if guest
may never have a device assignment. It can prevent from leaking RMRR.

4. We reset those mappings as 1:1:p2m_mmio_direct:p2m_ram_rw once we
have a device assignment.

5. Before we take real device assignment, any access to RMRR may issue
ept_handle_violation because of p2m_access_n. Then we just call
update_guest_eip() to return.

6. After a device assignment, guest may maliciously access RMRR ranges
although we already reserve in e820 table. In the worst-case scenario
just that device can't work well. But this behavior should be same as
native so I think we shouldn't do anything here.

7. Its not necessary to introduce any flag in ept_set_entry.

First of all, hypervisor/dom0 should be trusted. Any user should make
sure they never override any valid RMRR tables without any check. So
our original set_identity_p2m_entry() tries to set as follows:

 - gfn space unoccupied -> insert mapping; success.
 - gfn space already occupied by 1:1 RMRR mapping -> do nothing; success.
 - gfn space already occupied by other mapping -> fail.

Now in our case we add a rule:
 - if p2m_access_n is set we also set this mapping.

Another reason is that ept_set_entry is called in many scenarios to
support its own management, I think we shouldn't corrupt this mechanism
and its also difficult to cover all points.

8. We need to take a consideration grouping all devices that have same
RMRR range to make sure they're just assigned to one VM.

----------------------------------------------------------------
Jan Beulich (1):
      introduce XENMEM_reserved_device_memory_map

Tiejun Chen (12):
      tools/libxc: introduce hypercall for xc_reserved_device_memory_map
      tools/libxc: check if modules space is overlapping with reserved device memory
      hvmloader/util: get reserved device memory maps
      hvmloader/mmio: reconcile guest mmio with reserved device memory
      hvmloader/ram: check if guest memory is out of reserved device memory maps
      xen/x86/p2m: introduce p2m_check_reserved_device_memory
      xen/x86/p2m: set p2m_access_n for reserved device memory mapping
      xen/x86/ept: handle reserved device memory in ept_handle_violation
      xen/x86/p2m: introduce set_identity_p2m_entry
      xen:vtd: create RMRR mapping
      xen/vtd: re-enable USB device assignment
      xen/vtd: group assigned device with RMRR

 tools/firmware/hvmloader/e820.c      | 215 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/pci.c       |  68 +++++++++++++++++++++++++++++++++++++++++++++++++-
 tools/firmware/hvmloader/util.c      |  66 ++++++++++++++++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/util.h      |   6 +++++
 tools/libxc/include/xenctrl.h        |   4 +++
 tools/libxc/xc_domain.c              |  29 +++++++++++++++++++++
 tools/libxc/xc_hvm_build_x86.c       | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 xen/arch/x86/hvm/vmx/vmx.c           |  14 +++++++++++
 xen/arch/x86/mm/p2m.c                |  52 ++++++++++++++++++++++++++++++++++++++
 xen/common/compat/memory.c           |  52 ++++++++++++++++++++++++++++++++++++++
 xen/common/memory.c                  |  49 ++++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/iommu.c      |  10 ++++++++
 xen/drivers/passthrough/vtd/dmar.c   |  46 +++++++++++++++++++++++++++++++++-
 xen/drivers/passthrough/vtd/dmar.h   |   3 ++-
 xen/drivers/passthrough/vtd/extern.h |   1 +
 xen/drivers/passthrough/vtd/iommu.c  |  93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 xen/drivers/passthrough/vtd/utils.c  |   7 ------
 xen/include/asm-x86/p2m.h            |  17 +++++++++++++
 xen/include/public/memory.h          |  24 +++++++++++++++++-
 xen/include/xen/iommu.h              |   4 +++
 xen/include/xlat.lst                 |   3 ++-
 21 files changed, 828 insertions(+), 46 deletions(-)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 14:11   ` Jan Beulich
  2014-10-27 13:35   ` Julien Grall
  2014-10-24  7:34 ` [v7][RFC][PATCH 02/13] tools/libxc: introduce hypercall for xc_reserved_device_memory_map Tiejun Chen
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

From: Jan Beulich <jbeulich@suse.com>

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/common/compat/memory.c           | 52 ++++++++++++++++++++++++++++++++++++
 xen/common/memory.c                  | 49 +++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/iommu.c      | 10 +++++++
 xen/drivers/passthrough/vtd/dmar.c   | 17 ++++++++++++
 xen/drivers/passthrough/vtd/extern.h |  1 +
 xen/drivers/passthrough/vtd/iommu.c  |  1 +
 xen/include/public/memory.h          | 24 ++++++++++++++++-
 xen/include/xen/iommu.h              |  4 +++
 xen/include/xlat.lst                 |  3 ++-
 9 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index 43d02bc..01b9f6e 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -16,6 +16,35 @@ CHECK_TYPE(domid);
 
 CHECK_mem_access_op;
 
+struct get_reserved_device_memory {
+    struct compat_mem_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct compat_mem_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+            return -ERANGE;
+
+        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
+                                     &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -273,6 +302,29 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( __copy_to_guest(compat, &grdm.map, 1) )
+                rc = -EFAULT;
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index cc36e39..51a32a8 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,32 @@ out:
     return rc;
 }
 
+struct get_reserved_device_memory {
+    struct xen_mem_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_mem_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1101,6 +1127,29 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index cc12735..7c17e8d 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
diff --git a/xen/drivers/passthrough/vtd/dmar.c b/xen/drivers/passthrough/vtd/dmar.c
index 1152c3a..141e735 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr;
+    int rc = 0;
+
+    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    {
+        rc = func(PFN_DOWN(rmrr->base_address),
+                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
+                  ctxt);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 5524dba..f9ee9b0 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct domain *domain, struct iommu *iommu,
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 1c52981..efd3390 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops = {
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f021958..4534215 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct vnuma_topology_info {
 typedef struct vnuma_topology_info vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * For legacy reasons, some devices must be configured with special memory
+ * regions to function correctly.  The guest must avoid using any of these
+ * regions.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_mem_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_mem_reserved_device_memory xen_mem_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_reserved_device_memory_t);
+
+struct xen_mem_reserved_device_memory_map {
+    /* IN/OUT */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_mem_reserved_device_memory_t) buffer;
+};
+typedef struct xen_mem_reserved_device_memory_map xen_mem_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8eb764a..409f6f8 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -120,6 +120,8 @@ void iommu_dt_domain_destroy(struct domain *d);
 
 struct page_info;
 
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -156,12 +158,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 234b668..2e60a0d 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -60,7 +60,8 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
+!	mem_reserved_device_memory_map	memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
 ?	physdev_eoi			physdev.h
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 02/13] tools/libxc: introduce hypercall for xc_reserved_device_memory_map
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 03/13] tools/libxc: check if modules space is overlapping with reserved device memory Tiejun Chen
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We will introduce that hypercall xc_reserved_device_memory_map
approach to libxc.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxc/include/xenctrl.h |  4 ++++
 tools/libxc/xc_domain.c       | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 564e187..b4a60ba 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1288,6 +1288,10 @@ int xc_domain_set_memory_map(xc_interface *xch,
 int xc_get_machine_memory_map(xc_interface *xch,
                               struct e820entry entries[],
                               uint32_t max_entries);
+
+int xc_reserved_device_memory_map(xc_interface *xch,
+                                  struct xen_mem_reserved_device_memory entries[],
+                                  uint32_t *max_entries);
 #endif
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index a9bcd4a..5f63b6c 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -659,6 +659,35 @@ int xc_domain_set_memory_map(xc_interface *xch,
 
     return rc;
 }
+
+int xc_reserved_device_memory_map(xc_interface *xch,
+                                  struct xen_mem_reserved_device_memory entries[],
+                                  uint32_t *max_entries)
+{
+    int rc;
+    struct xen_mem_reserved_device_memory_map memmap = {
+        .nr_entries = *max_entries
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries,
+                             sizeof(struct xen_mem_reserved_device_memory) *
+                             *max_entries, XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( xc_hypercall_bounce_pre(xch, entries) )
+        return -1;
+
+    set_xen_guest_handle(memmap.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_reserved_device_memory_map,
+                      &memmap, sizeof(memmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    if ( errno == ENOBUFS )
+        *max_entries = memmap.nr_entries;
+
+    return rc ? rc : memmap.nr_entries;
+}
+
 int xc_get_machine_memory_map(xc_interface *xch,
                               struct e820entry entries[],
                               uint32_t max_entries)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 03/13] tools/libxc: check if modules space is overlapping with reserved device memory
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 02/13] tools/libxc: introduce hypercall for xc_reserved_device_memory_map Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps Tiejun Chen
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

In case of reserved device memory overlapping with ram, it also probably
overlap with modules space so we need to check these reserved device
memory as well.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxc/xc_hvm_build_x86.c | 111 +++++++++++++++++++++++++++++++++++------
 1 file changed, 96 insertions(+), 15 deletions(-)

diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index c81a25b..cb590ee 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -54,9 +54,99 @@
 
 #define VGA_HOLE_SIZE (0x20)
 
+/* Record reserved device memory. */
+static struct xen_mem_reserved_device_memory *xmrdm = NULL;
+
+/*
+ * Check whether there exists mmio hole in the specified memory range.
+ * Returns 1 if exists, else returns 0.
+ */
+static int check_mmio_hole(uint64_t start, uint64_t memsize,
+                           uint64_t mmio_start, uint64_t mmio_size)
+{
+    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
+        return 0;
+    else
+        return 1;
+}
+
+/* Getting all reserved device memory map info. */
+static int xc_get_reserved_device_memory_map(xc_interface *xch)
+{
+    static uint32_t nr_entries = 0;
+    int rc = 0;
+
+    if ( !xmrdm )
+    {
+        /* Assume we have one entry if not enough we'll expand.*/
+        nr_entries = 1;
+        if ( (xmrdm = malloc(nr_entries *
+                             sizeof(xen_mem_reserved_device_memory_t))) == NULL )
+        {
+            PERROR("Could not allocate memory for map.");
+            return -1;
+        }
+
+        rc = xc_reserved_device_memory_map(xch, xmrdm, &nr_entries);
+        if ( rc < 0 )
+        {
+            switch ( errno )
+            {
+            case ENOENT: /* reserved device memory doesn't exist. */
+                rc = 0;
+                break;
+            case ENOBUFS: /* Need more space */
+                free(xmrdm);
+                /* Now we need more space to map all reserved device memory maps. */
+                if ( (xmrdm = malloc(nr_entries *
+                                     sizeof(xen_mem_reserved_device_memory_t))) == NULL )
+                {
+                    PERROR("Could not allocate memory.");
+                    return -1;
+                }
+                rc = xc_reserved_device_memory_map(xch, xmrdm, &nr_entries);
+                if ( rc )
+                {
+                    PERROR("Could not get reserved device memory maps.");
+                    free(xmrdm);
+                    return rc;
+                }
+                break;
+            default: /* Failed to get reserved device memory maps. */
+                PERROR("Could not get reserved device memory maps.");
+                return rc;
+            }
+        }
+    }
+
+    return nr_entries;
+}
+
+static int xc_check_modules_space(xc_interface *xch, uint64_t *mstart_out,
+                                  uint64_t *mend_out)
+{
+    unsigned int i = 0;
+    uint64_t rdm_start = 0, rdm_end = 0;
+    int nr_entries = xc_get_reserved_device_memory_map(xch);
+
+    for ( i = 0; i < nr_entries; i++ )
+    {
+        rdm_start = xmrdm[i].start_pfn << XC_PAGE_SHIFT;
+        rdm_end = rdm_start + (xmrdm[i].nr_pages << XC_PAGE_SHIFT);
+
+        /* Just use check_mmio_hole() to check modules ranges. */
+        if ( check_mmio_hole(rdm_start, xmrdm[i].nr_pages << XC_PAGE_SHIFT,
+                             *mstart_out, *mend_out) )
+            return -1;
+    }
+
+    return 0;
+}
+
 static int modules_init(struct xc_hvm_build_args *args,
                         uint64_t vend, struct elf_binary *elf,
-                        uint64_t *mstart_out, uint64_t *mend_out)
+                        uint64_t *mstart_out, uint64_t *mend_out,
+                        xc_interface *xch)
 {
 #define MODULE_ALIGN 1UL << 7
 #define MB_ALIGN     1UL << 20
@@ -80,6 +170,10 @@ static int modules_init(struct xc_hvm_build_args *args,
     if ( *mend_out > vend )    
         return -1;
 
+    /* Is it overlapping with reserved device memory? */
+    if ( xc_check_modules_space(xch, mstart_out, mend_out) )
+        return -1;
+
     if ( args->acpi_module.length != 0 )
         args->acpi_module.guest_addr_out = *mstart_out;
     if ( args->smbios_module.length != 0 )
@@ -226,19 +320,6 @@ static int loadmodules(xc_interface *xch,
     return rc;
 }
 
-/*
- * Check whether there exists mmio hole in the specified memory range.
- * Returns 1 if exists, else returns 0.
- */
-static int check_mmio_hole(uint64_t start, uint64_t memsize,
-                           uint64_t mmio_start, uint64_t mmio_size)
-{
-    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
-        return 0;
-    else
-        return 1;
-}
-
 static int setup_guest(xc_interface *xch,
                        uint32_t dom, struct xc_hvm_build_args *args,
                        char *image, unsigned long image_size)
@@ -282,7 +363,7 @@ static int setup_guest(xc_interface *xch,
         goto error_out;
     }
 
-    if ( modules_init(args, v_end, &elf, &m_start, &m_end) != 0 )
+    if ( modules_init(args, v_end, &elf, &m_start, &m_end, xch) != 0 )
     {
         ERROR("Insufficient space to load modules.");
         goto error_out;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (2 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 03/13] tools/libxc: check if modules space is overlapping with reserved device memory Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 14:22   ` Jan Beulich
  2014-10-24 14:27   ` Jan Beulich
  2014-10-24  7:34 ` [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory Tiejun Chen
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We need to use reserved device memory maps with multiple times, so
provide just one common function should be friend.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/util.c | 66 +++++++++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/util.h |  6 ++++
 2 files changed, 72 insertions(+)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 80d822f..aa2c17c 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -828,6 +828,72 @@ int hpet_exists(unsigned long hpet_base)
     return ((hpet_id >> 16) == 0x8086);
 }
 
+int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
+                                   uint32_t *max_entries)
+{
+    int rc;
+    struct xen_mem_reserved_device_memory_map memmap = {
+        .nr_entries = *max_entries
+    };
+
+    set_xen_guest_handle(memmap.buffer, entries);
+
+    rc = hypercall_memory_op(XENMEM_reserved_device_memory_map, &memmap);
+    if (rc == -ENOBUFS)
+        *max_entries = memmap.nr_entries;
+
+    return rc;
+}
+
+/* Getting all reserved device memory map info in case of hvmloader. */
+int hvm_get_reserved_device_memory_map(void)
+{
+    static uint32_t nr_entries = 0;
+    int rc = 0;
+
+    if ( !rdm_map )
+    {
+        /* Assume we have one entry if not enough we'll expand.*/
+        nr_entries = 1;
+        rdm_map = mem_alloc(nr_entries *
+                            sizeof(struct xen_mem_reserved_device_memory), 0);
+        if ( !rdm_map )
+        {
+            printf("No space to get reserved dev memory maps!\n");
+            return rc;
+        }
+
+        rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
+        if ( rc == -ENOBUFS )
+        {
+            rdm_map = mem_alloc(nr_entries *
+                                sizeof(struct xen_mem_reserved_device_memory),
+                                0);
+            if ( rdm_map )
+            {
+                rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
+                if ( rc )
+                {
+                    printf("Could not get reserved dev memory info on domain");
+                    return rc;
+                }
+            }
+            else
+            {
+                printf("No space to get reserved dev memory maps!\n");
+                return rc;
+            }
+        }
+        else if ( rc )
+        {
+            printf("Could not get reserved dev memory info on domain");
+            return rc;
+        }
+    }
+
+    return nr_entries;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index a70e4aa..81abc44 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
                      unsigned int bios_image_base);
 void dump_e820_table(struct e820entry *e820, unsigned int nr);
 
+#include <xen/memory.h>
+#define ENOBUFS     105 /* No buffer space available */
+struct xen_mem_reserved_device_memory *rdm_map;
+int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
+                                   uint32_t *max_entries);
+int hvm_get_reserved_device_memory_map(void);
 #ifndef NDEBUG
 void perform_tests(void);
 #else
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (3 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 14:42   ` Jan Beulich
  2014-10-24  7:34 ` [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps Tiejun Chen
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We need to make sure all mmio allocation don't overlap
any rdm, reserved device memory. Here we just skip
all reserved device memory range in mmio space.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/pci.c | 68 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 67 insertions(+), 1 deletion(-)

diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
index 3712988..e26481c 100644
--- a/tools/firmware/hvmloader/pci.c
+++ b/tools/firmware/hvmloader/pci.c
@@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
 enum virtual_vga virtual_vga = VGA_none;
 unsigned long igd_opregion_pgbase = 0;
 
+unsigned int need_skip_rmrr = 0;
+
+/*
+ * Check whether there exists mmio hole in the specified memory range.
+ * Returns 1 if exists, else returns 0.
+ */
+static int check_mmio_hole_confliction(uint64_t start, uint64_t memsize,
+                           uint64_t mmio_start, uint64_t mmio_size)
+{
+    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
+        return 0;
+    else
+        return 1;
+}
+
+static int check_reserved_device_memory_map(uint64_t mmio_base,
+                                            uint64_t mmio_max)
+{
+    uint32_t i = 0;
+    uint64_t rdm_start, rdm_end;
+    int nr_entries = -1;
+
+    nr_entries = hvm_get_reserved_device_memory_map();
+
+    for ( i = 0; i < nr_entries; i++ )
+    {
+        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
+        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
+        if ( check_mmio_hole_confliction(rdm_start, (rdm_end - rdm_start),
+                                         mmio_base, mmio_max - mmio_base) )
+        {
+            need_skip_rmrr++;
+        }
+    }
+
+    return nr_entries;
+}
+
 void pci_setup(void)
 {
     uint8_t is_64bar, using_64bar, bar64_relocate = 0;
@@ -58,7 +96,9 @@ void pci_setup(void)
         uint32_t bar_reg;
         uint64_t bar_sz;
     } *bars = (struct bars *)scratch_start;
-    unsigned int i, nr_bars = 0;
+    unsigned int i, j, nr_bars = 0;
+    int nr_entries = 0;
+    uint64_t rdm_start, rdm_end;
 
     const char *s;
     /*
@@ -309,6 +349,14 @@ void pci_setup(void)
     io_resource.base = 0xc000;
     io_resource.max = 0x10000;
 
+    /* Check low mmio range. */
+    nr_entries = check_reserved_device_memory_map(mem_resource.base,
+                                                  mem_resource.max);
+    /* Check high mmio range. */
+    if ( nr_entries > 0 )
+        nr_entries = check_reserved_device_memory_map(high_mem_resource.base,
+                                                      high_mem_resource.max);
+
     /* Assign iomem and ioport resources in descending order of size. */
     for ( i = 0; i < nr_bars; i++ )
     {
@@ -363,11 +411,29 @@ void pci_setup(void)
             bar_data &= ~PCI_BASE_ADDRESS_IO_MASK;
         }
 
+ reallocate_mmio:
         base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
         bar_data |= (uint32_t)base;
         bar_data_upper = (uint32_t)(base >> 32);
         base += bar_sz;
 
+        if ( need_skip_rmrr )
+        {
+            for ( j = 0; j < nr_entries; j++ )
+            {
+                rdm_start = rdm_map[j].start_pfn << PAGE_SHIFT;
+                rdm_end = rdm_start + (rdm_map[j].nr_pages << PAGE_SHIFT);
+                if ( check_mmio_hole_confliction(rdm_start,
+                                                 (rdm_end - rdm_start),
+                                                 base, bar_sz) )
+                {
+                    resource->base = rdm_end;
+                    need_skip_rmrr--;
+                    goto reallocate_mmio;
+                }
+            }
+        }
+
         if ( (base < resource->base) || (base > resource->max) )
         {
             printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (4 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 14:56   ` Jan Beulich
  2014-10-24  7:34 ` [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory Tiejun Chen
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We need to check to reserve all reserved device memory maps in e820
to avoid any potential guest memory conflict.

Currently, if we can't insert RDM entries directly, we may need to handle
several ranges as follows:
a. Fixed Ranges --> BUG()
 lowmem_reserved_base-0xA0000: reserved by BIOS implementation,
 BIOS region,
 RESERVED_MEMBASE ~ 0x100000000,
b. RAM or RAM:Hole -> Try to reserve

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/e820.c | 215 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 215 insertions(+)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 2e05e93..d188e02 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -68,12 +68,218 @@ void dump_e820_table(struct e820entry *e820, unsigned int nr)
     }
 }
 
+static unsigned int construct_rdm_e820_maps(unsigned int next_e820_entry_index,
+                                            uint32_t nr_map,
+                                            struct xen_mem_reserved_device_memory *map,
+                                            struct e820entry *e820,
+                                            unsigned int lowmem_reserved_base,
+                                            unsigned int bios_image_base)
+{
+    unsigned int i, j, sum_nr = next_e820_entry_index + nr_map;
+    uint64_t start, end, next_start, rdm_start, rdm_end;
+    uint32_t type;
+    unsigned int insert = 0, do_insert = 0;
+    int err = 0;
+
+ do_real_construct:
+    for ( i = 0; i < nr_map; i++ )
+    {
+        rdm_start = map[i].start_pfn << PAGE_SHIFT;
+        rdm_end = rdm_start + (map[i].nr_pages << PAGE_SHIFT);
+
+        for ( j = 0; j < next_e820_entry_index - 1; j++ )
+        {
+            start = e820[j].addr;
+            end = e820[j].addr + e820[j].size;
+            type = e820[j].type;
+            next_start = e820[j+1].addr;
+
+            /* lowmem_reserved_base-0xA0000: reserved by BIOS implementation. */
+            if ( lowmem_reserved_base < 0xA0000 &&
+                 start == lowmem_reserved_base )
+            {
+                if ( rdm_start >= start && rdm_start <= end )
+                {
+                    err = -1;
+                    break;
+                }
+            }
+
+            /*
+             * BIOS region.
+             */
+            if ( start == bios_image_base )
+            {
+                if ( rdm_start >= start && rdm_start <= end )
+                {
+                    err = -1;
+                    break;
+                }
+            }
+
+            /* The default memory map always occupy one fixed reserved
+             * range: RESERVED_MEMBASE ~ 0x100000000
+             */
+            if ( rdm_start >= RESERVED_MEMBASE &&
+                      rdm_start <= ((uint64_t)1 << 32) )
+            {
+                err = -1;
+                break;
+            }
+
+            /* Just amid those remaining e820 entries. */
+            if ( (rdm_start > end) && (rdm_end < next_start) )
+            {
+                if ( do_insert )
+                {
+                    memmove(&e820[j+2], &e820[j+1],
+                            (sum_nr - j - 1) * sizeof(struct e820entry));
+
+                    /* Then fill RMRR into that entry. */
+                    e820[j+1].addr = rdm_start;
+                    e820[j+1].size = rdm_end - rdm_start;
+                    e820[j+1].type = E820_RESERVED;
+                    next_e820_entry_index++;
+                }
+                insert++;
+            }
+            /* Already at the end. */
+            else if ( (rdm_start > end) && !next_start )
+            {
+                if ( do_insert )
+                {
+                    e820[next_e820_entry_index].addr = rdm_start;
+                    e820[next_e820_entry_index].size = rdm_end - rdm_start;
+                    e820[next_e820_entry_index].type = E820_RESERVED;
+                    next_e820_entry_index++;
+                }
+                insert++;
+            }
+            /* If completely overlap with one RAM range. */
+            else if ( rdm_start == start && rdm_end == end && type == E820_RAM )
+            {
+                if ( do_insert )
+                    e820[j].type = E820_RESERVED;
+                insert++;
+            }
+            /* If we're just alligned with start of one RAM range. */
+            else if ( rdm_start == start && rdm_end < end && type == E820_RAM )
+            {
+                if ( do_insert )
+                {
+                    memmove(&e820[j+1], &e820[j],
+                            (sum_nr - j) * sizeof(struct e820entry));
+
+                    e820[j+1].addr = rdm_end;
+                    e820[j+1].size = e820[j].addr + e820[j].size - rdm_end;
+                    e820[j+1].type = E820_RAM;
+                    next_e820_entry_index++;
+
+                    e820[j].addr = rdm_start;
+                    e820[j].size = rdm_end - rdm_start;
+                    e820[j].type = E820_RESERVED;
+                }
+                insert++;
+            }
+            /* If we're just alligned with end of one RAM range. */
+            else if ( rdm_start > start && rdm_end == end && type == E820_RAM )
+            {
+                if ( do_insert )
+                {
+                    memmove(&e820[j+1], &e820[j],
+                            (sum_nr - j) * sizeof(struct e820entry));
+
+                    e820[j].size = rdm_start - e820[j].addr;
+                    e820[j].type = E820_RAM;
+
+                    e820[j+1].addr = rdm_start;
+                    e820[j+1].size = rdm_end - rdm_start;
+                    e820[j+1].type = E820_RESERVED;
+                    next_e820_entry_index++;
+                }
+                insert++;
+            }
+            /* If we're just in of one RAM range */
+            else if ( rdm_start > start && rdm_end < end && type == E820_RAM )
+            {
+                if ( do_insert )
+                {
+                    memmove(&e820[j+2], &e820[j],
+                            (sum_nr - j) * sizeof(struct e820entry));
+
+                    e820[j+2].addr = rdm_end;
+                    e820[j+2].size = e820[j].addr + e820[j].size - rdm_end;
+                    e820[j+2].type = E820_RAM;
+                    next_e820_entry_index++;
+
+                    e820[j+1].addr = rdm_start;
+                    e820[j+1].size = rdm_end - rdm_start;
+                    e820[j+1].type = E820_RESERVED;
+                    next_e820_entry_index++;
+
+                    e820[j].size = rdm_start - e820[j].addr;
+                    e820[j].type = E820_RAM;
+                }
+                insert++;
+            }
+            /* If we're going last RAM:Hole range */
+            else if ( end < next_start &&
+                      rdm_start > start &&
+                      rdm_end < next_start &&
+                      type == E820_RAM )
+            {
+                if ( do_insert )
+                {
+                    memmove(&e820[j+1], &e820[j],
+                            (sum_nr - j) * sizeof(struct e820entry));
+
+                    e820[j].size = rdm_start - e820[j].addr;
+                    e820[j].type = E820_RAM;
+
+                    e820[j+1].addr = rdm_start;
+                    e820[j+1].size = rdm_end - rdm_start;
+                    e820[j+1].type = E820_RESERVED;
+                    next_e820_entry_index++;
+                }
+                insert++;
+            }
+        }
+    }
+
+    /* These overlap may issue guest can't work well. */
+    if ( err )
+    {
+        printf("Guest can't work with some reserved device memory overlap!\n");
+        BUG();
+    }
+
+    /* Just return if done. */
+    if ( do_insert )
+        return next_e820_entry_index;
+
+    /* Fine to construct RDM mappings into e820. */
+    if ( insert == nr_map )
+    {
+        do_insert = 1;
+        goto do_real_construct;
+    }
+    /* Overlap. */
+    else
+    {
+        printf("RDM overlap with those existing e820 entries!\n");
+        printf("So we don't construct RDM mapping in e820!\n");
+    }
+
+    return next_e820_entry_index;
+}
+
 /* Create an E820 table based on memory parameters provided in hvm_info. */
 int build_e820_table(struct e820entry *e820,
                      unsigned int lowmem_reserved_base,
                      unsigned int bios_image_base)
 {
     unsigned int nr = 0;
+    int nr_entries = 0;
 
     if ( !lowmem_reserved_base )
             lowmem_reserved_base = 0xA0000;
@@ -169,6 +375,15 @@ int build_e820_table(struct e820entry *e820,
         nr++;
     }
 
+    if ( rdm_map )
+    {
+        nr_entries = hvm_get_reserved_device_memory_map();
+        if ( nr_entries > 0 )
+            nr = construct_rdm_e820_maps(nr, nr_entries, rdm_map, e820,
+                                         lowmem_reserved_base,
+                                         bios_image_base);
+    }
+
     return nr;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (5 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 15:02   ` Jan Beulich
  2014-10-24  7:34 ` [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping Tiejun Chen
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

This can be used conveniently in many cases later.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/include/asm-x86/p2m.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 90ddd15..934324e 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -713,6 +713,19 @@ extern int arch_grant_map_page_identity(struct domain *d, unsigned long frame,
                                  bool_t writeable);
 extern int arch_grant_unmap_page_identity(struct domain *d, unsigned long frame);
 
+/* Check if we are accessing rdm. */
+static inline int p2m_check_reserved_device_memory(xen_pfn_t start,
+                                                   xen_ulong_t nr, void *d)
+{
+    unsigned long *gfn = d;
+    xen_pfn_t end = start + nr;
+
+    if ( *gfn >= start && *gfn <= end )
+        return 1;
+
+    return 0;
+}
+
 #endif /* _XEN_P2M_H */
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (6 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 15:11   ` Jan Beulich
  2014-10-24  7:34 ` [v7][RFC][PATCH 09/13] xen/x86/ept: handle reserved device memory in ept_handle_violation Tiejun Chen
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

In case of shared-ept or non-shared-ept but 1:1 mapping, we
need to set p2m_access_n to make sure all reserved device
memory can't be accessed by any !iommu approach.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/arch/x86/mm/p2m.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index efa49dd..97eb6fd 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -686,6 +686,30 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
     /* Now, actually do the two-way mapping */
     if ( mfn_valid(_mfn(mfn)) ) 
     {
+
+        if ( !is_hardware_domain(d) )
+        {
+            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  &gfn);
+            if ( rc )
+            {
+                /*
+                 * Just set p2m_access_n in case of shared-ept
+                 * or non-shared ept but 1:1 mapping.
+                 */
+                if ( iommu_use_hap_pt(d) ||
+                     (!iommu_use_hap_pt(d) && mfn == gfn) )
+                {
+                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
+                                       p2m_access_n);
+                    if ( rc )
+                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
+                                 gfn);
+                    goto out; /* Failed to update rdm p2m. */
+                }
+            }
+        }
+
         rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
                            p2m->default_access);
         if ( rc )
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 09/13] xen/x86/ept: handle reserved device memory in ept_handle_violation
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (7 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 10/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We always initialize all reserved device memory as p2m_access_n in
case of shared-ept or non-shared-ept but 1:1 mapping, and we only
allow to reset these tables if assign a device with reserved device
memory. So if others try to access we just update eip then return.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/arch/x86/hvm/vmx/vmx.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 304aeea..5efec93 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2398,6 +2398,20 @@ static void ept_handle_violation(unsigned long qualification, paddr_t gpa)
         __trace_var(TRC_HVM_NPF, 0, sizeof(_d), &_d);
     }
 
+    /* If this violation is from reserved device memory range
+     * this means something may maliciously access them since
+     * we always initialize these tables as p2m_access_n unless
+     * in case of device assignment.
+     * So its not allowed then we just update eip then return.
+     */
+    ret = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                           &gfn);
+    if ( ret )
+    {
+        update_guest_eip();
+        return;
+    }
+
     if ( qualification & EPT_GLA_VALID )
     {
         __vmread(GUEST_LINEAR_ADDRESS, &gla);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 10/13] xen/x86/p2m: introduce set_identity_p2m_entry
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (8 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 09/13] xen/x86/ept: handle reserved device memory in ept_handle_violation Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 11/13] xen:vtd: create RMRR mapping Tiejun Chen
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

We will create RMRR mapping as follows:

If gfn space unoccupied or as p2m_access_n, set that. If
space already occupy by 1:1 RMRR mapping do thing. Others
should be failed.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/arch/x86/mm/p2m.c     | 28 ++++++++++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  4 ++++
 2 files changed, 32 insertions(+)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 97eb6fd..80d9918 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -882,6 +882,34 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
     return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct);
 }
 
+int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
+                           p2m_access_t p2ma)
+{
+    p2m_type_t p2mt;
+    p2m_access_t a;
+    mfn_t mfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int ret = -EBUSY;
+
+    gfn_lock(p2m, gfn, 0);
+
+    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL);
+
+    if ( !mfn_valid(mfn) || a == p2m_access_n )
+        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
+                            p2ma);
+    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
+        ret = 0;
+    else
+        printk(XENLOG_G_WARNING
+               "Cannot identity map d%d:%lx, already mapped to %lx.\n",
+               d->domain_id, gfn, mfn_x(mfn));
+
+    gfn_unlock(p2m, gfn, 0);
+
+    return ret;
+}
+
 /* Returns: 0 for success, -errno for failure */
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
 {
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 934324e..72cdd25 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -509,6 +509,10 @@ int p2m_is_logdirty_range(struct p2m_domain *, unsigned long start,
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 
+/* Set identity addresses in the p2m table (for pass-through) */
+int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
+                           p2m_access_t p2ma);
+
 /* Add foreign mapping to the guest's p2m table. */
 int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
                     unsigned long gpfn, domid_t foreign_domid);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 11/13] xen:vtd: create RMRR mapping
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (9 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 10/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 12/13] xen/vtd: re-enable USB device assignment Tiejun Chen
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

intel_iommu_map_page() does nothing if VT-d shares EPT page table.
So rmrr_identity_mapping() never create RMRR mapping but in some
cases like some GFX drivers it still need to access RMRR.

Here we will create those RMRR mappings even in shared EPT case.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index efd3390..07136df 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1856,10 +1856,15 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
 
     while ( base_pfn < end_pfn )
     {
-        int err = intel_iommu_map_page(d, base_pfn, base_pfn,
-                                       IOMMUF_readable|IOMMUF_writable);
-
-        if ( err )
+        int err = 0;
+        if ( iommu_use_hap_pt(d) )
+        {
+            ASSERT(!iommu_passthrough || !is_hardware_domain(d));
+            if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
+                return err;
+        }
+        else if ( (err = intel_iommu_map_page(d, base_pfn, base_pfn,
+					      IOMMUF_readable|IOMMUF_writable)) )
             return err;
         base_pfn++;
     }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 12/13] xen/vtd: re-enable USB device assignment
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (10 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 11/13] xen:vtd: create RMRR mapping Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24  7:34 ` [v7][RFC][PATCH 13/13] xen/vtd: group assigned device with RMRR Tiejun Chen
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

Before we refine RMRR mechanism, USB RMRR may conflict with guest bios
region so we always ignore USB RMRR. Now this can be gone.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/drivers/passthrough/vtd/dmar.h  |  1 -
 xen/drivers/passthrough/vtd/iommu.c | 11 ++---------
 xen/drivers/passthrough/vtd/utils.c |  7 -------
 3 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/dmar.h b/xen/drivers/passthrough/vtd/dmar.h
index af1feef..af205f5 100644
--- a/xen/drivers/passthrough/vtd/dmar.h
+++ b/xen/drivers/passthrough/vtd/dmar.h
@@ -129,7 +129,6 @@ do {                                                \
 
 int vtd_hw_check(void);
 void disable_pmr(struct iommu *iommu);
-int is_usb_device(u16 seg, u8 bus, u8 devfn);
 int is_igd_drhd(struct acpi_drhd_unit *drhd);
 
 #endif /* _DMAR_H_ */
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 07136df..298d458 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2232,11 +2232,9 @@ static int reassign_device_ownership(
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
-     * booting time. Also account for the special casing of USB below (in
-     * intel_iommu_assign_device()).
+     * booting time.
      */
-    if ( !is_hardware_domain(source) &&
-         !is_usb_device(pdev->seg, pdev->bus, pdev->devfn) )
+    if ( !is_hardware_domain(source) )
     {
         const struct acpi_rmrr_unit *rmrr;
         u16 bdf;
@@ -2285,13 +2283,8 @@ static int intel_iommu_assign_device(
     if ( ret )
         return ret;
 
-    /* FIXME: Because USB RMRR conflicts with guest bios region,
-     * ignore USB RMRR temporarily.
-     */
     seg = pdev->seg;
     bus = pdev->bus;
-    if ( is_usb_device(seg, bus, pdev->devfn) )
-        return 0;
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index a33564b..0d7bfe9 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -29,13 +29,6 @@
 #include "extern.h"
 #include <asm/io_apic.h>
 
-int is_usb_device(u16 seg, u8 bus, u8 devfn)
-{
-    u16 class = pci_conf_read16(seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                                PCI_CLASS_DEVICE);
-    return (class == 0xc03);
-}
-
 /* Disable vt-d protected memory registers. */
 void disable_pmr(struct iommu *iommu)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* [v7][RFC][PATCH 13/13] xen/vtd: group assigned device with RMRR
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (11 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 12/13] xen/vtd: re-enable USB device assignment Tiejun Chen
@ 2014-10-24  7:34 ` Tiejun Chen
  2014-10-24 10:52 ` [v7][RFC][PATCH 01/13] xen: RMRR fix Jan Beulich
  2014-10-30 22:15 ` Tim Deegan
  14 siblings, 0 replies; 180+ messages in thread
From: Tiejun Chen @ 2014-10-24  7:34 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang; +Cc: xen-devel

Sometimes different devices may share RMRR range so in this
case we shouldn't assign these devices into different VMs
since they may have potential leakage even damage between VMs.

So we need to group all devices as RMRR range to make sure they
are just assigned into the same VM.

Here we introduce two field, gid and domid, in struct,
acpi_rmrr_unit:
 gid: indicate which group this device owns. "0" is invalid so
      just start from "1".
 domid: indicate which domain this device owns currently. Firstly
        the hardware domain should own it.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/drivers/passthrough/vtd/dmar.c  | 29 +++++++++++++++-
 xen/drivers/passthrough/vtd/dmar.h  |  2 ++
 xen/drivers/passthrough/vtd/iommu.c | 68 +++++++++++++++++++++++++++++++++----
 3 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/dmar.c b/xen/drivers/passthrough/vtd/dmar.c
index 141e735..546eca9 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -572,10 +572,11 @@ acpi_parse_one_rmrr(struct acpi_dmar_header *header)
 {
     struct acpi_dmar_reserved_memory *rmrr =
         container_of(header, struct acpi_dmar_reserved_memory, header);
-    struct acpi_rmrr_unit *rmrru;
+    struct acpi_rmrr_unit *rmrru, *cur_rmrr;
     void *dev_scope_start, *dev_scope_end;
     u64 base_addr = rmrr->base_address, end_addr = rmrr->end_address;
     int ret;
+    static unsigned int group_id = 0;
 
     if ( (ret = acpi_dmar_check_length(header, sizeof(*rmrr))) != 0 )
         return ret;
@@ -611,6 +612,8 @@ acpi_parse_one_rmrr(struct acpi_dmar_header *header)
     rmrru->base_address = base_addr;
     rmrru->end_address = end_addr;
     rmrru->segment = rmrr->segment;
+    /* "0" is an invalid group id. */
+    rmrru->gid = 0;
 
     dev_scope_start = (void *)(rmrr + 1);
     dev_scope_end   = ((void *)rmrr) + header->length;
@@ -674,7 +677,31 @@ acpi_parse_one_rmrr(struct acpi_dmar_header *header)
                         "  RMRR region: base_addr %"PRIx64
                         " end_address %"PRIx64"\n",
                         rmrru->base_address, rmrru->end_address);
+
+            list_for_each_entry(cur_rmrr, &acpi_rmrr_units, list)
+            {
+                /*
+                 * Any same or overlap range mean they should be
+                 * at same group.
+                 */
+                if ( ((base_addr >= cur_rmrr->base_address) &&
+                     (end_addr <= cur_rmrr->end_address)) ||
+                     ((base_addr <= cur_rmrr->base_address) &&
+                     (end_addr >= cur_rmrr->end_address)) )
+                {
+                    rmrru->gid = cur_rmrr->gid;
+                    continue;
+                }
+            }
+
             acpi_register_rmrr_unit(rmrru);
+
+            /* Allocate group id from gid:1. */
+            if ( !rmrru->gid )
+            {
+                group_id++;
+                rmrru->gid = group_id;
+            }
         }
     }
 
diff --git a/xen/drivers/passthrough/vtd/dmar.h b/xen/drivers/passthrough/vtd/dmar.h
index af205f5..3ef7cb7 100644
--- a/xen/drivers/passthrough/vtd/dmar.h
+++ b/xen/drivers/passthrough/vtd/dmar.h
@@ -76,6 +76,8 @@ struct acpi_rmrr_unit {
     u64    end_address;
     u16    segment;
     u8     allow_all:1;
+    int    gid;
+    domid_t    domid;
 };
 
 struct acpi_atsr_unit {
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 298d458..00f72cb 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1882,9 +1882,9 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
 
 static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
 {
-    struct acpi_rmrr_unit *rmrr;
-    u16 bdf;
-    int ret, i;
+    struct acpi_rmrr_unit *rmrr, *g_rmrr;
+    u16 bdf, g_bdf;
+    int ret, i, j;
 
     ASSERT(spin_is_locked(&pcidevs_lock));
 
@@ -1905,6 +1905,32 @@ static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
              PCI_BUS(bdf) == pdev->bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
+            if ( rmrr->domid == hardware_domain->domain_id )
+            {
+                for_each_rmrr_device ( g_rmrr, g_bdf, j )
+                {
+                    if ( g_rmrr->gid == rmrr->gid )
+                    {
+                        if ( g_rmrr->domid == hardware_domain->domain_id )
+                            g_rmrr->domid = pdev->domain->domain_id;
+                        else if ( g_rmrr->domid != pdev->domain->domain_id )
+                        {
+                            rmrr->domid = g_rmrr->domid;
+                            continue;
+                        }
+                    }
+                }
+            }
+
+            if ( rmrr->domid != pdev->domain->domain_id )
+            {
+                domain_context_unmap(pdev->domain, devfn, pdev);
+                dprintk(XENLOG_ERR VTDPREFIX, "d%d: this is a group device owned by d%d\n",
+                        pdev->domain->domain_id, rmrr->domid);
+                rmrr->domid = 0;
+                return -EINVAL;
+            }
+
             ret = rmrr_identity_mapping(pdev->domain, 1, rmrr);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
@@ -1946,6 +1972,8 @@ static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
              PCI_DEVFN2(bdf) != devfn )
             continue;
 
+        /* Just release to hardware domain. */
+        rmrr->domid = hardware_domain->domain_id;
         rmrr_identity_mapping(pdev->domain, 0, rmrr);
     }
 
@@ -2104,6 +2132,8 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
     spin_lock(&pcidevs_lock);
     for_each_rmrr_device ( rmrr, bdf, i )
     {
+        /* hwdom should own all devices at first. */
+        rmrr->domid = d->domain_id;
         ret = rmrr_identity_mapping(d, 1, rmrr);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
@@ -2271,9 +2301,9 @@ static int reassign_device_ownership(
 static int intel_iommu_assign_device(
     struct domain *d, u8 devfn, struct pci_dev *pdev)
 {
-    struct acpi_rmrr_unit *rmrr;
-    int ret = 0, i;
-    u16 bdf, seg;
+    struct acpi_rmrr_unit *rmrr, *g_rmrr;
+    int ret = 0, i, j;
+    u16 bdf, seg, g_bdf;
     u8 bus;
 
     if ( list_empty(&acpi_drhd_units) )
@@ -2293,6 +2323,32 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
+            if ( rmrr->domid == hardware_domain->domain_id )
+            {
+                for_each_rmrr_device ( g_rmrr, g_bdf, j )
+                {
+                    if ( g_rmrr->gid == rmrr->gid )
+                    {
+                        if ( g_rmrr->domid == hardware_domain->domain_id )
+                            g_rmrr->domid = pdev->domain->domain_id;
+                        else if ( g_rmrr->domid != pdev->domain->domain_id )
+                        {
+                            rmrr->domid = g_rmrr->domid;
+                            continue;
+                        }
+                    }
+                }
+            }
+
+            if ( rmrr->domid != pdev->domain->domain_id )
+            {
+                domain_context_unmap(pdev->domain, devfn, pdev);
+                dprintk(XENLOG_ERR VTDPREFIX, "d%d: this is a group device owned by d%d\n",
+                        pdev->domain->domain_id, rmrr->domid);
+                rmrr->domid = 0;
+                return -EINVAL;
+            }
+
             ret = rmrr_identity_mapping(d, 1, rmrr);
             if ( ret )
             {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (12 preceding siblings ...)
  2014-10-24  7:34 ` [v7][RFC][PATCH 13/13] xen/vtd: group assigned device with RMRR Tiejun Chen
@ 2014-10-24 10:52 ` Jan Beulich
  2014-10-27  2:00   ` Chen, Tiejun
  2014-10-30 22:15 ` Tim Deegan
  14 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 10:52 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> 5. Before we take real device assignment, any access to RMRR may issue
> ept_handle_violation because of p2m_access_n. Then we just call
> update_guest_eip() to return.

I.e. ignore such accesses? Why?

> Now in our case we add a rule:
>  - if p2m_access_n is set we also set this mapping.

Does that not conflict with eventual use mem-access makes of this
type?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
@ 2014-10-24 14:11   ` Jan Beulich
  2014-10-27  2:11     ` Chen, Tiejun
  2014-10-27 13:35   ` Julien Grall
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 14:11 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

[-- Attachment #1: Type: text/plain, Size: 625 bytes --]

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> From: Jan Beulich <jbeulich@suse.com>
> 
> This is a prerequisite for punching holes into HVM and PVH guests' P2M
> to allow passing through devices that are associated with (on VT-d)
> RMRRs.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

I'm confused - you dropped Kevins ack and instead added you S-o-b
despite you not having changed anything in the patch.

That said - in my local version I did a little bit of renaming. I'm
attaching the most recent version for your reference.

Jan


[-- Attachment #2: get-reserved-device-memory.patch --]
[-- Type: text/plain, Size: 8905 bytes --]

introduce XENMEM_reserved_device_memory_map

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -16,6 +16,35 @@ CHECK_TYPE(domid);
 
 CHECK_mem_access_op;
 
+struct get_reserved_device_memory {
+    struct compat_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct compat_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+            return -ERANGE;
+
+        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
+                                     &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -273,6 +302,29 @@ int compat_memory_op(unsigned int cmd, X
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( __copy_to_guest(compat, &grdm.map, 1) )
+                rc = -EFAULT;
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,32 @@ out:
     return rc;
 }
 
+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1101,6 +1127,29 @@ long do_memory_op(unsigned long cmd, XEN
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr;
+    int rc = 0;
+
+    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    {
+        rc = func(PFN_DOWN(rmrr->base_address),
+                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
+                  ctxt);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct do
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops =
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct vnuma_topology_info {
 typedef struct vnuma_topology_info vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * For legacy reasons, some devices must be configured with special memory
+ * regions to function correctly.  The guest must avoid using any of these
+ * regions.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+    /* IN/OUT */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -120,6 +120,8 @@ void iommu_dt_domain_destroy(struct doma
 
 struct page_info;
 
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -156,12 +158,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -61,9 +61,10 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+!	reserved_device_memory_map	memory.h
 ?	physdev_eoi			physdev.h
 ?	physdev_get_free_pirq		physdev.h
 ?	physdev_irq			physdev.h

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-24  7:34 ` [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps Tiejun Chen
@ 2014-10-24 14:22   ` Jan Beulich
  2014-10-27  3:12     ` Chen, Tiejun
  2014-10-24 14:27   ` Jan Beulich
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 14:22 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/util.c
> +++ b/tools/firmware/hvmloader/util.c
> @@ -828,6 +828,72 @@ int hpet_exists(unsigned long hpet_base)
>      return ((hpet_id >> 16) == 0x8086);
>  }
>  
> +int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
> +                                   uint32_t *max_entries)
> +{
> +    int rc;
> +    struct xen_mem_reserved_device_memory_map memmap = {
> +        .nr_entries = *max_entries
> +    };
> +
> +    set_xen_guest_handle(memmap.buffer, entries);
> +
> +    rc = hypercall_memory_op(XENMEM_reserved_device_memory_map, &memmap);
> +    if (rc == -ENOBUFS)

Coding style.

> +        *max_entries = memmap.nr_entries;
> +
> +    return rc;
> +}
> +
> +/* Getting all reserved device memory map info in case of hvmloader. */
> +int hvm_get_reserved_device_memory_map(void)
> +{
> +    static uint32_t nr_entries = 0;
> +    int rc = 0;
> +
> +    if ( !rdm_map )
> +    {
> +        /* Assume we have one entry if not enough we'll expand.*/
> +        nr_entries = 1;
> +        rdm_map = mem_alloc(nr_entries *
> +                            sizeof(struct xen_mem_reserved_device_memory), 0);
> +        if ( !rdm_map )
> +        {
> +            printf("No space to get reserved dev memory maps!\n");
> +            return rc;
> +        }
> +
> +        rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
> +        if ( rc == -ENOBUFS )
> +        {
> +            rdm_map = mem_alloc(nr_entries *
> +                                sizeof(struct xen_mem_reserved_device_memory),
> +                                0);
> +            if ( rdm_map )
> +            {
> +                rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
> +                if ( rc )
> +                {
> +                    printf("Could not get reserved dev memory info on domain");
> +                    return rc;
> +                }
> +            }
> +            else
> +            {
> +                printf("No space to get reserved dev memory maps!\n");
> +                return rc;
> +            }
> +        }
> +        else if ( rc )
> +        {
> +            printf("Could not get reserved dev memory info on domain");
> +            return rc;
> +        }
> +    }
> +
> +    return nr_entries;
> +}

I continue to think that adding these functions without user isn't
really worthwhile in a separate patch.

> --- a/tools/firmware/hvmloader/util.h
> +++ b/tools/firmware/hvmloader/util.h
> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>                       unsigned int bios_image_base);
>  void dump_e820_table(struct e820entry *e820, unsigned int nr);
>  
> +#include <xen/memory.h>
> +#define ENOBUFS     105 /* No buffer space available */

This is a joke I hope? The #include belongs at the top (albeit afaict
you don't really need it here), and the #define is completely
misplaced here. While I generally wouldn't recommend doing this, I
think in the case here including the hypervisor header that defines
them would be okay. Perhaps not via relative path, but via having
the Makefile symlink the hypervisor header here.

> +struct xen_mem_reserved_device_memory *rdm_map;
> +int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
> +                                   uint32_t *max_entries);
> +int hvm_get_reserved_device_memory_map(void);
>  #ifndef NDEBUG

Blank line missing after your additions.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-24  7:34 ` [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps Tiejun Chen
  2014-10-24 14:22   ` Jan Beulich
@ 2014-10-24 14:27   ` Jan Beulich
  2014-10-27  5:07     ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 14:27 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/util.h
> +++ b/tools/firmware/hvmloader/util.h
> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>                       unsigned int bios_image_base);
>  void dump_e820_table(struct e820entry *e820, unsigned int nr);
>  
> +#include <xen/memory.h>
> +#define ENOBUFS     105 /* No buffer space available */
> +struct xen_mem_reserved_device_memory *rdm_map;

Oh, and - without "extern" this creates an instance in each
translation unit (and would cause build failures the moment
someone passed -fno-common to the compiler).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-24  7:34 ` [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory Tiejun Chen
@ 2014-10-24 14:42   ` Jan Beulich
  2014-10-27  7:12     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 14:42 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/pci.c
> +++ b/tools/firmware/hvmloader/pci.c
> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>  enum virtual_vga virtual_vga = VGA_none;
>  unsigned long igd_opregion_pgbase = 0;
>  
> +unsigned int need_skip_rmrr = 0;

Static (and without initializer)?

> +
> +/*
> + * Check whether there exists mmio hole in the specified memory range.
> + * Returns 1 if exists, else returns 0.
> + */
> +static int check_mmio_hole_confliction(uint64_t start, uint64_t memsize,

I don't think the word "confliction" exists. "conflict" please.

> +                           uint64_t mmio_start, uint64_t mmio_size)
> +{
> +    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
> +        return 0;
> +    else
> +        return 1;

Make this a simple single return statement?

> +}
> +
> +static int check_reserved_device_memory_map(uint64_t mmio_base,
> +                                            uint64_t mmio_max)
> +{
> +    uint32_t i = 0;

Pointless initializer.

> +    uint64_t rdm_start, rdm_end;
> +    int nr_entries = -1;

And again.

> +
> +    nr_entries = hvm_get_reserved_device_memory_map();

It's completely unclear why this can't be the variable's initializer.

> +
> +    for ( i = 0; i < nr_entries; i++ )
> +    {
> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);

I'm pretty certain I pointed out before that you can't simply shift
these fields - you risk losing significant bits.

> +        if ( check_mmio_hole_confliction(rdm_start, (rdm_end - rdm_start),

Pointless parentheses.

> +                                         mmio_base, mmio_max - mmio_base) )
> +        {
> +            need_skip_rmrr++;
> +        }
> +    }
> +
> +    return nr_entries;
> +}
> +
>  void pci_setup(void)
>  {
>      uint8_t is_64bar, using_64bar, bar64_relocate = 0;
> @@ -58,7 +96,9 @@ void pci_setup(void)
>          uint32_t bar_reg;
>          uint64_t bar_sz;
>      } *bars = (struct bars *)scratch_start;
> -    unsigned int i, nr_bars = 0;
> +    unsigned int i, j, nr_bars = 0;
> +    int nr_entries = 0;

And another pointless initializer. Plus as a count of something this
surely wants to be "unsigned int". Also I guess the variable name
is too generic - nr_rdm_entries perhaps?

> @@ -363,11 +411,29 @@ void pci_setup(void)
>              bar_data &= ~PCI_BASE_ADDRESS_IO_MASK;
>          }
>  
> + reallocate_mmio:
>          base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>          bar_data |= (uint32_t)base;
>          bar_data_upper = (uint32_t)(base >> 32);
>          base += bar_sz;
>  
> +        if ( need_skip_rmrr )
> +        {
> +            for ( j = 0; j < nr_entries; j++ )
> +            {
> +                rdm_start = rdm_map[j].start_pfn << PAGE_SHIFT;
> +                rdm_end = rdm_start + (rdm_map[j].nr_pages << PAGE_SHIFT);
> +                if ( check_mmio_hole_confliction(rdm_start,
> +                                                 (rdm_end - rdm_start),
> +                                                 base, bar_sz) )

"base" was already updated by this point.

> +                {
> +                    resource->base = rdm_end;
> +                    need_skip_rmrr--;
> +                    goto reallocate_mmio;

If you ever get here, the earlier determination of whether the MMIO
hole is large enough may get invalidated. I.e. I'm afraid this approach
is still to simplistic. Also to way you do the retry, the resulting bar_data
and bar_data_upper will likely end up being garbage.

Did you actually _test_ this code?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-24  7:34 ` [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps Tiejun Chen
@ 2014-10-24 14:56   ` Jan Beulich
  2014-10-27  8:09     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 14:56 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> We need to check to reserve all reserved device memory maps in e820
> to avoid any potential guest memory conflict.
> 
> Currently, if we can't insert RDM entries directly, we may need to handle
> several ranges as follows:
> a. Fixed Ranges --> BUG()
>  lowmem_reserved_base-0xA0000: reserved by BIOS implementation,
>  BIOS region,
>  RESERVED_MEMBASE ~ 0x100000000,

This seems conceptually wrong to me, and I said so before:
Depending on host characteristics this approach may mean you're
going to be unable to build any HVM guests. Minimally there needs
to be a way to avoid these checks (resulting in devices associated
with RMRRs not being assignable to such a guest). I'm therefore
only going to briefly look at the rest of this patch.

> +static unsigned int construct_rdm_e820_maps(unsigned int next_e820_entry_index,
> +                                            uint32_t nr_map,
> +                                            struct 
> xen_mem_reserved_device_memory *map,
> +                                            struct e820entry *e820,
> +                                            unsigned int lowmem_reserved_base,
> +                                            unsigned int bios_image_base)
> +{
> +    unsigned int i, j, sum_nr = next_e820_entry_index + nr_map;
> +    uint64_t start, end, next_start, rdm_start, rdm_end;
> +    uint32_t type;
> +    unsigned int insert = 0, do_insert = 0;
> +    int err = 0;
> +
> + do_real_construct:
> +    for ( i = 0; i < nr_map; i++ )
> +    {
> +        rdm_start = map[i].start_pfn << PAGE_SHIFT;
> +        rdm_end = rdm_start + (map[i].nr_pages << PAGE_SHIFT);
> +
> +        for ( j = 0; j < next_e820_entry_index - 1; j++ )
> +        {
> +            start = e820[j].addr;
> +            end = e820[j].addr + e820[j].size;
> +            type = e820[j].type;
> +            next_start = e820[j+1].addr;
> +
> +            /* lowmem_reserved_base-0xA0000: reserved by BIOS implementation. */
> +            if ( lowmem_reserved_base < 0xA0000 &&
> +                 start == lowmem_reserved_base )
> +            {
> +                if ( rdm_start >= start && rdm_start <= end )
> +                {
> +                    err = -1;
> +                    break;
> +                }
> +            }
> +
> +            /*
> +             * BIOS region.
> +             */
> +            if ( start == bios_image_base )
> +            {
> +                if ( rdm_start >= start && rdm_start <= end )
> +                {
> +                    err = -1;
> +                    break;
> +                }
> +            }
> +
> +            /* The default memory map always occupy one fixed reserved
> +             * range: RESERVED_MEMBASE ~ 0x100000000
> +             */
> +            if ( rdm_start >= RESERVED_MEMBASE &&
> +                      rdm_start <= ((uint64_t)1 << 32) )
> +            {
> +                err = -1;
> +                break;
> +            }
> +
> +            /* Just amid those remaining e820 entries. */
> +            if ( (rdm_start > end) && (rdm_end < next_start) )
> +            {
> +                if ( do_insert )
> +                {
> +                    memmove(&e820[j+2], &e820[j+1],
> +                            (sum_nr - j - 1) * sizeof(struct e820entry));
> +
> +                    /* Then fill RMRR into that entry. */
> +                    e820[j+1].addr = rdm_start;
> +                    e820[j+1].size = rdm_end - rdm_start;
> +                    e820[j+1].type = E820_RESERVED;
> +                    next_e820_entry_index++;
> +                }
> +                insert++;
> +            }
> +            /* Already at the end. */
> +            else if ( (rdm_start > end) && !next_start )
> +            {
> +                if ( do_insert )
> +                {
> +                    e820[next_e820_entry_index].addr = rdm_start;
> +                    e820[next_e820_entry_index].size = rdm_end - rdm_start;
> +                    e820[next_e820_entry_index].type = E820_RESERVED;
> +                    next_e820_entry_index++;
> +                }
> +                insert++;
> +            }
> +            /* If completely overlap with one RAM range. */
> +            else if ( rdm_start == start && rdm_end == end && type == E820_RAM )

Comment and expression disagree.

> +            {
> +                if ( do_insert )
> +                    e820[j].type = E820_RESERVED;
> +                insert++;
> +            }
> +            /* If we're just alligned with start of one RAM range. */
> +            else if ( rdm_start == start && rdm_end < end && type == E820_RAM )
> +            {
> +                if ( do_insert )
> +                {
> +                    memmove(&e820[j+1], &e820[j],
> +                            (sum_nr - j) * sizeof(struct e820entry));
> +
> +                    e820[j+1].addr = rdm_end;
> +                    e820[j+1].size = e820[j].addr + e820[j].size - rdm_end;
> +                    e820[j+1].type = E820_RAM;
> +                    next_e820_entry_index++;
> +
> +                    e820[j].addr = rdm_start;
> +                    e820[j].size = rdm_end - rdm_start;
> +                    e820[j].type = E820_RESERVED;
> +                }
> +                insert++;
> +            }
> +            /* If we're just alligned with end of one RAM range. */
> +            else if ( rdm_start > start && rdm_end == end && type == E820_RAM )
> +            {
> +                if ( do_insert )
> +                {
> +                    memmove(&e820[j+1], &e820[j],
> +                            (sum_nr - j) * sizeof(struct e820entry));
> +
> +                    e820[j].size = rdm_start - e820[j].addr;
> +                    e820[j].type = E820_RAM;
> +
> +                    e820[j+1].addr = rdm_start;
> +                    e820[j+1].size = rdm_end - rdm_start;
> +                    e820[j+1].type = E820_RESERVED;
> +                    next_e820_entry_index++;
> +                }
> +                insert++;
> +            }
> +            /* If we're just in of one RAM range */
> +            else if ( rdm_start > start && rdm_end < end && type == E820_RAM )
> +            {
> +                if ( do_insert )
> +                {
> +                    memmove(&e820[j+2], &e820[j],
> +                            (sum_nr - j) * sizeof(struct e820entry));
> +
> +                    e820[j+2].addr = rdm_end;
> +                    e820[j+2].size = e820[j].addr + e820[j].size - rdm_end;
> +                    e820[j+2].type = E820_RAM;
> +                    next_e820_entry_index++;
> +
> +                    e820[j+1].addr = rdm_start;
> +                    e820[j+1].size = rdm_end - rdm_start;
> +                    e820[j+1].type = E820_RESERVED;
> +                    next_e820_entry_index++;
> +
> +                    e820[j].size = rdm_start - e820[j].addr;
> +                    e820[j].type = E820_RAM;
> +                }
> +                insert++;
> +            }
> +            /* If we're going last RAM:Hole range */
> +            else if ( end < next_start &&
> +                      rdm_start > start &&
> +                      rdm_end < next_start &&
> +                      type == E820_RAM )
> +            {
> +                if ( do_insert )
> +                {
> +                    memmove(&e820[j+1], &e820[j],
> +                            (sum_nr - j) * sizeof(struct e820entry));
> +
> +                    e820[j].size = rdm_start - e820[j].addr;
> +                    e820[j].type = E820_RAM;
> +
> +                    e820[j+1].addr = rdm_start;
> +                    e820[j+1].size = rdm_end - rdm_start;
> +                    e820[j+1].type = E820_RESERVED;
> +                    next_e820_entry_index++;
> +                }
> +                insert++;
> +            }

This if-else-if series looks horrible - is there really no way to consolidate
it? Also, other than punching holes in the E820 map you don't seem to
be doing anything here. And the earlier tools side patches didn't do
anything about this either. Consequently, at the time where it may
become necessary to establish the 1:1 mapping in the P2M, there'll
be the RAM mapping still there, causing the device assignment to fail.
Again - did you _test_ this scenario with a big enough guest and your
gfx card passed through?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory
  2014-10-24  7:34 ` [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory Tiejun Chen
@ 2014-10-24 15:02   ` Jan Beulich
  2014-10-27  8:50     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 15:02 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -713,6 +713,19 @@ extern int arch_grant_map_page_identity(struct domain *d, unsigned long frame,
>                                   bool_t writeable);
>  extern int arch_grant_unmap_page_identity(struct domain *d, unsigned long frame);
>  
> +/* Check if we are accessing rdm. */
> +static inline int p2m_check_reserved_device_memory(xen_pfn_t start,
> +                                                   xen_ulong_t nr, void *d)
> +{
> +    unsigned long *gfn = d;
> +    xen_pfn_t end = start + nr;
> +
> +    if ( *gfn >= start && *gfn <= end )
> +        return 1;
> +
> +    return 0;
> +}

There is absolutely nothing in the function connecting it to its name.
And the last parameter being void looks pretty bogus too. Both
signs of it (again) being wrong to add such a function without at
least one first caller.

Also the way you return from the function suggests the return
type really ought to be bool_t. And instead of the last four lines in
the function body you could again use a single simple return
statement.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-24  7:34 ` [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping Tiejun Chen
@ 2014-10-24 15:11   ` Jan Beulich
  2014-10-27  9:05     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-24 15:11 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -686,6 +686,30 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>      /* Now, actually do the two-way mapping */
>      if ( mfn_valid(_mfn(mfn)) ) 
>      {
> +
> +        if ( !is_hardware_domain(d) )
> +        {
> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
> +                                                  &gfn);

Okay, no I see what that function is needed for. It being an inline
function is of course very questionable looking at this use site.

> +            if ( rc )

And the return value from the called function is of type int -
non-zero may not just mean "true" but (when negative) also
"error". You need to distinguish these cases.

> +            {
> +                /*
> +                 * Just set p2m_access_n in case of shared-ept
> +                 * or non-shared ept but 1:1 mapping.
> +                 */
> +                if ( iommu_use_hap_pt(d) ||
> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )

How would, other than by chance, mfn equal gfn here? Also the
double use of iommu_use_hap_pt(d) is pointless here.

> +                {
> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
> +                                       p2m_access_n);
> +                    if ( rc )
> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
> +                                 gfn);

Such messages are (due to acting on a foreign domain) relatively
useless without also logging the domain that is affected. Conversely,
logging the current domain and vCPU (due to using gdprintk()) is
rather pointless. Also please drop either the colon or the
parentheses in the message.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-24 10:52 ` [v7][RFC][PATCH 01/13] xen: RMRR fix Jan Beulich
@ 2014-10-27  2:00   ` Chen, Tiejun
  2014-10-27  9:41     ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  2:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

n 2014/10/24 18:52, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:

Thanks for your review.

>> 5. Before we take real device assignment, any access to RMRR may issue
>> ept_handle_violation because of p2m_access_n. Then we just call
>> update_guest_eip() to return.
>
> I.e. ignore such accesses? Why?

Yeah. This illegal access isn't allowed but its enough to ignore that 
without further protection or punishment.

Or what procedure should be concerned here based on your opinion?

>
>> Now in our case we add a rule:
>>   - if p2m_access_n is set we also set this mapping.
>
> Does that not conflict with eventual use mem-access makes of this
> type?
>

In our case, we always initialize these RMRR ranges with p2m_access_n to 
make sure we can intercept any illegal access to these range until we 
can reset them with p2m_access_rw via set_identity_p2m_entry(d, 
base_pfn, p2m_access_rw).

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-24 14:11   ` Jan Beulich
@ 2014-10-27  2:11     ` Chen, Tiejun
  2014-10-27  2:18       ` Chen, Tiejun
  2014-10-27  9:42       ` Jan Beulich
  0 siblings, 2 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  2:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 22:11, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>>
>> This is a prerequisite for punching holes into HVM and PVH guests' P2M
>> to allow passing through devices that are associated with (on VT-d)
>> RMRRs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>
> I'm confused - you dropped Kevins ack and instead added you S-o-b

I need to add this ACK.

> despite you not having changed anything in the patch.

The original you sent to me previously is needed to rebase on the 
latest. Please check xen/include/public/memory.h.

The original:

--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct vnuma_topology_info {
  typedef struct vnuma_topology_info vnuma_topology_info_t;
  DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);

-/* Next available subop number is 27 */

Mine:

--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -523,7 +523,29 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_

  #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */

-/* Next available subop number is 26 */

Unless you're saying this kind of rebase shouldn't introduce my SOB.

>
> That said - in my local version I did a little bit of renaming. I'm
> attaching the most recent version for your reference.

I will pick this to try.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-27  2:11     ` Chen, Tiejun
@ 2014-10-27  2:18       ` Chen, Tiejun
  2014-10-27  9:42       ` Jan Beulich
  1 sibling, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  2:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 10:11, Chen, Tiejun wrote:
> On 2014/10/24 22:11, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> From: Jan Beulich <jbeulich@suse.com>
>>>
>>> This is a prerequisite for punching holes into HVM and PVH guests' P2M
>>> to allow passing through devices that are associated with (on VT-d)
>>> RMRRs.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>
>> I'm confused - you dropped Kevins ack and instead added you S-o-b
>
> I need to add this ACK.
>
>> despite you not having changed anything in the patch.
>
> The original you sent to me previously is needed to rebase on the
> latest. Please check xen/include/public/memory.h.
>
> The original:

s/The original/Mine

>
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -573,7 +573,29 @@ struct vnuma_topology_info {
>   typedef struct vnuma_topology_info vnuma_topology_info_t;
>   DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
>
> -/* Next available subop number is 27 */
>
> Mine:

s/Mine/The original

>
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -523,7 +523,29 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_
>
>   #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>
> -/* Next available subop number is 26 */
>
> Unless you're saying this kind of rebase shouldn't introduce my SOB.
>
>>
>> That said - in my local version I did a little bit of renaming. I'm
>> attaching the most recent version for your reference.
>
> I will pick this to try.
>
> Thanks
> Tiejun
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-24 14:22   ` Jan Beulich
@ 2014-10-27  3:12     ` Chen, Tiejun
  2014-10-27  9:45       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  3:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 22:22, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/util.c
>> +++ b/tools/firmware/hvmloader/util.c
>> @@ -828,6 +828,72 @@ int hpet_exists(unsigned long hpet_base)
>>       return ((hpet_id >> 16) == 0x8086);
>>   }
>>
>> +int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
>> +                                   uint32_t *max_entries)
>> +{
>> +    int rc;
>> +    struct xen_mem_reserved_device_memory_map memmap = {
>> +        .nr_entries = *max_entries
>> +    };
>> +
>> +    set_xen_guest_handle(memmap.buffer, entries);
>> +
>> +    rc = hypercall_memory_op(XENMEM_reserved_device_memory_map, &memmap);
>> +    if (rc == -ENOBUFS)
>
> Coding style.

     if ( rc == -ENOBUFS )

>
>> +        *max_entries = memmap.nr_entries;
>> +
>> +    return rc;
>> +}
>> +
>> +/* Getting all reserved device memory map info in case of hvmloader. */
>> +int hvm_get_reserved_device_memory_map(void)
>> +{
>> +    static uint32_t nr_entries = 0;
>> +    int rc = 0;
>> +
>> +    if ( !rdm_map )
>> +    {
>> +        /* Assume we have one entry if not enough we'll expand.*/
>> +        nr_entries = 1;
>> +        rdm_map = mem_alloc(nr_entries *
>> +                            sizeof(struct xen_mem_reserved_device_memory), 0);
>> +        if ( !rdm_map )
>> +        {
>> +            printf("No space to get reserved dev memory maps!\n");
>> +            return rc;
>> +        }
>> +
>> +        rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
>> +        if ( rc == -ENOBUFS )
>> +        {
>> +            rdm_map = mem_alloc(nr_entries *
>> +                                sizeof(struct xen_mem_reserved_device_memory),
>> +                                0);
>> +            if ( rdm_map )
>> +            {
>> +                rc = get_reserved_device_memory_map(rdm_map, &nr_entries);
>> +                if ( rc )
>> +                {
>> +                    printf("Could not get reserved dev memory info on domain");
>> +                    return rc;
>> +                }
>> +            }
>> +            else
>> +            {
>> +                printf("No space to get reserved dev memory maps!\n");
>> +                return rc;
>> +            }
>> +        }
>> +        else if ( rc )
>> +        {
>> +            printf("Could not get reserved dev memory info on domain");
>> +            return rc;
>> +        }
>> +    }
>> +
>> +    return nr_entries;
>> +}
>
> I continue to think that adding these functions without user isn't
> really worthwhile in a separate patch.

These functions will be called in pci.c:pci_setup() and 
e820.c:build_e820_table(). If you really like one big patch I can squash 
these three patches as one.

>
>> --- a/tools/firmware/hvmloader/util.h
>> +++ b/tools/firmware/hvmloader/util.h
>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>                        unsigned int bios_image_base);
>>   void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>
>> +#include <xen/memory.h>
>> +#define ENOBUFS     105 /* No buffer space available */
>
> This is a joke I hope? The #include belongs at the top (albeit afaict
> you don't really need it here), and the #define is completely

If without this line, #include <xen/memory.h>,

In file included from build.c:25:0:
../util.h:246:70: error: array type has incomplete element type
  int get_reserved_device_memory_map(struct xen_reserved_device_memory 
entries[],
                                                                       ^
make[8]: *** [build.o] Error 1

> misplaced here. While I generally wouldn't recommend doing this, I
> think in the case here including the hypervisor header that defines
> them would be okay. Perhaps not via relative path, but via having

Seems we just need to include this,

#include <errno.h>

> the Makefile symlink the hypervisor header here.
>
>> +struct xen_mem_reserved_device_memory *rdm_map;
>> +int get_reserved_device_memory_map(struct xen_mem_reserved_device_memory entries[],
>> +                                   uint32_t *max_entries);
>> +int hvm_get_reserved_device_memory_map(void);
>>   #ifndef NDEBUG
>
> Blank line missing after your additions.
>

Added.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-24 14:27   ` Jan Beulich
@ 2014-10-27  5:07     ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  5:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 22:27, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/util.h
>> +++ b/tools/firmware/hvmloader/util.h
>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>                        unsigned int bios_image_base);
>>   void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>
>> +#include <xen/memory.h>
>> +#define ENOBUFS     105 /* No buffer space available */
>> +struct xen_mem_reserved_device_memory *rdm_map;
>
> Oh, and - without "extern" this creates an instance in each
> translation unit (and would cause build failures the moment
> someone passed -fno-common to the compiler).
>

So instead, just define this in util.c then use this with 'extern'.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-24 14:42   ` Jan Beulich
@ 2014-10-27  7:12     ` Chen, Tiejun
  2014-10-27  9:56       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  7:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 22:42, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/pci.c
>> +++ b/tools/firmware/hvmloader/pci.c
>> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>>   enum virtual_vga virtual_vga = VGA_none;
>>   unsigned long igd_opregion_pgbase = 0;
>>
>> +unsigned int need_skip_rmrr = 0;
>
> Static (and without initializer)?

static unsigned int need_skip_rmrr;

>
>> +
>> +/*
>> + * Check whether there exists mmio hole in the specified memory range.
>> + * Returns 1 if exists, else returns 0.
>> + */
>> +static int check_mmio_hole_confliction(uint64_t start, uint64_t memsize,
>
> I don't think the word "confliction" exists. "conflict" please.

s/check_mmio_hole_confliction/check_mmio_hole_conflict/g

>
>> +                           uint64_t mmio_start, uint64_t mmio_size)
>> +{
>> +    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
>> +        return 0;
>> +    else
>> +        return 1;
>
> Make this a simple single return statement?

static int check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
                                     uint64_t mmio_start, uint64_t 
mmio_size)
{
     if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
         return 0;

     return 1;
}

>
>> +}
>> +
>> +static int check_reserved_device_memory_map(uint64_t mmio_base,
>> +                                            uint64_t mmio_max)
>> +{
>> +    uint32_t i = 0;
>
> Pointless initializer.
>
>> +    uint64_t rdm_start, rdm_end;
>> +    int nr_entries = -1;
>
> And again.
>
>> +
>> +    nr_entries = hvm_get_reserved_device_memory_map();
>
> It's completely unclear why this can't be the variable's initializer.

     uint32_t i;
     uint64_t rdm_start, rdm_end;
     int nr_rdm_entries = hvm_get_reserved_device_memory_map();

>
>> +
>> +    for ( i = 0; i < nr_entries; i++ )
>> +    {
>> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
>> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
>
> I'm pretty certain I pointed out before that you can't simply shift
> these fields - you risk losing significant bits.

I tried to go back looking into something but just found you were saying 
I shouldn't use PAGE_SHIFT and PAGE_SIZE at the same time. If I'm still 
missing could you show me what you expect?

>
>> +        if ( check_mmio_hole_confliction(rdm_start, (rdm_end - rdm_start),
>
> Pointless parentheses.

         if ( check_mmio_hole_conflict(rdm_start, rdm_end - rdm_start,

>
>> +                                         mmio_base, mmio_max - mmio_base) )
>> +        {
>> +            need_skip_rmrr++;
>> +        }
>> +    }
>> +
>> +    return nr_entries;
>> +}
>> +
>>   void pci_setup(void)
>>   {
>>       uint8_t is_64bar, using_64bar, bar64_relocate = 0;
>> @@ -58,7 +96,9 @@ void pci_setup(void)
>>           uint32_t bar_reg;
>>           uint64_t bar_sz;
>>       } *bars = (struct bars *)scratch_start;
>> -    unsigned int i, nr_bars = 0;
>> +    unsigned int i, j, nr_bars = 0;
>> +    int nr_entries = 0;
>
> And another pointless initializer. Plus as a count of something this

int nr_rdm_entries;

> surely wants to be "unsigned int". Also I guess the variable name

nr_rdm_entries should be literally unsigned int but this value always be 
set from  hvm_get_reserved_device_memory_map(),

nr_rdm_entries = hvm_get_reserved_device_memory_map()

I hope that return value can be negative value in some failed case

> is too generic - nr_rdm_entries perhaps?

Okay, s/nr_entries/nr_rdm_entries/g

>
>> @@ -363,11 +411,29 @@ void pci_setup(void)
>>               bar_data &= ~PCI_BASE_ADDRESS_IO_MASK;
>>           }
>>
>> + reallocate_mmio:
>>           base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>>           bar_data |= (uint32_t)base;
>>           bar_data_upper = (uint32_t)(base >> 32);
>>           base += bar_sz;
>>
>> +        if ( need_skip_rmrr )
>> +        {
>> +            for ( j = 0; j < nr_entries; j++ )
>> +            {
>> +                rdm_start = rdm_map[j].start_pfn << PAGE_SHIFT;
>> +                rdm_end = rdm_start + (rdm_map[j].nr_pages << PAGE_SHIFT);
>> +                if ( check_mmio_hole_confliction(rdm_start,
>> +                                                 (rdm_end - rdm_start),
>> +                                                 base, bar_sz) )
>
> "base" was already updated by this point.

Looks I should insert these code fragments between

bar_data_upper = (uint32_t)(base >> 32);

and

base += bar_sz;

So (See below)

>
>> +                {
>> +                    resource->base = rdm_end;
>> +                    need_skip_rmrr--;
>> +                    goto reallocate_mmio;
>
> If you ever get here, the earlier determination of whether the MMIO
> hole is large enough may get invalidated. I.e. I'm afraid this approach
> is still to simplistic. Also to way you do the retry, the resulting bar_data

Here move 'reallocate_mmio' downward one line, and

s/resource->base = rdm_end;/base = (rdm_end  + bar_sz - 1) & 
~(uint64_t)(bar_sz - 1);

Then,
	...
         base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
  reallocate_mmio:
         bar_data |= (uint32_t)base;
         bar_data_upper = (uint32_t)(base >> 32);

         if ( need_skip_rmrr )
         {
             for ( j = 0; j < nr_rdm_entries; j++ )
             {
                 rdm_start = rdm_map[j].start_pfn << PAGE_SHIFT;
                 rdm_end = rdm_start + (rdm_map[j].nr_pages << PAGE_SHIFT);
                 if ( check_mmio_hole_conflict(rdm_start, rdm_end - 
rdm_start,
                                               base, bar_sz) )
                 {
                     base = (rdm_end  + bar_sz - 1) & ~(uint64_t)(bar_sz 
- 1);
                     need_skip_rmrr--;
                     goto reallocate_mmio;
                 }
             }
         }

         base += bar_sz;

Additionally, actually there are some original codes just following my 
codes:

         if ( need_skip_rmrr )
         {
		...
         }

	base += bar_sz;

         if ( (base < resource->base) || (base > resource->max) )
         {
             printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
                    "resource!\n", devfn>>3, devfn&7, bar_reg,
                    PRIllx_arg(bar_sz));
             continue;
         }

This can guarantee we don't overwhelm the previous mmio range.

> and bar_data_upper will likely end up being garbage.
>
> Did you actually _test_ this code?

Actually in my real case those RMRR ranges are always below MMIO.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-24 14:56   ` Jan Beulich
@ 2014-10-27  8:09     ` Chen, Tiejun
  2014-10-27 10:17       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  8:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 22:56, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> We need to check to reserve all reserved device memory maps in e820
>> to avoid any potential guest memory conflict.
>>
>> Currently, if we can't insert RDM entries directly, we may need to handle
>> several ranges as follows:
>> a. Fixed Ranges --> BUG()
>>   lowmem_reserved_base-0xA0000: reserved by BIOS implementation,
>>   BIOS region,
>>   RESERVED_MEMBASE ~ 0x100000000,
>
> This seems conceptually wrong to me, and I said so before:
> Depending on host characteristics this approach may mean you're
> going to be unable to build any HVM guests. Minimally there needs
> to be a way to avoid these checks (resulting in devices associated
> with RMRRs not being assignable to such a guest). I'm therefore

I just use 'err' to indicate if these fixed range overlaps RMRR,

+    /* These overlap may issue guest can't work well. */
+    if ( err )
+    {
+        printf("Guest can't work with some reserved device memory 
overlap!\n");
+        BUG();
+    }

As I understand, these fixed ranges don't like RAM that we can move 
safely out any RMRR overlap. And actually its rare to overlap with those 
fixed ranges.

But I can remove BUG if you insist on this point.

> only going to briefly look at the rest of this patch.
>
>> +static unsigned int construct_rdm_e820_maps(unsigned int next_e820_entry_index,
>> +                                            uint32_t nr_map,
>> +                                            struct
>> xen_mem_reserved_device_memory *map,
>> +                                            struct e820entry *e820,
>> +                                            unsigned int lowmem_reserved_base,
>> +                                            unsigned int bios_image_base)
>> +{
>> +    unsigned int i, j, sum_nr = next_e820_entry_index + nr_map;
>> +    uint64_t start, end, next_start, rdm_start, rdm_end;
>> +    uint32_t type;
>> +    unsigned int insert = 0, do_insert = 0;
>> +    int err = 0;
>> +
>> + do_real_construct:
>> +    for ( i = 0; i < nr_map; i++ )
>> +    {
>> +        rdm_start = map[i].start_pfn << PAGE_SHIFT;
>> +        rdm_end = rdm_start + (map[i].nr_pages << PAGE_SHIFT);
>> +
>> +        for ( j = 0; j < next_e820_entry_index - 1; j++ )
>> +        {
>> +            start = e820[j].addr;
>> +            end = e820[j].addr + e820[j].size;
>> +            type = e820[j].type;
>> +            next_start = e820[j+1].addr;
>> +
>> +            /* lowmem_reserved_base-0xA0000: reserved by BIOS implementation. */
>> +            if ( lowmem_reserved_base < 0xA0000 &&
>> +                 start == lowmem_reserved_base )
>> +            {
>> +                if ( rdm_start >= start && rdm_start <= end )
>> +                {
>> +                    err = -1;
>> +                    break;
>> +                }
>> +            }
>> +
>> +            /*
>> +             * BIOS region.
>> +             */
>> +            if ( start == bios_image_base )
>> +            {
>> +                if ( rdm_start >= start && rdm_start <= end )
>> +                {
>> +                    err = -1;
>> +                    break;
>> +                }
>> +            }
>> +
>> +            /* The default memory map always occupy one fixed reserved
>> +             * range: RESERVED_MEMBASE ~ 0x100000000
>> +             */
>> +            if ( rdm_start >= RESERVED_MEMBASE &&
>> +                      rdm_start <= ((uint64_t)1 << 32) )
>> +            {
>> +                err = -1;
>> +                break;
>> +            }
>> +
>> +            /* Just amid those remaining e820 entries. */
>> +            if ( (rdm_start > end) && (rdm_end < next_start) )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    memmove(&e820[j+2], &e820[j+1],
>> +                            (sum_nr - j - 1) * sizeof(struct e820entry));
>> +
>> +                    /* Then fill RMRR into that entry. */
>> +                    e820[j+1].addr = rdm_start;
>> +                    e820[j+1].size = rdm_end - rdm_start;
>> +                    e820[j+1].type = E820_RESERVED;
>> +                    next_e820_entry_index++;
>> +                }
>> +                insert++;
>> +            }
>> +            /* Already at the end. */
>> +            else if ( (rdm_start > end) && !next_start )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    e820[next_e820_entry_index].addr = rdm_start;
>> +                    e820[next_e820_entry_index].size = rdm_end - rdm_start;
>> +                    e820[next_e820_entry_index].type = E820_RESERVED;
>> +                    next_e820_entry_index++;
>> +                }
>> +                insert++;
>> +            }
>> +            /* If completely overlap with one RAM range. */
>> +            else if ( rdm_start == start && rdm_end == end && type == E820_RAM )
>
> Comment and expression disagree.

What about this?

/* If coincide with one RAM range. */

>
>> +            {
>> +                if ( do_insert )
>> +                    e820[j].type = E820_RESERVED;
>> +                insert++;
>> +            }
>> +            /* If we're just alligned with start of one RAM range. */
>> +            else if ( rdm_start == start && rdm_end < end && type == E820_RAM )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    memmove(&e820[j+1], &e820[j],
>> +                            (sum_nr - j) * sizeof(struct e820entry));
>> +
>> +                    e820[j+1].addr = rdm_end;
>> +                    e820[j+1].size = e820[j].addr + e820[j].size - rdm_end;
>> +                    e820[j+1].type = E820_RAM;
>> +                    next_e820_entry_index++;
>> +
>> +                    e820[j].addr = rdm_start;
>> +                    e820[j].size = rdm_end - rdm_start;
>> +                    e820[j].type = E820_RESERVED;
>> +                }
>> +                insert++;
>> +            }
>> +            /* If we're just alligned with end of one RAM range. */
>> +            else if ( rdm_start > start && rdm_end == end && type == E820_RAM )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    memmove(&e820[j+1], &e820[j],
>> +                            (sum_nr - j) * sizeof(struct e820entry));
>> +
>> +                    e820[j].size = rdm_start - e820[j].addr;
>> +                    e820[j].type = E820_RAM;
>> +
>> +                    e820[j+1].addr = rdm_start;
>> +                    e820[j+1].size = rdm_end - rdm_start;
>> +                    e820[j+1].type = E820_RESERVED;
>> +                    next_e820_entry_index++;
>> +                }
>> +                insert++;
>> +            }
>> +            /* If we're just in of one RAM range */
>> +            else if ( rdm_start > start && rdm_end < end && type == E820_RAM )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    memmove(&e820[j+2], &e820[j],
>> +                            (sum_nr - j) * sizeof(struct e820entry));
>> +
>> +                    e820[j+2].addr = rdm_end;
>> +                    e820[j+2].size = e820[j].addr + e820[j].size - rdm_end;
>> +                    e820[j+2].type = E820_RAM;
>> +                    next_e820_entry_index++;
>> +
>> +                    e820[j+1].addr = rdm_start;
>> +                    e820[j+1].size = rdm_end - rdm_start;
>> +                    e820[j+1].type = E820_RESERVED;
>> +                    next_e820_entry_index++;
>> +
>> +                    e820[j].size = rdm_start - e820[j].addr;
>> +                    e820[j].type = E820_RAM;
>> +                }
>> +                insert++;
>> +            }
>> +            /* If we're going last RAM:Hole range */
>> +            else if ( end < next_start &&
>> +                      rdm_start > start &&
>> +                      rdm_end < next_start &&
>> +                      type == E820_RAM )
>> +            {
>> +                if ( do_insert )
>> +                {
>> +                    memmove(&e820[j+1], &e820[j],
>> +                            (sum_nr - j) * sizeof(struct e820entry));
>> +
>> +                    e820[j].size = rdm_start - e820[j].addr;
>> +                    e820[j].type = E820_RAM;
>> +
>> +                    e820[j+1].addr = rdm_start;
>> +                    e820[j+1].size = rdm_end - rdm_start;
>> +                    e820[j+1].type = E820_RESERVED;
>> +                    next_e820_entry_index++;
>> +                }
>> +                insert++;
>> +            }
>
> This if-else-if series looks horrible - is there really no way to consolidate
> it? Also, other than punching holes in the E820 map you don't seem to

I know this is ugly but as you know there's no any rule we can make good 
use of this case. RMRR can start anywhere so We have to assume any 
scenarios,

1. Just amid those remaining e820 entries.
2. Already at the end.
3. If coincide with one RAM range.
4. If we're just aligned with start of one RAM range.
5. If we're just aligned with end of one RAM range.
6. If we're just in of one RAM range.
7. If we're going last RAM:Hole range.

So if you think we're handling correctly, maybe we can continue 
optimizing this way once we have a better idea.

> be doing anything here. And the earlier tools side patches didn't do
> anything about this either. Consequently, at the time where it may
> become necessary to establish the 1:1 mapping in the P2M, there'll
> be the RAM mapping still there, causing the device assignment to fail.

But I already set these range as p2m_access_n, and as you see I also 
reserved these range in e820 table. So although the RAM mapping still is 
still there but no any actual access.

Then when we assign device, we will override these p2m entry as 1:1 
mapping if p2m_access_n is set.

> Again - did you _test_ this scenario with a big enough guest and your
> gfx card passed through?

Yes. I validate these patches as following configuration:

memory = 2816
gfx_passthru=1
pci=["00:02.0"]

And I also perform 'xl dmesg' to check e820 table. Here I can provide an 
instance with overlap,

RMRR range:

root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace# xl dmesg | 
grep RMRR
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address 
ab81dfff
(XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
(XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address 
af7fffff
root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace#

Without my patch:

(d4) E820 table:
(d4)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(d4)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(d4)  HOLE: 00000000:000a0000 - 00000000:000e0000
(d4)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(d4)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
(d4)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
(d4)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
(d4)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
(d4)  HOLE: 00000000:af800000 - 00000000:fc000000
(d4)  [07]: 00000000:fc000000 - 00000001:00000000: RESERVED


With my patch:

(d2)  f0000-fffff: Main BIOS
(d2) E820 table:
(d2)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
(d2)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
(d2)  HOLE: 00000000:000a0000 - 00000000:000e0000
(d2)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
(d2)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
(d2)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
(d2)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
(d2)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
(d2)  HOLE: 00000000:af800000 - 00000000:fc000000
(d2)  [07]: 00000000:fc000000 - 00000000:fdffc000: RESERVED
(d2)  [08]: 00000000:fdffc000 - 00000000:fdfff000: NVS
(d2)  [09]: 00000000:fdfff000 - 00000001:00000000: RESERVED
(d2) Invoking ROMBIOS ...

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory
  2014-10-24 15:02   ` Jan Beulich
@ 2014-10-27  8:50     ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  8:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 23:02, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -713,6 +713,19 @@ extern int arch_grant_map_page_identity(struct domain *d, unsigned long frame,
>>                                    bool_t writeable);
>>   extern int arch_grant_unmap_page_identity(struct domain *d, unsigned long frame);
>>
>> +/* Check if we are accessing rdm. */
>> +static inline int p2m_check_reserved_device_memory(xen_pfn_t start,
>> +                                                   xen_ulong_t nr, void *d)
>> +{
>> +    unsigned long *gfn = d;
>> +    xen_pfn_t end = start + nr;
>> +
>> +    if ( *gfn >= start && *gfn <= end )
>> +        return 1;
>> +
>> +    return 0;
>> +}
>
> There is absolutely nothing in the function connecting it to its name.
> And the last parameter being void looks pretty bogus too. Both
> signs of it (again) being wrong to add such a function without at
> least one first caller.

Looks you hope I fold this into sequent patch, lets address this in that 
sequent patch as well.

Thanks
Tiejun

>
> Also the way you return from the function suggests the return
> type really ought to be bool_t. And instead of the last four lines in
> the function body you could again use a single simple return
> statement.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-24 15:11   ` Jan Beulich
@ 2014-10-27  9:05     ` Chen, Tiejun
  2014-10-27 10:33       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-27  9:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/24 23:11, Jan Beulich wrote:
>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -686,6 +686,30 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>       /* Now, actually do the two-way mapping */
>>       if ( mfn_valid(_mfn(mfn)) )
>>       {
>> +
>> +        if ( !is_hardware_domain(d) )
>> +        {
>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>> +                                                  &gfn);
>
> Okay, no I see what that function is needed for. It being an inline
> function is of course very questionable looking at this use site.

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -556,6 +556,17 @@ guest_physmap_remove_page(struct domain *d, 
unsigned long gfn,
      gfn_unlock(p2m, gfn, page_order);
  }

+/* Check if we are accessing rdm. */
+int p2m_check_reserved_device_memory(xen_pfn_t start,
+                                     xen_ulong_t nr,
+                                     void *d)
+{
+    unsigned long *gfn = d;
+    xen_pfn_t end = start + nr;
+
+   return ( *gfn >= start && *gfn <= end ) ? 1 : 0;
+}
+
  int
  guest_physmap_add_entry(struct domain *d, unsigned long gfn,
                          unsigned long mfn, unsigned int page_order,

>
>> +            if ( rc )
>
> And the return value from the called function is of type int -
> non-zero may not just mean "true" but (when negative) also
> "error". You need to distinguish these cases.

But in our case its impossible to get a negative value.

>
>> +            {
>> +                /*
>> +                 * Just set p2m_access_n in case of shared-ept
>> +                 * or non-shared ept but 1:1 mapping.
>> +                 */
>> +                if ( iommu_use_hap_pt(d) ||
>> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )
>
> How would, other than by chance, mfn equal gfn here? Also the
> double use of iommu_use_hap_pt(d) is pointless here.

There are two scenarios we should concern:

#1 in case of shared-ept.

We always need to check so iommu_use_hap_pt(d) is good.

#2 in case of non-sharepd-ept

If mfn != gfn I think guest don't access RMRR range, so its allowed.

>
>> +                {
>> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>> +                                       p2m_access_n);
>> +                    if ( rc )
>> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
>> +                                 gfn);
>
> Such messages are (due to acting on a foreign domain) relatively
> useless without also logging the domain that is affected. Conversely,
> logging the current domain and vCPU (due to using gdprintk()) is
> rather pointless. Also please drop either the colon or the
> parentheses in the message.

Can P2M_DEBUG work here?

P2M_DEBUG("set rdm p2m failed: %#lx\n", gfn);

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-27  2:00   ` Chen, Tiejun
@ 2014-10-27  9:41     ` Jan Beulich
  2014-10-28  8:36       ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27  9:41 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> 5. Before we take real device assignment, any access to RMRR may issue
>>> ept_handle_violation because of p2m_access_n. Then we just call
>>> update_guest_eip() to return.
>>
>> I.e. ignore such accesses? Why?
> 
> Yeah. This illegal access isn't allowed but its enough to ignore that 
> without further protection or punishment.
> 
> Or what procedure should be concerned here based on your opinion?

If the access is illegal, inject a fault to the guest or kill it, unless you
can explain why ignoring such an access is correct/necessary (e.g.
I could see this being the equivalent of an access to a memory region
the address of which is not being decoded by any component in a
physical system).

>>> Now in our case we add a rule:
>>>   - if p2m_access_n is set we also set this mapping.
>>
>> Does that not conflict with eventual use mem-access makes of this
>> type?
>>
> 
> In our case, we always initialize these RMRR ranges with p2m_access_n to 
> make sure we can intercept any illegal access to these range until we 
> can reset them with p2m_access_rw via set_identity_p2m_entry(d, 
> base_pfn, p2m_access_rw).

This restates what the patch does but doesn't answer the question.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-27  2:11     ` Chen, Tiejun
  2014-10-27  2:18       ` Chen, Tiejun
@ 2014-10-27  9:42       ` Jan Beulich
  2014-10-28  2:22         ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27  9:42 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 03:11, <tiejun.chen@intel.com> wrote:
> On 2014/10/24 22:11, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> From: Jan Beulich <jbeulich@suse.com>
>>>
>>> This is a prerequisite for punching holes into HVM and PVH guests' P2M
>>> to allow passing through devices that are associated with (on VT-d)
>>> RMRRs.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>
>> I'm confused - you dropped Kevins ack and instead added you S-o-b
> 
> I need to add this ACK.
> 
>> despite you not having changed anything in the patch.
> 
> The original you sent to me previously is needed to rebase on the 
> latest. Please check xen/include/public/memory.h.
> 
> The original:
> 
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -573,7 +573,29 @@ struct vnuma_topology_info {
>   typedef struct vnuma_topology_info vnuma_topology_info_t;
>   DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
> 
> -/* Next available subop number is 27 */
> 
> Mine:
> 
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -523,7 +523,29 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_
> 
>   #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
> 
> -/* Next available subop number is 26 */
> 
> Unless you're saying this kind of rebase shouldn't introduce my SOB.

I indeed don't think rebasing counts as you actively doing any
changes.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-27  3:12     ` Chen, Tiejun
@ 2014-10-27  9:45       ` Jan Beulich
  2014-10-28  5:21         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27  9:45 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/firmware/hvmloader/util.h
>>> +++ b/tools/firmware/hvmloader/util.h
>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>                        unsigned int bios_image_base);
>>>   void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>
>>> +#include <xen/memory.h>
>>> +#define ENOBUFS     105 /* No buffer space available */
>>
>> This is a joke I hope? The #include belongs at the top (albeit afaict
>> you don't really need it here), and the #define is completely
> 
> If without this line, #include <xen/memory.h>,
> 
> In file included from build.c:25:0:
> ../util.h:246:70: error: array type has incomplete element type
>   int get_reserved_device_memory_map(struct xen_reserved_device_memory 
> entries[],
>                                                                        ^
> make[8]: *** [build.o] Error 1

So just forward declare the structure ahead of the function
declaration.

>> misplaced here. While I generally wouldn't recommend doing this, I
>> think in the case here including the hypervisor header that defines
>> them would be okay. Perhaps not via relative path, but via having
> 
> Seems we just need to include this,
> 
> #include <errno.h>

You shouldn't include system headers here - what if the build system's
-E... values differ from Xen's? Please remember that what your making
changes to is not arbitrary application code.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-27  7:12     ` Chen, Tiejun
@ 2014-10-27  9:56       ` Jan Beulich
  2014-10-28  7:11         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27  9:56 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/firmware/hvmloader/pci.c
>>> +++ b/tools/firmware/hvmloader/pci.c
>>> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>>>   enum virtual_vga virtual_vga = VGA_none;
>>>   unsigned long igd_opregion_pgbase = 0;
>>>
>>> +unsigned int need_skip_rmrr = 0;
>>
>> Static (and without initializer)?
> 
> static unsigned int need_skip_rmrr;

Please stop echoing back what was requested.

>>> +                           uint64_t mmio_start, uint64_t mmio_size)
>>> +{
>>> +    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
>>> +        return 0;
>>> +    else
>>> +        return 1;
>>
>> Make this a simple single return statement?
> 
> static int check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>                                      uint64_t mmio_start, uint64_t 
> mmio_size)
> {
>      if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
>          return 0;
> 
>      return 1;
> }

Is "a simple single return statement" ambiguous in any way? This

static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
                                      uint64_t mmio_start, uint64_t mmio_size)
{
     return start + memsize > mmio_start && start < mmio_start + mmio_size;
}

is how I think this should be.

>>> +}
>>> +
>>> +static int check_reserved_device_memory_map(uint64_t mmio_base,
>>> +                                            uint64_t mmio_max)
>>> +{
>>> +    uint32_t i = 0;
>>
>> Pointless initializer.
>>
>>> +    uint64_t rdm_start, rdm_end;
>>> +    int nr_entries = -1;
>>
>> And again.
>>
>>> +
>>> +    nr_entries = hvm_get_reserved_device_memory_map();
>>
>> It's completely unclear why this can't be the variable's initializer.
> 
>      uint32_t i;
>      uint64_t rdm_start, rdm_end;
>      int nr_rdm_entries = hvm_get_reserved_device_memory_map();

And (see also below) "unsigned int". It's bogus anyway to have the
function return the count by normal means by the actual array via a
global variable. I think you ought to switch to a consistent model.

>>> +    for ( i = 0; i < nr_entries; i++ )
>>> +    {
>>> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
>>> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
>>
>> I'm pretty certain I pointed out before that you can't simply shift
>> these fields - you risk losing significant bits.
> 
> I tried to go back looking into something but just found you were saying 
> I shouldn't use PAGE_SHIFT and PAGE_SIZE at the same time. If I'm still 
> missing could you show me what you expect?

Shifting a 32-bit quantity left still yields a 32-bit quantity, no matter
whether the result is then stored in a 64-bit variable. You need to
up-cast the left side of the shift first.

>>> @@ -58,7 +96,9 @@ void pci_setup(void)
>>>           uint32_t bar_reg;
>>>           uint64_t bar_sz;
>>>       } *bars = (struct bars *)scratch_start;
>>> -    unsigned int i, nr_bars = 0;
>>> +    unsigned int i, j, nr_bars = 0;
>>> +    int nr_entries = 0;
>>
>> And another pointless initializer. Plus as a count of something this
> 
> int nr_rdm_entries;
> 
>> surely wants to be "unsigned int". Also I guess the variable name
> 
> nr_rdm_entries should be literally unsigned int but this value always be 
> set from  hvm_get_reserved_device_memory_map(),
> 
> nr_rdm_entries = hvm_get_reserved_device_memory_map()
> 
> I hope that return value can be negative value in some failed case

If only you checked for these negative values...

> Additionally, actually there are some original codes just following my 
> codes:
> 
>          if ( need_skip_rmrr )
>          {
> 		...
>          }
> 
> 	base += bar_sz;
> 
>          if ( (base < resource->base) || (base > resource->max) )
>          {
>              printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>                     "resource!\n", devfn>>3, devfn&7, bar_reg,
>                     PRIllx_arg(bar_sz));
>              continue;
>          }
> 
> This can guarantee we don't overwhelm the previous mmio range.

Resulting in the BAR not getting a value assigned afaict. Certainly
not what we want as a side effect of your changes.

>> and bar_data_upper will likely end up being garbage.
>>
>> Did you actually _test_ this code?
> 
> Actually in my real case those RMRR ranges are always below MMIO.

Below whose MMIO? The host's or the guest's? In the latter case,
just (in order to test your code) increase the range reserved for
MMIO enough to cover the RMRR range.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-27  8:09     ` Chen, Tiejun
@ 2014-10-27 10:17       ` Jan Beulich
  2014-10-28  7:47         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27 10:17 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
> On 2014/10/24 22:56, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>> We need to check to reserve all reserved device memory maps in e820
>>> to avoid any potential guest memory conflict.
>>>
>>> Currently, if we can't insert RDM entries directly, we may need to handle
>>> several ranges as follows:
>>> a. Fixed Ranges --> BUG()
>>>   lowmem_reserved_base-0xA0000: reserved by BIOS implementation,
>>>   BIOS region,
>>>   RESERVED_MEMBASE ~ 0x100000000,
>>
>> This seems conceptually wrong to me, and I said so before:
>> Depending on host characteristics this approach may mean you're
>> going to be unable to build any HVM guests. Minimally there needs
>> to be a way to avoid these checks (resulting in devices associated
>> with RMRRs not being assignable to such a guest). I'm therefore
> 
> I just use 'err' to indicate if these fixed range overlaps RMRR,
> 
> +    /* These overlap may issue guest can't work well. */
> +    if ( err )
> +    {
> +        printf("Guest can't work with some reserved device memory overlap!\n");
> +        BUG();
> +    }
> 
> As I understand, these fixed ranges don't like RAM that we can move 
> safely out any RMRR overlap. And actually its rare to overlap with those 
> fixed ranges.

Again - one of my systems has RMRRs in the Ex000 range, which
certainly risks overlapping with the BIOS image should that one be
larger than 64k. Plus with RMRRs being in that region, I can
certainly see (physical) systems with small enough BIOS images
to place RMRRs even in the low Fx000 range, which then quite
certainly would overlap with the (virtual) BIOS range.

> But I can remove BUG if you insist on this point.

Whether removing the BUG() here is correct and/or sufficient to
address my concern I can't immediately tell. What I insist on is that
_no matter_ what RMRRs a physical host has, it should not prevent
the creation of guests (the worst that may result is that passing
through certain devices doesn't work anymore, and even then the
operator needs to be given a way of circumventing this if (s)he
knows that the device won't access the range post-boot, or if it's
being deemed acceptable for it to do so).

>>> +            /* If we're going last RAM:Hole range */
>>> +            else if ( end < next_start &&
>>> +                      rdm_start > start &&
>>> +                      rdm_end < next_start &&
>>> +                      type == E820_RAM )
>>> +            {
>>> +                if ( do_insert )
>>> +                {
>>> +                    memmove(&e820[j+1], &e820[j],
>>> +                            (sum_nr - j) * sizeof(struct e820entry));
>>> +
>>> +                    e820[j].size = rdm_start - e820[j].addr;
>>> +                    e820[j].type = E820_RAM;
>>> +
>>> +                    e820[j+1].addr = rdm_start;
>>> +                    e820[j+1].size = rdm_end - rdm_start;
>>> +                    e820[j+1].type = E820_RESERVED;
>>> +                    next_e820_entry_index++;
>>> +                }
>>> +                insert++;
>>> +            }
>>
>> This if-else-if series looks horrible - is there really no way to consolidate
>> it? Also, other than punching holes in the E820 map you don't seem to
> 
> I know this is ugly but as you know there's no any rule we can make good 
> use of this case. RMRR can start anywhere so We have to assume any 
> scenarios,
> 
> 1. Just amid those remaining e820 entries.
> 2. Already at the end.
> 3. If coincide with one RAM range.
> 4. If we're just aligned with start of one RAM range.
> 5. If we're just aligned with end of one RAM range.
> 6. If we're just in of one RAM range.
> 7. If we're going last RAM:Hole range.
> 
> So if you think we're handling correctly, maybe we can continue 
> optimizing this way once we have a better idea.

I understand that there are various cases to be considered, but
that's no different elsewhere. For example, look at
xen/arch/x86/e820.c:e820_change_range_type() which gets
away with quite a bit shorter an if/else-if sequence.

>> be doing anything here. And the earlier tools side patches didn't do
>> anything about this either. Consequently, at the time where it may
>> become necessary to establish the 1:1 mapping in the P2M, there'll
>> be the RAM mapping still there, causing the device assignment to fail.
> 
> But I already set these range as p2m_access_n, and as you see I also 
> reserved these range in e820 table. So although the RAM mapping still is 
> still there but no any actual access.

That's being done in patch 8, but we're talking about patch 6 here.
Also - what size are the RMRRs in your case? The USB ones I know
of are typical single or very few page ones, so having the guest
lose that amount of memory may be tolerable. But if the ranges can
get any larger than a couple of pages, or if there can reasonably be
a larger amount of them (like could be the case on e.g. multi-node
systems), simply hiding that memory may not be well received by
our users.

> RMRR range:
> 
> root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace# xl dmesg | 
> grep RMRR
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff
> root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace#
> 
> Without my patch:
> 
> (d4) E820 table:
> (d4)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (d4)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (d4)  HOLE: 00000000:000a0000 - 00000000:000e0000
> (d4)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (d4)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
> (d4)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
> (d4)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
> (d4)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED

Where would this reserved range come from when you patches
aren't in place?

> (d4)  HOLE: 00000000:af800000 - 00000000:fc000000
> (d4)  [07]: 00000000:fc000000 - 00000001:00000000: RESERVED
> 
> 
> With my patch:
> 
> (d2)  f0000-fffff: Main BIOS
> (d2) E820 table:
> (d2)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
> (d2)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
> (d2)  HOLE: 00000000:000a0000 - 00000000:000e0000
> (d2)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
> (d2)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
> (d2)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
> (d2)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
> (d2)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED

And this already answers what I asked above: You shouldn't be blindly
hiding 40Mb from the guest.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-27  9:05     ` Chen, Tiejun
@ 2014-10-27 10:33       ` Jan Beulich
  2014-10-28  8:26         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-27 10:33 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 27.10.14 at 10:05, <tiejun.chen@intel.com> wrote:
> On 2014/10/24 23:11, Jan Beulich wrote:
>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>   int
>   guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>                           unsigned long mfn, unsigned int page_order,
> 
>>
>>> +            if ( rc )
>>
>> And the return value from the called function is of type int -
>> non-zero may not just mean "true" but (when negative) also
>> "error". You need to distinguish these cases.
> 
> But in our case its impossible to get a negative value.

Being guaranteed by what? Please don't simply take the
_current implementation_ of iommu_get_reserved_device_memory()
as reference - it could be changed at any time, and it allowing for an
error return status already would make it perfectly fine for someone
adding an actual case thereof not to go through all existing callers
to check whether they can cope. This is a general code quality
requirement to assure things remain maintainable.

>>> +            {
>>> +                /*
>>> +                 * Just set p2m_access_n in case of shared-ept
>>> +                 * or non-shared ept but 1:1 mapping.
>>> +                 */
>>> +                if ( iommu_use_hap_pt(d) ||
>>> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )
>>
>> How would, other than by chance, mfn equal gfn here? Also the
>> double use of iommu_use_hap_pt(d) is pointless here.
> 
> There are two scenarios we should concern:
> 
> #1 in case of shared-ept.
> 
> We always need to check so iommu_use_hap_pt(d) is good.
> 
> #2 in case of non-sharepd-ept
> 
> If mfn != gfn I think guest don't access RMRR range, so its allowed.

And what if subsequently a device needing a 1:1 mapping at this GFN
gets assigned? (I simply don't see why shared vs non-shared would
matter here.)

>>> +                {
>>> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>> +                                       p2m_access_n);
>>> +                    if ( rc )
>>> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
>>> +                                 gfn);
>>
>> Such messages are (due to acting on a foreign domain) relatively
>> useless without also logging the domain that is affected. Conversely,
>> logging the current domain and vCPU (due to using gdprintk()) is
>> rather pointless. Also please drop either the colon or the
>> parentheses in the message.
> 
> Can P2M_DEBUG work here?
> 
> P2M_DEBUG("set rdm p2m failed: %#lx\n", gfn);

I don't think this would magically add the missing information. Plus it
would limit output to the !NDEBUG case, putting the practical
usefulness of this under question even more.

But anyway, looking at the existing code again, I think you'd be better
off falling through to the p2m_set_entry() that's already there, just
altering the access permission value you pass. Less code, better
readable.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
  2014-10-24 14:11   ` Jan Beulich
@ 2014-10-27 13:35   ` Julien Grall
  2014-10-28  2:35     ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Julien Grall @ 2014-10-27 13:35 UTC (permalink / raw)
  To: Tiejun Chen, JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang
  Cc: xen-devel

Hi,

On 10/24/2014 08:34 AM, Tiejun Chen wrote:
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index cc36e39..51a32a8 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -692,6 +692,32 @@ out:
>      return rc;
>  }
>  
> +struct get_reserved_device_memory {
> +    struct xen_mem_reserved_device_memory_map map;
> +    unsigned int used_entries;
> +};
> +
> +static int get_reserved_device_memory(xen_pfn_t start,
> +                                      xen_ulong_t nr, void *ctxt)

This function is only used when HAS_PASSTHROUGH is defined. You have to
protected by an #ifdef HAS_PASSTHROUGH.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-27  9:42       ` Jan Beulich
@ 2014-10-28  2:22         ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  2:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 17:42, Jan Beulich wrote:
>>>> On 27.10.14 at 03:11, <tiejun.chen@intel.com> wrote:
>> On 2014/10/24 22:11, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>
>>>> This is a prerequisite for punching holes into HVM and PVH guests' P2M
>>>> to allow passing through devices that are associated with (on VT-d)
>>>> RMRRs.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>
>>> I'm confused - you dropped Kevins ack and instead added you S-o-b
>>
>> I need to add this ACK.
>>
>>> despite you not having changed anything in the patch.
>>
>> The original you sent to me previously is needed to rebase on the
>> latest. Please check xen/include/public/memory.h.
>>
>> The original:
>>
>> --- a/xen/include/public/memory.h
>> +++ b/xen/include/public/memory.h
>> @@ -573,7 +573,29 @@ struct vnuma_topology_info {
>>    typedef struct vnuma_topology_info vnuma_topology_info_t;
>>    DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
>>
>> -/* Next available subop number is 27 */
>>
>> Mine:
>>
>> --- a/xen/include/public/memory.h
>> +++ b/xen/include/public/memory.h
>> @@ -523,7 +523,29 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_
>>
>>    #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>>
>> -/* Next available subop number is 26 */
>>
>> Unless you're saying this kind of rebase shouldn't introduce my SOB.
>
> I indeed don't think rebasing counts as you actively doing any
> changes.
>

Fine. I think this patch has a long history between us.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-27 13:35   ` Julien Grall
@ 2014-10-28  2:35     ` Chen, Tiejun
  2014-10-28 10:36       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  2:35 UTC (permalink / raw)
  To: Julien Grall, JBeulich, tim, konrad.wilk, kevin.tian, yang.z.zhang
  Cc: xen-devel

On 2014/10/27 21:35, Julien Grall wrote:
> Hi,
>
> On 10/24/2014 08:34 AM, Tiejun Chen wrote:
>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>> index cc36e39..51a32a8 100644
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -692,6 +692,32 @@ out:
>>       return rc;
>>   }
>>
>> +struct get_reserved_device_memory {
>> +    struct xen_mem_reserved_device_memory_map map;
>> +    unsigned int used_entries;
>> +};
>> +
>> +static int get_reserved_device_memory(xen_pfn_t start,
>> +                                      xen_ulong_t nr, void *ctxt)
>
> This function is only used when HAS_PASSTHROUGH is defined. You have to
> protected by an #ifdef HAS_PASSTHROUGH.
>

I guess you mean we need to do this,

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 1449c10..2177c56 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,7 @@ out:
      return rc;
  }

+#ifdef HAS_PASSTHROUGH
  struct get_reserved_device_memory {
      struct xen_reserved_device_memory_map map;
      unsigned int used_entries;
@@ -717,6 +718,7 @@ static int get_reserved_device_memory(xen_pfn_t start,

      return 0;
  }
+#endif

  long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
  {

Jan,

With this above change, is the following working for you?

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
CC: Julien Grall <julien.grall@linaro.org>
[Julien: Protect some definitions by an #ifdef HAS_PASSTHROUGH.]
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-27  9:45       ` Jan Beulich
@ 2014-10-28  5:21         ` Chen, Tiejun
  2014-10-28  9:48           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  5:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 17:45, Jan Beulich wrote:
>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/firmware/hvmloader/util.h
>>>> +++ b/tools/firmware/hvmloader/util.h
>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>                         unsigned int bios_image_base);
>>>>    void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>
>>>> +#include <xen/memory.h>
>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>
>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>> you don't really need it here), and the #define is completely
>>
>> If without this line, #include <xen/memory.h>,
>>
>> In file included from build.c:25:0:
>> ../util.h:246:70: error: array type has incomplete element type
>>    int get_reserved_device_memory_map(struct xen_reserved_device_memory
>> entries[],
>>                                                                         ^
>> make[8]: *** [build.o] Error 1
>
> So just forward declare the structure ahead of the function
> declaration.

tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>

So any reason I can't do such a same thing?

>
>>> misplaced here. While I generally wouldn't recommend doing this, I
>>> think in the case here including the hypervisor header that defines
>>> them would be okay. Perhaps not via relative path, but via having
>>
>> Seems we just need to include this,
>>
>> #include <errno.h>
>
> You shouldn't include system headers here - what if the build system's
> -E... values differ from Xen's? Please remember that what your making

tools/firmware/hvmloader/xenbus.c:30:#include <errno.h>

And why will Xen define this different?

Thanks
Tiejun

> changes to is not arbitrary application code.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-27  9:56       ` Jan Beulich
@ 2014-10-28  7:11         ` Chen, Tiejun
  2014-10-28  9:56           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  7:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 17:56, Jan Beulich wrote:
>>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
>> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/firmware/hvmloader/pci.c
>>>> +++ b/tools/firmware/hvmloader/pci.c
>>>> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>>>>    enum virtual_vga virtual_vga = VGA_none;
>>>>    unsigned long igd_opregion_pgbase = 0;
>>>>
>>>> +unsigned int need_skip_rmrr = 0;
>>>
>>> Static (and without initializer)?
>>
>> static unsigned int need_skip_rmrr;
>
> Please stop echoing back what was requested.

I try to fix inline to make sure I'm addressing all comments inline 
properly. If you think this is correct please just ignore that.

>
>>>> +                           uint64_t mmio_start, uint64_t mmio_size)
>>>> +{
>>>> +    if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
>>>> +        return 0;
>>>> +    else
>>>> +        return 1;
>>>
>>> Make this a simple single return statement?
>>
>> static int check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>>                                       uint64_t mmio_start, uint64_t
>> mmio_size)
>> {
>>       if ( start + memsize <= mmio_start || start >= mmio_start + mmio_size )
>>           return 0;
>>
>>       return 1;
>> }
>
> Is "a simple single return statement" ambiguous in any way? This
>
> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>                                        uint64_t mmio_start, uint64_t mmio_size)
> {
>       return start + memsize > mmio_start && start < mmio_start + mmio_size;
> }
>
> is how I think this should be.

Thanks for your show.

>
>>>> +}
>>>> +
>>>> +static int check_reserved_device_memory_map(uint64_t mmio_base,
>>>> +                                            uint64_t mmio_max)
>>>> +{
>>>> +    uint32_t i = 0;
>>>
>>> Pointless initializer.
>>>
>>>> +    uint64_t rdm_start, rdm_end;
>>>> +    int nr_entries = -1;
>>>
>>> And again.
>>>
>>>> +
>>>> +    nr_entries = hvm_get_reserved_device_memory_map();
>>>
>>> It's completely unclear why this can't be the variable's initializer.
>>
>>       uint32_t i;
>>       uint64_t rdm_start, rdm_end;
>>       int nr_rdm_entries = hvm_get_reserved_device_memory_map();
>
> And (see also below) "unsigned int". It's bogus anyway to have the
> function return the count by normal means by the actual array via a
> global variable. I think you ought to switch to a consistent model.
>
>>>> +    for ( i = 0; i < nr_entries; i++ )
>>>> +    {
>>>> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
>>>> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
>>>
>>> I'm pretty certain I pointed out before that you can't simply shift
>>> these fields - you risk losing significant bits.
>>
>> I tried to go back looking into something but just found you were saying
>> I shouldn't use PAGE_SHIFT and PAGE_SIZE at the same time. If I'm still
>> missing could you show me what you expect?
>
> Shifting a 32-bit quantity left still yields a 32-bit quantity, no matter
> whether the result is then stored in a 64-bit variable. You need to
> up-cast the left side of the shift first.

Do you mean this?

rdm_start = (uint64_t)rdm_map[j].start_pfn << PAGE_SHIFT;
rdm_end = rdm_start + ((uint64_t)rdm_map[j].nr_pages << PAGE_SHIFT);

>
>>>> @@ -58,7 +96,9 @@ void pci_setup(void)
>>>>            uint32_t bar_reg;
>>>>            uint64_t bar_sz;
>>>>        } *bars = (struct bars *)scratch_start;
>>>> -    unsigned int i, nr_bars = 0;
>>>> +    unsigned int i, j, nr_bars = 0;
>>>> +    int nr_entries = 0;
>>>
>>> And another pointless initializer. Plus as a count of something this
>>
>> int nr_rdm_entries;
>>
>>> surely wants to be "unsigned int". Also I guess the variable name
>>
>> nr_rdm_entries should be literally unsigned int but this value always be
>> set from  hvm_get_reserved_device_memory_map(),
>>
>> nr_rdm_entries = hvm_get_reserved_device_memory_map()
>>
>> I hope that return value can be negative value in some failed case
>
> If only you checked for these negative values...

May I can simplify these failed cases handle with within 
hvm_get_reserved_device_memory_map() like,
	if ( rc )
		return 0;

Because actually we don't need any negative return value again. So '0' 
is always fine. So here,

unsigned long nr_rdm_entries = hvm_get_reserved_device_memory_map();

>
>> Additionally, actually there are some original codes just following my
>> codes:
>>
>>           if ( need_skip_rmrr )
>>           {
>> 		...
>>           }
>>
>> 	base += bar_sz;
>>
>>           if ( (base < resource->base) || (base > resource->max) )
>>           {
>>               printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>                      "resource!\n", devfn>>3, devfn&7, bar_reg,
>>                      PRIllx_arg(bar_sz));
>>               continue;
>>           }
>>
>> This can guarantee we don't overwhelm the previous mmio range.
>
> Resulting in the BAR not getting a value assigned afaict. Certainly
> not what we want as a side effect of your changes.

I don't understand what a side effect is. I just to try to make sure BAR 
space skip any conflict range but they are still in these resource ranges.

>
>>> and bar_data_upper will likely end up being garbage.
>>>
>>> Did you actually _test_ this code?
>>
>> Actually in my real case those RMRR ranges are always below MMIO.
>
> Below whose MMIO? The host's or the guest's? In the latter case,
> just (in order to test your code) increase the range reserved for
> MMIO enough to cover the RMRR range.

In my platform,

RMRR region: base_addr ab80a000 end_address ab81dfff
RMRR region: base_addr ad000000 end_address af7fffff

So I guess you hope I change this

#define PCI_MEM_START       0xf0000000

to

#define PCI_MEM_START       0xa0000000

right?

But what test you want to see? Just boot?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-27 10:17       ` Jan Beulich
@ 2014-10-28  7:47         ` Chen, Tiejun
  2014-10-28 10:06           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  7:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 18:17, Jan Beulich wrote:
>>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
>> On 2014/10/24 22:56, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> We need to check to reserve all reserved device memory maps in e820
>>>> to avoid any potential guest memory conflict.
>>>>
>>>> Currently, if we can't insert RDM entries directly, we may need to handle
>>>> several ranges as follows:
>>>> a. Fixed Ranges --> BUG()
>>>>    lowmem_reserved_base-0xA0000: reserved by BIOS implementation,
>>>>    BIOS region,
>>>>    RESERVED_MEMBASE ~ 0x100000000,
>>>
>>> This seems conceptually wrong to me, and I said so before:
>>> Depending on host characteristics this approach may mean you're
>>> going to be unable to build any HVM guests. Minimally there needs
>>> to be a way to avoid these checks (resulting in devices associated
>>> with RMRRs not being assignable to such a guest). I'm therefore
>>
>> I just use 'err' to indicate if these fixed range overlaps RMRR,
>>
>> +    /* These overlap may issue guest can't work well. */
>> +    if ( err )
>> +    {
>> +        printf("Guest can't work with some reserved device memory overlap!\n");
>> +        BUG();
>> +    }
>>
>> As I understand, these fixed ranges don't like RAM that we can move
>> safely out any RMRR overlap. And actually its rare to overlap with those
>> fixed ranges.
>
> Again - one of my systems has RMRRs in the Ex000 range, which
> certainly risks overlapping with the BIOS image should that one be
> larger than 64k. Plus with RMRRs being in that region, I can
> certainly see (physical) systems with small enough BIOS images
> to place RMRRs even in the low Fx000 range, which then quite
> certainly would overlap with the (virtual) BIOS range.
>
>> But I can remove BUG if you insist on this point.
>
> Whether removing the BUG() here is correct and/or sufficient to
> address my concern I can't immediately tell. What I insist on is that

Okay.

> _no matter_ what RMRRs a physical host has, it should not prevent
> the creation of guests (the worst that may result is that passing
> through certain devices doesn't work anymore, and even then the
> operator needs to be given a way of circumventing this if (s)he
> knows that the device won't access the range post-boot, or if it's
> being deemed acceptable for it to do so).

As we know just legacy USB and GFX need these RMRR ranges. Especially, I 
believe just USB need << 1M space, so it may be possible to be placed 
below 1M. But I think we can ask BIOS to reallocate them upwards like my 
real platform,

RMRR region: base_addr ab80a000 end_address ab81dfff

I don't know what platform you're using, maybe its a legacy machine? But 
anyway it should be feasible to update BIOS. And even we can ask BIOS do 
this as a normal rule in the future.

For GFX, oftentimes it need dozens of MB,

RMRR region: base_addr ad000000 end_address af7fffff

So it shouldn't be overlapped with <1M.

>
>>>> +            /* If we're going last RAM:Hole range */
>>>> +            else if ( end < next_start &&
>>>> +                      rdm_start > start &&
>>>> +                      rdm_end < next_start &&
>>>> +                      type == E820_RAM )
>>>> +            {
>>>> +                if ( do_insert )
>>>> +                {
>>>> +                    memmove(&e820[j+1], &e820[j],
>>>> +                            (sum_nr - j) * sizeof(struct e820entry));
>>>> +
>>>> +                    e820[j].size = rdm_start - e820[j].addr;
>>>> +                    e820[j].type = E820_RAM;
>>>> +
>>>> +                    e820[j+1].addr = rdm_start;
>>>> +                    e820[j+1].size = rdm_end - rdm_start;
>>>> +                    e820[j+1].type = E820_RESERVED;
>>>> +                    next_e820_entry_index++;
>>>> +                }
>>>> +                insert++;
>>>> +            }
>>>
>>> This if-else-if series looks horrible - is there really no way to consolidate
>>> it? Also, other than punching holes in the E820 map you don't seem to
>>
>> I know this is ugly but as you know there's no any rule we can make good
>> use of this case. RMRR can start anywhere so We have to assume any
>> scenarios,
>>
>> 1. Just amid those remaining e820 entries.
>> 2. Already at the end.
>> 3. If coincide with one RAM range.
>> 4. If we're just aligned with start of one RAM range.
>> 5. If we're just aligned with end of one RAM range.
>> 6. If we're just in of one RAM range.
>> 7. If we're going last RAM:Hole range.
>>
>> So if you think we're handling correctly, maybe we can continue
>> optimizing this way once we have a better idea.
>
> I understand that there are various cases to be considered, but
> that's no different elsewhere. For example, look at
> xen/arch/x86/e820.c:e820_change_range_type() which gets

I don't think this circumstance is same as our requirement.

Here we are trying to insert different multiple entries that they have 
different range.

Anyway, I can take a further look at if we can improve this.

> away with quite a bit shorter an if/else-if sequence.
>
>>> be doing anything here. And the earlier tools side patches didn't do
>>> anything about this either. Consequently, at the time where it may
>>> become necessary to establish the 1:1 mapping in the P2M, there'll
>>> be the RAM mapping still there, causing the device assignment to fail.
>>
>> But I already set these range as p2m_access_n, and as you see I also
>> reserved these range in e820 table. So although the RAM mapping still is
>> still there but no any actual access.
>
> That's being done in patch 8, but we're talking about patch 6 here.
> Also - what size are the RMRRs in your case? The USB ones I know

RMRR region: base_addr ab80a000 end_address ab81dfff
RMRR region: base_addr ad000000 end_address af7fffff

> of are typical single or very few page ones, so having the guest
> lose that amount of memory may be tolerable. But if the ranges can
> get any larger than a couple of pages, or if there can reasonably be
> a larger amount of them (like could be the case on e.g. multi-node
> systems), simply hiding that memory may not be well received by
> our users.

Customers may accept dozens of MB but I'm not sure.

>
>> RMRR range:
>>
>> root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace# xl dmesg |
>> grep RMRR
>> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
>> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ab80a000 end_address ab81dfff
>> (XEN) [VT-D]dmar.c:834: found ACPI_DMAR_RMRR:
>> (XEN) [VT-D]dmar.c:679:   RMRR region: base_addr ad000000 end_address af7fffff
>> root@tchen0-Shark-Bay-Client-platform:/home/tchen0/workspace#
>>
>> Without my patch:
>>
>> (d4) E820 table:
>> (d4)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (d4)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (d4)  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (d4)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (d4)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
>> (d4)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
>> (d4)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
>> (d4)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
>
> Where would this reserved range come from when you patches
> aren't in place?
>
>> (d4)  HOLE: 00000000:af800000 - 00000000:fc000000
>> (d4)  [07]: 00000000:fc000000 - 00000001:00000000: RESERVED
>>
>>
>> With my patch:
>>
>> (d2)  f0000-fffff: Main BIOS
>> (d2) E820 table:
>> (d2)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>> (d2)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>> (d2)  HOLE: 00000000:000a0000 - 00000000:000e0000
>> (d2)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>> (d2)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
>> (d2)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
>> (d2)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
>> (d2)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
>
> And this already answers what I asked above: You shouldn't be blindly
> hiding 40Mb from the guest.

If we don't reserve these RMRR ranges, so guest may create 1:1 mapping. 
Then it will affect a device usage in other VM, or a device usage may 
corrupt these ranges in other VM.

Yes, we really need a policy to do this. So please tell me what you expect.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-27 10:33       ` Jan Beulich
@ 2014-10-28  8:26         ` Chen, Tiejun
  2014-10-28 10:12           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  8:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 18:33, Jan Beulich wrote:
>>>> On 27.10.14 at 10:05, <tiejun.chen@intel.com> wrote:
>> On 2014/10/24 23:11, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>    int
>>    guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>                            unsigned long mfn, unsigned int page_order,
>>
>>>
>>>> +            if ( rc )
>>>
>>> And the return value from the called function is of type int -
>>> non-zero may not just mean "true" but (when negative) also
>>> "error". You need to distinguish these cases.
>>
>> But in our case its impossible to get a negative value.
>
> Being guaranteed by what? Please don't simply take the
> _current implementation_ of iommu_get_reserved_device_memory()
> as reference - it could be changed at any time, and it allowing for an

Okay.
	if ( rc == 1 )

This means we hit a RMRR page. Other failed cases just post a warning 
message.

> error return status already would make it perfectly fine for someone
> adding an actual case thereof not to go through all existing callers
> to check whether they can cope. This is a general code quality
> requirement to assure things remain maintainable.
>
>>>> +            {
>>>> +                /*
>>>> +                 * Just set p2m_access_n in case of shared-ept
>>>> +                 * or non-shared ept but 1:1 mapping.
>>>> +                 */
>>>> +                if ( iommu_use_hap_pt(d) ||
>>>> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )
>>>
>>> How would, other than by chance, mfn equal gfn here? Also the
>>> double use of iommu_use_hap_pt(d) is pointless here.
>>
>> There are two scenarios we should concern:
>>
>> #1 in case of shared-ept.
>>
>> We always need to check so iommu_use_hap_pt(d) is good.
>>
>> #2 in case of non-sharepd-ept
>>
>> If mfn != gfn I think guest don't access RMRR range, so its allowed.
>
> And what if subsequently a device needing a 1:1 mapping at this GFN
> gets assigned? (I simply don't see why shared vs non-shared would
> matter here.)

In case of non-shared ept we just create VT-d table, if we assign a 
device with 1:1 RMRR mapping. So as long as mfn != gfn, its not 
necessary to set p2m_access_n.

In case of shared ept, we have to set p2m_access_n in any scenarios.

>
>>>> +                {
>>>> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>> +                                       p2m_access_n);
>>>> +                    if ( rc )
>>>> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
>>>> +                                 gfn);
>>>
>>> Such messages are (due to acting on a foreign domain) relatively
>>> useless without also logging the domain that is affected. Conversely,
>>> logging the current domain and vCPU (due to using gdprintk()) is
>>> rather pointless. Also please drop either the colon or the
>>> parentheses in the message.
>>
>> Can P2M_DEBUG work here?
>>
>> P2M_DEBUG("set rdm p2m failed: %#lx\n", gfn);
>
> I don't think this would magically add the missing information. Plus it

Sorry, is it okay?

gdprintk(XENLOG_WARNING, "Domain %hu set rdm p2m failed: %#lx\n",
                                  d->domain_id, gfn);

And I think I don't understand what you said properly, so I will ask 
other guys.

> would limit output to the !NDEBUG case, putting the practical
> usefulness of this under question even more.
>
> But anyway, looking at the existing code again, I think you'd be better
> off falling through to the p2m_set_entry() that's already there, just

Are you saying I should do this in p2m_set_entry()?

> altering the access permission value you pass. Less code, better

But here, guest_physmap_add_entry() just initiate to set p2m_access_n. 
We will alter the access permission until if we really assign device to 
create a 1:1 mapping.

Thanks
Tiejun

> readable.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-27  9:41     ` Jan Beulich
@ 2014-10-28  8:36       ` Chen, Tiejun
  2014-10-28  9:34         ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-28  8:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/27 17:41, Jan Beulich wrote:
>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>> update_guest_eip() to return.
>>>
>>> I.e. ignore such accesses? Why?
>>
>> Yeah. This illegal access isn't allowed but its enough to ignore that
>> without further protection or punishment.
>>
>> Or what procedure should be concerned here based on your opinion?
>
> If the access is illegal, inject a fault to the guest or kill it, unless you

Kill means we will crash domain? Seems its radical, isn't it? So I guess 
its better to inject a fault.

But what kind of fault you prefer currently?

> can explain why ignoring such an access is correct/necessary (e.g.
> I could see this being the equivalent of an access to a memory region
> the address of which is not being decoded by any component in a
> physical system).
>
>>>> Now in our case we add a rule:
>>>>    - if p2m_access_n is set we also set this mapping.
>>>
>>> Does that not conflict with eventual use mem-access makes of this
>>> type?

Do you mean what will happen after we reset these ranges as 
p2m_access_rw? We already reserve these ranges guest shouldn't access 
these range actually. And a guest still maliciously access them, that 
device may not work well.

>>>
>>
>> In our case, we always initialize these RMRR ranges with p2m_access_n to
>> make sure we can intercept any illegal access to these range until we
>> can reset them with p2m_access_rw via set_identity_p2m_entry(d,
>> base_pfn, p2m_access_rw).
>
> This restates what the patch does but doesn't answer the question.

Or Yang,

Could you reply this? I guess I'm still misunderstanding Jan's question.

Thanks
Tiejun

>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-28  8:36       ` Chen, Tiejun
@ 2014-10-28  9:34         ` Jan Beulich
  2014-10-28  9:39           ` Razvan Cojocaru
  2014-10-29  0:48           ` Chen, Tiejun
  0 siblings, 2 replies; 180+ messages in thread
From: Jan Beulich @ 2014-10-28  9:34 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>> update_guest_eip() to return.
>>>>
>>>> I.e. ignore such accesses? Why?
>>>
>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>> without further protection or punishment.
>>>
>>> Or what procedure should be concerned here based on your opinion?
>>
>> If the access is illegal, inject a fault to the guest or kill it, unless you
> 
> Kill means we will crash domain? Seems its radical, isn't it? So I guess 
> its better to inject a fault.
> 
> But what kind of fault you prefer currently?

#GP (but this being arbitrary is why simply killing the guest is another
option to consider).

>>>>> Now in our case we add a rule:
>>>>>    - if p2m_access_n is set we also set this mapping.
>>>>
>>>> Does that not conflict with eventual use mem-access makes of this
>>>> type?
> 
> Do you mean what will happen after we reset these ranges as 
> p2m_access_rw? We already reserve these ranges guest shouldn't access 
> these range actually. And a guest still maliciously access them, that 
> device may not work well.

mem-access is functionality used by a control domain, not the domain
itself. You need to make sure that neither your use of p2m_access_n
can confuse the mem-access code, nor that their use can confuse you.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-28  9:34         ` Jan Beulich
@ 2014-10-28  9:39           ` Razvan Cojocaru
  2014-10-29  0:51             ` Chen, Tiejun
  2014-10-29  0:48           ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Razvan Cojocaru @ 2014-10-28  9:39 UTC (permalink / raw)
  To: Jan Beulich, Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 10/28/2014 11:34 AM, Jan Beulich wrote:
>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>> update_guest_eip() to return.
>>>>>
>>>>> I.e. ignore such accesses? Why?
>>>>
>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>> without further protection or punishment.
>>>>
>>>> Or what procedure should be concerned here based on your opinion?
>>>
>>> If the access is illegal, inject a fault to the guest or kill it, unless you
>>
>> Kill means we will crash domain? Seems its radical, isn't it? So I guess 
>> its better to inject a fault.
>>
>> But what kind of fault you prefer currently?
> 
> #GP (but this being arbitrary is why simply killing the guest is another
> option to consider).
> 
>>>>>> Now in our case we add a rule:
>>>>>>    - if p2m_access_n is set we also set this mapping.
>>>>>
>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>> type?
>>
>> Do you mean what will happen after we reset these ranges as 
>> p2m_access_rw? We already reserve these ranges guest shouldn't access 
>> these range actually. And a guest still maliciously access them, that 
>> device may not work well.
> 
> mem-access is functionality used by a control domain, not the domain
> itself. You need to make sure that neither your use of p2m_access_n
> can confuse the mem-access code, nor that their use can confuse you.

Jan makes a very good point. If a guest, as you say, maliciously
accesses any of the guest's pages, a dom0 application (working via the
mem_access mechanism) might want to know about it (regardless of what
the guest itself can and cannot do). :)

So please, make sure that no such application will get confused by the
changes.


Thanks,
Razvan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-28  5:21         ` Chen, Tiejun
@ 2014-10-28  9:48           ` Jan Beulich
  2014-10-29  6:54             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-28  9:48 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>                         unsigned int bios_image_base);
>>>>>    void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>
>>>>> +#include <xen/memory.h>
>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>
>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>> you don't really need it here), and the #define is completely
>>>
>>> If without this line, #include <xen/memory.h>,
>>>
>>> In file included from build.c:25:0:
>>> ../util.h:246:70: error: array type has incomplete element type
>>>    int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>> entries[],
>>>                                                                         ^
>>> make[8]: *** [build.o] Error 1
>>
>> So just forward declare the structure ahead of the function
>> declaration.
> 
> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
> 
> So any reason I can't do such a same thing?

You can, but it's undesirable. You're wanting this in a header, i.e.
you'll make everyone consuming that header also implicitly depend
on the new header you would include. We shouldn't pointlessly
add build dependencies (and we should really try to reduce them
where possible).

>>>> misplaced here. While I generally wouldn't recommend doing this, I
>>>> think in the case here including the hypervisor header that defines
>>>> them would be okay. Perhaps not via relative path, but via having
>>>
>>> Seems we just need to include this,
>>>
>>> #include <errno.h>
>>
>> You shouldn't include system headers here - what if the build system's
>> -E... values differ from Xen's? Please remember that what your making
> 
> tools/firmware/hvmloader/xenbus.c:30:#include <errno.h>

This is a completely different case: For one, no-one really looks at
the error codes generated here. And even if someone would, these
would be error value purely internal to hvmloader. Whereas in your
case you want to interpret a value you get back from the hypervisor.

> And why will Xen define this different?

Why would Linux, *BSD, Solaris, and whatever else OS usable as a
build host for building Xen all be required to use exactly the same
-E... definitions when already Linux has variations for some of them
depending on host architecture (i.e. when doing cross builds you'd
risk running into problems even on Linux)?

Once again - please always keep in mind that you're modifying
hypervisor code, not some simple user mode application.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-28  7:11         ` Chen, Tiejun
@ 2014-10-28  9:56           ` Jan Beulich
  2014-10-29  7:03             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-28  9:56 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 08:11, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 17:56, Jan Beulich wrote:
>>>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> --- a/tools/firmware/hvmloader/pci.c
>>>>> +++ b/tools/firmware/hvmloader/pci.c
>>>>> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>>>>>    enum virtual_vga virtual_vga = VGA_none;
>>>>>    unsigned long igd_opregion_pgbase = 0;
>>>>>
>>>>> +unsigned int need_skip_rmrr = 0;
>>>>
>>>> Static (and without initializer)?
>>>
>>> static unsigned int need_skip_rmrr;
>>
>> Please stop echoing back what was requested.
> 
> I try to fix inline to make sure I'm addressing all comments inline 
> properly. If you think this is correct please just ignore that.

But it still takes people reading you reply time to read that. Please
ask back only when you think an earlier reply was ambiguous.

>>>>> +    for ( i = 0; i < nr_entries; i++ )
>>>>> +    {
>>>>> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
>>>>> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
>>>>
>>>> I'm pretty certain I pointed out before that you can't simply shift
>>>> these fields - you risk losing significant bits.
>>>
>>> I tried to go back looking into something but just found you were saying
>>> I shouldn't use PAGE_SHIFT and PAGE_SIZE at the same time. If I'm still
>>> missing could you show me what you expect?
>>
>> Shifting a 32-bit quantity left still yields a 32-bit quantity, no matter
>> whether the result is then stored in a 64-bit variable. You need to
>> up-cast the left side of the shift first.
> 
> Do you mean this?
> 
> rdm_start = (uint64_t)rdm_map[j].start_pfn << PAGE_SHIFT;
> rdm_end = rdm_start + ((uint64_t)rdm_map[j].nr_pages << PAGE_SHIFT);

Yes. Finally.

>>>>> @@ -58,7 +96,9 @@ void pci_setup(void)
>>>>>            uint32_t bar_reg;
>>>>>            uint64_t bar_sz;
>>>>>        } *bars = (struct bars *)scratch_start;
>>>>> -    unsigned int i, nr_bars = 0;
>>>>> +    unsigned int i, j, nr_bars = 0;
>>>>> +    int nr_entries = 0;
>>>>
>>>> And another pointless initializer. Plus as a count of something this
>>>
>>> int nr_rdm_entries;
>>>
>>>> surely wants to be "unsigned int". Also I guess the variable name
>>>
>>> nr_rdm_entries should be literally unsigned int but this value always be
>>> set from  hvm_get_reserved_device_memory_map(),
>>>
>>> nr_rdm_entries = hvm_get_reserved_device_memory_map()
>>>
>>> I hope that return value can be negative value in some failed case
>>
>> If only you checked for these negative values...
> 
> May I can simplify these failed cases handle with within 
> hvm_get_reserved_device_memory_map() like,
> 	if ( rc )
> 		return 0;
> 
> Because actually we don't need any negative return value again. So '0' 
> is always fine. So here,
> 
> unsigned long nr_rdm_entries = hvm_get_reserved_device_memory_map();

Except that "unsigned int" would seem more consistent.

>>> Additionally, actually there are some original codes just following my
>>> codes:
>>>
>>>           if ( need_skip_rmrr )
>>>           {
>>> 		...
>>>           }
>>>
>>> 	base += bar_sz;
>>>
>>>           if ( (base < resource->base) || (base > resource->max) )
>>>           {
>>>               printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>>                      "resource!\n", devfn>>3, devfn&7, bar_reg,
>>>                      PRIllx_arg(bar_sz));
>>>               continue;
>>>           }
>>>
>>> This can guarantee we don't overwhelm the previous mmio range.
>>
>> Resulting in the BAR not getting a value assigned afaict. Certainly
>> not what we want as a side effect of your changes.
> 
> I don't understand what a side effect is. I just to try to make sure BAR 
> space skip any conflict range but they are still in these resource ranges.

A side effect is an effect you don't primarily intend with your change
(or more generally, with any particular operation). In the case here,
a BAR that previously got a value assigned may not anymore with
your change in place. An acceptable effect of your change would be
if the value it gets assigned is now different, but not assigning a value
at all is not acceptable.

>>>> and bar_data_upper will likely end up being garbage.
>>>>
>>>> Did you actually _test_ this code?
>>>
>>> Actually in my real case those RMRR ranges are always below MMIO.
>>
>> Below whose MMIO? The host's or the guest's? In the latter case,
>> just (in order to test your code) increase the range reserved for
>> MMIO enough to cover the RMRR range.
> 
> In my platform,
> 
> RMRR region: base_addr ab80a000 end_address ab81dfff
> RMRR region: base_addr ad000000 end_address af7fffff
> 
> So I guess you hope I change this
> 
> #define PCI_MEM_START       0xf0000000
> 
> to
> 
> #define PCI_MEM_START       0xa0000000
> 
> right?

Almost. You'd have to use 0x80000000, or other logic would break.

> But what test you want to see? Just boot?

Yes, guest boot, but including proper inspection that what your
code does is really how it should be.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-28  7:47         ` Chen, Tiejun
@ 2014-10-28 10:06           ` Jan Beulich
  2014-10-29  7:43             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-28 10:06 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 08:47, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 18:17, Jan Beulich wrote:
>>>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/24 22:56, Jan Beulich wrote:
>> _no matter_ what RMRRs a physical host has, it should not prevent
>> the creation of guests (the worst that may result is that passing
>> through certain devices doesn't work anymore, and even then the
>> operator needs to be given a way of circumventing this if (s)he
>> knows that the device won't access the range post-boot, or if it's
>> being deemed acceptable for it to do so).
> 
> As we know just legacy USB and GFX need these RMRR ranges.

This is specified where?

> Especially, I 
> believe just USB need << 1M space, so it may be possible to be placed 
> below 1M. But I think we can ask BIOS to reallocate them upwards like my 
> real platform,
> 
> RMRR region: base_addr ab80a000 end_address ab81dfff
> 
> I don't know what platform you're using, maybe its a legacy machine?

A Westmere one.

> But 
> anyway it should be feasible to update BIOS. And even we can ask BIOS do 
> this as a normal rule in the future.
> 
> For GFX, oftentimes it need dozens of MB,
> 
> RMRR region: base_addr ad000000 end_address af7fffff
> 
> So it shouldn't be overlapped with <1M.

These "I believe" and "shouldn't" are a real problem here: Please
make claims only based on the specification, not on observations on
a particular system. In the real world, you have to be prepared for
implementations of a specification to be taking more liberties than
allowed for; you should never assume people don't even make use
off the full scope a specification provides for. I know I'm repeating
myself, but again - remember you're changing the hypervisor here
(and view hvmloader as an extension of the hypervisor inside the
guest).

>>> I know this is ugly but as you know there's no any rule we can make good
>>> use of this case. RMRR can start anywhere so We have to assume any
>>> scenarios,
>>>
>>> 1. Just amid those remaining e820 entries.
>>> 2. Already at the end.
>>> 3. If coincide with one RAM range.
>>> 4. If we're just aligned with start of one RAM range.
>>> 5. If we're just aligned with end of one RAM range.
>>> 6. If we're just in of one RAM range.
>>> 7. If we're going last RAM:Hole range.
>>>
>>> So if you think we're handling correctly, maybe we can continue
>>> optimizing this way once we have a better idea.
>>
>> I understand that there are various cases to be considered, but
>> that's no different elsewhere. For example, look at
>> xen/arch/x86/e820.c:e820_change_range_type() which gets
> 
> I don't think this circumstance is same as our requirement.
> 
> Here we are trying to insert different multiple entries that they have 
> different range.

Inserting multiple entries can always be done by inserting one
entry at a time. If that yields better (easier to understand and
maintain) code, and if the code path isn't a hot one, that should
be the route to go.

>>> With my patch:
>>>
>>> (d2)  f0000-fffff: Main BIOS
>>> (d2) E820 table:
>>> (d2)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>> (d2)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>> (d2)  HOLE: 00000000:000a0000 - 00000000:000e0000
>>> (d2)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>> (d2)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
>>> (d2)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
>>> (d2)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
>>> (d2)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
>>
>> And this already answers what I asked above: You shouldn't be blindly
>> hiding 40Mb from the guest.
> 
> If we don't reserve these RMRR ranges, so guest may create 1:1 mapping. 
> Then it will affect a device usage in other VM, or a device usage may 
> corrupt these ranges in other VM.
> 
> Yes, we really need a policy to do this. So please tell me what you expect.

In the tool stack, don't even populate these holes with RAM. This
will then lead to RAM getting populated further up at the upper end.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-28  8:26         ` Chen, Tiejun
@ 2014-10-28 10:12           ` Jan Beulich
  2014-10-29  8:20             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-28 10:12 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 09:26, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 18:33, Jan Beulich wrote:
>>>>> On 27.10.14 at 10:05, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/24 23:11, Jan Beulich wrote:
>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> +            {
>>>>> +                /*
>>>>> +                 * Just set p2m_access_n in case of shared-ept
>>>>> +                 * or non-shared ept but 1:1 mapping.
>>>>> +                 */
>>>>> +                if ( iommu_use_hap_pt(d) ||
>>>>> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )
>>>>
>>>> How would, other than by chance, mfn equal gfn here? Also the
>>>> double use of iommu_use_hap_pt(d) is pointless here.
>>>
>>> There are two scenarios we should concern:
>>>
>>> #1 in case of shared-ept.
>>>
>>> We always need to check so iommu_use_hap_pt(d) is good.
>>>
>>> #2 in case of non-sharepd-ept
>>>
>>> If mfn != gfn I think guest don't access RMRR range, so its allowed.
>>
>> And what if subsequently a device needing a 1:1 mapping at this GFN
>> gets assigned? (I simply don't see why shared vs non-shared would
>> matter here.)
> 
> In case of non-shared ept we just create VT-d table, if we assign a 
> device with 1:1 RMRR mapping. So as long as mfn != gfn, its not 
> necessary to set p2m_access_n.

I think it was mentioned before (and not just by me) that it's at least
risky to allow the two page tables to get out of sync wrt the
translations they do (as opposed to permissions they set).

>>>>> +                {
>>>>> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>>> +                                       p2m_access_n);
>>>>> +                    if ( rc )
>>>>> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
>>>>> +                                 gfn);
>>>>
>>>> Such messages are (due to acting on a foreign domain) relatively
>>>> useless without also logging the domain that is affected. Conversely,
>>>> logging the current domain and vCPU (due to using gdprintk()) is
>>>> rather pointless. Also please drop either the colon or the
>>>> parentheses in the message.
>>>
>>> Can P2M_DEBUG work here?
>>>
>>> P2M_DEBUG("set rdm p2m failed: %#lx\n", gfn);
>>
>> I don't think this would magically add the missing information. Plus it
> 
> Sorry, is it okay?
> 
> gdprintk(XENLOG_WARNING, "Domain %hu set rdm p2m failed: %#lx\n",
>                                   d->domain_id, gfn);

Almost. Use printk() instead of gdprintk(), and %d instead of %hu.

>> would limit output to the !NDEBUG case, putting the practical
>> usefulness of this under question even more.
>>
>> But anyway, looking at the existing code again, I think you'd be better
>> off falling through to the p2m_set_entry() that's already there, just
> 
> Are you saying I should do this in p2m_set_entry()?

No. I said that I'd prefer you to use the _existing call to_
p2m_set_entry().

>> altering the access permission value you pass. Less code, better
> 
> But here, guest_physmap_add_entry() just initiate to set p2m_access_n. 
> We will alter the access permission until if we really assign device to 
> create a 1:1 mapping.

And I didn't question that. All I said is to use the existing call by simply
replacing the unconditional use of the default access value with one
conditionally set to p2m_access_n.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-28  2:35     ` Chen, Tiejun
@ 2014-10-28 10:36       ` Jan Beulich
  2014-10-29  0:40         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-28 10:36 UTC (permalink / raw)
  To: Tiejun Chen, Julien Grall; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 28.10.14 at 03:35, <tiejun.chen@intel.com> wrote:
> On 2014/10/27 21:35, Julien Grall wrote:
>> Hi,
>>
>> On 10/24/2014 08:34 AM, Tiejun Chen wrote:
>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>> index cc36e39..51a32a8 100644
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -692,6 +692,32 @@ out:
>>>       return rc;
>>>   }
>>>
>>> +struct get_reserved_device_memory {
>>> +    struct xen_mem_reserved_device_memory_map map;
>>> +    unsigned int used_entries;
>>> +};
>>> +
>>> +static int get_reserved_device_memory(xen_pfn_t start,
>>> +                                      xen_ulong_t nr, void *ctxt)
>>
>> This function is only used when HAS_PASSTHROUGH is defined. You have to
>> protected by an #ifdef HAS_PASSTHROUGH.
>>
> 
> I guess you mean we need to do this,
> 
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 1449c10..2177c56 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -692,6 +692,7 @@ out:
>       return rc;
>   }
> 
> +#ifdef HAS_PASSTHROUGH
>   struct get_reserved_device_memory {
>       struct xen_reserved_device_memory_map map;
>       unsigned int used_entries;
> @@ -717,6 +718,7 @@ static int get_reserved_device_memory(xen_pfn_t start,
> 
>       return 0;
>   }
> +#endif
> 
>   long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>   {
> 
> Jan,
> 
> With this above change, is the following working for you?

I already fixed this in my version (which I view to be the canonical
one).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-28 10:36       ` Jan Beulich
@ 2014-10-29  0:40         ` Chen, Tiejun
  2014-10-29  8:53           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  0:40 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 18:36, Jan Beulich wrote:
>>>> On 28.10.14 at 03:35, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 21:35, Julien Grall wrote:
>>> Hi,
>>>
>>> On 10/24/2014 08:34 AM, Tiejun Chen wrote:
>>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>>> index cc36e39..51a32a8 100644
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -692,6 +692,32 @@ out:
>>>>        return rc;
>>>>    }
>>>>
>>>> +struct get_reserved_device_memory {
>>>> +    struct xen_mem_reserved_device_memory_map map;
>>>> +    unsigned int used_entries;
>>>> +};
>>>> +
>>>> +static int get_reserved_device_memory(xen_pfn_t start,
>>>> +                                      xen_ulong_t nr, void *ctxt)
>>>
>>> This function is only used when HAS_PASSTHROUGH is defined. You have to
>>> protected by an #ifdef HAS_PASSTHROUGH.
>>>
>>
>> I guess you mean we need to do this,
>>
>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>> index 1449c10..2177c56 100644
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -692,6 +692,7 @@ out:
>>        return rc;
>>    }
>>
>> +#ifdef HAS_PASSTHROUGH
>>    struct get_reserved_device_memory {
>>        struct xen_reserved_device_memory_map map;
>>        unsigned int used_entries;
>> @@ -717,6 +718,7 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>
>>        return 0;
>>    }
>> +#endif
>>
>>    long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>    {
>>
>> Jan,
>>
>> With this above change, is the following working for you?
>
> I already fixed this in my version (which I view to be the canonical
> one).
>

Are you point that attached patch? Are you sure? Here I pick some code 
fragments from your latest patch,

--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,32 @@ out:
      return rc;
  }

+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+
  long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
  {
      struct domain *d;
@@ -1101,6 +1127,29 @@ long do_memory_op(unsigned long cmd, XEN
          break;
      }

+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
      default:
          rc = arch_memory_op(cmd, arg);
          break;
--- a/xen/drivers/passthrough/iommu.c

Thanks
Tiejun

> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-28  9:34         ` Jan Beulich
  2014-10-28  9:39           ` Razvan Cojocaru
@ 2014-10-29  0:48           ` Chen, Tiejun
  2014-10-29  2:51             ` Chen, Tiejun
  2014-10-29  8:44             ` Jan Beulich
  1 sibling, 2 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  0:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 17:34, Jan Beulich wrote:
>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>> update_guest_eip() to return.
>>>>>
>>>>> I.e. ignore such accesses? Why?
>>>>
>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>> without further protection or punishment.
>>>>
>>>> Or what procedure should be concerned here based on your opinion?
>>>
>>> If the access is illegal, inject a fault to the guest or kill it, unless you
>>
>> Kill means we will crash domain? Seems its radical, isn't it? So I guess
>> its better to inject a fault.
>>
>> But what kind of fault you prefer currently?
>
> #GP (but this being arbitrary is why simply killing the guest is another
> option to consider).

In this case I think we just need to refer to native behavior. So I feel 
GP may be a little bit reasonable.

>
>>>>>> Now in our case we add a rule:
>>>>>>     - if p2m_access_n is set we also set this mapping.
>>>>>
>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>> type?
>>
>> Do you mean what will happen after we reset these ranges as
>> p2m_access_rw? We already reserve these ranges guest shouldn't access
>> these range actually. And a guest still maliciously access them, that
>> device may not work well.
>
> mem-access is functionality used by a control domain, not the domain

I really don't know this mechanism so thanks for your good coverage.

> itself. You need to make sure that neither your use of p2m_access_n
> can confuse the mem-access code, nor that their use can confuse you.

Absolutely, but I think I need to know more about mem-access firstly.

Thanks
Tiejun

>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-28  9:39           ` Razvan Cojocaru
@ 2014-10-29  0:51             ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  0:51 UTC (permalink / raw)
  To: Razvan Cojocaru, Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 17:39, Razvan Cojocaru wrote:
> On 10/28/2014 11:34 AM, Jan Beulich wrote:
>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>>> update_guest_eip() to return.
>>>>>>
>>>>>> I.e. ignore such accesses? Why?
>>>>>
>>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>>> without further protection or punishment.
>>>>>
>>>>> Or what procedure should be concerned here based on your opinion?
>>>>
>>>> If the access is illegal, inject a fault to the guest or kill it, unless you
>>>
>>> Kill means we will crash domain? Seems its radical, isn't it? So I guess
>>> its better to inject a fault.
>>>
>>> But what kind of fault you prefer currently?
>>
>> #GP (but this being arbitrary is why simply killing the guest is another
>> option to consider).
>>
>>>>>>> Now in our case we add a rule:
>>>>>>>     - if p2m_access_n is set we also set this mapping.
>>>>>>
>>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>>> type?
>>>
>>> Do you mean what will happen after we reset these ranges as
>>> p2m_access_rw? We already reserve these ranges guest shouldn't access
>>> these range actually. And a guest still maliciously access them, that
>>> device may not work well.
>>
>> mem-access is functionality used by a control domain, not the domain
>> itself. You need to make sure that neither your use of p2m_access_n
>> can confuse the mem-access code, nor that their use can confuse you.
>
> Jan makes a very good point. If a guest, as you say, maliciously

Yes, he pointed out something I don't consider but really need to concern.

> accesses any of the guest's pages, a dom0 application (working via the
> mem_access mechanism) might want to know about it (regardless of what
> the guest itself can and cannot do). :)
>
> So please, make sure that no such application will get confused by the
> changes.

Thanks for your further comments.

Tiejun

>
>
> Thanks,
> Razvan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-29  0:48           ` Chen, Tiejun
@ 2014-10-29  2:51             ` Chen, Tiejun
  2014-10-29  8:45               ` Jan Beulich
  2014-10-29  8:44             ` Jan Beulich
  1 sibling, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  2:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 8:48, Chen, Tiejun wrote:
> On 2014/10/28 17:34, Jan Beulich wrote:
>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>> 5. Before we take real device assignment, any access to RMRR may
>>>>>>> issue
>>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>>> update_guest_eip() to return.
>>>>>>
>>>>>> I.e. ignore such accesses? Why?
>>>>>
>>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>>> without further protection or punishment.
>>>>>
>>>>> Or what procedure should be concerned here based on your opinion?
>>>>
>>>> If the access is illegal, inject a fault to the guest or kill it,
>>>> unless you
>>>
>>> Kill means we will crash domain? Seems its radical, isn't it? So I guess
>>> its better to inject a fault.
>>>
>>> But what kind of fault you prefer currently?
>>
>> #GP (but this being arbitrary is why simply killing the guest is another
>> option to consider).
>
> In this case I think we just need to refer to native behavior. So I feel
> GP may be a little bit reasonable.
>
>>
>>>>>>> Now in our case we add a rule:
>>>>>>>     - if p2m_access_n is set we also set this mapping.
>>>>>>
>>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>>> type?
>>>
>>> Do you mean what will happen after we reset these ranges as
>>> p2m_access_rw? We already reserve these ranges guest shouldn't access
>>> these range actually. And a guest still maliciously access them, that
>>> device may not work well.
>>
>> mem-access is functionality used by a control domain, not the domain
>
> I really don't know this mechanism so thanks for your good coverage.
>
>> itself. You need to make sure that neither your use of p2m_access_n
>> can confuse the mem-access code, nor that their use can confuse you.
>
> Absolutely, but I think I need to know more about mem-access firstly.
>

I think these reserved device memory shouldn't be pocked since any write 
may affect device. Even, what if a device with RMRR isn't assign current 
domain? And read also should not be allowed since this still may 
introduce some potential unexpected behavior to device.

So if mem_access is trying to access those RMRR range, could we let 
mem_access exit directly with some message? I mean we can check if we're 
accessing those RMRR ranges in case of XENMEM_access_op_set_access.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-28  9:48           ` Jan Beulich
@ 2014-10-29  6:54             ` Chen, Tiejun
  2014-10-29  9:05               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  6:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 17:48, Jan Beulich wrote:
>>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>>                          unsigned int bios_image_base);
>>>>>>     void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>>
>>>>>> +#include <xen/memory.h>
>>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>>
>>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>>> you don't really need it here), and the #define is completely
>>>>
>>>> If without this line, #include <xen/memory.h>,
>>>>
>>>> In file included from build.c:25:0:
>>>> ../util.h:246:70: error: array type has incomplete element type
>>>>     int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>>> entries[],
>>>>                                                                          ^
>>>> make[8]: *** [build.o] Error 1
>>>
>>> So just forward declare the structure ahead of the function
>>> declaration.
>>
>> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
>> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
>>
>> So any reason I can't do such a same thing?
>
> You can, but it's undesirable. You're wanting this in a header, i.e.
> you'll make everyone consuming that header also implicitly depend
> on the new header you would include. We shouldn't pointlessly
> add build dependencies (and we should really try to reduce them
> where possible).

Looks I can remove those stuff from util.h and just add 'extern' to them 
when we really need them.

>
>>>>> misplaced here. While I generally wouldn't recommend doing this, I
>>>>> think in the case here including the hypervisor header that defines
>>>>> them would be okay. Perhaps not via relative path, but via having

So is the following is a way "via having the Makefile symlink the 
hypervisor header here."?

--- a/tools/include/Makefile
+++ b/tools/include/Makefile
@@ -17,6 +17,7 @@ xen/.dir:
         ln -sf ../xen-sys/$(XEN_OS) xen/sys
         ln -sf $(addprefix $(XEN_ROOT)/xen/include/xen/,libelf.h 
elfstructs.h) xen/libelf/
         ln -s ../xen-foreign xen/foreign
+       ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen
         touch $@

  .PHONY: install

Then we just need include this in util.c:

--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -26,6 +26,7 @@
  #include <xen/xen.h>
  #include <xen/memory.h>
  #include <xen/sched.h>
+#include <xen/errno.h>

  void wrmsr(uint32_t idx, uint64_t v)
  {

Thanks
Tiejun

>>>>
>>>> Seems we just need to include this,
>>>>
>>>> #include <errno.h>
>>>
>>> You shouldn't include system headers here - what if the build system's
>>> -E... values differ from Xen's? Please remember that what your making
>>
>> tools/firmware/hvmloader/xenbus.c:30:#include <errno.h>
>
> This is a completely different case: For one, no-one really looks at
> the error codes generated here. And even if someone would, these
> would be error value purely internal to hvmloader. Whereas in your
> case you want to interpret a value you get back from the hypervisor.
>
>> And why will Xen define this different?
>
> Why would Linux, *BSD, Solaris, and whatever else OS usable as a
> build host for building Xen all be required to use exactly the same
> -E... definitions when already Linux has variations for some of them
> depending on host architecture (i.e. when doing cross builds you'd
> risk running into problems even on Linux)?
>
> Once again - please always keep in mind that you're modifying
> hypervisor code, not some simple user mode application.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-28  9:56           ` Jan Beulich
@ 2014-10-29  7:03             ` Chen, Tiejun
  2014-10-29  9:08               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  7:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 17:56, Jan Beulich wrote:
>>>> On 28.10.14 at 08:11, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 17:56, Jan Beulich wrote:
>>>>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/tools/firmware/hvmloader/pci.c
>>>>>> +++ b/tools/firmware/hvmloader/pci.c
>>>>>> @@ -37,6 +37,44 @@ uint64_t pci_hi_mem_start = 0, pci_hi_mem_end = 0;
>>>>>>     enum virtual_vga virtual_vga = VGA_none;
>>>>>>     unsigned long igd_opregion_pgbase = 0;
>>>>>>
>>>>>> +unsigned int need_skip_rmrr = 0;
>>>>>
>>>>> Static (and without initializer)?
>>>>
>>>> static unsigned int need_skip_rmrr;
>>>
>>> Please stop echoing back what was requested.
>>
>> I try to fix inline to make sure I'm addressing all comments inline
>> properly. If you think this is correct please just ignore that.
>
> But it still takes people reading you reply time to read that. Please
> ask back only when you think an earlier reply was ambiguous.
>
>>>>>> +    for ( i = 0; i < nr_entries; i++ )
>>>>>> +    {
>>>>>> +        rdm_start = rdm_map[i].start_pfn << PAGE_SHIFT;
>>>>>> +        rdm_end = rdm_start + (rdm_map[i].nr_pages << PAGE_SHIFT);
>>>>>
>>>>> I'm pretty certain I pointed out before that you can't simply shift
>>>>> these fields - you risk losing significant bits.
>>>>
>>>> I tried to go back looking into something but just found you were saying
>>>> I shouldn't use PAGE_SHIFT and PAGE_SIZE at the same time. If I'm still
>>>> missing could you show me what you expect?
>>>
>>> Shifting a 32-bit quantity left still yields a 32-bit quantity, no matter
>>> whether the result is then stored in a 64-bit variable. You need to
>>> up-cast the left side of the shift first.
>>
>> Do you mean this?
>>
>> rdm_start = (uint64_t)rdm_map[j].start_pfn << PAGE_SHIFT;
>> rdm_end = rdm_start + ((uint64_t)rdm_map[j].nr_pages << PAGE_SHIFT);
>
> Yes. Finally.
>
>>>>>> @@ -58,7 +96,9 @@ void pci_setup(void)
>>>>>>             uint32_t bar_reg;
>>>>>>             uint64_t bar_sz;
>>>>>>         } *bars = (struct bars *)scratch_start;
>>>>>> -    unsigned int i, nr_bars = 0;
>>>>>> +    unsigned int i, j, nr_bars = 0;
>>>>>> +    int nr_entries = 0;
>>>>>
>>>>> And another pointless initializer. Plus as a count of something this
>>>>
>>>> int nr_rdm_entries;
>>>>
>>>>> surely wants to be "unsigned int". Also I guess the variable name
>>>>
>>>> nr_rdm_entries should be literally unsigned int but this value always be
>>>> set from  hvm_get_reserved_device_memory_map(),
>>>>
>>>> nr_rdm_entries = hvm_get_reserved_device_memory_map()
>>>>
>>>> I hope that return value can be negative value in some failed case
>>>
>>> If only you checked for these negative values...
>>
>> May I can simplify these failed cases handle with within
>> hvm_get_reserved_device_memory_map() like,
>> 	if ( rc )
>> 		return 0;
>>
>> Because actually we don't need any negative return value again. So '0'
>> is always fine. So here,
>>
>> unsigned long nr_rdm_entries = hvm_get_reserved_device_memory_map();
>
> Except that "unsigned int" would seem more consistent.

Fixed.

>
>>>> Additionally, actually there are some original codes just following my
>>>> codes:
>>>>
>>>>            if ( need_skip_rmrr )
>>>>            {
>>>> 		...
>>>>            }
>>>>
>>>> 	base += bar_sz;
>>>>
>>>>            if ( (base < resource->base) || (base > resource->max) )
>>>>            {
>>>>                printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>>>                       "resource!\n", devfn>>3, devfn&7, bar_reg,
>>>>                       PRIllx_arg(bar_sz));
>>>>                continue;
>>>>            }
>>>>
>>>> This can guarantee we don't overwhelm the previous mmio range.
>>>
>>> Resulting in the BAR not getting a value assigned afaict. Certainly
>>> not what we want as a side effect of your changes.
>>
>> I don't understand what a side effect is. I just to try to make sure BAR
>> space skip any conflict range but they are still in these resource ranges.
>
> A side effect is an effect you don't primarily intend with your change
> (or more generally, with any particular operation). In the case here,
> a BAR that previously got a value assigned may not anymore with
> your change in place. An acceptable effect of your change would be
> if the value it gets assigned is now different, but not assigning a value
> at all is not acceptable.

As I understand that value just need to align with BAR and size. Then 
any range should be fine. Here I think its not necessary to consider any 
space restriction, i.e, some device may just access under 4G.

>
>>>>> and bar_data_upper will likely end up being garbage.
>>>>>
>>>>> Did you actually _test_ this code?
>>>>
>>>> Actually in my real case those RMRR ranges are always below MMIO.
>>>
>>> Below whose MMIO? The host's or the guest's? In the latter case,
>>> just (in order to test your code) increase the range reserved for
>>> MMIO enough to cover the RMRR range.
>>
>> In my platform,
>>
>> RMRR region: base_addr ab80a000 end_address ab81dfff
>> RMRR region: base_addr ad000000 end_address af7fffff
>>
>> So I guess you hope I change this
>>
>> #define PCI_MEM_START       0xf0000000
>>
>> to
>>
>> #define PCI_MEM_START       0xa0000000
>>
>> right?
>
> Almost. You'd have to use 0x80000000, or other logic would break.
>
>> But what test you want to see? Just boot?

I don't see any strong error from boot.
>
> Yes, guest boot, but including proper inspection that what your
> code does is really how it should be.
>

Okay, I will go into details.

Thanks
Tiejun

> Jan
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-28 10:06           ` Jan Beulich
@ 2014-10-29  7:43             ` Chen, Tiejun
  2014-10-29  9:15               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  7:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 18:06, Jan Beulich wrote:
>>>> On 28.10.14 at 08:47, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 18:17, Jan Beulich wrote:
>>>>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/24 22:56, Jan Beulich wrote:
>>> _no matter_ what RMRRs a physical host has, it should not prevent
>>> the creation of guests (the worst that may result is that passing
>>> through certain devices doesn't work anymore, and even then the
>>> operator needs to be given a way of circumventing this if (s)he
>>> knows that the device won't access the range post-boot, or if it's
>>> being deemed acceptable for it to do so).
>>
>> As we know just legacy USB and GFX need these RMRR ranges.
>
> This is specified where?

In VT-D specification, I just see,

"The RMRR regions are expected to be used for legacy usages (such as 
USB, UMA Graphics, etc.) requiring reserved memory. Platform designers 
should avoid or limit use of reserved memory regions since these require 
system software to create holes in the DMA virtual address range 
available to system software and its driver."

>
>> Especially, I
>> believe just USB need << 1M space, so it may be possible to be placed
>> below 1M. But I think we can ask BIOS to reallocate them upwards like my
>> real platform,
>>
>> RMRR region: base_addr ab80a000 end_address ab81dfff
>>
>> I don't know what platform you're using, maybe its a legacy machine?
>
> A Westmere one.
>
>> But
>> anyway it should be feasible to update BIOS. And even we can ask BIOS do
>> this as a normal rule in the future.
>>
>> For GFX, oftentimes it need dozens of MB,
>>
>> RMRR region: base_addr ad000000 end_address af7fffff
>>
>> So it shouldn't be overlapped with <1M.
>
> These "I believe" and "shouldn't" are a real problem here: Please
> make claims only based on the specification, not on observations on
> a particular system. In the real world, you have to be prepared for
> implementations of a specification to be taking more liberties than
> allowed for; you should never assume people don't even make use
> off the full scope a specification provides for. I know I'm repeating
> myself, but again - remember you're changing the hypervisor here
> (and view hvmloader as an extension of the hypervisor inside the
> guest).

RMRR really is very troublesome.

The legacy usage of USB just cover ps2 emulation as I know. And as you 
see these address are different in different platforms so this mean 
they're not redistricted somewhere specific. And GFX need more space so 
its not possible to be placed under 1M.

So maybe I can drop patch #12, xen/vtd: re-enable USB device assignment, 
to leave USB out our scope. Or a little improvement is to check if its 
own range is below 1M.

>
>>>> I know this is ugly but as you know there's no any rule we can make good
>>>> use of this case. RMRR can start anywhere so We have to assume any
>>>> scenarios,
>>>>
>>>> 1. Just amid those remaining e820 entries.
>>>> 2. Already at the end.
>>>> 3. If coincide with one RAM range.
>>>> 4. If we're just aligned with start of one RAM range.
>>>> 5. If we're just aligned with end of one RAM range.
>>>> 6. If we're just in of one RAM range.
>>>> 7. If we're going last RAM:Hole range.
>>>>
>>>> So if you think we're handling correctly, maybe we can continue
>>>> optimizing this way once we have a better idea.
>>>
>>> I understand that there are various cases to be considered, but
>>> that's no different elsewhere. For example, look at
>>> xen/arch/x86/e820.c:e820_change_range_type() which gets
>>
>> I don't think this circumstance is same as our requirement.
>>
>> Here we are trying to insert different multiple entries that they have
>> different range.
>
> Inserting multiple entries can always be done by inserting one
> entry at a time. If that yields better (easier to understand and

Insert means you have to split one existing entry. Even it may overlap 
with two entries at the same time, and these two entries may not be 
continuous.

> maintain) code, and if the code path isn't a hot one, that should
> be the route to go.

I promise I can take a further look but it may be difficult to me.

>
>>>> With my patch:
>>>>
>>>> (d2)  f0000-fffff: Main BIOS
>>>> (d2) E820 table:
>>>> (d2)  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>>> (d2)  [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>>> (d2)  HOLE: 00000000:000a0000 - 00000000:000e0000
>>>> (d2)  [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>>> (d2)  [03]: 00000000:00100000 - 00000000:ab80a000: RAM
>>>> (d2)  [04]: 00000000:ab80a000 - 00000000:ab81e000: RESERVED
>>>> (d2)  [05]: 00000000:ab81e000 - 00000000:ad000000: RAM
>>>> (d2)  [06]: 00000000:ad000000 - 00000000:af800000: RESERVED
>>>
>>> And this already answers what I asked above: You shouldn't be blindly
>>> hiding 40Mb from the guest.
>>
>> If we don't reserve these RMRR ranges, so guest may create 1:1 mapping.
>> Then it will affect a device usage in other VM, or a device usage may
>> corrupt these ranges in other VM.
>>
>> Yes, we really need a policy to do this. So please tell me what you expect.
>
> In the tool stack, don't even populate these holes with RAM. This
> will then lead to RAM getting populated further up at the upper end.

Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we 
already be there to mark them as p2m_access_n.

Thanks
Tiejun
>
> Jan
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-28 10:12           ` Jan Beulich
@ 2014-10-29  8:20             ` Chen, Tiejun
  2014-10-29  9:20               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-29  8:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/28 18:12, Jan Beulich wrote:
>>>> On 28.10.14 at 09:26, <tiejun.chen@intel.com> wrote:
>> On 2014/10/27 18:33, Jan Beulich wrote:
>>>>>> On 27.10.14 at 10:05, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/24 23:11, Jan Beulich wrote:
>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> +            {
>>>>>> +                /*
>>>>>> +                 * Just set p2m_access_n in case of shared-ept
>>>>>> +                 * or non-shared ept but 1:1 mapping.
>>>>>> +                 */
>>>>>> +                if ( iommu_use_hap_pt(d) ||
>>>>>> +                     (!iommu_use_hap_pt(d) && mfn == gfn) )
>>>>>
>>>>> How would, other than by chance, mfn equal gfn here? Also the
>>>>> double use of iommu_use_hap_pt(d) is pointless here.
>>>>
>>>> There are two scenarios we should concern:
>>>>
>>>> #1 in case of shared-ept.
>>>>
>>>> We always need to check so iommu_use_hap_pt(d) is good.
>>>>
>>>> #2 in case of non-sharepd-ept
>>>>
>>>> If mfn != gfn I think guest don't access RMRR range, so its allowed.
>>>
>>> And what if subsequently a device needing a 1:1 mapping at this GFN
>>> gets assigned? (I simply don't see why shared vs non-shared would
>>> matter here.)
>>
>> In case of non-shared ept we just create VT-d table, if we assign a
>> device with 1:1 RMRR mapping. So as long as mfn != gfn, its not
>> necessary to set p2m_access_n.
>
> I think it was mentioned before (and not just by me) that it's at least
> risky to allow the two page tables to get out of sync wrt the
> translations they do (as opposed to permissions they set).

Okay I will remove all condition check here.

>
>>>>>> +                {
>>>>>> +                    rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>>>> +                                       p2m_access_n);
>>>>>> +                    if ( rc )
>>>>>> +                        gdprintk(XENLOG_WARNING, "set rdm p2m failed: (%#lx)\n",
>>>>>> +                                 gfn);
>>>>>
>>>>> Such messages are (due to acting on a foreign domain) relatively
>>>>> useless without also logging the domain that is affected. Conversely,
>>>>> logging the current domain and vCPU (due to using gdprintk()) is
>>>>> rather pointless. Also please drop either the colon or the
>>>>> parentheses in the message.
>>>>
>>>> Can P2M_DEBUG work here?
>>>>
>>>> P2M_DEBUG("set rdm p2m failed: %#lx\n", gfn);
>>>
>>> I don't think this would magically add the missing information. Plus it
>>
>> Sorry, is it okay?
>>
>> gdprintk(XENLOG_WARNING, "Domain %hu set rdm p2m failed: %#lx\n",
>>                                    d->domain_id, gfn);
>
> Almost. Use printk() instead of gdprintk(), and %d instead of %hu.
>
>>> would limit output to the !NDEBUG case, putting the practical
>>> usefulness of this under question even more.
>>>
>>> But anyway, looking at the existing code again, I think you'd be better
>>> off falling through to the p2m_set_entry() that's already there, just
>>
>> Are you saying I should do this in p2m_set_entry()?
>
> No. I said that I'd prefer you to use the _existing call to_
> p2m_set_entry().
>
>>> altering the access permission value you pass. Less code, better
>>
>> But here, guest_physmap_add_entry() just initiate to set p2m_access_n.
>> We will alter the access permission until if we really assign device to
>> create a 1:1 mapping.
>
> And I didn't question that. All I said is to use the existing call by simply
> replacing the unconditional use of the default access value with one
> conditionally set to p2m_access_n.

Okay,

@@ -686,8 +686,19 @@ guest_physmap_add_entry(struct domain *d, unsigned 
long gfn,
      /* Now, actually do the two-way mapping */
      if ( mfn_valid(_mfn(mfn)) )
      {
-        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
-                           p2m->default_access);
+        rc = 0;
+        if ( !is_hardware_domain(d) )
+        {
+            rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  &gfn);
+            if ( rc < 0 )
+                printk("Domain %d can't can't check reserved device 
memory.\n",
+                       d->domain_id);
+        }
+
+        /* We need to set reserved device memory as p2m_access_n. */
+        a =  ( rc == 1 ) ? p2m_access_n : p2m->default_access;
+        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
          if ( rc )
              goto out; /* Failed to update p2m, bail without updating 
m2p. */



Thanks
Tiejun

>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-29  0:48           ` Chen, Tiejun
  2014-10-29  2:51             ` Chen, Tiejun
@ 2014-10-29  8:44             ` Jan Beulich
  2014-10-30  2:51               ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  8:44 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 01:48, <tiejun.chen@intel.com> wrote:
> On 2014/10/28 17:34, Jan Beulich wrote:
>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>>> update_guest_eip() to return.
>>>>>>
>>>>>> I.e. ignore such accesses? Why?
>>>>>
>>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>>> without further protection or punishment.
>>>>>
>>>>> Or what procedure should be concerned here based on your opinion?
>>>>
>>>> If the access is illegal, inject a fault to the guest or kill it, unless you
>>>
>>> Kill means we will crash domain? Seems its radical, isn't it? So I guess
>>> its better to inject a fault.
>>>
>>> But what kind of fault you prefer currently?
>>
>> #GP (but this being arbitrary is why simply killing the guest is another
>> option to consider).
> 
> In this case I think we just need to refer to native behavior. So I feel 
> GP may be a little bit reasonable.

But as said before - prior to switching to raising #GP, clarify for
yourself what behavior you want and why. It you properly explain
(in the patch description) why ignoring the accesses is better (read:
closer to native behavior in comparable cases), then this is fine with
me. I.e. I'm not so much questioning the solution, but the lack of
reasoning why it got chosen.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-29  2:51             ` Chen, Tiejun
@ 2014-10-29  8:45               ` Jan Beulich
  2014-10-30  8:21                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  8:45 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 03:51, <tiejun.chen@intel.com> wrote:
> On 2014/10/29 8:48, Chen, Tiejun wrote:
>> On 2014/10/28 17:34, Jan Beulich wrote:
>>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>> Now in our case we add a rule:
>>>>>>>>     - if p2m_access_n is set we also set this mapping.
>>>>>>>
>>>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>>>> type?
>>>>
>>>> Do you mean what will happen after we reset these ranges as
>>>> p2m_access_rw? We already reserve these ranges guest shouldn't access
>>>> these range actually. And a guest still maliciously access them, that
>>>> device may not work well.
>>>
>>> mem-access is functionality used by a control domain, not the domain
>>
>> I really don't know this mechanism so thanks for your good coverage.
>>
>>> itself. You need to make sure that neither your use of p2m_access_n
>>> can confuse the mem-access code, nor that their use can confuse you.
>>
>> Absolutely, but I think I need to know more about mem-access firstly.
>>
> 
> I think these reserved device memory shouldn't be pocked since any write 
> may affect device. Even, what if a device with RMRR isn't assign current 
> domain? And read also should not be allowed since this still may 
> introduce some potential unexpected behavior to device.
> 
> So if mem_access is trying to access those RMRR range, could we let 
> mem_access exit directly with some message? I mean we can check if we're 
> accessing those RMRR ranges in case of XENMEM_access_op_set_access.

Sounds reasonable at first glance.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-29  0:40         ` Chen, Tiejun
@ 2014-10-29  8:53           ` Jan Beulich
  2014-10-30  2:53             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  8:53 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: kevin.tian, Julien Grall, tim, xen-devel, yang.z.zhang

>>> On 29.10.14 at 01:40, <tiejun.chen@intel.com> wrote:
> On 2014/10/28 18:36, Jan Beulich wrote:
>>>>> On 28.10.14 at 03:35, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 21:35, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 10/24/2014 08:34 AM, Tiejun Chen wrote:
>>>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>>>> index cc36e39..51a32a8 100644
>>>>> --- a/xen/common/memory.c
>>>>> +++ b/xen/common/memory.c
>>>>> @@ -692,6 +692,32 @@ out:
>>>>>        return rc;
>>>>>    }
>>>>>
>>>>> +struct get_reserved_device_memory {
>>>>> +    struct xen_mem_reserved_device_memory_map map;
>>>>> +    unsigned int used_entries;
>>>>> +};
>>>>> +
>>>>> +static int get_reserved_device_memory(xen_pfn_t start,
>>>>> +                                      xen_ulong_t nr, void *ctxt)
>>>>
>>>> This function is only used when HAS_PASSTHROUGH is defined. You have to
>>>> protected by an #ifdef HAS_PASSTHROUGH.
>>>>
>>>
>>> I guess you mean we need to do this,
>>>
>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>> index 1449c10..2177c56 100644
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -692,6 +692,7 @@ out:
>>>        return rc;
>>>    }
>>>
>>> +#ifdef HAS_PASSTHROUGH
>>>    struct get_reserved_device_memory {
>>>        struct xen_reserved_device_memory_map map;
>>>        unsigned int used_entries;
>>> @@ -717,6 +718,7 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>>
>>>        return 0;
>>>    }
>>> +#endif
>>>
>>>    long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>>    {
>>>
>>> Jan,
>>>
>>> With this above change, is the following working for you?
>>
>> I already fixed this in my version (which I view to be the canonical
>> one).
>>
> 
> Are you point that attached patch? Are you sure? Here I pick some code 
> fragments from your latest patch,

No, that fixup was in response to Julien's comment, which I think
was given after I had forwarded you my version. In any event,
what's going to get applied is what I have here unless you find a
need to do substantial changes to it.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-29  6:54             ` Chen, Tiejun
@ 2014-10-29  9:05               ` Jan Beulich
  2014-10-30  5:55                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  9:05 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
> On 2014/10/28 17:48, Jan Beulich wrote:
>>>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>>>                          unsigned int bios_image_base);
>>>>>>>     void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>>>
>>>>>>> +#include <xen/memory.h>
>>>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>>>
>>>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>>>> you don't really need it here), and the #define is completely
>>>>>
>>>>> If without this line, #include <xen/memory.h>,
>>>>>
>>>>> In file included from build.c:25:0:
>>>>> ../util.h:246:70: error: array type has incomplete element type
>>>>>     int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>>>> entries[],
>>>>>                                                                          ^
>>>>> make[8]: *** [build.o] Error 1
>>>>
>>>> So just forward declare the structure ahead of the function
>>>> declaration.
>>>
>>> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
>>> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
>>>
>>> So any reason I can't do such a same thing?
>>
>> You can, but it's undesirable. You're wanting this in a header, i.e.
>> you'll make everyone consuming that header also implicitly depend
>> on the new header you would include. We shouldn't pointlessly
>> add build dependencies (and we should really try to reduce them
>> where possible).
> 
> Looks I can remove those stuff from util.h and just add 'extern' to them 
> when we really need them.

Please stop thinking this way. Declarations for things defined in .c
files are to be present in headers, and the defining .c file has to
include that header (making sure declaration and definition are and
remain in sync). I hate having to again repeat my remark that you
shouldn't forget it's not application code that you're modifying.
Robust and maintainable code are a requirement in the hypervisor
(and, as said it being an extension of it, hvmloader). Which - just
to avoid any misunderstanding - isn't to say that this shouldn't also
apply to application code. It's just that in the hypervisor and kernel
(and certain other code system components) the consequences of
being lax are much more severe.

>>>>>> misplaced here. While I generally wouldn't recommend doing this, I
>>>>>> think in the case here including the hypervisor header that defines
>>>>>> them would be okay. Perhaps not via relative path, but via having
> 
> So is the following is a way "via having the Makefile symlink the 
> hypervisor header here."?
> 
> --- a/tools/include/Makefile
> +++ b/tools/include/Makefile
> @@ -17,6 +17,7 @@ xen/.dir:
>          ln -sf ../xen-sys/$(XEN_OS) xen/sys
>          ln -sf $(addprefix $(XEN_ROOT)/xen/include/xen/,libelf.h elfstructs.h) xen/libelf/
>          ln -s ../xen-foreign xen/foreign
> +       ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen

Along those lines at least. Consult with the tools maintainers as to
whether this would better go into a special subdirectory (and
perhaps not even under tools/include, but somewhere beneath
tools/firmware/hvmloader/, since it's not meant to be used by
anyone else - as one of the ones de facto looking after hvmloader
irrespective of what ./MAINTAINERS says, I'd strongly recommend
limiting the visibility scope of this header as much as possible).

> Then we just need include this in util.c:
> 
> --- a/tools/firmware/hvmloader/util.c
> +++ b/tools/firmware/hvmloader/util.c
> @@ -26,6 +26,7 @@
>   #include <xen/xen.h>
>   #include <xen/memory.h>
>   #include <xen/sched.h>
> +#include <xen/errno.h>

Exactly.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-29  7:03             ` Chen, Tiejun
@ 2014-10-29  9:08               ` Jan Beulich
  2014-10-30  3:18                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  9:08 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 08:03, <tiejun.chen@intel.com> wrote:
> On 2014/10/28 17:56, Jan Beulich wrote:
>>>>> On 28.10.14 at 08:11, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 17:56, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> Additionally, actually there are some original codes just following my
>>>>> codes:
>>>>>
>>>>>            if ( need_skip_rmrr )
>>>>>            {
>>>>> 		...
>>>>>            }
>>>>>
>>>>> 	base += bar_sz;
>>>>>
>>>>>            if ( (base < resource->base) || (base > resource->max) )
>>>>>            {
>>>>>                printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>>>>                       "resource!\n", devfn>>3, devfn&7, bar_reg,
>>>>>                       PRIllx_arg(bar_sz));
>>>>>                continue;
>>>>>            }
>>>>>
>>>>> This can guarantee we don't overwhelm the previous mmio range.
>>>>
>>>> Resulting in the BAR not getting a value assigned afaict. Certainly
>>>> not what we want as a side effect of your changes.
>>>
>>> I don't understand what a side effect is. I just to try to make sure BAR
>>> space skip any conflict range but they are still in these resource ranges.
>>
>> A side effect is an effect you don't primarily intend with your change
>> (or more generally, with any particular operation). In the case here,
>> a BAR that previously got a value assigned may not anymore with
>> your change in place. An acceptable effect of your change would be
>> if the value it gets assigned is now different, but not assigning a value
>> at all is not acceptable.
> 
> As I understand that value just need to align with BAR and size. Then 
> any range should be fine. Here I think its not necessary to consider any 
> space restriction, i.e, some device may just access under 4G.

No. The code determining where to put the lower boundary of the
MMIO range doesn't (with your present patch) consider the regions
the actual assignment code now skips. Hence the lower boundary
may not be low enough to accommodate all BARs.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-29  7:43             ` Chen, Tiejun
@ 2014-10-29  9:15               ` Jan Beulich
  2014-10-30  3:11                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  9:15 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
> On 2014/10/28 18:06, Jan Beulich wrote:
>>>>> On 28.10.14 at 08:47, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/27 18:17, Jan Beulich wrote:
>>>>>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/24 22:56, Jan Beulich wrote:
>>>> _no matter_ what RMRRs a physical host has, it should not prevent
>>>> the creation of guests (the worst that may result is that passing
>>>> through certain devices doesn't work anymore, and even then the
>>>> operator needs to be given a way of circumventing this if (s)he
>>>> knows that the device won't access the range post-boot, or if it's
>>>> being deemed acceptable for it to do so).
>>>
>>> As we know just legacy USB and GFX need these RMRR ranges.
>>
>> This is specified where?
> 
> In VT-D specification, I just see,
> 
> "The RMRR regions are expected to be used for legacy usages (such as 
> USB, UMA Graphics, etc.) requiring reserved memory. Platform designers 
> should avoid or limit use of reserved memory regions since these require 
> system software to create holes in the DMA virtual address range 
> available to system software and its driver."

Nice that you quote it, but did you also read it properly? There's this
little "etc" following the explicit naming of USB and UMA...

> RMRR really is very troublesome.
> 
> The legacy usage of USB just cover ps2 emulation as I know. And as you 
> see these address are different in different platforms so this mean 
> they're not redistricted somewhere specific. And GFX need more space so 
> its not possible to be placed under 1M.
> 
> So maybe I can drop patch #12, xen/vtd: re-enable USB device assignment, 
> to leave USB out our scope. Or a little improvement is to check if its 
> own range is below 1M.

I think we made clear a number of iterations ago that rather than
aiming for another half baked solution, it should be done right this
time. No excuses. It's bad enough that this half broken code made
it into the tree originally.

>> In the tool stack, don't even populate these holes with RAM. This
>> will then lead to RAM getting populated further up at the upper end.
> 
> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we 
> already be there to mark them as p2m_access_n.

Marking them with p2m_access_n is not the same as not populating
the regions in the first place. Again - hiding multiple megabytes of
memory (and who knows if it can't grow into the gigabyte range) is
just not acceptable. Even for just a few pages I wouldn't be really
happy, but could probably be talked into accepting this.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-29  8:20             ` Chen, Tiejun
@ 2014-10-29  9:20               ` Jan Beulich
  2014-10-30  7:39                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-29  9:20 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 29.10.14 at 09:20, <tiejun.chen@intel.com> wrote:
> @@ -686,8 +686,19 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>       /* Now, actually do the two-way mapping */
>       if ( mfn_valid(_mfn(mfn)) )
>       {
> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
> -                           p2m->default_access);
> +        rc = 0;
> +        if ( !is_hardware_domain(d) )
> +        {
> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
> +                                                  &gfn);
> +            if ( rc < 0 )
> +                printk("Domain %d can't can't check reserved device memory.\n",
> +                       d->domain_id);
> +        }
> +
> +        /* We need to set reserved device memory as p2m_access_n. */
> +        a =  ( rc == 1 ) ? p2m_access_n : p2m->default_access;
> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>           if ( rc )
>               goto out; /* Failed to update p2m, bail without updating m2p. */

Getting closer. Just set a to p2m->default_access before the if(),
and overwrite it when rc == 1 inside the if(). And properly handle
the error case (just logging a message - which btw lacks a proper
XENLOG_G_* prefix - doesn't seem enough to me).

But then again this code may change altogether if you avoid
populating the reserved regions in the first place.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-29  8:44             ` Jan Beulich
@ 2014-10-30  2:51               ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  2:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel


On 2014/10/29 16:44, Jan Beulich wrote:
>>>> On 29.10.14 at 01:48, <tiejun.chen@intel.com> wrote:
>> On 2014/10/28 17:34, Jan Beulich wrote:
>>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>> 5. Before we take real device assignment, any access to RMRR may issue
>>>>>>>> ept_handle_violation because of p2m_access_n. Then we just call
>>>>>>>> update_guest_eip() to return.
>>>>>>>
>>>>>>> I.e. ignore such accesses? Why?
>>>>>>
>>>>>> Yeah. This illegal access isn't allowed but its enough to ignore that
>>>>>> without further protection or punishment.
>>>>>>
>>>>>> Or what procedure should be concerned here based on your opinion?
>>>>>
>>>>> If the access is illegal, inject a fault to the guest or kill it, unless you
>>>>
>>>> Kill means we will crash domain? Seems its radical, isn't it? So I guess
>>>> its better to inject a fault.
>>>>
>>>> But what kind of fault you prefer currently?
>>>
>>> #GP (but this being arbitrary is why simply killing the guest is another
>>> option to consider).
>>
>> In this case I think we just need to refer to native behavior. So I feel
>> GP may be a little bit reasonable.
>
> But as said before - prior to switching to raising #GP, clarify for
> yourself what behavior you want and why. It you properly explain

Our previous thought is that, we always reserve these ranges since we 
never allow any stuff to poke them.

But in theory some untrusted VM can maliciously access them, right? So 
we also set p2m_access_n in p2m table then we can intercept this 
approach. But we just don't want to leak anything or introduce any side 
affect since other OSs may touch them by careless behavior, so I think 
its enough to have a lightweight way. I mean it shouldn't be same as 
those broken pages which cause domain crush.

So we just need to return with next eip then let VM/OS itself handle 
such a scenario as its own logic.

Thanks
Tiejun

> (in the patch description) why ignoring the accesses is better (read:
> closer to native behavior in comparable cases), then this is fine with
> me. I.e. I'm not so much questioning the solution, but the lack of
> reasoning why it got chosen.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-29  8:53           ` Jan Beulich
@ 2014-10-30  2:53             ` Chen, Tiejun
  2014-10-30  9:10               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  2:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kevin.tian, Julien Grall, tim, xen-devel, yang.z.zhang

On 2014/10/29 16:53, Jan Beulich wrote:
>>>> On 29.10.14 at 01:40, <tiejun.chen@intel.com> wrote:
>> On 2014/10/28 18:36, Jan Beulich wrote:
>>>>>> On 28.10.14 at 03:35, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 21:35, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> On 10/24/2014 08:34 AM, Tiejun Chen wrote:
>>>>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>>>>> index cc36e39..51a32a8 100644
>>>>>> --- a/xen/common/memory.c
>>>>>> +++ b/xen/common/memory.c
>>>>>> @@ -692,6 +692,32 @@ out:
>>>>>>         return rc;
>>>>>>     }
>>>>>>
>>>>>> +struct get_reserved_device_memory {
>>>>>> +    struct xen_mem_reserved_device_memory_map map;
>>>>>> +    unsigned int used_entries;
>>>>>> +};
>>>>>> +
>>>>>> +static int get_reserved_device_memory(xen_pfn_t start,
>>>>>> +                                      xen_ulong_t nr, void *ctxt)
>>>>>
>>>>> This function is only used when HAS_PASSTHROUGH is defined. You have to
>>>>> protected by an #ifdef HAS_PASSTHROUGH.
>>>>>
>>>>
>>>> I guess you mean we need to do this,
>>>>
>>>> diff --git a/xen/common/memory.c b/xen/common/memory.c
>>>> index 1449c10..2177c56 100644
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -692,6 +692,7 @@ out:
>>>>         return rc;
>>>>     }
>>>>
>>>> +#ifdef HAS_PASSTHROUGH
>>>>     struct get_reserved_device_memory {
>>>>         struct xen_reserved_device_memory_map map;
>>>>         unsigned int used_entries;
>>>> @@ -717,6 +718,7 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>>>
>>>>         return 0;
>>>>     }
>>>> +#endif
>>>>
>>>>     long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>     {
>>>>
>>>> Jan,
>>>>
>>>> With this above change, is the following working for you?
>>>
>>> I already fixed this in my version (which I view to be the canonical
>>> one).
>>>
>>
>> Are you point that attached patch? Are you sure? Here I pick some code
>> fragments from your latest patch,
>
> No, that fixup was in response to Julien's comment, which I think
> was given after I had forwarded you my version. In any event,

So could you send me the latest? I'd like to replace it in my tree.

Thanks
Tiejun

> what's going to get applied is what I have here unless you find a
> need to do substantial changes to it.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-29  9:15               ` Jan Beulich
@ 2014-10-30  3:11                 ` Chen, Tiejun
  2014-10-30  9:20                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  3:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 17:15, Jan Beulich wrote:
>>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
>> On 2014/10/28 18:06, Jan Beulich wrote:
>>>>>> On 28.10.14 at 08:47, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 18:17, Jan Beulich wrote:
>>>>>>>> On 27.10.14 at 09:09, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/24 22:56, Jan Beulich wrote:
>>>>> _no matter_ what RMRRs a physical host has, it should not prevent
>>>>> the creation of guests (the worst that may result is that passing
>>>>> through certain devices doesn't work anymore, and even then the
>>>>> operator needs to be given a way of circumventing this if (s)he
>>>>> knows that the device won't access the range post-boot, or if it's
>>>>> being deemed acceptable for it to do so).
>>>>
>>>> As we know just legacy USB and GFX need these RMRR ranges.
>>>
>>> This is specified where?
>>
>> In VT-D specification, I just see,
>>
>> "The RMRR regions are expected to be used for legacy usages (such as
>> USB, UMA Graphics, etc.) requiring reserved memory. Platform designers
>> should avoid or limit use of reserved memory regions since these require
>> system software to create holes in the DMA virtual address range
>> available to system software and its driver."
>
> Nice that you quote it, but did you also read it properly? There's this
> little "etc" following the explicit naming of USB and UMA...

Yes. But this already clarify RMRR "used for legacy usage" and "avoid or 
limit use of reserved memory regions", so RMRR would be gone finally. So 
I mean it may be acceptable to assume something based our known info.

>
>> RMRR really is very troublesome.
>>
>> The legacy usage of USB just cover ps2 emulation as I know. And as you
>> see these address are different in different platforms so this mean
>> they're not redistricted somewhere specific. And GFX need more space so
>> its not possible to be placed under 1M.
>>
>> So maybe I can drop patch #12, xen/vtd: re-enable USB device assignment,
>> to leave USB out our scope. Or a little improvement is to check if its
>> own range is below 1M.
>
> I think we made clear a number of iterations ago that rather than
> aiming for another half baked solution, it should be done right this
> time. No excuses. It's bad enough that this half broken code made
> it into the tree originally.

Okay. But if USB is really overlapping with BIOS...

Furthermore, maybe I can ask some guys to look into if we can move RMRR 
ranges in the hypervisor. If yes, things could be better.

>
>>> In the tool stack, don't even populate these holes with RAM. This
>>> will then lead to RAM getting populated further up at the upper end.
>>
>> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we
>> already be there to mark them as p2m_access_n.
>
> Marking them with p2m_access_n is not the same as not populating
> the regions in the first place. Again - hiding multiple megabytes of
> memory (and who knows if it can't grow into the gigabyte range) is
> just not acceptable. Even for just a few pages I wouldn't be really

I don't think so. If you're considering a VM, this case should be same 
under native circumstance. And in native case, all RMRR ranges are 
marked as reserved in e820 table.

Thanks
Tiejun

> happy, but could probably be talked into accepting this.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory
  2014-10-29  9:08               ` Jan Beulich
@ 2014-10-30  3:18                 ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  3:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 17:08, Jan Beulich wrote:
>>>> On 29.10.14 at 08:03, <tiejun.chen@intel.com> wrote:
>> On 2014/10/28 17:56, Jan Beulich wrote:
>>>>>> On 28.10.14 at 08:11, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 17:56, Jan Beulich wrote:
>>>>>>>> On 27.10.14 at 08:12, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/24 22:42, Jan Beulich wrote:
>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>> Additionally, actually there are some original codes just following my
>>>>>> codes:
>>>>>>
>>>>>>             if ( need_skip_rmrr )
>>>>>>             {
>>>>>> 		...
>>>>>>             }
>>>>>>
>>>>>> 	base += bar_sz;
>>>>>>
>>>>>>             if ( (base < resource->base) || (base > resource->max) )
>>>>>>             {
>>>>>>                 printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>>>>>                        "resource!\n", devfn>>3, devfn&7, bar_reg,
>>>>>>                        PRIllx_arg(bar_sz));
>>>>>>                 continue;
>>>>>>             }
>>>>>>
>>>>>> This can guarantee we don't overwhelm the previous mmio range.
>>>>>
>>>>> Resulting in the BAR not getting a value assigned afaict. Certainly
>>>>> not what we want as a side effect of your changes.
>>>>
>>>> I don't understand what a side effect is. I just to try to make sure BAR
>>>> space skip any conflict range but they are still in these resource ranges.
>>>
>>> A side effect is an effect you don't primarily intend with your change
>>> (or more generally, with any particular operation). In the case here,
>>> a BAR that previously got a value assigned may not anymore with
>>> your change in place. An acceptable effect of your change would be
>>> if the value it gets assigned is now different, but not assigning a value
>>> at all is not acceptable.
>>
>> As I understand that value just need to align with BAR and size. Then
>> any range should be fine. Here I think its not necessary to consider any
>> space restriction, i.e, some device may just access under 4G.
>
> No. The code determining where to put the lower boundary of the
> MMIO range doesn't (with your present patch) consider the regions
> the actual assignment code now skips. Hence the lower boundary
> may not be low enough to accommodate all BARs.

Right, so maybe before any mmio allocation, we can expand mmio resource 
to own sufficient space covering RMRR and all BARs.

Thanks
Tiejun

>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-29  9:05               ` Jan Beulich
@ 2014-10-30  5:55                 ` Chen, Tiejun
  2014-10-30  9:13                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  5:55 UTC (permalink / raw)
  To: Jan Beulich, wei.liu2, stefano.stabellini, ian.campbell, ian.jackson
  Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 17:05, Jan Beulich wrote:
>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>> On 2014/10/28 17:48, Jan Beulich wrote:
>>>>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>>>>                           unsigned int bios_image_base);
>>>>>>>>      void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>>>>
>>>>>>>> +#include <xen/memory.h>
>>>>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>>>>
>>>>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>>>>> you don't really need it here), and the #define is completely
>>>>>>
>>>>>> If without this line, #include <xen/memory.h>,
>>>>>>
>>>>>> In file included from build.c:25:0:
>>>>>> ../util.h:246:70: error: array type has incomplete element type
>>>>>>      int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>>>>> entries[],
>>>>>>                                                                           ^
>>>>>> make[8]: *** [build.o] Error 1
>>>>>
>>>>> So just forward declare the structure ahead of the function
>>>>> declaration.
>>>>
>>>> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
>>>> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
>>>>
>>>> So any reason I can't do such a same thing?
>>>
>>> You can, but it's undesirable. You're wanting this in a header, i.e.
>>> you'll make everyone consuming that header also implicitly depend
>>> on the new header you would include. We shouldn't pointlessly
>>> add build dependencies (and we should really try to reduce them
>>> where possible).
>>
>> Looks I can remove those stuff from util.h and just add 'extern' to them
>> when we really need them.
>
> Please stop thinking this way. Declarations for things defined in .c
> files are to be present in headers, and the defining .c file has to
> include that header (making sure declaration and definition are and
> remain in sync). I hate having to again repeat my remark that you
> shouldn't forget it's not application code that you're modifying.
> Robust and maintainable code are a requirement in the hypervisor
> (and, as said it being an extension of it, hvmloader). Which - just
> to avoid any misunderstanding - isn't to say that this shouldn't also
> apply to application code. It's just that in the hypervisor and kernel
> (and certain other code system components) the consequences of
> being lax are much more severe.

Okay. But currently, the pci.c file already include 'util.h' and 
'<xen/memory.h>,

#include "util.h"
...
#include <xen/memory.h>

We can't redefine struct xen_reserved_device_memory in util.h.

>
>>>>>>> misplaced here. While I generally wouldn't recommend doing this, I
>>>>>>> think in the case here including the hypervisor header that defines
>>>>>>> them would be okay. Perhaps not via relative path, but via having
>>
>> So is the following is a way "via having the Makefile symlink the
>> hypervisor header here."?
>>
>> --- a/tools/include/Makefile
>> +++ b/tools/include/Makefile
>> @@ -17,6 +17,7 @@ xen/.dir:
>>           ln -sf ../xen-sys/$(XEN_OS) xen/sys
>>           ln -sf $(addprefix $(XEN_ROOT)/xen/include/xen/,libelf.h elfstructs.h) xen/libelf/
>>           ln -s ../xen-foreign xen/foreign
>> +       ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen
>
> Along those lines at least. Consult with the tools maintainers as to
> whether this would better go into a special subdirectory (and
> perhaps not even under tools/include, but somewhere beneath
> tools/firmware/hvmloader/, since it's not meant to be used by

Agree with you.

> anyone else - as one of the ones de facto looking after hvmloader
> irrespective of what ./MAINTAINERS says, I'd strongly recommend
> limiting the visibility scope of this header as much as possible).

So this time this email is Toed the following maintainers

Ian Jackson <ian.jackson@eu.citrix.com>
Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Ian Campbell <ian.campbell@citrix.com>
Wei Liu <wei.liu2@citrix.com>

So all tools maintainers,

Could you give us such a comment?

'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen' versus
'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h 
$(XEN_ROOT)/tools/firmware/hvmloader/,

Thanks
Tiejun

>
>> Then we just need include this in util.c:
>>
>> --- a/tools/firmware/hvmloader/util.c
>> +++ b/tools/firmware/hvmloader/util.c
>> @@ -26,6 +26,7 @@
>>    #include <xen/xen.h>
>>    #include <xen/memory.h>
>>    #include <xen/sched.h>
>> +#include <xen/errno.h>
>
> Exactly.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-29  9:20               ` Jan Beulich
@ 2014-10-30  7:39                 ` Chen, Tiejun
  2014-10-30  9:24                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  7:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 17:20, Jan Beulich wrote:
>>>> On 29.10.14 at 09:20, <tiejun.chen@intel.com> wrote:
>> @@ -686,8 +686,19 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>        /* Now, actually do the two-way mapping */
>>        if ( mfn_valid(_mfn(mfn)) )
>>        {
>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>> -                           p2m->default_access);
>> +        rc = 0;
>> +        if ( !is_hardware_domain(d) )
>> +        {
>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>> +                                                  &gfn);
>> +            if ( rc < 0 )
>> +                printk("Domain %d can't can't check reserved device memory.\n",
>> +                       d->domain_id);
>> +        }
>> +
>> +        /* We need to set reserved device memory as p2m_access_n. */
>> +        a =  ( rc == 1 ) ? p2m_access_n : p2m->default_access;
>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>            if ( rc )
>>                goto out; /* Failed to update p2m, bail without updating m2p. */
>
> Getting closer. Just set a to p2m->default_access before the if(),
> and overwrite it when rc == 1 inside the if(). And properly handle
> the error case (just logging a message - which btw lacks a proper
> XENLOG_G_* prefix - doesn't seem enough to me).

Please check the follows:

@@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned 
long gfn,
      /* Now, actually do the two-way mapping */
      if ( mfn_valid(_mfn(mfn)) )
      {
-        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
-                           p2m->default_access);
+        rc = 0;
+        a =  p2m->default_access;
+        if ( !is_hardware_domain(d) )
+        {
+            rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  &gfn);
+            /* We need to set reserved device memory as p2m_access_n. */
+            if ( rc == 1 )
+                a = p2m_access_n;
+            else if ( rc < 0 )
+                printk(XENLOG_WARNING
+                       "Domain %d can't check reserved device memory.\n",
+                       d->domain_id);
+        }
+
+        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
          if ( rc )
              goto out; /* Failed to update p2m, bail without updating 
m2p. */


>
> But then again this code may change altogether if you avoid
> populating the reserved regions in the first place.

Are you saying this scenario?

#1 Here we first set these ranges as p2m_access_n
#2 We reset them as 1:1 RMRR mapping with p2m_access_rw somewhere
#3 Someone may try to populate these ranges again

Thanks
Tiejun

>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-29  8:45               ` Jan Beulich
@ 2014-10-30  8:21                 ` Chen, Tiejun
  2014-10-30  9:07                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-30  8:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/29 16:45, Jan Beulich wrote:
>>>> On 29.10.14 at 03:51, <tiejun.chen@intel.com> wrote:
>> On 2014/10/29 8:48, Chen, Tiejun wrote:
>>> On 2014/10/28 17:34, Jan Beulich wrote:
>>>>>>> On 28.10.14 at 09:36, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/27 17:41, Jan Beulich wrote:
>>>>>>>>> On 27.10.14 at 03:00, <tiejun.chen@intel.com> wrote:
>>>>>>> n 2014/10/24 18:52, Jan Beulich wrote:
>>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>>> Now in our case we add a rule:
>>>>>>>>>      - if p2m_access_n is set we also set this mapping.
>>>>>>>>
>>>>>>>> Does that not conflict with eventual use mem-access makes of this
>>>>>>>> type?
>>>>>
>>>>> Do you mean what will happen after we reset these ranges as
>>>>> p2m_access_rw? We already reserve these ranges guest shouldn't access
>>>>> these range actually. And a guest still maliciously access them, that
>>>>> device may not work well.
>>>>
>>>> mem-access is functionality used by a control domain, not the domain
>>>
>>> I really don't know this mechanism so thanks for your good coverage.
>>>
>>>> itself. You need to make sure that neither your use of p2m_access_n
>>>> can confuse the mem-access code, nor that their use can confuse you.
>>>
>>> Absolutely, but I think I need to know more about mem-access firstly.
>>>
>>
>> I think these reserved device memory shouldn't be pocked since any write
>> may affect device. Even, what if a device with RMRR isn't assign current
>> domain? And read also should not be allowed since this still may
>> introduce some potential unexpected behavior to device.
>>
>> So if mem_access is trying to access those RMRR range, could we let
>> mem_access exit directly with some message? I mean we can check if we're
>> accessing those RMRR ranges in case of XENMEM_access_op_set_access.
>
> Sounds reasonable at first glance.
>

I think we just guarantee no one set mem_access for those ranges, but 
its fine to get mem_access:

diff --git a/xen/common/mem_access.c b/xen/common/mem_access.c
index 6c2724b..4c84f88 100644
--- a/xen/common/mem_access.c
+++ b/xen/common/mem_access.c
@@ -55,6 +55,31 @@ void mem_access_resume(struct domain *d)
      }
  }

+/* We can't expose reserved device memory. */
+static int mem_access_check_rdm(struct domain *d, uint64_aligned_t start,
+                                uint32_t nr)
+{
+    uint32_t i;
+    uint64_aligned_t gfn;
+    int rc = 0;
+
+    if ( !is_hardware_domain(d) )
+    {
+        for ( i = 0; i < nr; i++ )
+        {
+            gfn = start + i;
+            rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  &gfn);
+            if ( rc < 0 )
+                printk(XENLOG_WARNING
+                       "Domain %d can't check reserved device memory.\n",
+                       d->domain_id);
+        }
+    }
+
+    return rc;
+}
+
  int mem_access_memop(unsigned long cmd,
                       XEN_GUEST_HANDLE_PARAM(xen_mem_access_op_t) arg)
  {
@@ -99,6 +124,15 @@ int mem_access_memop(unsigned long cmd,
                ((mao.pfn + mao.nr - 1) > domain_get_maximum_gpfn(d))) )
              break;

+        rc =  mem_access_check_rdm(d, mao.pfn, mao.nr);
+        if ( rc == 1 )
+        {
+            printk(XENLOG_WARNING
+                   "Domain %d: we shouldn't mem_access reserved device 
memory.\n",
+                   d->domain_id);
+            break;
+        }
+
          rc = p2m_set_mem_access(d, mao.pfn, mao.nr, start_iter,
                                  MEMOP_CMD_MASK, mao.access);
          if ( rc > 0 )


Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-30  8:21                 ` Chen, Tiejun
@ 2014-10-30  9:07                   ` Jan Beulich
  2014-10-31  3:11                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-30  9:07 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 30.10.14 at 09:21, <tiejun.chen@intel.com> wrote:
> I think we just guarantee no one set mem_access for those ranges, but 
> its fine to get mem_access:

Seems reasonable to me, but you'll need to get the respective
maintainers to agree. And of course this needs to be cleaned up;
personally I also dislike the excessive logging of messages that
you add here and elsewhere.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-30  2:53             ` Chen, Tiejun
@ 2014-10-30  9:10               ` Jan Beulich
  2014-10-31  1:03                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-30  9:10 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: kevin.tian, Julien Grall, tim, xen-devel, yang.z.zhang

[-- Attachment #1: Type: text/plain, Size: 154 bytes --]

>>> On 30.10.14 at 03:53, <tiejun.chen@intel.com> wrote:
> So could you send me the latest? I'd like to replace it in my tree.

Here you go.

Jan


[-- Attachment #2: get-reserved-device-memory.patch --]
[-- Type: text/plain, Size: 8973 bytes --]

introduce XENMEM_reserved_device_memory_map

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -16,6 +16,37 @@ CHECK_TYPE(domid);
 
 CHECK_mem_access_op;
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct compat_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct compat_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+            return -ERANGE;
+
+        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
+                                     &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -273,6 +304,29 @@ int compat_memory_op(unsigned int cmd, X
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( __copy_to_guest(compat, &grdm.map, 1) )
+                rc = -EFAULT;
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,34 @@ out:
     return rc;
 }
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1101,6 +1129,29 @@ long do_memory_op(unsigned long cmd, XEN
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr;
+    int rc = 0;
+
+    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    {
+        rc = func(PFN_DOWN(rmrr->base_address),
+                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
+                  ctxt);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct do
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops =
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct vnuma_topology_info {
 typedef struct vnuma_topology_info vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * For legacy reasons, some devices must be configured with special memory
+ * regions to function correctly.  The guest must avoid using any of these
+ * regions.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+    /* IN/OUT */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -120,6 +120,8 @@ void iommu_dt_domain_destroy(struct doma
 
 struct page_info;
 
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -156,12 +158,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -61,9 +61,10 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+!	reserved_device_memory_map	memory.h
 ?	physdev_eoi			physdev.h
 ?	physdev_get_free_pirq		physdev.h
 ?	physdev_irq			physdev.h

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-30  5:55                 ` Chen, Tiejun
@ 2014-10-30  9:13                   ` Jan Beulich
  2014-10-31  2:20                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-30  9:13 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/28 17:48, Jan Beulich wrote:
>>>>>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>>>>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>>>>>                           unsigned int bios_image_base);
>>>>>>>>>      void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>>>>>
>>>>>>>>> +#include <xen/memory.h>
>>>>>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>>>>>
>>>>>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>>>>>> you don't really need it here), and the #define is completely
>>>>>>>
>>>>>>> If without this line, #include <xen/memory.h>,
>>>>>>>
>>>>>>> In file included from build.c:25:0:
>>>>>>> ../util.h:246:70: error: array type has incomplete element type
>>>>>>>      int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>>>>>> entries[],
>>>>>>>                                                                           ^
>>>>>>> make[8]: *** [build.o] Error 1
>>>>>>
>>>>>> So just forward declare the structure ahead of the function
>>>>>> declaration.
>>>>>
>>>>> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
>>>>> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
>>>>>
>>>>> So any reason I can't do such a same thing?
>>>>
>>>> You can, but it's undesirable. You're wanting this in a header, i.e.
>>>> you'll make everyone consuming that header also implicitly depend
>>>> on the new header you would include. We shouldn't pointlessly
>>>> add build dependencies (and we should really try to reduce them
>>>> where possible).
>>>
>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>> when we really need them.
>>
>> Please stop thinking this way. Declarations for things defined in .c
>> files are to be present in headers, and the defining .c file has to
>> include that header (making sure declaration and definition are and
>> remain in sync). I hate having to again repeat my remark that you
>> shouldn't forget it's not application code that you're modifying.
>> Robust and maintainable code are a requirement in the hypervisor
>> (and, as said it being an extension of it, hvmloader). Which - just
>> to avoid any misunderstanding - isn't to say that this shouldn't also
>> apply to application code. It's just that in the hypervisor and kernel
>> (and certain other code system components) the consequences of
>> being lax are much more severe.
> 
> Okay. But currently, the pci.c file already include 'util.h' and 
> '<xen/memory.h>,
> 
> #include "util.h"
> ...
> #include <xen/memory.h>
> 
> We can't redefine struct xen_reserved_device_memory in util.h.

Redefine? I said forward declare.

> So all tools maintainers,
> 
> Could you give us such a comment?
> 
> 'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen' versus
> 'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h 
> $(XEN_ROOT)/tools/firmware/hvmloader/,

I think if it is (as suggested) to be limited to hvmloader, the
symlinking also should be done in its Makefile rather than along
with other tools stuff.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-30  3:11                 ` Chen, Tiejun
@ 2014-10-30  9:20                   ` Jan Beulich
  2014-10-31  5:41                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-30  9:20 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 30.10.14 at 04:11, <tiejun.chen@intel.com> wrote:
> On 2014/10/29 17:15, Jan Beulich wrote:
>>>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
>>> In VT-D specification, I just see,
>>>
>>> "The RMRR regions are expected to be used for legacy usages (such as
>>> USB, UMA Graphics, etc.) requiring reserved memory. Platform designers
>>> should avoid or limit use of reserved memory regions since these require
>>> system software to create holes in the DMA virtual address range
>>> available to system software and its driver."
>>
>> Nice that you quote it, but did you also read it properly? There's this
>> little "etc" following the explicit naming of USB and UMA...
> 
> Yes. But this already clarify RMRR "used for legacy usage" and "avoid or 
> limit use of reserved memory regions", so RMRR would be gone finally. So 
> I mean it may be acceptable to assume something based our known info.

No. Making assumption on observed broken behavior is okay (to work
around it), but making assumptions for not (yet) observed correct
behavior to be absent should never be done.

>>>> In the tool stack, don't even populate these holes with RAM. This
>>>> will then lead to RAM getting populated further up at the upper end.
>>>
>>> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we
>>> already be there to mark them as p2m_access_n.
>>
>> Marking them with p2m_access_n is not the same as not populating
>> the regions in the first place. Again - hiding multiple megabytes of
>> memory (and who knows if it can't grow into the gigabyte range) is
>> just not acceptable. Even for just a few pages I wouldn't be really
> 
> I don't think so. If you're considering a VM, this case should be same 
> under native circumstance. And in native case, all RMRR ranges are 
> marked as reserved in e820 table.

That would only be a valid comparison if all devices associated with
RMRRs would also be passed to that particular VM. But since you're
doing the E820 adjustment to all VMs (no matter whether they would
ever get _any_ device passed through to them) this is not comparing
like entities.

Thinking about this some more, this odd universal hole punching in
the E820 is very likely to end up causing problems. Hence I think
this really should be optional behavior, with pass through of devices
associated with RMRRs failing if not used. (This ought to include
punching holes for _just_ the devices passed through to a guest
upon creation when the option is not enabled.)

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-30  7:39                 ` Chen, Tiejun
@ 2014-10-30  9:24                   ` Jan Beulich
  2014-10-31  2:50                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-30  9:24 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
> On 2014/10/29 17:20, Jan Beulich wrote:
>> Getting closer. Just set a to p2m->default_access before the if(),
>> and overwrite it when rc == 1 inside the if(). And properly handle
>> the error case (just logging a message - which btw lacks a proper
>> XENLOG_G_* prefix - doesn't seem enough to me).
> 
> Please check the follows:
> 
> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>       /* Now, actually do the two-way mapping */
>       if ( mfn_valid(_mfn(mfn)) )
>       {
> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
> -                           p2m->default_access);
> +        rc = 0;
> +        a =  p2m->default_access;
> +        if ( !is_hardware_domain(d) )
> +        {
> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
> +                                                  &gfn);
> +            /* We need to set reserved device memory as p2m_access_n. */
> +            if ( rc == 1 )
> +                a = p2m_access_n;
> +            else if ( rc < 0 )
> +                printk(XENLOG_WARNING
> +                       "Domain %d can't check reserved device memory.\n",
> +                       d->domain_id);
> +        }
> +
> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>           if ( rc )
>               goto out; /* Failed to update p2m, bail without updating m2p. */

The handling of "a" looks good now, but the error handling and
logging is still as broken as it was before.

>> But then again this code may change altogether if you avoid
>> populating the reserved regions in the first place.
> 
> Are you saying this scenario?
> 
> #1 Here we first set these ranges as p2m_access_n
> #2 We reset them as 1:1 RMRR mapping with p2m_access_rw somewhere
> #3 Someone may try to populate these ranges again

No. I pointed at the fact that if you avoid populating the holes,
there's no need to force them to p2m_access_n. Any attempts
to map other than the 1:1 thing there could then simply be
rejected.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
                   ` (13 preceding siblings ...)
  2014-10-24 10:52 ` [v7][RFC][PATCH 01/13] xen: RMRR fix Jan Beulich
@ 2014-10-30 22:15 ` Tim Deegan
  2014-10-31  2:53   ` Chen, Tiejun
  14 siblings, 1 reply; 180+ messages in thread
From: Tim Deegan @ 2014-10-30 22:15 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, xen-devel, JBeulich

Hi,

At 15:34 +0800 on 24 Oct (1414161264), Tiejun Chen wrote:
> This series of patches try to reconcile those remaining problems but
> just post as RFC to ask for any comments to refine everything.
> 
> The current whole scheme is as follows:
> 
> 1. Reconcile guest mmio with RMRR in pci_setup
> 2. Reconcile guest RAM with RMRR in e820 table
> 
> Then in theory guest wouldn't access any RMRR range.
> 
> 3. Just initialize all RMRR ranges as p2m_access_n in p2m table:
>     gfn:mfn:p2m_access_n

Please don't use p2m_access for this.  It will conflict with other
users of that interface.  Just clear the entries instead (i.e. set to
p2m_type_invalid).  Then you don't need to do this either:

> 5. Before we take real device assignment, any access to RMRR may issue
> ept_handle_violation because of p2m_access_n. Then we just call
> update_guest_eip() to return.

because the RMRR area will be handled like any other un-backed address.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map
  2014-10-30  9:10               ` Jan Beulich
@ 2014-10-31  1:03                 ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  1:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: kevin.tian, Julien Grall, tim, xen-devel, yang.z.zhang

On 2014/10/30 17:10, Jan Beulich wrote:
>>>> On 30.10.14 at 03:53, <tiejun.chen@intel.com> wrote:
>> So could you send me the latest? I'd like to replace it in my tree.
>
> Here you go.

Thanks a lot.

Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-30  9:13                   ` Jan Beulich
@ 2014-10-31  2:20                     ` Chen, Tiejun
  2014-10-31  8:14                       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  2:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

On 2014/10/30 17:13, Jan Beulich wrote:
>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/28 17:48, Jan Beulich wrote:
>>>>>>>> On 28.10.14 at 06:21, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/27 17:45, Jan Beulich wrote:
>>>>>>>>>> On 27.10.14 at 04:12, <tiejun.chen@intel.com> wrote:
>>>>>>>> On 2014/10/24 22:22, Jan Beulich wrote:
>>>>>>>>>>>> On 24.10.14 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>>>>>>> --- a/tools/firmware/hvmloader/util.h
>>>>>>>>>> +++ b/tools/firmware/hvmloader/util.h
>>>>>>>>>> @@ -241,6 +241,12 @@ int build_e820_table(struct e820entry *e820,
>>>>>>>>>>                            unsigned int bios_image_base);
>>>>>>>>>>       void dump_e820_table(struct e820entry *e820, unsigned int nr);
>>>>>>>>>>
>>>>>>>>>> +#include <xen/memory.h>
>>>>>>>>>> +#define ENOBUFS     105 /* No buffer space available */
>>>>>>>>>
>>>>>>>>> This is a joke I hope? The #include belongs at the top (albeit afaict
>>>>>>>>> you don't really need it here), and the #define is completely
>>>>>>>>
>>>>>>>> If without this line, #include <xen/memory.h>,
>>>>>>>>
>>>>>>>> In file included from build.c:25:0:
>>>>>>>> ../util.h:246:70: error: array type has incomplete element type
>>>>>>>>       int get_reserved_device_memory_map(struct xen_reserved_device_memory
>>>>>>>> entries[],
>>>>>>>>                                                                            ^
>>>>>>>> make[8]: *** [build.o] Error 1
>>>>>>>
>>>>>>> So just forward declare the structure ahead of the function
>>>>>>> declaration.
>>>>>>
>>>>>> tools/firmware/hvmloader/pci.c:28:#include <xen/memory.h>
>>>>>> tools/firmware/hvmloader/ovmf.c:36:#include <xen/memory.h>
>>>>>>
>>>>>> So any reason I can't do such a same thing?
>>>>>
>>>>> You can, but it's undesirable. You're wanting this in a header, i.e.
>>>>> you'll make everyone consuming that header also implicitly depend
>>>>> on the new header you would include. We shouldn't pointlessly
>>>>> add build dependencies (and we should really try to reduce them
>>>>> where possible).
>>>>
>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>> when we really need them.
>>>
>>> Please stop thinking this way. Declarations for things defined in .c
>>> files are to be present in headers, and the defining .c file has to
>>> include that header (making sure declaration and definition are and
>>> remain in sync). I hate having to again repeat my remark that you
>>> shouldn't forget it's not application code that you're modifying.
>>> Robust and maintainable code are a requirement in the hypervisor
>>> (and, as said it being an extension of it, hvmloader). Which - just
>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>> apply to application code. It's just that in the hypervisor and kernel
>>> (and certain other code system components) the consequences of
>>> being lax are much more severe.
>>
>> Okay. But currently, the pci.c file already include 'util.h' and
>> '<xen/memory.h>,
>>
>> #include "util.h"
>> ...
>> #include <xen/memory.h>
>>
>> We can't redefine struct xen_reserved_device_memory in util.h.
>
> Redefine? I said forward declare.

Seems we just need to declare hvm_get_reserved_device_memory_map() in 
the head file, tools/firmware/hvmloader/util.h,

unsigned int hvm_get_reserved_device_memory_map(void);

since this is enough to pci_setup() and construct_rdm_e820_maps().

tools/firmware/hvmloader/util.c:

struct xen_reserved_device_memory *rdm_map;
static int
get_reserved_device_memory_map(struct xen_reserved_device_memory 
entries[], uint32_t *max_entries)
{
	...
}
unsigned int hvm_get_reserved_device_memory_map(void)
{
	...
}

>
>> So all tools maintainers,
>>
>> Could you give us such a comment?
>>
>> 'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h xen' versus
>> 'ln -sf $(XEN_ROOT)/xen/include/xen/errno.h
>> $(XEN_ROOT)/tools/firmware/hvmloader/,
>
> I think if it is (as suggested) to be limited to hvmloader, the
> symlinking also should be done in its Makefile rather than along
> with other tools stuff.

You're right.

I just send out a separate patch to ask a review from tools side, and 
CCed you.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-30  9:24                   ` Jan Beulich
@ 2014-10-31  2:50                     ` Chen, Tiejun
  2014-10-31  8:25                       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  2:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/30 17:24, Jan Beulich wrote:
>>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
>> On 2014/10/29 17:20, Jan Beulich wrote:
>>> Getting closer. Just set a to p2m->default_access before the if(),
>>> and overwrite it when rc == 1 inside the if(). And properly handle
>>> the error case (just logging a message - which btw lacks a proper
>>> XENLOG_G_* prefix - doesn't seem enough to me).
>>
>> Please check the follows:
>>
>> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>        /* Now, actually do the two-way mapping */
>>        if ( mfn_valid(_mfn(mfn)) )
>>        {
>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>> -                           p2m->default_access);
>> +        rc = 0;
>> +        a =  p2m->default_access;
>> +        if ( !is_hardware_domain(d) )
>> +        {
>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>> +                                                  &gfn);
>> +            /* We need to set reserved device memory as p2m_access_n. */
>> +            if ( rc == 1 )
>> +                a = p2m_access_n;
>> +            else if ( rc < 0 )
>> +                printk(XENLOG_WARNING
>> +                       "Domain %d can't check reserved device memory.\n",
>> +                       d->domain_id);
>> +        }
>> +
>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>            if ( rc )
>>                goto out; /* Failed to update p2m, bail without updating m2p. */
>
> The handling of "a" looks good now, but the error handling and
> logging is still as broken as it was before.

Do you mean I'm missing some necessary info? Like gfn and mfn, so domain 
id, gfn and mfn can show enough message.

Sorry I'm poor to understand what you expect.

>
>>> But then again this code may change altogether if you avoid
>>> populating the reserved regions in the first place.
>>
>> Are you saying this scenario?
>>
>> #1 Here we first set these ranges as p2m_access_n
>> #2 We reset them as 1:1 RMRR mapping with p2m_access_rw somewhere
>> #3 Someone may try to populate these ranges again
>
> No. I pointed at the fact that if you avoid populating the holes,
> there's no need to force them to p2m_access_n. Any attempts
> to map other than the 1:1 thing there could then simply be
> rejected.

I think any population should be rejected totally, because 1:1 mapping 
means guest can access these RMRR ranges in case of no any device 
assignment with RMRR, right? Any access to these range corrupt the real 
device usage.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-30 22:15 ` Tim Deegan
@ 2014-10-31  2:53   ` Chen, Tiejun
  2014-10-31  9:10     ` Tim Deegan
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  2:53 UTC (permalink / raw)
  To: Tim Deegan; +Cc: yang.z.zhang, kevin.tian, xen-devel, JBeulich

On 2014/10/31 6:15, Tim Deegan wrote:
> Hi,
>
> At 15:34 +0800 on 24 Oct (1414161264), Tiejun Chen wrote:
>> This series of patches try to reconcile those remaining problems but
>> just post as RFC to ask for any comments to refine everything.
>>
>> The current whole scheme is as follows:
>>
>> 1. Reconcile guest mmio with RMRR in pci_setup
>> 2. Reconcile guest RAM with RMRR in e820 table
>>
>> Then in theory guest wouldn't access any RMRR range.
>>
>> 3. Just initialize all RMRR ranges as p2m_access_n in p2m table:
>>      gfn:mfn:p2m_access_n
>
> Please don't use p2m_access for this.  It will conflict with other
> users of that interface.  Just clear the entries instead (i.e. set to
> p2m_type_invalid).  Then you don't need to do this either:

IMO all p2m tables are initialized as invalid.

Furthermore, so I guess you also may agree something discussed between 
Jan and me. There Jan think we just reject any attempt to populate these 
ranges.

Thanks
Tiejun

>
>> 5. Before we take real device assignment, any access to RMRR may issue
>> ept_handle_violation because of p2m_access_n. Then we just call
>> update_guest_eip() to return.
>
> because the RMRR area will be handled like any other un-backed address.
>
> Cheers,
>
> Tim.
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-30  9:07                   ` Jan Beulich
@ 2014-10-31  3:11                     ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  3:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/30 17:07, Jan Beulich wrote:
>>>> On 30.10.14 at 09:21, <tiejun.chen@intel.com> wrote:
>> I think we just guarantee no one set mem_access for those ranges, but
>> its fine to get mem_access:
>
> Seems reasonable to me, but you'll need to get the respective
> maintainers to agree. And of course this needs to be cleaned up;

If something is obvious please give me comment kindly. I mean if 
possible, I hope I can narrow down these bad stuffs before I send them 
in next revision :)

> personally I also dislike the excessive logging of messages that
> you add here and elsewhere.

Yeah, I noticed you were saying this but as I said, this is not clear to 
me based on my poor understanding.

Thanks
Tiejun

>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-30  9:20                   ` Jan Beulich
@ 2014-10-31  5:41                     ` Chen, Tiejun
  2014-10-31  6:21                       ` Tian, Kevin
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  5:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/10/30 17:20, Jan Beulich wrote:
>>>> On 30.10.14 at 04:11, <tiejun.chen@intel.com> wrote:
>> On 2014/10/29 17:15, Jan Beulich wrote:
>>>>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
>>>> In VT-D specification, I just see,
>>>>
>>>> "The RMRR regions are expected to be used for legacy usages (such as
>>>> USB, UMA Graphics, etc.) requiring reserved memory. Platform designers
>>>> should avoid or limit use of reserved memory regions since these require
>>>> system software to create holes in the DMA virtual address range
>>>> available to system software and its driver."
>>>
>>> Nice that you quote it, but did you also read it properly? There's this
>>> little "etc" following the explicit naming of USB and UMA...
>>
>> Yes. But this already clarify RMRR "used for legacy usage" and "avoid or
>> limit use of reserved memory regions", so RMRR would be gone finally. So
>> I mean it may be acceptable to assume something based our known info.
>
> No. Making assumption on observed broken behavior is okay (to work
> around it), but making assumptions for not (yet) observed correct
> behavior to be absent should never be done.
>
>>>>> In the tool stack, don't even populate these holes with RAM. This
>>>>> will then lead to RAM getting populated further up at the upper end.
>>>>
>>>> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we
>>>> already be there to mark them as p2m_access_n.
>>>
>>> Marking them with p2m_access_n is not the same as not populating
>>> the regions in the first place. Again - hiding multiple megabytes of
>>> memory (and who knows if it can't grow into the gigabyte range) is
>>> just not acceptable. Even for just a few pages I wouldn't be really
>>
>> I don't think so. If you're considering a VM, this case should be same
>> under native circumstance. And in native case, all RMRR ranges are
>> marked as reserved in e820 table.
>
> That would only be a valid comparison if all devices associated with
> RMRRs would also be passed to that particular VM. But since you're
> doing the E820 adjustment to all VMs (no matter whether they would
> ever get _any_ device passed through to them) this is not comparing
> like entities.
>
> Thinking about this some more, this odd universal hole punching in
> the E820 is very likely to end up causing problems. Hence I think
> this really should be optional behavior, with pass through of devices
> associated with RMRRs failing if not used. (This ought to include
> punching holes for _just_ the devices passed through to a guest
> upon creation when the option is not enabled.)

Yeah, we had a similar discussion internal to add a parameter to force 
reserving RMRR. In this case we can't create a VM if these ranges 
conflict with anything. So what about this idea?

If yes, could you give us some rules we should follow?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-31  5:41                     ` Chen, Tiejun
@ 2014-10-31  6:21                       ` Tian, Kevin
  2014-10-31  7:02                         ` Chen, Tiejun
  2014-10-31  8:20                         ` Jan Beulich
  0 siblings, 2 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-10-31  6:21 UTC (permalink / raw)
  To: Chen, Tiejun, Jan Beulich; +Cc: Zhang, Yang Z, tim, xen-devel

> From: Chen, Tiejun
> Sent: Friday, October 31, 2014 1:41 PM
> 
> On 2014/10/30 17:20, Jan Beulich wrote:
> >>>> On 30.10.14 at 04:11, <tiejun.chen@intel.com> wrote:
> >> On 2014/10/29 17:15, Jan Beulich wrote:
> >>>>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
> >>>> In VT-D specification, I just see,
> >>>>
> >>>> "The RMRR regions are expected to be used for legacy usages (such as
> >>>> USB, UMA Graphics, etc.) requiring reserved memory. Platform
> designers
> >>>> should avoid or limit use of reserved memory regions since these require
> >>>> system software to create holes in the DMA virtual address range
> >>>> available to system software and its driver."
> >>>
> >>> Nice that you quote it, but did you also read it properly? There's this
> >>> little "etc" following the explicit naming of USB and UMA...
> >>
> >> Yes. But this already clarify RMRR "used for legacy usage" and "avoid or
> >> limit use of reserved memory regions", so RMRR would be gone finally. So
> >> I mean it may be acceptable to assume something based our known info.
> >
> > No. Making assumption on observed broken behavior is okay (to work
> > around it), but making assumptions for not (yet) observed correct
> > behavior to be absent should never be done.
> >
> >>>>> In the tool stack, don't even populate these holes with RAM. This
> >>>>> will then lead to RAM getting populated further up at the upper end.
> >>>>
> >>>> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we
> >>>> already be there to mark them as p2m_access_n.
> >>>
> >>> Marking them with p2m_access_n is not the same as not populating
> >>> the regions in the first place. Again - hiding multiple megabytes of
> >>> memory (and who knows if it can't grow into the gigabyte range) is
> >>> just not acceptable. Even for just a few pages I wouldn't be really
> >>
> >> I don't think so. If you're considering a VM, this case should be same
> >> under native circumstance. And in native case, all RMRR ranges are
> >> marked as reserved in e820 table.
> >
> > That would only be a valid comparison if all devices associated with
> > RMRRs would also be passed to that particular VM. But since you're
> > doing the E820 adjustment to all VMs (no matter whether they would
> > ever get _any_ device passed through to them) this is not comparing
> > like entities.
> >
> > Thinking about this some more, this odd universal hole punching in
> > the E820 is very likely to end up causing problems. Hence I think
> > this really should be optional behavior, with pass through of devices
> > associated with RMRRs failing if not used. (This ought to include
> > punching holes for _just_ the devices passed through to a guest
> > upon creation when the option is not enabled.)
> 
> Yeah, we had a similar discussion internal to add a parameter to force
> reserving RMRR. In this case we can't create a VM if these ranges
> conflict with anything. So what about this idea?
> 

Adding a new parameter (e.g. 'check-passthrough') looks the right 
approach. When the parameter is on, RMRR check/hole punch is
activated at VM creation. Otherwise we just keep existing behavior. 

If user configures device pass-through at creation time, this parameter
will be set by default. If user wants the VM capable of device hot-plug,
an explicit parameter can be added in the config file to enforce RMRR
check at creation time.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-31  6:21                       ` Tian, Kevin
@ 2014-10-31  7:02                         ` Chen, Tiejun
  2014-10-31  8:20                         ` Jan Beulich
  1 sibling, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-10-31  7:02 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich; +Cc: Zhang, Yang Z, tim, xen-devel

On 2014/10/31 14:21, Tian, Kevin wrote:
>> From: Chen, Tiejun
>> Sent: Friday, October 31, 2014 1:41 PM
>>
>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>> On 30.10.14 at 04:11, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/29 17:15, Jan Beulich wrote:
>>>>>>>> On 29.10.14 at 08:43, <tiejun.chen@intel.com> wrote:
>>>>>> In VT-D specification, I just see,
>>>>>>
>>>>>> "The RMRR regions are expected to be used for legacy usages (such as
>>>>>> USB, UMA Graphics, etc.) requiring reserved memory. Platform
>> designers
>>>>>> should avoid or limit use of reserved memory regions since these require
>>>>>> system software to create holes in the DMA virtual address range
>>>>>> available to system software and its driver."
>>>>>
>>>>> Nice that you quote it, but did you also read it properly? There's this
>>>>> little "etc" following the explicit naming of USB and UMA...
>>>>
>>>> Yes. But this already clarify RMRR "used for legacy usage" and "avoid or
>>>> limit use of reserved memory regions", so RMRR would be gone finally. So
>>>> I mean it may be acceptable to assume something based our known info.
>>>
>>> No. Making assumption on observed broken behavior is okay (to work
>>> around it), but making assumptions for not (yet) observed correct
>>> behavior to be absent should never be done.
>>>
>>>>>>> In the tool stack, don't even populate these holes with RAM. This
>>>>>>> will then lead to RAM getting populated further up at the upper end.
>>>>>>
>>>>>> Shouldn't populate RAM still with guest_physmap_add_entry()? If yes, we
>>>>>> already be there to mark them as p2m_access_n.
>>>>>
>>>>> Marking them with p2m_access_n is not the same as not populating
>>>>> the regions in the first place. Again - hiding multiple megabytes of
>>>>> memory (and who knows if it can't grow into the gigabyte range) is
>>>>> just not acceptable. Even for just a few pages I wouldn't be really
>>>>
>>>> I don't think so. If you're considering a VM, this case should be same
>>>> under native circumstance. And in native case, all RMRR ranges are
>>>> marked as reserved in e820 table.
>>>
>>> That would only be a valid comparison if all devices associated with
>>> RMRRs would also be passed to that particular VM. But since you're
>>> doing the E820 adjustment to all VMs (no matter whether they would
>>> ever get _any_ device passed through to them) this is not comparing
>>> like entities.
>>>
>>> Thinking about this some more, this odd universal hole punching in
>>> the E820 is very likely to end up causing problems. Hence I think
>>> this really should be optional behavior, with pass through of devices
>>> associated with RMRRs failing if not used. (This ought to include
>>> punching holes for _just_ the devices passed through to a guest
>>> upon creation when the option is not enabled.)
>>
>> Yeah, we had a similar discussion internal to add a parameter to force
>> reserving RMRR. In this case we can't create a VM if these ranges
>> conflict with anything. So what about this idea?
>>
>
> Adding a new parameter (e.g. 'check-passthrough') looks the right
> approach. When the parameter is on, RMRR check/hole punch is

Could we just extend a sub option as one subset of 'pci' parameter? like 
pci = ["00:02.0@passthrough"] or something else, I think it should make 
sense.

> activated at VM creation. Otherwise we just keep existing behavior.

Yes, VM creation should be failed from any overlap. If without this 
option, just make device assignment failed in case of conflict, but VM 
can still step forward.

>
> If user configures device pass-through at creation time, this parameter
> will be set by default. If user wants the VM capable of device hot-plug,
> an explicit parameter can be added in the config file to enforce RMRR
> check at creation time.

When we check if any RMRR exists, we should post this recommended 
message while parsing ACPI for RMRR.

Thanks
Tiejun

>
> Thanks
> Kevin
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-31  2:20                     ` Chen, Tiejun
@ 2014-10-31  8:14                       ` Jan Beulich
  2014-11-03  2:22                         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-31  8:14 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>> when we really need them.
>>>>
>>>> Please stop thinking this way. Declarations for things defined in .c
>>>> files are to be present in headers, and the defining .c file has to
>>>> include that header (making sure declaration and definition are and
>>>> remain in sync). I hate having to again repeat my remark that you
>>>> shouldn't forget it's not application code that you're modifying.
>>>> Robust and maintainable code are a requirement in the hypervisor
>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>> apply to application code. It's just that in the hypervisor and kernel
>>>> (and certain other code system components) the consequences of
>>>> being lax are much more severe.
>>>
>>> Okay. But currently, the pci.c file already include 'util.h' and
>>> '<xen/memory.h>,
>>>
>>> #include "util.h"
>>> ...
>>> #include <xen/memory.h>
>>>
>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>
>> Redefine? I said forward declare.
> 
> Seems we just need to declare hvm_get_reserved_device_memory_map() in 
> the head file, tools/firmware/hvmloader/util.h,
> 
> unsigned int hvm_get_reserved_device_memory_map(void);

To me this looks very much like poor programming style, even if in
the context of hvmloader communicating information via global
variables rather than function arguments and return values is
generally possible.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-31  6:21                       ` Tian, Kevin
  2014-10-31  7:02                         ` Chen, Tiejun
@ 2014-10-31  8:20                         ` Jan Beulich
  2014-11-03  5:49                           ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-31  8:20 UTC (permalink / raw)
  To: Kevin Tian, Tiejun Chen; +Cc: Yang Z Zhang, tim, xen-devel

>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>  From: Chen, Tiejun
>> Sent: Friday, October 31, 2014 1:41 PM
>> On 2014/10/30 17:20, Jan Beulich wrote:
>> > Thinking about this some more, this odd universal hole punching in
>> > the E820 is very likely to end up causing problems. Hence I think
>> > this really should be optional behavior, with pass through of devices
>> > associated with RMRRs failing if not used. (This ought to include
>> > punching holes for _just_ the devices passed through to a guest
>> > upon creation when the option is not enabled.)
>> 
>> Yeah, we had a similar discussion internal to add a parameter to force
>> reserving RMRR. In this case we can't create a VM if these ranges
>> conflict with anything. So what about this idea?
>> 
> 
> Adding a new parameter (e.g. 'check-passthrough') looks the right 
> approach. When the parameter is on, RMRR check/hole punch is
> activated at VM creation. Otherwise we just keep existing behavior. 
> 
> If user configures device pass-through at creation time, this parameter
> will be set by default. If user wants the VM capable of device hot-plug,
> an explicit parameter can be added in the config file to enforce RMRR
> check at creation time.

Not exactly, I specifically described it slightly differently above. When
devices get passed through and the option is absent, holes should be
punched only for the RMRRs associated with those devices (i.e.
ideally none). Of course this means we'll need a way to associate
RMRRs with devices in the tool stack and hvmloader, i.e. the current
XENMEM_reserved_device_memory_map alone won't suffice.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-31  2:50                     ` Chen, Tiejun
@ 2014-10-31  8:25                       ` Jan Beulich
  2014-11-03  6:20                         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-10-31  8:25 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 31.10.14 at 03:50, <tiejun.chen@intel.com> wrote:
> On 2014/10/30 17:24, Jan Beulich wrote:
>>>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/29 17:20, Jan Beulich wrote:
>>>> Getting closer. Just set a to p2m->default_access before the if(),
>>>> and overwrite it when rc == 1 inside the if(). And properly handle
>>>> the error case (just logging a message - which btw lacks a proper
>>>> XENLOG_G_* prefix - doesn't seem enough to me).
>>>
>>> Please check the follows:
>>>
>>> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>        /* Now, actually do the two-way mapping */
>>>        if ( mfn_valid(_mfn(mfn)) )
>>>        {
>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>> -                           p2m->default_access);
>>> +        rc = 0;
>>> +        a =  p2m->default_access;
>>> +        if ( !is_hardware_domain(d) )
>>> +        {
>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>> +                                                  &gfn);
>>> +            /* We need to set reserved device memory as p2m_access_n. */
>>> +            if ( rc == 1 )
>>> +                a = p2m_access_n;
>>> +            else if ( rc < 0 )
>>> +                printk(XENLOG_WARNING
>>> +                       "Domain %d can't check reserved device memory.\n",
>>> +                       d->domain_id);
>>> +        }
>>> +
>>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>>            if ( rc )
>>>                goto out; /* Failed to update p2m, bail without updating m2p. 
> */
>>
>> The handling of "a" looks good now, but the error handling and
>> logging is still as broken as it was before.
> 
> Do you mean I'm missing some necessary info? Like gfn and mfn, so domain 
> id, gfn and mfn can show enough message.
> 
> Sorry I'm poor to understand what you expect.

But I explained it already, and that explanation is still visible in
the quotes above. But to avoid any doubt, I'll repeat: "And
properly handle the error case (just logging a message - which
btw lacks a proper XENLOG_G_* prefix - doesn't seem enough
to me)."

>>>> But then again this code may change altogether if you avoid
>>>> populating the reserved regions in the first place.
>>>
>>> Are you saying this scenario?
>>>
>>> #1 Here we first set these ranges as p2m_access_n
>>> #2 We reset them as 1:1 RMRR mapping with p2m_access_rw somewhere
>>> #3 Someone may try to populate these ranges again
>>
>> No. I pointed at the fact that if you avoid populating the holes,
>> there's no need to force them to p2m_access_n. Any attempts
>> to map other than the 1:1 thing there could then simply be
>> rejected.
> 
> I think any population should be rejected totally, because 1:1 mapping 
> means guest can access these RMRR ranges in case of no any device 
> assignment with RMRR, right? Any access to these range corrupt the real 
> device usage.

Oh yes, of course I implied that the 1:1 mapping would be
permitted only for those ranges where the RMRR corresponds
to a device the guest got assigned.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 01/13] xen: RMRR fix
  2014-10-31  2:53   ` Chen, Tiejun
@ 2014-10-31  9:10     ` Tim Deegan
  0 siblings, 0 replies; 180+ messages in thread
From: Tim Deegan @ 2014-10-31  9:10 UTC (permalink / raw)
  To: Chen, Tiejun; +Cc: yang.z.zhang, kevin.tian, xen-devel, JBeulich

At 10:53 +0800 on 31 Oct (1414749238), Chen, Tiejun wrote:
> On 2014/10/31 6:15, Tim Deegan wrote:
> > Hi,
> >
> > At 15:34 +0800 on 24 Oct (1414161264), Tiejun Chen wrote:
> >> This series of patches try to reconcile those remaining problems but
> >> just post as RFC to ask for any comments to refine everything.
> >>
> >> The current whole scheme is as follows:
> >>
> >> 1. Reconcile guest mmio with RMRR in pci_setup
> >> 2. Reconcile guest RAM with RMRR in e820 table
> >>
> >> Then in theory guest wouldn't access any RMRR range.
> >>
> >> 3. Just initialize all RMRR ranges as p2m_access_n in p2m table:
> >>      gfn:mfn:p2m_access_n
> >
> > Please don't use p2m_access for this.  It will conflict with other
> > users of that interface.  Just clear the entries instead (i.e. set to
> > p2m_type_invalid).  Then you don't need to do this either:
> 
> IMO all p2m tables are initialized as invalid.

Yes, so you shouldn't need to explicitly clear them.  You just need to
check that they are already clear (or already 1:1, as we agreed
before).

Tim.

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-10-31  8:14                       ` Jan Beulich
@ 2014-11-03  2:22                         ` Chen, Tiejun
  2014-11-03  8:53                           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  2:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

On 2014/10/31 16:14, Jan Beulich wrote:
>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>> when we really need them.
>>>>>
>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>> files are to be present in headers, and the defining .c file has to
>>>>> include that header (making sure declaration and definition are and
>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>> shouldn't forget it's not application code that you're modifying.
>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>> (and certain other code system components) the consequences of
>>>>> being lax are much more severe.
>>>>
>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>> '<xen/memory.h>,
>>>>
>>>> #include "util.h"
>>>> ...
>>>> #include <xen/memory.h>
>>>>
>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>
>>> Redefine? I said forward declare.
>>
>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>> the head file, tools/firmware/hvmloader/util.h,
>>
>> unsigned int hvm_get_reserved_device_memory_map(void);
>
> To me this looks very much like poor programming style, even if in
> the context of hvmloader communicating information via global
> variables rather than function arguments and return values is

Do you mean you don't like a global variable? But it can be use to get 
RDM without more hypercall or function call in the context of hvmloader.

> generally possible.

The following is what I did:

+struct xen_reserved_device_memory *rdm_map;
+static int
+get_reserved_device_memory_map(struct xen_reserved_device_memory entries[],
+                               uint32_t *max_entries)
+{
+    int rc;
+    struct xen_reserved_device_memory_map xrdmmap = {
+        .nr_entries = *max_entries
+    };
+
+    set_xen_guest_handle(xrdmmap.buffer, entries);
+
+    rc = hypercall_memory_op(XENMEM_reserved_device_memory_map, &xrdmmap);
+    if ( rc == -ENOBUFS )
+        *max_entries = xrdmmap.nr_entries;
+
+    return rc;
+}
+
+/*
+ * Getting all reserved device memory map info in case of hvmloader.
+ * We just return zero for any failed cases, and this means we
+ * can't further handle any reserved device memory.
+ */
+unsigned int hvm_get_reserved_device_memory_map(void)
+{
+ ...
+}

So if you think they're not good, just please define these prototypes 
then I can finish them.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-10-31  8:20                         ` Jan Beulich
@ 2014-11-03  5:49                           ` Chen, Tiejun
  2014-11-03  8:56                             ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  5:49 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian; +Cc: Yang Z Zhang, tim, xen-devel

On 2014/10/31 16:20, Jan Beulich wrote:
>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>   From: Chen, Tiejun
>>> Sent: Friday, October 31, 2014 1:41 PM
>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>> Thinking about this some more, this odd universal hole punching in
>>>> the E820 is very likely to end up causing problems. Hence I think
>>>> this really should be optional behavior, with pass through of devices
>>>> associated with RMRRs failing if not used. (This ought to include
>>>> punching holes for _just_ the devices passed through to a guest
>>>> upon creation when the option is not enabled.)
>>>
>>> Yeah, we had a similar discussion internal to add a parameter to force
>>> reserving RMRR. In this case we can't create a VM if these ranges
>>> conflict with anything. So what about this idea?
>>>
>>
>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>> approach. When the parameter is on, RMRR check/hole punch is
>> activated at VM creation. Otherwise we just keep existing behavior.
>>
>> If user configures device pass-through at creation time, this parameter
>> will be set by default. If user wants the VM capable of device hot-plug,
>> an explicit parameter can be added in the config file to enforce RMRR
>> check at creation time.
>
> Not exactly, I specifically described it slightly differently above. When
> devices get passed through and the option is absent, holes should be
> punched only for the RMRRs associated with those devices (i.e.
> ideally none). Of course this means we'll need a way to associate
> RMRRs with devices in the tool stack and hvmloader, i.e. the current
> XENMEM_reserved_device_memory_map alone won't suffice.

Yeah, current hypercall just provide RMRR entries without that 
associated BDF. And especially, in some cases one range may be shared by 
multiple devices...

So anyway, let me be clear something here before we step next. Firstly 
I'm wondering if you will refine that patch to achieve our requirement 
by yourself since looks you'd like to maintain that patch, or I should 
do this? And of course we can walk another way, like xenstore as Yang 
suggested previously.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-10-31  8:25                       ` Jan Beulich
@ 2014-11-03  6:20                         ` Chen, Tiejun
  2014-11-03  9:00                           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  6:20 UTC (permalink / raw)
  To: Jan Beulich, tim; +Cc: yang.z.zhang, kevin.tian, xen-devel

On 2014/10/31 16:25, Jan Beulich wrote:
>>>> On 31.10.14 at 03:50, <tiejun.chen@intel.com> wrote:
>> On 2014/10/30 17:24, Jan Beulich wrote:
>>>>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/29 17:20, Jan Beulich wrote:
>>>>> Getting closer. Just set a to p2m->default_access before the if(),
>>>>> and overwrite it when rc == 1 inside the if(). And properly handle
>>>>> the error case (just logging a message - which btw lacks a proper
>>>>> XENLOG_G_* prefix - doesn't seem enough to me).
>>>>
>>>> Please check the follows:
>>>>
>>>> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>>         /* Now, actually do the two-way mapping */
>>>>         if ( mfn_valid(_mfn(mfn)) )
>>>>         {
>>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>> -                           p2m->default_access);
>>>> +        rc = 0;
>>>> +        a =  p2m->default_access;
>>>> +        if ( !is_hardware_domain(d) )
>>>> +        {
>>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>>> +                                                  &gfn);
>>>> +            /* We need to set reserved device memory as p2m_access_n. */
>>>> +            if ( rc == 1 )
>>>> +                a = p2m_access_n;
>>>> +            else if ( rc < 0 )
>>>> +                printk(XENLOG_WARNING
>>>> +                       "Domain %d can't check reserved device memory.\n",
>>>> +                       d->domain_id);
>>>> +        }
>>>> +
>>>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>>>             if ( rc )
>>>>                 goto out; /* Failed to update p2m, bail without updating m2p.
>> */
>>>
>>> The handling of "a" looks good now, but the error handling and
>>> logging is still as broken as it was before.
>>
>> Do you mean I'm missing some necessary info? Like gfn and mfn, so domain
>> id, gfn and mfn can show enough message.
>>
>> Sorry I'm poor to understand what you expect.
>
> But I explained it already, and that explanation is still visible in
> the quotes above. But to avoid any doubt, I'll repeat: "And

I tried to understand what you said but felt a confusion so ask if you 
show me directly.

> properly handle the error case (just logging a message - which
> btw lacks a proper XENLOG_G_* prefix - doesn't seem enough
> to me)."

Looks there are two problems:

#1: the error message

If current line is not fine,
	printk(XENLOG_G_WARNING "Domain %d can't check reserved device 
memory.\n", d->domain_id);

I mean could you change this directly.

#2 the error handling

In an error case what should I do? Currently we still create these 
mapping as normal. This means these mfns will be valid so later we can't 
set them again then device can't be assigned as passthrough. I think 
this makes sense. Or we should just stop them from setting 1:1 mapping?

>
>>>>> But then again this code may change altogether if you avoid
>>>>> populating the reserved regions in the first place.
>>>>
>>>> Are you saying this scenario?
>>>>
>>>> #1 Here we first set these ranges as p2m_access_n
>>>> #2 We reset them as 1:1 RMRR mapping with p2m_access_rw somewhere
>>>> #3 Someone may try to populate these ranges again
>>>
>>> No. I pointed at the fact that if you avoid populating the holes,
>>> there's no need to force them to p2m_access_n. Any attempts
>>> to map other than the 1:1 thing there could then simply be
>>> rejected.
>>
>> I think any population should be rejected totally, because 1:1 mapping
>> means guest can access these RMRR ranges in case of no any device
>> assignment with RMRR, right? Any access to these range corrupt the real
>> device usage.
>
> Oh yes, of course I implied that the 1:1 mapping would be
> permitted only for those ranges where the RMRR corresponds
> to a device the guest got assigned.
>

Yeah.

Tim,

Please take a look at this, and I hope this can make sure we'll be the 
same page finally :)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03  2:22                         ` Chen, Tiejun
@ 2014-11-03  8:53                           ` Jan Beulich
  2014-11-03  9:32                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03  8:53 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>> when we really need them.
>>>>>>
>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>> include that header (making sure declaration and definition are and
>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>> (and certain other code system components) the consequences of
>>>>>> being lax are much more severe.
>>>>>
>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>> '<xen/memory.h>,
>>>>>
>>>>> #include "util.h"
>>>>> ...
>>>>> #include <xen/memory.h>
>>>>>
>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>
>>>> Redefine? I said forward declare.
>>>
>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>> the head file, tools/firmware/hvmloader/util.h,
>>>
>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>
>> To me this looks very much like poor programming style, even if in
>> the context of hvmloader communicating information via global
>> variables rather than function arguments and return values is
> 
> Do you mean you don't like a global variable? But it can be use to get 
> RDM without more hypercall or function call in the context of hvmloader.

This argument which you brought up before, and which we commented
on before, is pretty pointless. We don't really care much about doing
one or two more hypercalls from hvmloader, unless these would be
long-running ones.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03  5:49                           ` Chen, Tiejun
@ 2014-11-03  8:56                             ` Jan Beulich
  2014-11-03  9:40                               ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03  8:56 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>   From: Chen, Tiejun
>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>> Thinking about this some more, this odd universal hole punching in
>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>> this really should be optional behavior, with pass through of devices
>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>> punching holes for _just_ the devices passed through to a guest
>>>>> upon creation when the option is not enabled.)
>>>>
>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>> conflict with anything. So what about this idea?
>>>>
>>>
>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>> approach. When the parameter is on, RMRR check/hole punch is
>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>
>>> If user configures device pass-through at creation time, this parameter
>>> will be set by default. If user wants the VM capable of device hot-plug,
>>> an explicit parameter can be added in the config file to enforce RMRR
>>> check at creation time.
>>
>> Not exactly, I specifically described it slightly differently above. When
>> devices get passed through and the option is absent, holes should be
>> punched only for the RMRRs associated with those devices (i.e.
>> ideally none). Of course this means we'll need a way to associate
>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>> XENMEM_reserved_device_memory_map alone won't suffice.
> 
> Yeah, current hypercall just provide RMRR entries without that 
> associated BDF. And especially, in some cases one range may be shared by 
> multiple devices...

Before we decide who's going to do an eventual change we need to
determine what behavior we want, and whether this hypercall is
really the right one. Quite possibly we'd need a per-domain view
along with the global view, and hence rather than modifying this one
we may need to introduce e.g. a new domctl.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03  6:20                         ` Chen, Tiejun
@ 2014-11-03  9:00                           ` Jan Beulich
  2014-11-03  9:51                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03  9:00 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
> On 2014/10/31 16:25, Jan Beulich wrote:
>>>>> On 31.10.14 at 03:50, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/30 17:24, Jan Beulich wrote:
>>>>>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>>>         /* Now, actually do the two-way mapping */
>>>>>         if ( mfn_valid(_mfn(mfn)) )
>>>>>         {
>>>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>>> -                           p2m->default_access);
>>>>> +        rc = 0;
>>>>> +        a =  p2m->default_access;
>>>>> +        if ( !is_hardware_domain(d) )
>>>>> +        {
>>>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>>>> +                                                  &gfn);
>>>>> +            /* We need to set reserved device memory as p2m_access_n. */
>>>>> +            if ( rc == 1 )
>>>>> +                a = p2m_access_n;
>>>>> +            else if ( rc < 0 )
>>>>> +                printk(XENLOG_WARNING
>>>>> +                       "Domain %d can't check reserved device memory.\n",
>>>>> +                       d->domain_id);
>>>>> +        }
>>>>> +
>>>>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>>>>             if ( rc )
>>>>>                 goto out; /* Failed to update p2m, bail without updating m2p.
>>> */
>>>>
>>>> The handling of "a" looks good now, but the error handling and
>>>> logging is still as broken as it was before.
>>>
>>> Do you mean I'm missing some necessary info? Like gfn and mfn, so domain
>>> id, gfn and mfn can show enough message.
>>>
>>> Sorry I'm poor to understand what you expect.
>>
>> But I explained it already, and that explanation is still visible in
>> the quotes above. But to avoid any doubt, I'll repeat: "And
> 
> I tried to understand what you said but felt a confusion so ask if you 
> show me directly.
> 
>> properly handle the error case (just logging a message - which
>> btw lacks a proper XENLOG_G_* prefix - doesn't seem enough
>> to me)."
> 
> Looks there are two problems:
> 
> #1: the error message
> 
> If current line is not fine,
> 	printk(XENLOG_G_WARNING "Domain %d can't check reserved device 
> memory.\n", d->domain_id);
> 
> I mean could you change this directly.

This looks reasonable, albeit we generally prefer Dom%d or dom%d
so that messages are somewhat grep-able.

> #2 the error handling
> 
> In an error case what should I do? Currently we still create these 
> mapping as normal. This means these mfns will be valid so later we can't 
> set them again then device can't be assigned as passthrough. I think 
> this makes sense. Or we should just stop them from setting 1:1 mapping?

You should, with very few exceptions, not ignore errors (which
includes "handling" them by just logging a message. Instead, you
should propagate the error back up the call chain.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03  8:53                           ` Jan Beulich
@ 2014-11-03  9:32                             ` Chen, Tiejun
  2014-11-03  9:45                               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  9:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

On 2014/11/3 16:53, Jan Beulich wrote:
>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>>> when we really need them.
>>>>>>>
>>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>> include that header (making sure declaration and definition are and
>>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>>> (and certain other code system components) the consequences of
>>>>>>> being lax are much more severe.
>>>>>>
>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>> '<xen/memory.h>,
>>>>>>
>>>>>> #include "util.h"
>>>>>> ...
>>>>>> #include <xen/memory.h>
>>>>>>
>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>
>>>>> Redefine? I said forward declare.
>>>>
>>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>
>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>
>>> To me this looks very much like poor programming style, even if in
>>> the context of hvmloader communicating information via global
>>> variables rather than function arguments and return values is
>>
>> Do you mean you don't like a global variable? But it can be use to get
>> RDM without more hypercall or function call in the context of hvmloader.
>
> This argument which you brought up before, and which we commented
> on before, is pretty pointless. We don't really care much about doing
> one or two more hypercalls from hvmloader, unless these would be
> long-running ones.
>

Another benefit to use a global variable is that we wouldn't allocate 
xen_reserved_device_memory * N each time, and reduce some duplicated 
codes, unless you mean I should define that as static inside in local.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03  8:56                             ` Jan Beulich
@ 2014-11-03  9:40                               ` Chen, Tiejun
  2014-11-03  9:51                                 ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  9:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

On 2014/11/3 16:56, Jan Beulich wrote:
>>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
>> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>>    From: Chen, Tiejun
>>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>> Thinking about this some more, this odd universal hole punching in
>>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>>> this really should be optional behavior, with pass through of devices
>>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>>> punching holes for _just_ the devices passed through to a guest
>>>>>> upon creation when the option is not enabled.)
>>>>>
>>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>>> conflict with anything. So what about this idea?
>>>>>
>>>>
>>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>>> approach. When the parameter is on, RMRR check/hole punch is
>>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>>
>>>> If user configures device pass-through at creation time, this parameter
>>>> will be set by default. If user wants the VM capable of device hot-plug,
>>>> an explicit parameter can be added in the config file to enforce RMRR
>>>> check at creation time.
>>>
>>> Not exactly, I specifically described it slightly differently above. When
>>> devices get passed through and the option is absent, holes should be
>>> punched only for the RMRRs associated with those devices (i.e.
>>> ideally none). Of course this means we'll need a way to associate
>>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>>> XENMEM_reserved_device_memory_map alone won't suffice.
>>
>> Yeah, current hypercall just provide RMRR entries without that
>> associated BDF. And especially, in some cases one range may be shared by
>> multiple devices...
>
> Before we decide who's going to do an eventual change we need to
> determine what behavior we want, and whether this hypercall is
> really the right one. Quite possibly we'd need a per-domain view
> along with the global view, and hence rather than modifying this one
> we may need to introduce e.g. a new domctl.
>

If we really need to work with a hypercall, maybe we can introduce a 
little bit to construct that to callback with multiple entries like 
this, for instance,

RMRR entry0 have three devices, and entry1 have two devices,

[start0, nr_pages0, bdf0],
[start0, nr_pages0, bdf1],
[start0, nr_pages0, bdf2],
[start1, nr_pages1, bdf3],
[start1, nr_pages1, bdf4],

Although its cost more buffers, actually as you know this actual case is 
really rare. So maybe this way can be feasible. Then we don't need 
additional hypercall or xenstore.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03  9:32                             ` Chen, Tiejun
@ 2014-11-03  9:45                               ` Jan Beulich
  2014-11-03  9:55                                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03  9:45 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 16:53, Jan Beulich wrote:
>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>>>> when we really need them.
>>>>>>>>
>>>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>>> include that header (making sure declaration and definition are and
>>>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>>>> (and certain other code system components) the consequences of
>>>>>>>> being lax are much more severe.
>>>>>>>
>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>>> '<xen/memory.h>,
>>>>>>>
>>>>>>> #include "util.h"
>>>>>>> ...
>>>>>>> #include <xen/memory.h>
>>>>>>>
>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>>
>>>>>> Redefine? I said forward declare.
>>>>>
>>>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>>
>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>>
>>>> To me this looks very much like poor programming style, even if in
>>>> the context of hvmloader communicating information via global
>>>> variables rather than function arguments and return values is
>>>
>>> Do you mean you don't like a global variable? But it can be use to get
>>> RDM without more hypercall or function call in the context of hvmloader.
>>
>> This argument which you brought up before, and which we commented
>> on before, is pretty pointless. We don't really care much about doing
>> one or two more hypercalls from hvmloader, unless these would be
>> long-running ones.
>>
> 
> Another benefit to use a global variable is that we wouldn't allocate 
> xen_reserved_device_memory * N each time, and reduce some duplicated 
> codes, unless you mean I should define that as static inside in local.

Now this reason is indeed worth a consideration. How many times is
the data being needed/retrieved?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03  9:00                           ` Jan Beulich
@ 2014-11-03  9:51                             ` Chen, Tiejun
  2014-11-03 10:03                               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  9:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/3 17:00, Jan Beulich wrote:
>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>> On 2014/10/31 16:25, Jan Beulich wrote:
>>>>>> On 31.10.14 at 03:50, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/30 17:24, Jan Beulich wrote:
>>>>>>>> On 30.10.14 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>>> @@ -686,8 +686,22 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>>>>          /* Now, actually do the two-way mapping */
>>>>>>          if ( mfn_valid(_mfn(mfn)) )
>>>>>>          {
>>>>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>>>> -                           p2m->default_access);
>>>>>> +        rc = 0;
>>>>>> +        a =  p2m->default_access;
>>>>>> +        if ( !is_hardware_domain(d) )
>>>>>> +        {
>>>>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>>>>> +                                                  &gfn);
>>>>>> +            /* We need to set reserved device memory as p2m_access_n. */
>>>>>> +            if ( rc == 1 )
>>>>>> +                a = p2m_access_n;
>>>>>> +            else if ( rc < 0 )
>>>>>> +                printk(XENLOG_WARNING
>>>>>> +                       "Domain %d can't check reserved device memory.\n",
>>>>>> +                       d->domain_id);
>>>>>> +        }
>>>>>> +
>>>>>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>>>>>              if ( rc )
>>>>>>                  goto out; /* Failed to update p2m, bail without updating m2p.
>>>> */
>>>>>
>>>>> The handling of "a" looks good now, but the error handling and
>>>>> logging is still as broken as it was before.
>>>>
>>>> Do you mean I'm missing some necessary info? Like gfn and mfn, so domain
>>>> id, gfn and mfn can show enough message.
>>>>
>>>> Sorry I'm poor to understand what you expect.
>>>
>>> But I explained it already, and that explanation is still visible in
>>> the quotes above. But to avoid any doubt, I'll repeat: "And
>>
>> I tried to understand what you said but felt a confusion so ask if you
>> show me directly.
>>
>>> properly handle the error case (just logging a message - which
>>> btw lacks a proper XENLOG_G_* prefix - doesn't seem enough
>>> to me)."
>>
>> Looks there are two problems:
>>
>> #1: the error message
>>
>> If current line is not fine,
>> 	printk(XENLOG_G_WARNING "Domain %d can't check reserved device
>> memory.\n", d->domain_id);
>>
>> I mean could you change this directly.
>
> This looks reasonable, albeit we generally prefer Dom%d or dom%d
> so that messages are somewhat grep-able.

Fixed.

>
>> #2 the error handling
>>
>> In an error case what should I do? Currently we still create these
>> mapping as normal. This means these mfns will be valid so later we can't
>> set them again then device can't be assigned as passthrough. I think
>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>
> You should, with very few exceptions, not ignore errors (which
> includes "handling" them by just logging a message. Instead, you
> should propagate the error back up the call chain.
>

Do you mean in your patch,

+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+

I shouldn't return that directly. Then instead, we should handle all 
error scenarios here?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03  9:40                               ` Chen, Tiejun
@ 2014-11-03  9:51                                 ` Jan Beulich
  2014-11-03 11:32                                   ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03  9:51 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

>>> On 03.11.14 at 10:40, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 16:56, Jan Beulich wrote:
>>>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
>>> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>>>    From: Chen, Tiejun
>>>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>>> Thinking about this some more, this odd universal hole punching in
>>>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>>>> this really should be optional behavior, with pass through of devices
>>>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>>>> punching holes for _just_ the devices passed through to a guest
>>>>>>> upon creation when the option is not enabled.)
>>>>>>
>>>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>>>> conflict with anything. So what about this idea?
>>>>>>
>>>>>
>>>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>>>> approach. When the parameter is on, RMRR check/hole punch is
>>>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>>>
>>>>> If user configures device pass-through at creation time, this parameter
>>>>> will be set by default. If user wants the VM capable of device hot-plug,
>>>>> an explicit parameter can be added in the config file to enforce RMRR
>>>>> check at creation time.
>>>>
>>>> Not exactly, I specifically described it slightly differently above. When
>>>> devices get passed through and the option is absent, holes should be
>>>> punched only for the RMRRs associated with those devices (i.e.
>>>> ideally none). Of course this means we'll need a way to associate
>>>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>>>> XENMEM_reserved_device_memory_map alone won't suffice.
>>>
>>> Yeah, current hypercall just provide RMRR entries without that
>>> associated BDF. And especially, in some cases one range may be shared by
>>> multiple devices...
>>
>> Before we decide who's going to do an eventual change we need to
>> determine what behavior we want, and whether this hypercall is
>> really the right one. Quite possibly we'd need a per-domain view
>> along with the global view, and hence rather than modifying this one
>> we may need to introduce e.g. a new domctl.
>>
> 
> If we really need to work with a hypercall, maybe we can introduce a 
> little bit to construct that to callback with multiple entries like 
> this, for instance,
> 
> RMRR entry0 have three devices, and entry1 have two devices,
> 
> [start0, nr_pages0, bdf0],
> [start0, nr_pages0, bdf1],
> [start0, nr_pages0, bdf2],
> [start1, nr_pages1, bdf3],
> [start1, nr_pages1, bdf4],
> 
> Although its cost more buffers, actually as you know this actual case is 
> really rare. So maybe this way can be feasible. Then we don't need 
> additional hypercall or xenstore.

Conceptually, as a MEMOP, it has no business reporting BDFs. And
then rather than returning the same address range more than once,
having the caller supply a handle to an array and storing all of the
SBDFs (or perhaps a single segment would suffice along with all the
BDFs) there would seem to be an approach more consistent with
what we do elsewhere.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03  9:45                               ` Jan Beulich
@ 2014-11-03  9:55                                 ` Chen, Tiejun
  2014-11-03 10:02                                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03  9:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

On 2014/11/3 17:45, Jan Beulich wrote:
>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 16:53, Jan Beulich wrote:
>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>>>>> when we really need them.
>>>>>>>>>
>>>>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>>>> include that header (making sure declaration and definition are and
>>>>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>>>>> (and certain other code system components) the consequences of
>>>>>>>>> being lax are much more severe.
>>>>>>>>
>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>>>> '<xen/memory.h>,
>>>>>>>>
>>>>>>>> #include "util.h"
>>>>>>>> ...
>>>>>>>> #include <xen/memory.h>
>>>>>>>>
>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>>>
>>>>>>> Redefine? I said forward declare.
>>>>>>
>>>>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>>>
>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>>>
>>>>> To me this looks very much like poor programming style, even if in
>>>>> the context of hvmloader communicating information via global
>>>>> variables rather than function arguments and return values is
>>>>
>>>> Do you mean you don't like a global variable? But it can be use to get
>>>> RDM without more hypercall or function call in the context of hvmloader.
>>>
>>> This argument which you brought up before, and which we commented
>>> on before, is pretty pointless. We don't really care much about doing
>>> one or two more hypercalls from hvmloader, unless these would be
>>> long-running ones.
>>>
>>
>> Another benefit to use a global variable is that we wouldn't allocate
>> xen_reserved_device_memory * N each time, and reduce some duplicated
>> codes, unless you mean I should define that as static inside in local.
>
> Now this reason is indeed worth a consideration. How many times is
> the data being needed/retrieved?

Currently there are two cases in tools/hvmloader, setup pci and build 
e820 table. Each time, as you know we don't know how may entries we 
should require, so we always allocate one instance then according to the 
return value to allocate the proper instances to get that.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03  9:55                                 ` Chen, Tiejun
@ 2014-11-03 10:02                                   ` Jan Beulich
  2014-11-21  6:26                                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03 10:02 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 17:45, Jan Beulich wrote:
>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
>>> On 2014/11/3 16:53, Jan Beulich wrote:
>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>>>>>> when we really need them.
>>>>>>>>>>
>>>>>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>>>>> include that header (making sure declaration and definition are and
>>>>>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>>>>>> (and certain other code system components) the consequences of
>>>>>>>>>> being lax are much more severe.
>>>>>>>>>
>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>>>>> '<xen/memory.h>,
>>>>>>>>>
>>>>>>>>> #include "util.h"
>>>>>>>>> ...
>>>>>>>>> #include <xen/memory.h>
>>>>>>>>>
>>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>>>>
>>>>>>>> Redefine? I said forward declare.
>>>>>>>
>>>>>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>>>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>>>>
>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>>>>
>>>>>> To me this looks very much like poor programming style, even if in
>>>>>> the context of hvmloader communicating information via global
>>>>>> variables rather than function arguments and return values is
>>>>>
>>>>> Do you mean you don't like a global variable? But it can be use to get
>>>>> RDM without more hypercall or function call in the context of hvmloader.
>>>>
>>>> This argument which you brought up before, and which we commented
>>>> on before, is pretty pointless. We don't really care much about doing
>>>> one or two more hypercalls from hvmloader, unless these would be
>>>> long-running ones.
>>>>
>>>
>>> Another benefit to use a global variable is that we wouldn't allocate
>>> xen_reserved_device_memory * N each time, and reduce some duplicated
>>> codes, unless you mean I should define that as static inside in local.
>>
>> Now this reason is indeed worth a consideration. How many times is
>> the data being needed/retrieved?
> 
> Currently there are two cases in tools/hvmloader, setup pci and build 
> e820 table. Each time, as you know we don't know how may entries we 
> should require, so we always allocate one instance then according to the 
> return value to allocate the proper instances to get that.

Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
a more "normal" interface.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03  9:51                             ` Chen, Tiejun
@ 2014-11-03 10:03                               ` Jan Beulich
  2014-11-03 11:48                                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03 10:03 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>> #2 the error handling
>>>
>>> In an error case what should I do? Currently we still create these
>>> mapping as normal. This means these mfns will be valid so later we can't
>>> set them again then device can't be assigned as passthrough. I think
>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>
>> You should, with very few exceptions, not ignore errors (which
>> includes "handling" them by just logging a message. Instead, you
>> should propagate the error back up the call chain.
>>
> 
> Do you mean in your patch,
> 
> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
> +{
> +    const struct iommu_ops *ops = iommu_get_ops();
> +
> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
> +        return 0;
> +
> +    return ops->get_reserved_device_memory(func, ctxt);
> +}
> +
> 
> I shouldn't return that directly. Then instead, we should handle all 
> error scenarios here?

No. All error scenarios are already being handled here (by
propagating the error code to the caller).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03  9:51                                 ` Jan Beulich
@ 2014-11-03 11:32                                   ` Chen, Tiejun
  2014-11-03 11:43                                     ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03 11:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

On 2014/11/3 17:51, Jan Beulich wrote:
>>>> On 03.11.14 at 10:40, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 16:56, Jan Beulich wrote:
>>>>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
>>>> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>>>>     From: Chen, Tiejun
>>>>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>>>> Thinking about this some more, this odd universal hole punching in
>>>>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>>>>> this really should be optional behavior, with pass through of devices
>>>>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>>>>> punching holes for _just_ the devices passed through to a guest
>>>>>>>> upon creation when the option is not enabled.)
>>>>>>>
>>>>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>>>>> conflict with anything. So what about this idea?
>>>>>>>
>>>>>>
>>>>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>>>>> approach. When the parameter is on, RMRR check/hole punch is
>>>>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>>>>
>>>>>> If user configures device pass-through at creation time, this parameter
>>>>>> will be set by default. If user wants the VM capable of device hot-plug,
>>>>>> an explicit parameter can be added in the config file to enforce RMRR
>>>>>> check at creation time.
>>>>>
>>>>> Not exactly, I specifically described it slightly differently above. When
>>>>> devices get passed through and the option is absent, holes should be
>>>>> punched only for the RMRRs associated with those devices (i.e.
>>>>> ideally none). Of course this means we'll need a way to associate
>>>>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>>>>> XENMEM_reserved_device_memory_map alone won't suffice.
>>>>
>>>> Yeah, current hypercall just provide RMRR entries without that
>>>> associated BDF. And especially, in some cases one range may be shared by
>>>> multiple devices...
>>>
>>> Before we decide who's going to do an eventual change we need to
>>> determine what behavior we want, and whether this hypercall is
>>> really the right one. Quite possibly we'd need a per-domain view
>>> along with the global view, and hence rather than modifying this one
>>> we may need to introduce e.g. a new domctl.
>>>
>>
>> If we really need to work with a hypercall, maybe we can introduce a
>> little bit to construct that to callback with multiple entries like
>> this, for instance,
>>
>> RMRR entry0 have three devices, and entry1 have two devices,
>>
>> [start0, nr_pages0, bdf0],
>> [start0, nr_pages0, bdf1],
>> [start0, nr_pages0, bdf2],
>> [start1, nr_pages1, bdf3],
>> [start1, nr_pages1, bdf4],
>>
>> Although its cost more buffers, actually as you know this actual case is
>> really rare. So maybe this way can be feasible. Then we don't need
>> additional hypercall or xenstore.
>
> Conceptually, as a MEMOP, it has no business reporting BDFs. And
> then rather than returning the same address range more than once,
> having the caller supply a handle to an array and storing all of the
> SBDFs (or perhaps a single segment would suffice along with all the
> BDFs) there would seem to be an approach more consistent with
> what we do elsewhere.

Here I'm wondering if we really need to expose BDFs to tools. Actually 
tools just want to know those range no matter who owns these entries. I 
mean we can do this in Xen.

When we try to assign device as passthrough, Xen can get that bdf so Xen 
can pre-check everything inside that hypercall, and Xen can return 
something like this,

#1 If this device has RMRR, we return that rmrr buffer. This is similar 
with our current implementation.
#2 If not, we return 'nr_entries' as '0' to notify hvmloader it has 
nothing to do.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03 11:32                                   ` Chen, Tiejun
@ 2014-11-03 11:43                                     ` Jan Beulich
  2014-11-03 11:58                                       ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03 11:43 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

>>> On 03.11.14 at 12:32, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 17:51, Jan Beulich wrote:
>>>>> On 03.11.14 at 10:40, <tiejun.chen@intel.com> wrote:
>>> On 2014/11/3 16:56, Jan Beulich wrote:
>>>>>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>>>>>     From: Chen, Tiejun
>>>>>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>>>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>>>>> Thinking about this some more, this odd universal hole punching in
>>>>>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>>>>>> this really should be optional behavior, with pass through of devices
>>>>>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>>>>>> punching holes for _just_ the devices passed through to a guest
>>>>>>>>> upon creation when the option is not enabled.)
>>>>>>>>
>>>>>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>>>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>>>>>> conflict with anything. So what about this idea?
>>>>>>>>
>>>>>>>
>>>>>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>>>>>> approach. When the parameter is on, RMRR check/hole punch is
>>>>>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>>>>>
>>>>>>> If user configures device pass-through at creation time, this parameter
>>>>>>> will be set by default. If user wants the VM capable of device hot-plug,
>>>>>>> an explicit parameter can be added in the config file to enforce RMRR
>>>>>>> check at creation time.
>>>>>>
>>>>>> Not exactly, I specifically described it slightly differently above. When
>>>>>> devices get passed through and the option is absent, holes should be
>>>>>> punched only for the RMRRs associated with those devices (i.e.
>>>>>> ideally none). Of course this means we'll need a way to associate
>>>>>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>>>>>> XENMEM_reserved_device_memory_map alone won't suffice.
>>>>>
>>>>> Yeah, current hypercall just provide RMRR entries without that
>>>>> associated BDF. And especially, in some cases one range may be shared by
>>>>> multiple devices...
>>>>
>>>> Before we decide who's going to do an eventual change we need to
>>>> determine what behavior we want, and whether this hypercall is
>>>> really the right one. Quite possibly we'd need a per-domain view
>>>> along with the global view, and hence rather than modifying this one
>>>> we may need to introduce e.g. a new domctl.
>>>>
>>>
>>> If we really need to work with a hypercall, maybe we can introduce a
>>> little bit to construct that to callback with multiple entries like
>>> this, for instance,
>>>
>>> RMRR entry0 have three devices, and entry1 have two devices,
>>>
>>> [start0, nr_pages0, bdf0],
>>> [start0, nr_pages0, bdf1],
>>> [start0, nr_pages0, bdf2],
>>> [start1, nr_pages1, bdf3],
>>> [start1, nr_pages1, bdf4],
>>>
>>> Although its cost more buffers, actually as you know this actual case is
>>> really rare. So maybe this way can be feasible. Then we don't need
>>> additional hypercall or xenstore.
>>
>> Conceptually, as a MEMOP, it has no business reporting BDFs. And
>> then rather than returning the same address range more than once,
>> having the caller supply a handle to an array and storing all of the
>> SBDFs (or perhaps a single segment would suffice along with all the
>> BDFs) there would seem to be an approach more consistent with
>> what we do elsewhere.
> 
> Here I'm wondering if we really need to expose BDFs to tools. Actually 
> tools just want to know those range no matter who owns these entries. I 
> mean we can do this in Xen.

As pointed out before, in order for the tools to (a) avoid punching
_all possible_ holes when not asked to but (b) punch holes for all
devices assigned right at guest boot time, I don't think the tools
can get away without having some way of identifying the _specific_
reserved memory regions a guest needs. Whether this works via
SBDF or some other means is tbd.

> When we try to assign device as passthrough, Xen can get that bdf so Xen 
> can pre-check everything inside that hypercall, and Xen can return 
> something like this,
> 
> #1 If this device has RMRR, we return that rmrr buffer. This is similar 
> with our current implementation.

"That rmrr buffer" being which? Remember that there may be
multiple devices associated with RMRRs and multiple devices
assigned to a guest.

> #2 If not, we return 'nr_entries' as '0' to notify hvmloader it has 
> nothing to do.

This (as vague as it is) hints in the same domain-specific reserved
region determination that I was already hinting at.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03 10:03                               ` Jan Beulich
@ 2014-11-03 11:48                                 ` Chen, Tiejun
  2014-11-03 11:53                                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03 11:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/3 18:03, Jan Beulich wrote:
>>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>>> #2 the error handling
>>>>
>>>> In an error case what should I do? Currently we still create these
>>>> mapping as normal. This means these mfns will be valid so later we can't
>>>> set them again then device can't be assigned as passthrough. I think
>>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>>
>>> You should, with very few exceptions, not ignore errors (which
>>> includes "handling" them by just logging a message. Instead, you
>>> should propagate the error back up the call chain.
>>>
>>
>> Do you mean in your patch,
>>
>> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>> +{
>> +    const struct iommu_ops *ops = iommu_get_ops();
>> +
>> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
>> +        return 0;
>> +
>> +    return ops->get_reserved_device_memory(func, ctxt);
>> +}
>> +
>>
>> I shouldn't return that directly. Then instead, we should handle all
>> error scenarios here?
>
> No. All error scenarios are already being handled here (by
> propagating the error code to the caller).

Sorry, how to propagate the error code?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03 11:48                                 ` Chen, Tiejun
@ 2014-11-03 11:53                                   ` Jan Beulich
  2014-11-04  1:35                                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03 11:53 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 03.11.14 at 12:48, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 18:03, Jan Beulich wrote:
>>>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
>>> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>>>> #2 the error handling
>>>>>
>>>>> In an error case what should I do? Currently we still create these
>>>>> mapping as normal. This means these mfns will be valid so later we can't
>>>>> set them again then device can't be assigned as passthrough. I think
>>>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>>>
>>>> You should, with very few exceptions, not ignore errors (which
>>>> includes "handling" them by just logging a message. Instead, you
>>>> should propagate the error back up the call chain.
>>>>
>>>
>>> Do you mean in your patch,
>>>
>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>> +{
>>> +    const struct iommu_ops *ops = iommu_get_ops();
>>> +
>>> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
>>> +        return 0;
>>> +
>>> +    return ops->get_reserved_device_memory(func, ctxt);
>>> +}
>>> +
>>>
>>> I shouldn't return that directly. Then instead, we should handle all
>>> error scenarios here?
>>
>> No. All error scenarios are already being handled here (by
>> propagating the error code to the caller).
> 
> Sorry, how to propagate the error code?

Return it to the caller (and it will do so onwards, until it reaches
[presumably] the entity having invoked a hypercall).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03 11:43                                     ` Jan Beulich
@ 2014-11-03 11:58                                       ` Chen, Tiejun
  2014-11-03 12:34                                         ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-03 11:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

On 2014/11/3 19:43, Jan Beulich wrote:
>>>> On 03.11.14 at 12:32, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 17:51, Jan Beulich wrote:
>>>>>> On 03.11.14 at 10:40, <tiejun.chen@intel.com> wrote:
>>>> On 2014/11/3 16:56, Jan Beulich wrote:
>>>>>>>> On 03.11.14 at 06:49, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/31 16:20, Jan Beulich wrote:
>>>>>>>>>> On 31.10.14 at 07:21, <kevin.tian@intel.com> wrote:
>>>>>>>>>      From: Chen, Tiejun
>>>>>>>>> Sent: Friday, October 31, 2014 1:41 PM
>>>>>>>>> On 2014/10/30 17:20, Jan Beulich wrote:
>>>>>>>>>> Thinking about this some more, this odd universal hole punching in
>>>>>>>>>> the E820 is very likely to end up causing problems. Hence I think
>>>>>>>>>> this really should be optional behavior, with pass through of devices
>>>>>>>>>> associated with RMRRs failing if not used. (This ought to include
>>>>>>>>>> punching holes for _just_ the devices passed through to a guest
>>>>>>>>>> upon creation when the option is not enabled.)
>>>>>>>>>
>>>>>>>>> Yeah, we had a similar discussion internal to add a parameter to force
>>>>>>>>> reserving RMRR. In this case we can't create a VM if these ranges
>>>>>>>>> conflict with anything. So what about this idea?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Adding a new parameter (e.g. 'check-passthrough') looks the right
>>>>>>>> approach. When the parameter is on, RMRR check/hole punch is
>>>>>>>> activated at VM creation. Otherwise we just keep existing behavior.
>>>>>>>>
>>>>>>>> If user configures device pass-through at creation time, this parameter
>>>>>>>> will be set by default. If user wants the VM capable of device hot-plug,
>>>>>>>> an explicit parameter can be added in the config file to enforce RMRR
>>>>>>>> check at creation time.
>>>>>>>
>>>>>>> Not exactly, I specifically described it slightly differently above. When
>>>>>>> devices get passed through and the option is absent, holes should be
>>>>>>> punched only for the RMRRs associated with those devices (i.e.
>>>>>>> ideally none). Of course this means we'll need a way to associate
>>>>>>> RMRRs with devices in the tool stack and hvmloader, i.e. the current
>>>>>>> XENMEM_reserved_device_memory_map alone won't suffice.
>>>>>>
>>>>>> Yeah, current hypercall just provide RMRR entries without that
>>>>>> associated BDF. And especially, in some cases one range may be shared by
>>>>>> multiple devices...
>>>>>
>>>>> Before we decide who's going to do an eventual change we need to
>>>>> determine what behavior we want, and whether this hypercall is
>>>>> really the right one. Quite possibly we'd need a per-domain view
>>>>> along with the global view, and hence rather than modifying this one
>>>>> we may need to introduce e.g. a new domctl.
>>>>>
>>>>
>>>> If we really need to work with a hypercall, maybe we can introduce a
>>>> little bit to construct that to callback with multiple entries like
>>>> this, for instance,
>>>>
>>>> RMRR entry0 have three devices, and entry1 have two devices,
>>>>
>>>> [start0, nr_pages0, bdf0],
>>>> [start0, nr_pages0, bdf1],
>>>> [start0, nr_pages0, bdf2],
>>>> [start1, nr_pages1, bdf3],
>>>> [start1, nr_pages1, bdf4],
>>>>
>>>> Although its cost more buffers, actually as you know this actual case is
>>>> really rare. So maybe this way can be feasible. Then we don't need
>>>> additional hypercall or xenstore.
>>>
>>> Conceptually, as a MEMOP, it has no business reporting BDFs. And
>>> then rather than returning the same address range more than once,
>>> having the caller supply a handle to an array and storing all of the
>>> SBDFs (or perhaps a single segment would suffice along with all the
>>> BDFs) there would seem to be an approach more consistent with
>>> what we do elsewhere.
>>
>> Here I'm wondering if we really need to expose BDFs to tools. Actually
>> tools just want to know those range no matter who owns these entries. I
>> mean we can do this in Xen.
>
> As pointed out before, in order for the tools to (a) avoid punching
> _all possible_ holes when not asked to but (b) punch holes for all
> devices assigned right at guest boot time, I don't think the tools
> can get away without having some way of identifying the _specific_
> reserved memory regions a guest needs. Whether this works via
> SBDF or some other means is tbd.
>
>> When we try to assign device as passthrough, Xen can get that bdf so Xen
>> can pre-check everything inside that hypercall, and Xen can return
>> something like this,
>>
>> #1 If this device has RMRR, we return that rmrr buffer. This is similar
>> with our current implementation.
>
> "That rmrr buffer" being which? Remember that there may be
> multiple devices associated with RMRRs and multiple devices
> assigned to a guest.

Maybe I don't describe what I want to do.

I know there are multiple device associated one RMRR, and multiple 
device can be assigned one guest.

Firstly we have a rule that we just allow all devices associated one 
RMRR to be assign same VM, right? So I mean while we create VM, we 
always call current hypercall but inside hypercall, Xen can know which 
devices will be assigned to this VM. So Xen still lookup that RMRR list 
but now Xen would check if these RMRR belongs to that device we want to 
assign this domain. If yes, we just let that callback go through these 
RMRR info from that list but exclude other unrelated RMRR. If not, we 
don't go through any RMRR info so that 'nr_entries' is also zero.

Thanks
Tiejun

>
>> #2 If not, we return 'nr_entries' as '0' to notify hvmloader it has
>> nothing to do.
>
> This (as vague as it is) hints in the same domain-specific reserved
> region determination that I was already hinting at.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03 11:58                                       ` Chen, Tiejun
@ 2014-11-03 12:34                                         ` Jan Beulich
  2014-11-04  5:05                                           ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-03 12:34 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

>>> On 03.11.14 at 12:58, <tiejun.chen@intel.com> wrote:
> Firstly we have a rule that we just allow all devices associated one 
> RMRR to be assign same VM, right? So I mean while we create VM, we 
> always call current hypercall but inside hypercall, Xen can know which 
> devices will be assigned to this VM.

I.e. the hypercall (at least optionally) becomes per-domain rather
than global. And you imply that device assignment happens
before memory getting populated (which likely can be arranged
for in the tool stack if that's not already the case, but which isn't
currently mandated by the hypervisor).

Jan

> So Xen still lookup that RMRR list 
> but now Xen would check if these RMRR belongs to that device we want to 
> assign this domain. If yes, we just let that callback go through these 
> RMRR info from that list but exclude other unrelated RMRR. If not, we 
> don't go through any RMRR info so that 'nr_entries' is also zero.
> 
> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-03 11:53                                   ` Jan Beulich
@ 2014-11-04  1:35                                     ` Chen, Tiejun
  2014-11-04  8:02                                       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-04  1:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/3 19:53, Jan Beulich wrote:
>>>> On 03.11.14 at 12:48, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 18:03, Jan Beulich wrote:
>>>>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
>>>> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>>>>> #2 the error handling
>>>>>>
>>>>>> In an error case what should I do? Currently we still create these
>>>>>> mapping as normal. This means these mfns will be valid so later we can't
>>>>>> set them again then device can't be assigned as passthrough. I think
>>>>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>>>>
>>>>> You should, with very few exceptions, not ignore errors (which
>>>>> includes "handling" them by just logging a message. Instead, you
>>>>> should propagate the error back up the call chain.
>>>>>
>>>>
>>>> Do you mean in your patch,
>>>>
>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>>> +{
>>>> +    const struct iommu_ops *ops = iommu_get_ops();
>>>> +
>>>> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
>>>> +        return 0;
>>>> +
>>>> +    return ops->get_reserved_device_memory(func, ctxt);
>>>> +}
>>>> +
>>>>
>>>> I shouldn't return that directly. Then instead, we should handle all
>>>> error scenarios here?
>>>
>>> No. All error scenarios are already being handled here (by
>>> propagating the error code to the caller).
>>
>> Sorry, how to propagate the error code?
>
> Return it to the caller (and it will do so onwards, until it reaches
> [presumably] the entity having invoked a hypercall).

I guess you mean we should return out in this error case,

@@ -686,8 +686,25 @@ guest_physmap_add_entry(struct domain *d, unsigned 
long gfn,
      /* Now, actually do the two-way mapping */
      if ( mfn_valid(_mfn(mfn)) )
      {
-        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
-                           p2m->default_access);
+        rc = 0;
+        a =  p2m->default_access;
+        if ( !is_hardware_domain(d) )
+        {
+            rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  &gfn);
+            /* We always avoid populating reserved device memory. */
+            if ( rc == 1 )
+                goto out;
+            else if ( rc < 0 )
+            {
+                printk(XENLOG_G_WARNING
+                       "Dom%d can't check reserved device memory.\n",
+                       d->domain_id);
+                goto out;
+            }
+        }
+
+        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
          if ( rc )
              goto out; /* Failed to update p2m, bail without updating 
m2p. */

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-03 12:34                                         ` Jan Beulich
@ 2014-11-04  5:05                                           ` Chen, Tiejun
  2014-11-04  7:54                                             ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-04  5:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

On 2014/11/3 20:34, Jan Beulich wrote:
>>>> On 03.11.14 at 12:58, <tiejun.chen@intel.com> wrote:
>> Firstly we have a rule that we just allow all devices associated one
>> RMRR to be assign same VM, right? So I mean while we create VM, we
>> always call current hypercall but inside hypercall, Xen can know which
>> devices will be assigned to this VM.
>
> I.e. the hypercall (at least optionally) becomes per-domain rather
> than global. And you imply that device assignment happens
> before memory getting populated (which likely can be arranged

I tried to find a clue about this point but unfortunately I can't trace 
when we assign device exactly. But in theory, based on your hint I 
prefer the device assignment should follow memory getting populated. 
Because when we add a device, we need to create iommu map so this means 
at this moment the guest should already finish populating memory, right?

Thanks
Tiejun

> for in the tool stack if that's not already the case, but which isn't
> currently mandated by the hypervisor).
>
> Jan
>
>> So Xen still lookup that RMRR list
>> but now Xen would check if these RMRR belongs to that device we want to
>> assign this domain. If yes, we just let that callback go through these
>> RMRR info from that list but exclude other unrelated RMRR. If not, we
>> don't go through any RMRR info so that 'nr_entries' is also zero.
>>
>> Thanks
>> Tiejun
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-04  5:05                                           ` Chen, Tiejun
@ 2014-11-04  7:54                                             ` Jan Beulich
  2014-11-05  2:59                                               ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-04  7:54 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

>>> On 04.11.14 at 06:05, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 20:34, Jan Beulich wrote:
>>>>> On 03.11.14 at 12:58, <tiejun.chen@intel.com> wrote:
>>> Firstly we have a rule that we just allow all devices associated one
>>> RMRR to be assign same VM, right? So I mean while we create VM, we
>>> always call current hypercall but inside hypercall, Xen can know which
>>> devices will be assigned to this VM.
>>
>> I.e. the hypercall (at least optionally) becomes per-domain rather
>> than global. And you imply that device assignment happens
>> before memory getting populated (which likely can be arranged
> 
> I tried to find a clue about this point but unfortunately I can't trace 
> when we assign device exactly. But in theory, based on your hint I 
> prefer the device assignment should follow memory getting populated. 
> Because when we add a device, we need to create iommu map so this means 
> at this moment the guest should already finish populating memory, right?

There's no such strong connection: When a device gets assigned,
IOMMU mappings get created for all memory the guest already has
assigned (which at least in theory can include the "none" case).
When (more) memory gets assigned after a device was already
assigned to the guest, the IOMMU mappings would simply get
updated.

While I think you're right in that memory assignment happens
before device assignment, for your specific purpose it might have
been easier the other way around, since when memory gets
populated first you'll need special peeking into which devices will
get assigned later in order to avoid the respective RMRR areas,
or you'll need to modify device assignment code to move the
RAM populated there out of the way.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-04  1:35                                     ` Chen, Tiejun
@ 2014-11-04  8:02                                       ` Jan Beulich
  2014-11-04 10:41                                         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-04  8:02 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 04.11.14 at 02:35, <tiejun.chen@intel.com> wrote:
> On 2014/11/3 19:53, Jan Beulich wrote:
>>>>> On 03.11.14 at 12:48, <tiejun.chen@intel.com> wrote:
>>> On 2014/11/3 18:03, Jan Beulich wrote:
>>>>>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>>>>>> #2 the error handling
>>>>>>>
>>>>>>> In an error case what should I do? Currently we still create these
>>>>>>> mapping as normal. This means these mfns will be valid so later we can't
>>>>>>> set them again then device can't be assigned as passthrough. I think
>>>>>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>>>>>
>>>>>> You should, with very few exceptions, not ignore errors (which
>>>>>> includes "handling" them by just logging a message. Instead, you
>>>>>> should propagate the error back up the call chain.
>>>>>>
>>>>>
>>>>> Do you mean in your patch,
>>>>>
>>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>>>> +{
>>>>> +    const struct iommu_ops *ops = iommu_get_ops();
>>>>> +
>>>>> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
>>>>> +        return 0;
>>>>> +
>>>>> +    return ops->get_reserved_device_memory(func, ctxt);
>>>>> +}
>>>>> +
>>>>>
>>>>> I shouldn't return that directly. Then instead, we should handle all
>>>>> error scenarios here?
>>>>
>>>> No. All error scenarios are already being handled here (by
>>>> propagating the error code to the caller).
>>>
>>> Sorry, how to propagate the error code?
>>
>> Return it to the caller (and it will do so onwards, until it reaches
>> [presumably] the entity having invoked a hypercall).
> 
> I guess you mean we should return out in this error case,

Yes!

> @@ -686,8 +686,25 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>       /* Now, actually do the two-way mapping */
>       if ( mfn_valid(_mfn(mfn)) )
>       {
> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
> -                           p2m->default_access);
> +        rc = 0;
> +        a =  p2m->default_access;
> +        if ( !is_hardware_domain(d) )
> +        {
> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
> +                                                  &gfn);
> +            /* We always avoid populating reserved device memory. */
> +            if ( rc == 1 )
> +                goto out;

But you'll need to make sure that you don't return 1 to the callers:
They expect 0 or negative error codes. But with the model of
not even populating these regions (or relocating the memory
before [at boot time] assigning a device associated with an RMRR)
I think this needs to become an error anyway.

> +            else if ( rc < 0 )
> +            {
> +                printk(XENLOG_G_WARNING
> +                       "Dom%d can't check reserved device memory.\n",

Actually, d being the subject domain, please make this more like
"Can't check reserved device memory for Dom%d\n".

Jan

> +                       d->domain_id);
> +                goto out;
> +            }
> +        }
> +
> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>           if ( rc )
>               goto out; /* Failed to update p2m, bail without updating m2p. */
> 
> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-04  8:02                                       ` Jan Beulich
@ 2014-11-04 10:41                                         ` Chen, Tiejun
  2014-11-04 11:41                                           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-04 10:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/4 16:02, Jan Beulich wrote:
>>>> On 04.11.14 at 02:35, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 19:53, Jan Beulich wrote:
>>>>>> On 03.11.14 at 12:48, <tiejun.chen@intel.com> wrote:
>>>> On 2014/11/3 18:03, Jan Beulich wrote:
>>>>>>>> On 03.11.14 at 10:51, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/11/3 17:00, Jan Beulich wrote:
>>>>>>>>>> On 03.11.14 at 07:20, <tiejun.chen@intel.com> wrote:
>>>>>>>> #2 the error handling
>>>>>>>>
>>>>>>>> In an error case what should I do? Currently we still create these
>>>>>>>> mapping as normal. This means these mfns will be valid so later we can't
>>>>>>>> set them again then device can't be assigned as passthrough. I think
>>>>>>>> this makes sense. Or we should just stop them from setting 1:1 mapping?
>>>>>>>
>>>>>>> You should, with very few exceptions, not ignore errors (which
>>>>>>> includes "handling" them by just logging a message. Instead, you
>>>>>>> should propagate the error back up the call chain.
>>>>>>>
>>>>>>
>>>>>> Do you mean in your patch,
>>>>>>
>>>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>>>>> +{
>>>>>> +    const struct iommu_ops *ops = iommu_get_ops();
>>>>>> +
>>>>>> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
>>>>>> +        return 0;
>>>>>> +
>>>>>> +    return ops->get_reserved_device_memory(func, ctxt);
>>>>>> +}
>>>>>> +
>>>>>>
>>>>>> I shouldn't return that directly. Then instead, we should handle all
>>>>>> error scenarios here?
>>>>>
>>>>> No. All error scenarios are already being handled here (by
>>>>> propagating the error code to the caller).
>>>>
>>>> Sorry, how to propagate the error code?
>>>
>>> Return it to the caller (and it will do so onwards, until it reaches
>>> [presumably] the entity having invoked a hypercall).
>>
>> I guess you mean we should return out in this error case,
>
> Yes!
>
>> @@ -686,8 +686,25 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>        /* Now, actually do the two-way mapping */
>>        if ( mfn_valid(_mfn(mfn)) )
>>        {
>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>> -                           p2m->default_access);
>> +        rc = 0;
>> +        a =  p2m->default_access;
>> +        if ( !is_hardware_domain(d) )
>> +        {
>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>> +                                                  &gfn);
>> +            /* We always avoid populating reserved device memory. */
>> +            if ( rc == 1 )
>> +                goto out;
>
> But you'll need to make sure that you don't return 1 to the callers:

You're right.

> They expect 0 or negative error codes. But with the model of
> not even populating these regions (or relocating the memory
> before [at boot time] assigning a device associated with an RMRR)
> I think this needs to become an error anyway.

Looks -EINVAL might be used.

>
>> +            else if ( rc < 0 )
>> +            {
>> +                printk(XENLOG_G_WARNING
>> +                       "Dom%d can't check reserved device memory.\n",
>
> Actually, d being the subject domain, please make this more like
> "Can't check reserved device memory for Dom%d\n".

Fixed.

Thanks
Tiejun

>
> Jan
>
>> +                       d->domain_id);
>> +                goto out;
>> +            }
>> +        }
>> +
>> +        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t, a);
>>            if ( rc )
>>                goto out; /* Failed to update p2m, bail without updating m2p. */
>>
>> Thanks
>> Tiejun
>
>
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-04 10:41                                         ` Chen, Tiejun
@ 2014-11-04 11:41                                           ` Jan Beulich
  2014-11-04 11:51                                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-04 11:41 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 04.11.14 at 11:41, <tiejun.chen@intel.com> wrote:
> On 2014/11/4 16:02, Jan Beulich wrote:
>>>>> On 04.11.14 at 02:35, <tiejun.chen@intel.com> wrote:
>>> @@ -686,8 +686,25 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>        /* Now, actually do the two-way mapping */
>>>        if ( mfn_valid(_mfn(mfn)) )
>>>        {
>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>> -                           p2m->default_access);
>>> +        rc = 0;
>>> +        a =  p2m->default_access;
>>> +        if ( !is_hardware_domain(d) )
>>> +        {
>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>> +                                                  &gfn);
>>> +            /* We always avoid populating reserved device memory. */
>>> +            if ( rc == 1 )
>>> +                goto out;
>>
>> But you'll need to make sure that you don't return 1 to the callers:
> 
> You're right.
> 
>> They expect 0 or negative error codes. But with the model of
>> not even populating these regions (or relocating the memory
>> before [at boot time] assigning a device associated with an RMRR)
>> I think this needs to become an error anyway.
> 
> Looks -EINVAL might be used.

Hmm, -EINVAL is being used way too frequently already. -EBUSY
would make it at least a little bit more distinct.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping
  2014-11-04 11:41                                           ` Jan Beulich
@ 2014-11-04 11:51                                             ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-04 11:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/4 19:41, Jan Beulich wrote:
>>>> On 04.11.14 at 11:41, <tiejun.chen@intel.com> wrote:
>> On 2014/11/4 16:02, Jan Beulich wrote:
>>>>>> On 04.11.14 at 02:35, <tiejun.chen@intel.com> wrote:
>>>> @@ -686,8 +686,25 @@ guest_physmap_add_entry(struct domain *d, unsigned long gfn,
>>>>         /* Now, actually do the two-way mapping */
>>>>         if ( mfn_valid(_mfn(mfn)) )
>>>>         {
>>>> -        rc = p2m_set_entry(p2m, gfn, _mfn(mfn), page_order, t,
>>>> -                           p2m->default_access);
>>>> +        rc = 0;
>>>> +        a =  p2m->default_access;
>>>> +        if ( !is_hardware_domain(d) )
>>>> +        {
>>>> +            rc = iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
>>>> +                                                  &gfn);
>>>> +            /* We always avoid populating reserved device memory. */
>>>> +            if ( rc == 1 )
>>>> +                goto out;
>>>
>>> But you'll need to make sure that you don't return 1 to the callers:
>>
>> You're right.
>>
>>> They expect 0 or negative error codes. But with the model of
>>> not even populating these regions (or relocating the memory
>>> before [at boot time] assigning a device associated with an RMRR)
>>> I think this needs to become an error anyway.
>>
>> Looks -EINVAL might be used.
>
> Hmm, -EINVAL is being used way too frequently already. -EBUSY
> would make it at least a little bit more distinct.

Yeah, this also makes sense.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-04  7:54                                             ` Jan Beulich
@ 2014-11-05  2:59                                               ` Chen, Tiejun
  2014-11-05 17:00                                                 ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-05  2:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Yang Z Zhang, Kevin Tian, tim, xen-devel

On 2014/11/4 15:54, Jan Beulich wrote:
>>>> On 04.11.14 at 06:05, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 20:34, Jan Beulich wrote:
>>>>>> On 03.11.14 at 12:58, <tiejun.chen@intel.com> wrote:
>>>> Firstly we have a rule that we just allow all devices associated one
>>>> RMRR to be assign same VM, right? So I mean while we create VM, we
>>>> always call current hypercall but inside hypercall, Xen can know which
>>>> devices will be assigned to this VM.
>>>
>>> I.e. the hypercall (at least optionally) becomes per-domain rather
>>> than global. And you imply that device assignment happens
>>> before memory getting populated (which likely can be arranged
>>
>> I tried to find a clue about this point but unfortunately I can't trace
>> when we assign device exactly. But in theory, based on your hint I
>> prefer the device assignment should follow memory getting populated.
>> Because when we add a device, we need to create iommu map so this means
>> at this moment the guest should already finish populating memory, right?
>
> There's no such strong connection: When a device gets assigned,
> IOMMU mappings get created for all memory the guest already has
> assigned (which at least in theory can include the "none" case).
> When (more) memory gets assigned after a device was already
> assigned to the guest, the IOMMU mappings would simply get
> updated.
>
> While I think you're right in that memory assignment happens
> before device assignment, for your specific purpose it might have

Yeah, I also double checked this sequence with printk.

> been easier the other way around, since when memory gets
> populated first you'll need special peeking into which devices will
> get assigned later in order to avoid the respective RMRR areas,
> or you'll need to modify device assignment code to move the
> RAM populated there out of the way.
>

Although I really don't grab these device assignment codes explicitly, 
but looks it may corrupt some existing frameworks to move device 
assignment codes.

I'm not sure if you already have a definitive solution. If yes, please 
guide me and just ignore the reset.

But if not, I have a preliminary picture as follows:

#1 Policy

We will introduce a parameter globally, 'pci_rdmforce'. Then we can pass 
that in the config file, like

pci_rdmforce = 1 => Of course this should be 0 by default.

'1' means we should force check to reserve all ranges. If failed VM 
wouldn't be created successfully. This also can give user a chance to 
work well with later hotplug, even if not a device assignment while 
creating VM.

But we can override that by one specific pci device:

pci = ['AA:BB.CC,rdmforce=0/1]

But this 'rdmforce' should be 1 by default since obviously any 
passthrough device always need to do this. Actually no one really want 
to set as '0' so it may be unnecessary but I'd like to leave this as a 
potential approach.

So if 'pci_rdmforce = 0' and 'rdmforce = 1', we just check to reserve 
the ranges associated to this device.

And in any case of 'pci_rdmforce = 0' which means we have nothing to do, 
so we will have a higher likelihood of failure to assign a device with 
RMRR in both case of adding and attaching it. So we have to reject that 
in case of overlapping. Actually this just depends on whether we can set 
these identified p2m mapping.

#2 How-to

Firstly we should post actively a recommendation message to indicate 
we'd better set 'pci_rdmforce' to work passthrough, if any RMRR exists 
while parsing ACPI.

Next, we need a new domctl to provide these info, 'pci_rdmforce' and 
'rdm_force' when parse the config file. Certainly we may need to 
introduce and set two new fields in strcut domain, and since then we 
just use these fields to modify our current existing iommu callback 
inside to support our policy. I mean we just expose those associated 
RMRR entry. I think its easy to implement this since inside Xen we can 
know which entry is owned by which device. So this can benefit us to 
avoid modifying any tools codes and most Xen codes we already addressed.

So is it feasible to you? Or Anything else I'm missing?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-05  2:59                                               ` Chen, Tiejun
@ 2014-11-05 17:00                                                 ` Jan Beulich
  2014-11-06  9:28                                                   ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-05 17:00 UTC (permalink / raw)
  To: tiejun.chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> "Chen, Tiejun" <tiejun.chen@intel.com> 11/05/14 3:59 AM >>>

Everything up to here sounded reasonable.

>Next, we need a new domctl to provide these info, 'pci_rdmforce' and 
>'rdm_force' when parse the config file. Certainly we may need to 
>introduce and set two new fields in strcut domain, and since then we 
>just use these fields to modify our current existing iommu callback 
>inside to support our policy. I mean we just expose those associated 
>RMRR entry. I think its easy to implement this since inside Xen we can 
>know which entry is owned by which device. So this can benefit us to 
>avoid modifying any tools codes and most Xen codes we already addressed.

Whether a domctl is the right approach I can't really tell with this
somewhat vague description.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-05 17:00                                                 ` Jan Beulich
@ 2014-11-06  9:28                                                   ` Chen, Tiejun
  2014-11-06 10:06                                                     ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-06  9:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/6 1:00, Jan Beulich wrote:
>>>> "Chen, Tiejun" <tiejun.chen@intel.com> 11/05/14 3:59 AM >>>
>
> Everything up to here sounded reasonable.
>
>> Next, we need a new domctl to provide these info, 'pci_rdmforce' and
>> 'rdm_force' when parse the config file. Certainly we may need to
>> introduce and set two new fields in strcut domain, and since then we
>> just use these fields to modify our current existing iommu callback
>> inside to support our policy. I mean we just expose those associated
>> RMRR entry. I think its easy to implement this since inside Xen we can
>> know which entry is owned by which device. So this can benefit us to
>> avoid modifying any tools codes and most Xen codes we already addressed.
>
> Whether a domctl is the right approach I can't really tell with this
> somewhat vague description.
>

So based on our current patch, I generate a draft patch as follows to 
show my idea. Note I just tried to compile this.

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 009e351..ff22228 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1642,6 +1642,21 @@ int xc_assign_device(
      return do_domctl(xch, &domctl);
  }

+int xc_domain_device_setrdm(xc_interface *xch,
+                            uint32_t domid,
+                            uint32_t pci_rdmforce,
+                            uint32_t pci_dev_bitmap)
+{
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_set_rdm;
+    domctl.domain = domid;
+    domctl.u.set_rdm.pci_rdmforce = pci_rdmforce;
+    domctl.u.set_rdm.pci_dev_bitmap = pci_dev_bitmap;
+
+    return do_domctl(xch, &domctl);
+}
+
  int xc_get_device_group(
      xc_interface *xch,
      uint32_t domid,
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 3e191c3..ca8b754 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -90,6 +90,19 @@ const char *libxl__domain_device_model(libxl__gc *gc,
      return dm;
  }

+int libxl__domain_device_setrdm(libxl__gc *gc,
+                                const libxl_domain_build_info *info,
+                                uint32_t dm_domid)
+{
+    int ret;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    ret = xc_domain_device_setrdm(ctx->xch, dm_domid, info->pci_rdmforce,
+                                  info->pci_dev_bitmap);
+
+    return ret;
+}
+
  const libxl_vnc_info *libxl__dm_vnc(const libxl_domain_config 
*guest_config)
  {
      const libxl_vnc_info *vnc = NULL;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 4361421..06938ee 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1471,6 +1471,11 @@ _hidden int libxl__domain_build(libxl__gc *gc,
  /* for device model creation */
  _hidden const char *libxl__domain_device_model(libxl__gc *gc,
                                          const libxl_domain_build_info 
*info);
+
+_hidden int libxl__domain_device_setrdm(libxl__gc *gc,
+                                        const libxl_domain_build_info 
*info,
+                                        uint32_t domid);
+
  _hidden int libxl__need_xenpv_qemu(libxl__gc *gc,
          int nr_consoles, libxl__device_console *consoles,
          int nr_vfbs, libxl_device_vfb *vfbs,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index ca3f724..ed20823 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -398,6 +398,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
      ("kernel",           string),
      ("cmdline",          string),
      ("ramdisk",          string),
+    ("pci_rdmforce",   uint32),
+    ("pci_dev_bitmap",   uint32),
      ("u", KeyedUnion(None, libxl_domain_type, "type",
                  [("hvm", Struct(None, [("firmware",         string),
                                         ("bios", 
libxl_bios_type),
@@ -518,6 +520,7 @@ libxl_device_pci = Struct("device_pci", [
      ("power_mgmt", bool),
      ("permissive", bool),
      ("seize", bool),
+    ("rdmforce", bool),
      ])

  libxl_device_vtpm = Struct("device_vtpm", [
diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
index 26fb143..989eac8 100644
--- a/tools/libxl/libxlu_pci.c
+++ b/tools/libxl/libxlu_pci.c
@@ -143,6 +143,8 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, 
libxl_device_pci *pcidev, const char *str
                      pcidev->permissive = atoi(tok);
                  }else if ( !strcmp(optkey, "seize") ) {
                      pcidev->seize = atoi(tok);
+                }else if ( !strcmp(optkey, "rdmforce") ) {
+                    pcidev->rdmforce = atoi(tok);
                  }else{
                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", 
optkey);
                  }
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 3c9f146..64a1e51 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -904,6 +904,7 @@ static void replace_string(char **str, const char *val)
      *str = xstrdup(val);
  }

+#define PCI_BDF(b,d,f) ((((b) & 0xff) << 8) | PCI_DEVFN(d,f))
  static void parse_config_data(const char *config_source,
                                const char *config_data,
                                int config_len,
@@ -919,6 +920,7 @@ static void parse_config_data(const char *config_source,
      int pci_msitranslate = 0;
      int pci_permissive = 0;
      int pci_seize = 0;
+    int pci_rdmforce = 0;
      int i, e;

      libxl_domain_create_info *c_info = &d_config->c_info;
@@ -1699,6 +1701,9 @@ skip_vfb:
      if (!xlu_cfg_get_long (config, "pci_seize", &l, 0))
          pci_seize = l;

+    if (!xlu_cfg_get_long (config, "pci_rdmforce", &l, 0))
+        pci_rdmforce = l;
+
      /* To be reworked (automatically enabled) once the auto ballooning
       * after guest starts is done (with PCI devices passed in). */
      if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
@@ -1719,9 +1724,16 @@ skip_vfb:
              pcidev->power_mgmt = pci_power_mgmt;
              pcidev->permissive = pci_permissive;
              pcidev->seize = pci_seize;
+            pcidev->rdmforce = pci_rdmforce;
              if (!xlu_pci_parse_bdf(config, pcidev, buf))
+            {
+                /* Just fake this wit a bitmap. */
+                b_info.pci_dev_bitmap |=  1 << PCI_BDF(pcidev->bus, 
pcidev->dev,
+                                                       pcidev->func);
                  d_config->num_pcidevs++;
+            }
          }
+        b_info.pci_rdmforce = pci_rdmforce;
          if (d_config->num_pcidevs && c_info->type == LIBXL_DOMAIN_TYPE_PV)
              libxl_defbool_set(&b_info->u.pv.e820_host, true);
      }
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index dc18f1b..3bbd28f 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2434,6 +2434,7 @@ static void ept_handle_violation(unsigned long 
qualification, paddr_t gpa)
       * handle such a scenario as its own logic.
       */
      ret = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                           d,
                                             &gfn);
      if ( ret )
      {
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 113d996..ba489ce 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -691,6 +691,7 @@ guest_physmap_add_entry(struct domain *d, unsigned 
long gfn,
          if ( !is_hardware_domain(d) )
          {
              rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  d,
                                                    &gfn);
              /* We always avoid populating reserved device memory. */
              if ( rc == 1 )
diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index af613b9..cd99ba7 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -315,6 +315,7 @@ int compat_memory_op(unsigned int cmd, 
XEN_GUEST_HANDLE_PARAM(void) compat)

              grdm.used_entries = 0;
              rc = 
iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  current->domain,
                                                    &grdm);

              if ( !rc && grdm.map.nr_entries < grdm.used_entries )
diff --git a/xen/common/mem_access.c b/xen/common/mem_access.c
index 4c84f88..e7973ee 100644
--- a/xen/common/mem_access.c
+++ b/xen/common/mem_access.c
@@ -69,6 +69,7 @@ static int mem_access_check_rdm(struct domain *d, 
uint64_aligned_t start,
          {
              gfn = start + i;
              rc = 
iommu_get_reserved_device_memory(p2m_check_reserved_device_memory,
+                                                  d,
                                                    &gfn);
              if ( rc < 0 )
                  printk(XENLOG_WARNING
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 2177c56..21db828 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1140,6 +1140,7 @@ long do_memory_op(unsigned long cmd, 
XEN_GUEST_HANDLE_PARAM(void) arg)

          grdm.used_entries = 0;
          rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              current->domain,
                                                &grdm);

          if ( !rc && grdm.map.nr_entries < grdm.used_entries )
diff --git a/xen/drivers/passthrough/iommu.c 
b/xen/drivers/passthrough/iommu.c
index 7c17e8d..7a5c66d 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -344,14 +344,15 @@ void iommu_crash_shutdown(void)
      iommu_enabled = iommu_intremap = 0;
  }

-int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, struct domain* d,
+                                     void *ctxt)
  {
      const struct iommu_ops *ops = iommu_get_ops();

      if ( !iommu_enabled || !ops->get_reserved_device_memory )
          return 0;

-    return ops->get_reserved_device_memory(func, ctxt);
+    return ops->get_reserved_device_memory(func, d, ctxt);
  }

  bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..ee689ff 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1540,6 +1540,18 @@ int iommu_do_pci_domctl(
          }
          break;

+    case XEN_DOMCTL_set_rdm:
+        if ( unlikely(d->is_dying) )
+        {
+            ret = -EINVAL;
+            break;
+        }
+
+        d->arch.hvm_domain.pci_rdmforce = domctl->u.set_rdm.pci_rdmforce;
+        d->arch.hvm_domain.pci_dev_bitmap = 
domctl->u.set_rdm.pci_dev_bitmap;
+
+        break;
+
      case XEN_DOMCTL_assign_device:
          if ( unlikely(d->is_dying) )
          {
diff --git a/xen/drivers/passthrough/vtd/dmar.c 
b/xen/drivers/passthrough/vtd/dmar.c
index 546eca9..138840c 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -28,6 +28,7 @@
  #include <xen/xmalloc.h>
  #include <xen/pci.h>
  #include <xen/pci_regs.h>
+#include <xen/sched.h>
  #include <asm/string.h>
  #include "dmar.h"
  #include "iommu.h"
@@ -921,18 +922,28 @@ int platform_supports_x2apic(void)
      return cpu_has_x2apic && ((dmar_flags & mask) == 
ACPI_DMAR_INTR_REMAP);
  }

-int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, struct 
domain *d,
+                                           void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
      int rc = 0;
+    int i;
+    u16 bdf;

-    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    for_each_rmrr_device ( rmrr, bdf, i )
      {
-        rc = func(PFN_DOWN(rmrr->base_address),
-                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
-                  ctxt);
-        if ( rc )
-            break;
+        if ( d->arch.hvm_domain.pci_rdmforce )
+        {
+            if ( d->arch.hvm_domain.pci_dev_bitmap & (uint32_t)(1 << bdf) )
+            {
+                rc = func(PFN_DOWN(rmrr->base_address),
+                                   PFN_UP(rmrr->end_address) -
+                                   PFN_DOWN(rmrr->base_address),
+                                   ctxt);
+                if ( rc )
+                    break;
+            }
+        }
      }

      return rc;
diff --git a/xen/drivers/passthrough/vtd/extern.h 
b/xen/drivers/passthrough/vtd/extern.h
index f9ee9b0..df9fed3 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -75,7 +75,8 @@ int domain_context_mapping_one(struct domain *domain, 
struct iommu *iommu,
                                 u8 bus, u8 devfn, const struct pci_dev *);
  int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                               u8 bus, u8 devfn);
-int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, struct 
domain *d,
+                                           void *ctxt);

  unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
  void io_apic_write_remap_rte(unsigned int apic,
diff --git a/xen/include/asm-x86/hvm/domain.h 
b/xen/include/asm-x86/hvm/domain.h
index 2757c7f..5dab8dd 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -90,6 +90,9 @@ struct hvm_domain {
      /* Cached CF8 for guest PCI config cycles */
      uint32_t                pci_cf8;

+    uint32_t                pci_rdmforce;
+    uint32_t                pci_dev_bitmap;
+
      struct pl_time         pl_time;

      struct hvm_io_handler *io_handler;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 58b19e7..8b7cd5b 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -484,6 +484,14 @@ struct xen_domctl_assign_device {
  typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
  DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);

+/* Control whether/how we check and reserve device memory. */
+struct xen_domctl_set_rdm {
+    uint32_t  pci_rdmforce;
+    uint32_t  pci_dev_bitmap;
+};
+typedef struct xen_domctl_set_rdm xen_domctl_set_rdm_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_rdm_t);
+
  /* Retrieve sibling devices infomation of machine_sbdf */
  /* XEN_DOMCTL_get_device_group */
  struct xen_domctl_get_device_group {
@@ -1056,6 +1064,7 @@ struct xen_domctl {
  #define XEN_DOMCTL_set_vcpu_msrs                 73
  #define XEN_DOMCTL_setvnumainfo                  74
  #define XEN_DOMCTL_psr_cmt_op                    75
+#define XEN_DOMCTL_set_rdm                       76
  #define XEN_DOMCTL_gdbsx_guestmemio            1000
  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1118,7 +1127,8 @@ struct xen_domctl {
          struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
          struct xen_domctl_vnuma             vnuma;
          struct xen_domctl_psr_cmt_op        psr_cmt_op;
-        uint8_t                             pad[128];
+        struct xen_domctl_set_rdm           set_rdm;
+        uint8_t                             pad[120];
      } u;
  };
  typedef struct xen_domctl xen_domctl_t;
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 409f6f8..adf3d52 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -158,14 +158,14 @@ struct iommu_ops {
      void (*crash_shutdown)(void);
      void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned 
int page_count);
      void (*iotlb_flush_all)(struct domain *d);
-    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, struct domain *, 
void *);
      void (*dump_p2m_table)(struct domain *d);
  };

  void iommu_suspend(void);
  void iommu_resume(void);
  void iommu_crash_shutdown(void);
-int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain *, 
void *);

  void iommu_share_p2m_table(struct domain *d);

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-06  9:28                                                   ` Chen, Tiejun
@ 2014-11-06 10:06                                                     ` Jan Beulich
  2014-11-07 10:27                                                       ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-06 10:06 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 06.11.14 at 10:28, <tiejun.chen@intel.com> wrote:
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -484,6 +484,14 @@ struct xen_domctl_assign_device {
>   typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
> 
> +/* Control whether/how we check and reserve device memory. */
> +struct xen_domctl_set_rdm {
> +    uint32_t  pci_rdmforce;
> +    uint32_t  pci_dev_bitmap;

So this allows for 32 devices; considering that you index the bitmap
by BDF, all this covers are 0000:00:00.0 through 0000:00:03.7.
Hardly enough to cover even a single pass through device (iirc the
range above is fully used by emulated devices). And of course a
much larger bitmap isn't a good solution either.

Nor am I really clear what you need the 32 bits of "pci_rdmforce"
for, nor why this field isn't just being named "force". Without a
comment alongside the fields, it's remaining guesswork anyway
when and how this domctl is to be used. Looking at your change
to intel_iommu_get_reserved_device_memory() the field appears
to be simply redundant.

> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -158,14 +158,14 @@ struct iommu_ops {
>       void (*crash_shutdown)(void);
>       void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
>       void (*iotlb_flush_all)(struct domain *d);
> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct domain *, void *);
>       void (*dump_p2m_table)(struct domain *d);
>   };
> 
>   void iommu_suspend(void);
>   void iommu_resume(void);
>   void iommu_crash_shutdown(void);
> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain *, void *);

I don't see why these generic interfaces would need to change;
the only thing that would seem to need changing is the callback
function (and of course the context passed to it).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-06 10:06                                                     ` Jan Beulich
@ 2014-11-07 10:27                                                       ` Chen, Tiejun
  2014-11-07 11:08                                                         ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-07 10:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/6 18:06, Jan Beulich wrote:
>>>> On 06.11.14 at 10:28, <tiejun.chen@intel.com> wrote:
>> --- a/xen/include/public/domctl.h
>> +++ b/xen/include/public/domctl.h
>> @@ -484,6 +484,14 @@ struct xen_domctl_assign_device {
>>    typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>>    DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
>>
>> +/* Control whether/how we check and reserve device memory. */
>> +struct xen_domctl_set_rdm {
>> +    uint32_t  pci_rdmforce;
>> +    uint32_t  pci_dev_bitmap;
>
> So this allows for 32 devices; considering that you index the bitmap

Sorry its my fault when I just focus on figuring out a doable way. We 
need to cover 256 x 32 x 8 = 65536.

> by BDF, all this covers are 0000:00:00.0 through 0000:00:03.7.
> Hardly enough to cover even a single pass through device (iirc the
> range above is fully used by emulated devices). And of course a

Are you saying Xen restrict some BDFs specific to emulate some devices? 
But I don't see these associated codes.

> much larger bitmap isn't a good solution either.

So I guess I need to reconstruct something new, please see the new draft 
codes.

>
> Nor am I really clear what you need the 32 bits of "pci_rdmforce"
> for, nor why this field isn't just being named "force". Without a

This variable can tell Xen to force check and reserve all RMRR ranges in 
that case the user is sure he's going to hotplug a device later but 
indeed, he really don't assign any device while creating a VM.

Please see the new draft codes as well.


> comment alongside the fields, it's remaining guesswork anyway
> when and how this domctl is to be used. Looking at your change
> to intel_iommu_get_reserved_device_memory() the field appears
> to be simply redundant.
>
>> --- a/xen/include/xen/iommu.h
>> +++ b/xen/include/xen/iommu.h
>> @@ -158,14 +158,14 @@ struct iommu_ops {
>>        void (*crash_shutdown)(void);
>>        void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
>>        void (*iotlb_flush_all)(struct domain *d);
>> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct domain *, void *);
>>        void (*dump_p2m_table)(struct domain *d);
>>    };
>>
>>    void iommu_suspend(void);
>>    void iommu_resume(void);
>>    void iommu_crash_shutdown(void);
>> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain *, void *);
>
> I don't see why these generic interfaces would need to change;
> the only thing that would seem to need changing is the callback
> function (and of course the context passed to it).
>

I'm not 100% sure if we can call current->domain in all scenarios. If 
you can help me confirm this I'd really like to remove this change :) 
Now I assume this should be true as follows:

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 30764b4..5957d2e 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2036,6 +2036,12 @@ int xc_assign_device(xc_interface *xch,
                       uint32_t domid,
                       uint32_t machine_bdf);

+int xc_domain_device_setrdm(xc_interface *xch,
+                            uint32_t domid,
+                            uint32_t num_pcidevs,
+                            uint32_t pci_rdmforce,
+                            struct xen_guest_pcidev_info *pcidevs);
+
  int xc_get_device_group(xc_interface *xch,
                       uint32_t domid,
                       uint32_t machine_bdf,
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 009e351..f38b400 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1642,6 +1642,34 @@ int xc_assign_device(
      return do_domctl(xch, &domctl);
  }

+int xc_domain_device_setrdm(xc_interface *xch,
+                            uint32_t domid,
+                            uint32_t num_pcidevs,
+                            uint32_t pci_rdmforce,
+                            struct xen_guest_pcidev_info *pcidevs)
+{
+    int ret;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(pcidevs,
+                             num_pcidevs*sizeof(xen_guest_pcidev_info_t),
+                             XC_HYPERCALL_BUFFER_BOUNCE_IN);
+
+    if ( xc_hypercall_bounce_pre(xch, pcidevs) )
+        return -1;
+
+    domctl.cmd = XEN_DOMCTL_set_rdm;
+    domctl.domain = domid;
+    domctl.u.set_rdm.pci_rdmforce = pci_rdmforce;
+    domctl.u.set_rdm.num_pcidevs = num_pcidevs;
+    set_xen_guest_handle(domctl.u.set_rdm.pcidevs, pcidevs);
+
+    ret = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, pcidevs);
+
+    return ret;
+}
+
  int xc_get_device_group(
      xc_interface *xch,
      uint32_t domid,
diff --git a/tools/libxc/xc_domain_restore.c 
b/tools/libxc/xc_domain_restore.c
index d8bd9b3..ac48a82 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -2165,8 +2165,8 @@ int xc_domain_restore(xc_interface *xch, int 
io_fd, uint32_t dom,

          if ( !ext_vcpucontext )
              goto vcpu_ext_state_restore;
-        memcpy(&domctl.u.ext_vcpucontext, vcpup, 128);
-        vcpup += 128;
+        memcpy(&domctl.u.ext_vcpucontext, vcpup, 112);
+        vcpup += 112;
          domctl.cmd = XEN_DOMCTL_set_ext_vcpucontext;
          domctl.domain = dom;
          frc = xc_domctl(xch, &domctl);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index b1ff5ae..2429416 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -860,6 +860,9 @@ static void initiate_domain_create(libxl__egc *egc,
      ret = libxl__domain_build_info_setdefault(gc, &d_config->b_info);
      if (ret) goto error_out;

+    ret = libxl__domain_device_setrdm(gc, d_config, domid);
+    if (ret) goto error_out;
+
      if (!sched_params_valid(gc, domid, &d_config->b_info.sched_params)) {
          LOG(ERROR, "Invalid scheduling parameters\n");
          ret = ERROR_INVAL;
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 3e191c3..9e402d1 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -90,6 +90,42 @@ const char *libxl__domain_device_model(libxl__gc *gc,
      return dm;
  }

+int libxl__domain_device_setrdm(libxl__gc *gc,
+                                libxl_domain_config *d_config,
+                                uint32_t dm_domid)
+{
+    int i, ret;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    struct xen_guest_pcidev_info *pcidevs = NULL;
+
+    if ( d_config->num_pcidevs )
+    {
+        pcidevs = 
malloc(d_config->num_pcidevs*sizeof(xen_guest_pcidev_info_t));
+        if ( pcidevs )
+        {
+            for (i = 0; i < d_config->num_pcidevs; i++)
+            {
+                pcidevs[i].func = d_config->pcidevs[i].func;
+                pcidevs[i].dev = d_config->pcidevs[i].dev;
+                pcidevs[i].bus = d_config->pcidevs[i].bus;
+                pcidevs[i].domain = d_config->pcidevs[i].domain;
+                pcidevs[i].rdmforce = d_config->pcidevs[i].rdmforce;
+            }
+        }
+        else
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                               "Can't allocate for pcidevs.");
+    }
+
+    ret = xc_domain_device_setrdm(ctx->xch, dm_domid,
+                                  (uint32_t)d_config->num_pcidevs,
+                                  d_config->b_info.pci_rdmforce,
+                                  pcidevs);
+    free(pcidevs);
+
+    return ret;
+}
+
  const libxl_vnc_info *libxl__dm_vnc(const libxl_domain_config 
*guest_config)
  {
      const libxl_vnc_info *vnc = NULL;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 4361421..c48a0e6 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1471,6 +1471,11 @@ _hidden int libxl__domain_build(libxl__gc *gc,
  /* for device model creation */
  _hidden const char *libxl__domain_device_model(libxl__gc *gc,
                                          const libxl_domain_build_info 
*info);
+
+_hidden int libxl__domain_device_setrdm(libxl__gc *gc,
+                                        libxl_domain_config *d_config,
+                                        uint32_t domid);
+
  _hidden int libxl__need_xenpv_qemu(libxl__gc *gc,
          int nr_consoles, libxl__device_console *consoles,
          int nr_vfbs, libxl_device_vfb *vfbs,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index ca3f724..b700263 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -398,6 +398,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
      ("kernel",           string),
      ("cmdline",          string),
      ("ramdisk",          string),
+    ("pci_rdmforce",   uint32),
      ("u", KeyedUnion(None, libxl_domain_type, "type",
                  [("hvm", Struct(None, [("firmware",         string),
                                         ("bios", 
libxl_bios_type),
@@ -518,6 +519,7 @@ libxl_device_pci = Struct("device_pci", [
      ("power_mgmt", bool),
      ("permissive", bool),
      ("seize", bool),
+    ("rdmforce", bool),
      ])

  libxl_device_vtpm = Struct("device_vtpm", [
diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
index 26fb143..989eac8 100644
--- a/tools/libxl/libxlu_pci.c
+++ b/tools/libxl/libxlu_pci.c
@@ -143,6 +143,8 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, 
libxl_device_pci *pcidev, const char *str
                      pcidev->permissive = atoi(tok);
                  }else if ( !strcmp(optkey, "seize") ) {
                      pcidev->seize = atoi(tok);
+                }else if ( !strcmp(optkey, "rdmforce") ) {
+                    pcidev->rdmforce = atoi(tok);
                  }else{
                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", 
optkey);
                  }
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 3c9f146..436b4f3 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -904,6 +904,7 @@ static void replace_string(char **str, const char *val)
      *str = xstrdup(val);
  }

+#define PCI_BDF(b,d,f) ((((b) & 0xff) << 8) | PCI_DEVFN(d,f))
  static void parse_config_data(const char *config_source,
                                const char *config_data,
                                int config_len,
@@ -919,6 +920,7 @@ static void parse_config_data(const char *config_source,
      int pci_msitranslate = 0;
      int pci_permissive = 0;
      int pci_seize = 0;
+    int pci_rdmforce = 0;
      int i, e;

      libxl_domain_create_info *c_info = &d_config->c_info;
@@ -1699,6 +1701,9 @@ skip_vfb:
      if (!xlu_cfg_get_long (config, "pci_seize", &l, 0))
          pci_seize = l;

+    if (!xlu_cfg_get_long (config, "pci_rdmforce", &l, 0))
+        pci_rdmforce = l;
+
      /* To be reworked (automatically enabled) once the auto ballooning
       * after guest starts is done (with PCI devices passed in). */
      if (c_info->type == LIBXL_DOMAIN_TYPE_PV) {
@@ -1719,9 +1724,11 @@ skip_vfb:
              pcidev->power_mgmt = pci_power_mgmt;
              pcidev->permissive = pci_permissive;
              pcidev->seize = pci_seize;
+            pcidev->rdmforce = pci_rdmforce;
              if (!xlu_pci_parse_bdf(config, pcidev, buf))
                  d_config->num_pcidevs++;
          }
+        d_config->b_info.pci_rdmforce = pci_rdmforce;
          if (d_config->num_pcidevs && c_info->type == LIBXL_DOMAIN_TYPE_PV)
              libxl_defbool_set(&b_info->u.pv.e820_host, true);
      }
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..daf259e 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1540,6 +1540,34 @@ int iommu_do_pci_domctl(
          }
          break;

+    case XEN_DOMCTL_set_rdm:
+    {
+        struct xen_domctl_set_rdm *xdsr = &domctl->u.set_rdm;
+        struct xen_guest_pcidev_info *pcidevs;
+        int i;
+
+        pcidevs = xmalloc_array(xen_guest_pcidev_info_t,
+                                domctl->u.set_rdm.num_pcidevs);
+        if ( pcidevs == NULL )
+        {
+            return -ENOMEM;
+        }
+
+        for ( i = 0; i < xdsr->num_pcidevs; ++i )
+        {
+            if ( __copy_from_guest_offset(pcidevs, xdsr->pcidevs, i, 1) )
+            {
+                xfree(pcidevs);
+                return -EFAULT;
+            }
+        }
+
+        d->arch.hvm_domain.pci_rdmforce = domctl->u.set_rdm.pci_rdmforce;
+        d->arch.hvm_domain.num_pcidevs = domctl->u.set_rdm.num_pcidevs;
+        d->arch.hvm_domain.pcidevs = pcidevs;
+    }
+        break;
+
      case XEN_DOMCTL_assign_device:
          if ( unlikely(d->is_dying) )
          {
diff --git a/xen/drivers/passthrough/vtd/dmar.c 
b/xen/drivers/passthrough/vtd/dmar.c
index 546eca9..4b35c04 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -28,6 +28,7 @@
  #include <xen/xmalloc.h>
  #include <xen/pci.h>
  #include <xen/pci_regs.h>
+#include <xen/sched.h>
  #include <asm/string.h>
  #include "dmar.h"
  #include "iommu.h"
@@ -925,14 +926,37 @@ int 
intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
      int rc = 0;
+    int i, j;
+    u16 bdf, pt_bdf;
+    struct domain *d = current->domain;

-    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    for_each_rmrr_device ( rmrr, bdf, i )
      {
-        rc = func(PFN_DOWN(rmrr->base_address),
-                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
-                  ctxt);
-        if ( rc )
-            break;
+        if ( d->arch.hvm_domain.pci_rdmforce )
+        {
+            rc = func(PFN_DOWN(rmrr->base_address),
+                      PFN_UP(rmrr->end_address) -
+                      PFN_DOWN(rmrr->base_address),
+                      ctxt);
+            if ( rc )
+                break;
+        }
+        else
+        {
+            for ( j = 0; j < d->arch.hvm_domain.num_pcidevs; j++ )
+            {
+                pt_bdf = PCI_BDF(d->arch.hvm_domain.pcidevs[j].bus,
+                                 d->arch.hvm_domain.pcidevs[j].dev,
+                                 d->arch.hvm_domain.pcidevs[j].func);
+                if ( pt_bdf == bdf )
+                    rc = func(PFN_DOWN(rmrr->base_address),
+                              PFN_UP(rmrr->end_address) -
+                              PFN_DOWN(rmrr->base_address),
+                              ctxt);
+                if ( rc )
+                    break;
+            }
+        }
      }

      return rc;
diff --git a/xen/include/asm-x86/hvm/domain.h 
b/xen/include/asm-x86/hvm/domain.h
index 2757c7f..b18ce40 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -90,6 +90,10 @@ struct hvm_domain {
      /* Cached CF8 for guest PCI config cycles */
      uint32_t                pci_cf8;

+    uint32_t                pci_rdmforce;
+    uint32_t                num_pcidevs;
+    struct xen_guest_pcidev_info      *pcidevs;
+
      struct pl_time         pl_time;

      struct hvm_io_handler *io_handler;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 58b19e7..4e47fac 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -484,6 +484,24 @@ struct xen_domctl_assign_device {
  typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
  DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);

+struct xen_guest_pcidev_info {
+    uint8_t func;
+    uint8_t dev;
+    uint8_t bus;
+    int domain;
+    int rdmforce;
+};
+typedef struct xen_guest_pcidev_info xen_guest_pcidev_info_t;
+DEFINE_XEN_GUEST_HANDLE(xen_guest_pcidev_info_t);
+/* Control whether/how we check and reserve device memory. */
+struct xen_domctl_set_rdm {
+    uint32_t  pci_rdmforce;
+    uint32_t  num_pcidevs;
+    XEN_GUEST_HANDLE(xen_guest_pcidev_info_t) pcidevs;
+};
+typedef struct xen_domctl_set_rdm xen_domctl_set_rdm_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_rdm_t);
+
  /* Retrieve sibling devices infomation of machine_sbdf */
  /* XEN_DOMCTL_get_device_group */
  struct xen_domctl_get_device_group {
@@ -1056,6 +1074,7 @@ struct xen_domctl {
  #define XEN_DOMCTL_set_vcpu_msrs                 73
  #define XEN_DOMCTL_setvnumainfo                  74
  #define XEN_DOMCTL_psr_cmt_op                    75
+#define XEN_DOMCTL_set_rdm                       76
  #define XEN_DOMCTL_gdbsx_guestmemio            1000
  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1118,7 +1137,8 @@ struct xen_domctl {
          struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
          struct xen_domctl_vnuma             vnuma;
          struct xen_domctl_psr_cmt_op        psr_cmt_op;
-        uint8_t                             pad[128];
+        struct xen_domctl_set_rdm           set_rdm;
+        uint8_t                             pad[112];
      } u;
  };
  typedef struct xen_domctl xen_domctl_t;


Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-07 10:27                                                       ` Chen, Tiejun
@ 2014-11-07 11:08                                                         ` Jan Beulich
  2014-11-11  6:32                                                           ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-07 11:08 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 07.11.14 at 11:27, <tiejun.chen@intel.com> wrote:
> Are you saying Xen restrict some BDFs specific to emulate some devices? 
> But I don't see these associated codes.

I didn't say so. All I said that some of the SBDFs are being used by
them.

>>> --- a/xen/include/xen/iommu.h
>>> +++ b/xen/include/xen/iommu.h
>>> @@ -158,14 +158,14 @@ struct iommu_ops {
>>>        void (*crash_shutdown)(void);
>>>        void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
>>>        void (*iotlb_flush_all)(struct domain *d);
>>> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>>> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct domain *, void *);
>>>        void (*dump_p2m_table)(struct domain *d);
>>>    };
>>>
>>>    void iommu_suspend(void);
>>>    void iommu_resume(void);
>>>    void iommu_crash_shutdown(void);
>>> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain *, void *);
>>
>> I don't see why these generic interfaces would need to change;
>> the only thing that would seem to need changing is the callback
>> function (and of course the context passed to it).
> 
> I'm not 100% sure if we can call current->domain in all scenarios. If 
> you can help me confirm this I'd really like to remove this change :) 
> Now I assume this should be true as follows:

Which is wrong, and not what I said. Instead you should pass the
domain as part of the context that gets made available to the
callback function.

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1540,6 +1540,34 @@ int iommu_do_pci_domctl(
>           }
>           break;
> 
> +    case XEN_DOMCTL_set_rdm:
> +    {
> +        struct xen_domctl_set_rdm *xdsr = &domctl->u.set_rdm;
> +        struct xen_guest_pcidev_info *pcidevs;
> +        int i;
> +
> +        pcidevs = xmalloc_array(xen_guest_pcidev_info_t,
> +                                domctl->u.set_rdm.num_pcidevs);
> +        if ( pcidevs == NULL )
> +        {
> +            return -ENOMEM;
> +        }
> +
> +        for ( i = 0; i < xdsr->num_pcidevs; ++i )
> +        {
> +            if ( __copy_from_guest_offset(pcidevs, xdsr->pcidevs, i, 1) )
> +            {
> +                xfree(pcidevs);
> +                return -EFAULT;
> +            }
> +        }

I don't see the need for a loop here. And you definitely can't use the
double-underscore-prefixed variant the way you do.

> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -90,6 +90,10 @@ struct hvm_domain {
>       /* Cached CF8 for guest PCI config cycles */
>       uint32_t                pci_cf8;
> 
> +    uint32_t                pci_rdmforce;

I still don't see why this is a uint32_t.

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -484,6 +484,24 @@ struct xen_domctl_assign_device {
>   typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
> 
> +struct xen_guest_pcidev_info {
> +    uint8_t func;
> +    uint8_t dev;
> +    uint8_t bus;
> +    int domain;
> +    int rdmforce;
> +};

Please see struct physdev_pci_device_add for how to properly and
space efficiently express what you need. And of course I'd expect
the code to actually use all fields you specify here - neither domain
(which really ought to be named segment if it is what I think it is
meant to be) nor rdmforce are being used anywhere. Plus - again -
just "force" would be sufficient as a name, provided the field is
needed at all.

> +struct xen_domctl_set_rdm {
> +    uint32_t  pci_rdmforce;

Same comment as on the field added to "struct hvm_domain".

> +    uint32_t  num_pcidevs;
> +    XEN_GUEST_HANDLE(xen_guest_pcidev_info_t) pcidevs;

Did you _at all_ look at any of the other domctls when adding this?
There's not a single use of XEN_GUEST_HANDLE() in the whole file.

> @@ -1118,7 +1137,8 @@ struct xen_domctl {
>           struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
>           struct xen_domctl_vnuma             vnuma;
>           struct xen_domctl_psr_cmt_op        psr_cmt_op;
> -        uint8_t                             pad[128];
> +        struct xen_domctl_set_rdm           set_rdm;
> +        uint8_t                             pad[112];

Why are you altering the padding size here?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-07 11:08                                                         ` Jan Beulich
@ 2014-11-11  6:32                                                           ` Chen, Tiejun
  2014-11-11  7:49                                                             ` Chen, Tiejun
  2014-11-11  8:59                                                             ` Jan Beulich
  0 siblings, 2 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-11  6:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/7 19:08, Jan Beulich wrote:
>>>> On 07.11.14 at 11:27, <tiejun.chen@intel.com> wrote:
>> Are you saying Xen restrict some BDFs specific to emulate some devices?
>> But I don't see these associated codes.
>
> I didn't say so. All I said that some of the SBDFs are being used by
> them.
>
>>>> --- a/xen/include/xen/iommu.h
>>>> +++ b/xen/include/xen/iommu.h
>>>> @@ -158,14 +158,14 @@ struct iommu_ops {
>>>>         void (*crash_shutdown)(void);
>>>>         void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
>>>>         void (*iotlb_flush_all)(struct domain *d);
>>>> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>>>> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct domain *, void *);
>>>>         void (*dump_p2m_table)(struct domain *d);
>>>>     };
>>>>
>>>>     void iommu_suspend(void);
>>>>     void iommu_resume(void);
>>>>     void iommu_crash_shutdown(void);
>>>> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain *, void *);
>>>
>>> I don't see why these generic interfaces would need to change;
>>> the only thing that would seem to need changing is the callback
>>> function (and of course the context passed to it).
>>
>> I'm not 100% sure if we can call current->domain in all scenarios. If
>> you can help me confirm this I'd really like to remove this change :)
>> Now I assume this should be true as follows:
>
> Which is wrong, and not what I said. Instead you should pass the
> domain as part of the context that gets made available to the
> callback function.

Okay I will try to go there.

>
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1540,6 +1540,34 @@ int iommu_do_pci_domctl(
>>            }
>>            break;
>>
>> +    case XEN_DOMCTL_set_rdm:
>> +    {
>> +        struct xen_domctl_set_rdm *xdsr = &domctl->u.set_rdm;
>> +        struct xen_guest_pcidev_info *pcidevs;
>> +        int i;
>> +
>> +        pcidevs = xmalloc_array(xen_guest_pcidev_info_t,
>> +                                domctl->u.set_rdm.num_pcidevs);
>> +        if ( pcidevs == NULL )
>> +        {
>> +            return -ENOMEM;
>> +        }
>> +
>> +        for ( i = 0; i < xdsr->num_pcidevs; ++i )
>> +        {
>> +            if ( __copy_from_guest_offset(pcidevs, xdsr->pcidevs, i, 1) )
>> +            {
>> +                xfree(pcidevs);
>> +                return -EFAULT;
>> +            }
>> +        }
>
> I don't see the need for a loop here. And you definitely can't use the
> double-underscore-prefixed variant the way you do.

Do you mean this line?

copy_from_guest_offset(pcidevs, xdsr->pcidevs, 0, 
xdsr->num_pcidevs*sizeof(xen_guest_pcidev_info_t))

>
>> --- a/xen/include/asm-x86/hvm/domain.h
>> +++ b/xen/include/asm-x86/hvm/domain.h
>> @@ -90,6 +90,10 @@ struct hvm_domain {
>>        /* Cached CF8 for guest PCI config cycles */
>>        uint32_t                pci_cf8;
>>
>> +    uint32_t                pci_rdmforce;
>
> I still don't see why this is a uint32_t.

How about bool_t?

>
>> --- a/xen/include/public/domctl.h
>> +++ b/xen/include/public/domctl.h
>> @@ -484,6 +484,24 @@ struct xen_domctl_assign_device {
>>    typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>>    DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
>>
>> +struct xen_guest_pcidev_info {
>> +    uint8_t func;
>> +    uint8_t dev;
>> +    uint8_t bus;
>> +    int domain;
>> +    int rdmforce;
>> +};
>
> Please see struct physdev_pci_device_add for how to properly and
> space efficiently express what you need. And of course I'd expect

Yes. Actually I ever considered this point but I just think we may need 
to keep a complete set of fields.

You're right and anywhere what we should do is focusing on on-demand.

> the code to actually use all fields you specify here - neither domain
> (which really ought to be named segment if it is what I think it is
> meant to be) nor rdmforce are being used anywhere. Plus - again -
> just "force" would be sufficient as a name, provided the field is
> needed at all.

Okay I can use 'force' directly.

In Xen side we have 'bool_t', but we have 'bool' in tools side. So how 
to define this in xen/include/public/domctl.h?

>
>> +struct xen_domctl_set_rdm {
>> +    uint32_t  pci_rdmforce;
>
> Same comment as on the field added to "struct hvm_domain".

Ditto.

>
>> +    uint32_t  num_pcidevs;
>> +    XEN_GUEST_HANDLE(xen_guest_pcidev_info_t) pcidevs;
>
> Did you _at all_ look at any of the other domctls when adding this?
> There's not a single use of XEN_GUEST_HANDLE() in the whole file.

Looks I should do this,

XEN_GUEST_HANDLE_64(xen_guest_pcidev_info_t) pcidevs;

>
>> @@ -1118,7 +1137,8 @@ struct xen_domctl {
>>            struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
>>            struct xen_domctl_vnuma             vnuma;
>>            struct xen_domctl_psr_cmt_op        psr_cmt_op;
>> -        uint8_t                             pad[128];
>> +        struct xen_domctl_set_rdm           set_rdm;
>> +        uint8_t                             pad[112];
>
> Why are you altering the padding size here?

As I understand we should shrink this pad when we introduce new filed, 
shouldn't we?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  6:32                                                           ` Chen, Tiejun
@ 2014-11-11  7:49                                                             ` Chen, Tiejun
  2014-11-11  9:03                                                               ` Jan Beulich
  2014-11-11  8:59                                                             ` Jan Beulich
  1 sibling, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-11  7:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>>>> --- a/xen/include/xen/iommu.h
>>>>> +++ b/xen/include/xen/iommu.h
>>>>> @@ -158,14 +158,14 @@ struct iommu_ops {
>>>>>         void (*crash_shutdown)(void);
>>>>>         void (*iotlb_flush)(struct domain *d, unsigned long gfn,
>>>>> unsigned int page_count);
>>>>>         void (*iotlb_flush_all)(struct domain *d);
>>>>> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>>>>> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct
>>>>> domain *, void *);
>>>>>         void (*dump_p2m_table)(struct domain *d);
>>>>>     };
>>>>>
>>>>>     void iommu_suspend(void);
>>>>>     void iommu_resume(void);
>>>>>     void iommu_crash_shutdown(void);
>>>>> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain
>>>>> *, void *);
>>>>
>>>> I don't see why these generic interfaces would need to change;
>>>> the only thing that would seem to need changing is the callback
>>>> function (and of course the context passed to it).
>>>
>>> I'm not 100% sure if we can call current->domain in all scenarios. If
>>> you can help me confirm this I'd really like to remove this change :)
>>> Now I assume this should be true as follows:
>>
>> Which is wrong, and not what I said. Instead you should pass the
>> domain as part of the context that gets made available to the
>> callback function.
>
> Okay I will try to go there.
>

Are you saying this change?

@@ -898,14 +899,36 @@ int 
intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
      int rc = 0;
+    int i, j;
+    u16 bdf, pt_bdf;
+    struct domain *d = ctxt->domain;

-    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    for_each_rmrr_device ( rmrr, bdf, i )
      {
-        rc = func(PFN_DOWN(rmrr->base_address),
-                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
-                  ctxt);
-        if ( rc )
-            break;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            rc = func(PFN_DOWN(rmrr->base_address),
+                      PFN_UP(rmrr->end_address) -
+                      PFN_DOWN(rmrr->base_address),
+                      ctxt);
+            if ( rc )
+                break;
+        }
+        else
+        {
+            for ( j = 0; j < d->arch.hvm_domain.num_pcidevs; j++ )
+            {

But,

dmar.c: In function 'intel_iommu_get_reserved_device_memory'"
dmar.c:904:28: error: dereferencing 'void *' pointer [-Werror]
      struct domain *d = ctxt->domain;
                             ^
dmar.c:904:28: error: request for member 'domain' in something not a 
structure or union
cc1: all warnings being treated as errors
make[6]: *** [dmar.o] Error 1
make[6]: *** Waiting for unfinished jobs.

Unless we move all check inside each callback functions.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  6:32                                                           ` Chen, Tiejun
  2014-11-11  7:49                                                             ` Chen, Tiejun
@ 2014-11-11  8:59                                                             ` Jan Beulich
  2014-11-11  9:35                                                               ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-11  8:59 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 11.11.14 at 07:32, <tiejun.chen@intel.com> wrote:
> On 2014/11/7 19:08, Jan Beulich wrote:
>>>>> On 07.11.14 at 11:27, <tiejun.chen@intel.com> wrote:
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -1540,6 +1540,34 @@ int iommu_do_pci_domctl(
>>>            }
>>>            break;
>>>
>>> +    case XEN_DOMCTL_set_rdm:
>>> +    {
>>> +        struct xen_domctl_set_rdm *xdsr = &domctl->u.set_rdm;
>>> +        struct xen_guest_pcidev_info *pcidevs;
>>> +        int i;
>>> +
>>> +        pcidevs = xmalloc_array(xen_guest_pcidev_info_t,
>>> +                                domctl->u.set_rdm.num_pcidevs);
>>> +        if ( pcidevs == NULL )
>>> +        {
>>> +            return -ENOMEM;
>>> +        }
>>> +
>>> +        for ( i = 0; i < xdsr->num_pcidevs; ++i )
>>> +        {
>>> +            if ( __copy_from_guest_offset(pcidevs, xdsr->pcidevs, i, 1) )
>>> +            {
>>> +                xfree(pcidevs);
>>> +                return -EFAULT;
>>> +            }
>>> +        }
>>
>> I don't see the need for a loop here. And you definitely can't use the
>> double-underscore-prefixed variant the way you do.
> 
> Do you mean this line?
> 
> copy_from_guest_offset(pcidevs, xdsr->pcidevs, 0, 
> xdsr->num_pcidevs*sizeof(xen_guest_pcidev_info_t))

Almost:

    copy_from_guest(pcidevs, xdsr->pcidevs, xdsr->num_pcidevs * sizeof(*pcidevs))

>>> --- a/xen/include/asm-x86/hvm/domain.h
>>> +++ b/xen/include/asm-x86/hvm/domain.h
>>> @@ -90,6 +90,10 @@ struct hvm_domain {
>>>        /* Cached CF8 for guest PCI config cycles */
>>>        uint32_t                pci_cf8;
>>>
>>> +    uint32_t                pci_rdmforce;
>>
>> I still don't see why this is a uint32_t.
> 
> How about bool_t?

Exactly.

> In Xen side we have 'bool_t', but we have 'bool' in tools side. So how 
> to define this in xen/include/public/domctl.h?

Have a structure field named e.g. "flags" and a #define consuming
exactly one bit of it. Just like it's being done everywhere else.

>>> @@ -1118,7 +1137,8 @@ struct xen_domctl {
>>>            struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
>>>            struct xen_domctl_vnuma             vnuma;
>>>            struct xen_domctl_psr_cmt_op        psr_cmt_op;
>>> -        uint8_t                             pad[128];
>>> +        struct xen_domctl_set_rdm           set_rdm;
>>> +        uint8_t                             pad[112];
>>
>> Why are you altering the padding size here?
> 
> As I understand we should shrink this pad when we introduce new filed, 
> shouldn't we?

You realize that this is inside a union?

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  7:49                                                             ` Chen, Tiejun
@ 2014-11-11  9:03                                                               ` Jan Beulich
  2014-11-11  9:06                                                                 ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-11  9:03 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 11.11.14 at 08:49, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/xen/include/xen/iommu.h
>>>>>> +++ b/xen/include/xen/iommu.h
>>>>>> @@ -158,14 +158,14 @@ struct iommu_ops {
>>>>>>         void (*crash_shutdown)(void);
>>>>>>         void (*iotlb_flush)(struct domain *d, unsigned long gfn,
>>>>>> unsigned int page_count);
>>>>>>         void (*iotlb_flush_all)(struct domain *d);
>>>>>> -    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>>>>>> +    int (*get_reserved_device_memory)(iommu_grdm_t *, struct
>>>>>> domain *, void *);
>>>>>>         void (*dump_p2m_table)(struct domain *d);
>>>>>>     };
>>>>>>
>>>>>>     void iommu_suspend(void);
>>>>>>     void iommu_resume(void);
>>>>>>     void iommu_crash_shutdown(void);
>>>>>> -int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>>>>>> +int iommu_get_reserved_device_memory(iommu_grdm_t *, struct domain
>>>>>> *, void *);
>>>>>
>>>>> I don't see why these generic interfaces would need to change;
>>>>> the only thing that would seem to need changing is the callback
>>>>> function (and of course the context passed to it).
>>>>
>>>> I'm not 100% sure if we can call current->domain in all scenarios. If
>>>> you can help me confirm this I'd really like to remove this change :)
>>>> Now I assume this should be true as follows:
>>>
>>> Which is wrong, and not what I said. Instead you should pass the
>>> domain as part of the context that gets made available to the
>>> callback function.
>>
>> Okay I will try to go there.
>>
> 
> Are you saying this change?
> 
> @@ -898,14 +899,36 @@ int 
> intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>   {
>       struct acpi_rmrr_unit *rmrr;
>       int rc = 0;
> +    int i, j;
> +    u16 bdf, pt_bdf;
> +    struct domain *d = ctxt->domain;
> 
> -    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
> +    for_each_rmrr_device ( rmrr, bdf, i )
>       {
> -        rc = func(PFN_DOWN(rmrr->base_address),
> -                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
> -                  ctxt);
> -        if ( rc )
> -            break;
> +        if ( d->arch.hvm_domain.pci_force )
> +        {
> +            rc = func(PFN_DOWN(rmrr->base_address),
> +                      PFN_UP(rmrr->end_address) -
> +                      PFN_DOWN(rmrr->base_address),
> +                      ctxt);
> +            if ( rc )
> +                break;
> +        }
> +        else
> +        {
> +            for ( j = 0; j < d->arch.hvm_domain.num_pcidevs; j++ )
> +            {
> 
> But,
> 
> dmar.c: In function 'intel_iommu_get_reserved_device_memory'"
> dmar.c:904:28: error: dereferencing 'void *' pointer [-Werror]
>       struct domain *d = ctxt->domain;
>                              ^
> dmar.c:904:28: error: request for member 'domain' in something not a 
> structure or union
> cc1: all warnings being treated as errors
> make[6]: *** [dmar.o] Error 1
> make[6]: *** Waiting for unfinished jobs.
> 
> Unless we move all check inside each callback functions.

That's what you should do imo, albeit I realize that the comparing
of the specific SBDFs will need additional consideration.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:03                                                               ` Jan Beulich
@ 2014-11-11  9:06                                                                 ` Jan Beulich
  2014-11-11  9:42                                                                   ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-11  9:06 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 11.11.14 at 10:03, <JBeulich@suse.com> wrote:
>>>> On 11.11.14 at 08:49, <tiejun.chen@intel.com> wrote:
>> Unless we move all check inside each callback functions.
> 
> That's what you should do imo, albeit I realize that the comparing
> of the specific SBDFs will need additional consideration.

Part of which would involve re-considering whether device
assignment shouldn't be done before memory population in the
tool stack.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  8:59                                                             ` Jan Beulich
@ 2014-11-11  9:35                                                               ` Chen, Tiejun
  2014-11-11  9:42                                                                 ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-11  9:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>> Do you mean this line?
>>
>> copy_from_guest_offset(pcidevs, xdsr->pcidevs, 0,
>> xdsr->num_pcidevs*sizeof(xen_guest_pcidev_info_t))
>
> Almost:
>
>      copy_from_guest(pcidevs, xdsr->pcidevs, xdsr->num_pcidevs * sizeof(*pcidevs))

Fixed.

>
>>>> --- a/xen/include/asm-x86/hvm/domain.h
>>>> +++ b/xen/include/asm-x86/hvm/domain.h
>>>> @@ -90,6 +90,10 @@ struct hvm_domain {
>>>>         /* Cached CF8 for guest PCI config cycles */
>>>>         uint32_t                pci_cf8;
>>>>
>>>> +    uint32_t                pci_rdmforce;
>>>
>>> I still don't see why this is a uint32_t.
>>
>> How about bool_t?
>
> Exactly.
>
>> In Xen side we have 'bool_t', but we have 'bool' in tools side. So how
>> to define this in xen/include/public/domctl.h?
>
> Have a structure field named e.g. "flags" and a #define consuming
> exactly one bit of it. Just like it's being done everywhere else.

Like this?

  typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
  DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);

+struct xen_guest_pcidev_info {
+    uint8_t bus;
+    uint8_t devfn;
+    struct {
+        uint32_t    force           : 1,
+                    reserved        : 31;
+    }flags;
+};
+typedef struct xen_guest_pcidev_info xen_guest_pcidev_info_t;
+DEFINE_XEN_GUEST_HANDLE(xen_guest_pcidev_info_t);
+/* Control whether/how we check and reserve device memory. */
+struct xen_domctl_set_rdm {
+    struct {
+        uint32_t    force           : 1,
+                    reserved        : 31;
+    }flags;
+    uint32_t    num_pcidevs;
+    XEN_GUEST_HANDLE_64(xen_guest_pcidev_info_t) pcidevs;
+};
+typedef struct xen_domctl_set_rdm xen_domctl_set_rdm_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_rdm_t);
+
  /* Retrieve sibling devices infomation of machine_sbdf */
  /* XEN_DOMCTL_get_device_group */
  struct xen_domctl_get_device_group {


>
>>>> @@ -1118,7 +1137,8 @@ struct xen_domctl {
>>>>             struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
>>>>             struct xen_domctl_vnuma             vnuma;
>>>>             struct xen_domctl_psr_cmt_op        psr_cmt_op;
>>>> -        uint8_t                             pad[128];
>>>> +        struct xen_domctl_set_rdm           set_rdm;
>>>> +        uint8_t                             pad[112];
>>>
>>> Why are you altering the padding size here?
>>
>> As I understand we should shrink this pad when we introduce new filed,
>> shouldn't we?
>
> You realize that this is inside a union?

Sorry I didn't realize this point.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:35                                                               ` Chen, Tiejun
@ 2014-11-11  9:42                                                                 ` Jan Beulich
  2014-11-11  9:51                                                                   ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-11  9:42 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 11.11.14 at 10:35, <tiejun.chen@intel.com> wrote:
>> Have a structure field named e.g. "flags" and a #define consuming
>> exactly one bit of it. Just like it's being done everywhere else.
> 
> Like this?
> 
>   typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
> 
> +struct xen_guest_pcidev_info {
> +    uint8_t bus;
> +    uint8_t devfn;
> +    struct {
> +        uint32_t    force           : 1,
> +                    reserved        : 31;
> +    }flags;
> +};

I said #define for a reason - with a few exceptions we try to avoid
using bit fields in the public interface.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:06                                                                 ` Jan Beulich
@ 2014-11-11  9:42                                                                   ` Chen, Tiejun
  2014-11-11 10:07                                                                     ` Jan Beulich
  2014-11-12  3:05                                                                     ` Chen, Tiejun
  0 siblings, 2 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-11  9:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/11 17:06, Jan Beulich wrote:
>>>> On 11.11.14 at 10:03, <JBeulich@suse.com> wrote:
>>>>> On 11.11.14 at 08:49, <tiejun.chen@intel.com> wrote:
>>> Unless we move all check inside each callback functions.
>>
>> That's what you should do imo, albeit I realize that the comparing

Yes, I can do this in all existing callback functions but I'm somewhat 
afraid when other guys want to introduce new callback function, who can 
guarantee this should be done as well?

>> of the specific SBDFs will need additional consideration.
>
> Part of which would involve re-considering whether device
> assignment shouldn't be done before memory population in the
> tool stack.
>

Yes, after we introduce this new domctl, we just make sure this domctl 
should be called before memory population no matter when we do assign a 
device as passthrough.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:42                                                                 ` Jan Beulich
@ 2014-11-11  9:51                                                                   ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-11  9:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/11 17:42, Jan Beulich wrote:
>>>> On 11.11.14 at 10:35, <tiejun.chen@intel.com> wrote:
>>> Have a structure field named e.g. "flags" and a #define consuming
>>> exactly one bit of it. Just like it's being done everywhere else.
>>
>> Like this?
>>
>>    typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>>    DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
>>
>> +struct xen_guest_pcidev_info {
>> +    uint8_t bus;
>> +    uint8_t devfn;
>> +    struct {
>> +        uint32_t    force           : 1,
>> +                    reserved        : 31;
>> +    }flags;
>> +};
>
> I said #define for a reason - with a few exceptions we try to avoid
> using bit fields in the public interface.
>

Ok,

  typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
  DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);

+/* Currently just one bit to indicate force to check Reserved Device 
Memory. */
+#define PCI_DEV_RDM_CHECK   0x1
+struct xen_guest_pcidev_info {
+    uint8_t bus;
+    uint8_t devfn;
+    uint32_t    flags;
+};
+typedef struct xen_guest_pcidev_info xen_guest_pcidev_info_t;
+DEFINE_XEN_GUEST_HANDLE(xen_guest_pcidev_info_t);
+/* Control whether/how we check and reserve device memory. */
+struct xen_domctl_set_rdm {
+    uint32_t    flags;
+    uint32_t    num_pcidevs;
+    XEN_GUEST_HANDLE_64(xen_guest_pcidev_info_t) pcidevs;
+};
+typedef struct xen_domctl_set_rdm xen_domctl_set_rdm_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_rdm_t);
+
  /* Retrieve sibling devices infomation of machine_sbdf */
  /* XEN_DOMCTL_get_device_group */
  struct xen_domctl_get_device_group {

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:42                                                                   ` Chen, Tiejun
@ 2014-11-11 10:07                                                                     ` Jan Beulich
  2014-11-12  1:36                                                                       ` Chen, Tiejun
  2014-11-12  3:05                                                                     ` Chen, Tiejun
  1 sibling, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-11 10:07 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 11.11.14 at 10:42, <tiejun.chen@intel.com> wrote:
> On 2014/11/11 17:06, Jan Beulich wrote:
>> Part of which would involve re-considering whether device
>> assignment shouldn't be done before memory population in the
>> tool stack.
>>
> 
> Yes, after we introduce this new domctl, we just make sure this domctl 
> should be called before memory population no matter when we do assign a 
> device as passthrough.

You didn't think through the implications then: If device assignment
happens early enough, there's no need to report SBDF tuples via a
new domctl (or only if we want to still allow for post-boot assignment
of affected devices without using the global enforcement flag, in
which case assigning devices early at boot time is pointless).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11 10:07                                                                     ` Jan Beulich
@ 2014-11-12  1:36                                                                       ` Chen, Tiejun
  2014-11-12  8:37                                                                         ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12  1:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/11 18:07, Jan Beulich wrote:
>>>> On 11.11.14 at 10:42, <tiejun.chen@intel.com> wrote:
>> On 2014/11/11 17:06, Jan Beulich wrote:
>>> Part of which would involve re-considering whether device
>>> assignment shouldn't be done before memory population in the
>>> tool stack.
>>>
>>
>> Yes, after we introduce this new domctl, we just make sure this domctl
>> should be called before memory population no matter when we do assign a
>> device as passthrough.
>
> You didn't think through the implications then: If device assignment
> happens early enough, there's no need to report SBDF tuples via a

In the present the device assignment is always after memory population. 
And I also mentioned previously I double checked this sequence with printk.

> new domctl (or only if we want to still allow for post-boot assignment
> of affected devices without using the global enforcement flag, in
> which case assigning devices early at boot time is pointless).

The global enforcement flag is mainly used to provide such an approach 
the user know exactly he/she may need a hotplug later, we'd better check 
to reserve all RMRR ranges in advance.

So I guess you mean we need to separate this interface from our original 
device assignment like this,

#1 pci_force as a global variable would control whether we force to 
check/reserve __all__ RMRR ranges.

#2 flags field in each specific device of new domctl would control 
whether this device need to check/reserve its own RMRR range. But its 
not dependent on current device assignment domctl, so the user can use 
them to control which devices need to work as hotplug later, separately. 
This means we should introduce new parameters just like current 'pci' 
construction, right? Or extend current device assignment to be 
compatible with this case, for instance, new field to indicate if we 
really want to do device assignment in boot time.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-11  9:42                                                                   ` Chen, Tiejun
  2014-11-11 10:07                                                                     ` Jan Beulich
@ 2014-11-12  3:05                                                                     ` Chen, Tiejun
  2014-11-12  8:55                                                                       ` Jan Beulich
  1 sibling, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12  3:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/11 17:42, Chen, Tiejun wrote:
> On 2014/11/11 17:06, Jan Beulich wrote:
>>>>> On 11.11.14 at 10:03, <JBeulich@suse.com> wrote:
>>>>>> On 11.11.14 at 08:49, <tiejun.chen@intel.com> wrote:
>>>> Unless we move all check inside each callback functions.
>>>
>>> That's what you should do imo, albeit I realize that the comparing
>
> Yes, I can do this in all existing callback functions but I'm somewhat
> afraid when other guys want to introduce new callback function, who can
> guarantee this should be done as well?
>

I don't see any feedback to this point, so I think you still prefer we 
should do all check in the callback function.

I tried to address this but obviously we have to pass each 'pdf' to 
callback functions,

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index af613b9..db4b90f 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -20,12 +20,16 @@ CHECK_mem_access_op;
  struct get_reserved_device_memory {
      struct compat_reserved_device_memory_map map;
      unsigned int used_entries;
+    struct domain *domain;
  };

-static int get_reserved_device_memory(xen_pfn_t start,
-                                      xen_ulong_t nr, void *ctxt)
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
+                                      u16 bdf, void *ctxt)
  {
      struct get_reserved_device_memory *grdm = ctxt;
+    u16 pt_bdf;
+    int i;
+    struct domain *d = grdm->domain;

      if ( grdm->used_entries < grdm->map.nr_entries )
      {
@@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
          if ( rdm.start_pfn != start || rdm.nr_pages != nr )
              return -ERANGE;

-        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
-                                     &rdm, 1) )
-            return -EFAULT;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            if ( __copy_to_compat_offset(grdm->map.buffer, 
grdm->used_entries,
+                                         &rdm, 1) )
+                return -EFAULT;
+        }
+        else
+        {
+            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
+            {
+                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
+                                  d->arch.hvm_domain.pcidevs[i].devfn);
+                if ( pt_bdf == bdf )
+                    if ( __copy_to_compat_offset(grdm->map.buffer, 
grdm->used_entries,
+                                                 &rdm, 1) )
+                        return -EFAULT;
+            }
+        }
      }

      ++grdm->used_entries;
@@ -314,6 +333,7 @@ int compat_memory_op(unsigned int cmd, 
XEN_GUEST_HANDLE_PARAM(void) compat)
                  return -EFAULT;

              grdm.used_entries = 0;
+            grdm.domain = current->domain;
              rc = 
iommu_get_reserved_device_memory(get_reserved_device_memory,
                                                    &grdm);

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 2177c56..f5e9c1f 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -696,12 +696,16 @@ out:
  struct get_reserved_device_memory {
      struct xen_reserved_device_memory_map map;
      unsigned int used_entries;
+    struct domain *domain;
  };

-static int get_reserved_device_memory(xen_pfn_t start,
-                                      xen_ulong_t nr, void *ctxt)
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
+                                      u16 bdf, void *ctxt)
  {
      struct get_reserved_device_memory *grdm = ctxt;
+    u16 pt_bdf;
+    int i;
+    struct domain *d = grdm->domain;

      if ( grdm->used_entries < grdm->map.nr_entries )
      {
@@ -709,9 +713,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
              .start_pfn = start, .nr_pages = nr
          };

-        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
-                                    &rdm, 1) )
-            return -EFAULT;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            if ( __copy_to_guest_offset(grdm->map.buffer, 
grdm->used_entries,
+                                        &rdm, 1) )
+                return -EFAULT;
+        }
+        else
+        {
+            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
+            {
+                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
+                                  d->arch.hvm_domain.pcidevs[i].devfn);
+                if ( pt_bdf == bdf )
+                    if ( __copy_to_guest_offset(grdm->map.buffer, 
grdm->used_entries,
+                                                &rdm, 1) )
+                        return -EFAULT;
+            }
+        }
      }

      ++grdm->used_entries;
@@ -1139,6 +1158,7 @@ long do_memory_op(unsigned long cmd, 
XEN_GUEST_HANDLE_PARAM(void) arg)
              return -EFAULT;

          grdm.used_entries = 0;
+        grdm.domain = current->domain;
          rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
                                                &grdm);

diff --git a/xen/drivers/passthrough/vtd/dmar.c 
b/xen/drivers/passthrough/vtd/dmar.c
index 141e735..68da9d0 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -898,11 +898,14 @@ int 
intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
      int rc = 0;
+    int i;
+    u16 bdf;

-    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    for_each_rmrr_device ( rmrr, bdf, i )
      {
          rc = func(PFN_DOWN(rmrr->base_address),
                    PFN_UP(rmrr->end_address) - 
PFN_DOWN(rmrr->base_address),
+                  bdf,
                    ctxt);
          if ( rc )
              break;
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 409f6f8..ddea0be 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -120,7 +120,7 @@ void iommu_dt_domain_destroy(struct domain *d);

  struct page_info;

-typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u16 bdf, void 
*ctxt);

  struct iommu_ops {
      int (*init)(struct domain *d);

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  1:36                                                                       ` Chen, Tiejun
@ 2014-11-12  8:37                                                                         ` Jan Beulich
  2014-11-12  8:45                                                                           ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-12  8:37 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 12.11.14 at 02:36, <tiejun.chen@intel.com> wrote:
> On 2014/11/11 18:07, Jan Beulich wrote:
>>>>> On 11.11.14 at 10:42, <tiejun.chen@intel.com> wrote:
>>> On 2014/11/11 17:06, Jan Beulich wrote:
>>>> Part of which would involve re-considering whether device
>>>> assignment shouldn't be done before memory population in the
>>>> tool stack.
>>>
>>> Yes, after we introduce this new domctl, we just make sure this domctl
>>> should be called before memory population no matter when we do assign a
>>> device as passthrough.
>>
>> You didn't think through the implications then: If device assignment
>> happens early enough, there's no need to report SBDF tuples via a
> 
> In the present the device assignment is always after memory population. 
> And I also mentioned previously I double checked this sequence with printk.

And I didn't question that; instead I suggested to re-consider whether
that should be changed.

>> new domctl (or only if we want to still allow for post-boot assignment
>> of affected devices without using the global enforcement flag, in
>> which case assigning devices early at boot time is pointless).
> 
> The global enforcement flag is mainly used to provide such an approach 
> the user know exactly he/she may need a hotplug later, we'd better check 
> to reserve all RMRR ranges in advance.
> 
> So I guess you mean we need to separate this interface from our original 
> device assignment like this,
> 
> #1 pci_force as a global variable would control whether we force to 
> check/reserve __all__ RMRR ranges.

Yes.

> #2 flags field in each specific device of new domctl would control 
> whether this device need to check/reserve its own RMRR range. But its 
> not dependent on current device assignment domctl, so the user can use 
> them to control which devices need to work as hotplug later, separately. 

And this could be left as a second step, in order for what needs to
be done now to not get more complicated that necessary.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  8:37                                                                         ` Jan Beulich
@ 2014-11-12  8:45                                                                           ` Chen, Tiejun
  2014-11-12  9:02                                                                             ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12  8:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>> #2 flags field in each specific device of new domctl would control
>> whether this device need to check/reserve its own RMRR range. But its
>> not dependent on current device assignment domctl, so the user can use
>> them to control which devices need to work as hotplug later, separately.
>
> And this could be left as a second step, in order for what needs to
> be done now to not get more complicated that necessary.
>

Do you mean currently we still rely on the device assignment domctl to 
provide SBDF? So looks nothing should be changed in our policy.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  3:05                                                                     ` Chen, Tiejun
@ 2014-11-12  8:55                                                                       ` Jan Beulich
  2014-11-12 10:18                                                                         ` Chen, Tiejun
                                                                                           ` (2 more replies)
  0 siblings, 3 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-12  8:55 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
> I don't see any feedback to this point, so I think you still prefer we 
> should do all check in the callback function.

As a draft this looks reasonable, but there are various bugs to be
dealt with along with cosmetic issues (I'll point out the former, but
I'm tired of pointing out the latter once again - please go back to
earlier reviews of patches to refresh e.g. what types to use for
loop variables).

> I tried to address this but obviously we have to pass each 'pdf' to 
> callback functions,

Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
but something more neutral (maybe "id"). And you again lost the
segment there.

> @@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
>           if ( rdm.start_pfn != start || rdm.nr_pages != nr )
>               return -ERANGE;
> 
> -        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
> -                                     &rdm, 1) )
> -            return -EFAULT;
> +        if ( d->arch.hvm_domain.pci_force )
> +        {
> +            if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
> +                                         &rdm, 1) )
> +                return -EFAULT;
> +        }
> +        else
> +        {
> +            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
> +            {
> +                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
> +                                  d->arch.hvm_domain.pcidevs[i].devfn);
> +                if ( pt_bdf == bdf )
> +                    if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
> +                                                 &rdm, 1) )
> +                        return -EFAULT;

I think d->arch.hvm_domain.pcidevs[] shouldn't contain duplicates,
and hence there's no point continuing the loop if you found a match.

> +            }
> +        }
>       }
> 
>       ++grdm->used_entries;

This should no longer get incremented unconditionally.

> @@ -314,6 +333,7 @@ int compat_memory_op(unsigned int cmd, 
> XEN_GUEST_HANDLE_PARAM(void) compat)
>                   return -EFAULT;
> 
>               grdm.used_entries = 0;
> +            grdm.domain = current->domain;
>               rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
>                                                     &grdm);
> 

Maybe I misguided you with an earlier response, or maybe the
earlier reply was in a different context: There's no point
communicating current->domain to the callback function; there
would be a point communicating the domain if it was _not_
always current->domain.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  8:45                                                                           ` Chen, Tiejun
@ 2014-11-12  9:02                                                                             ` Jan Beulich
  2014-11-12  9:13                                                                               ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-12  9:02 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
>> > #2 flags field in each specific device of new domctl would control
>>> whether this device need to check/reserve its own RMRR range. But its
>>> not dependent on current device assignment domctl, so the user can use
>>> them to control which devices need to work as hotplug later, separately.
>>
>> And this could be left as a second step, in order for what needs to
>> be done now to not get more complicated that necessary.
>>
> 
> Do you mean currently we still rely on the device assignment domctl to 
> provide SBDF? So looks nothing should be changed in our policy.

I can't connect your question to what I said. What I tried to tell you
was that I don't currently see a need to make this overly complicated:
Having the option to punch holes for all devices and (by default)
dealing with just the devices assigned at boot may be sufficient as a
first step. Yet (repeating just to avoid any misunderstanding) that
makes things easier only if we decide to require device assignment to
happen before memory getting populated (since in that case there's
no need for a new domctl to communicate SBDFs, as devices needing
holes will be known to the hypervisor already).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  9:02                                                                             ` Jan Beulich
@ 2014-11-12  9:13                                                                               ` Chen, Tiejun
  2014-11-12  9:56                                                                                 ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12  9:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/12 17:02, Jan Beulich wrote:
>>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
>>>> #2 flags field in each specific device of new domctl would control
>>>> whether this device need to check/reserve its own RMRR range. But its
>>>> not dependent on current device assignment domctl, so the user can use
>>>> them to control which devices need to work as hotplug later, separately.
>>>
>>> And this could be left as a second step, in order for what needs to
>>> be done now to not get more complicated that necessary.
>>>
>>
>> Do you mean currently we still rely on the device assignment domctl to
>> provide SBDF? So looks nothing should be changed in our policy.
>
> I can't connect your question to what I said. What I tried to tell you

Something is misunderstanding to me.

> was that I don't currently see a need to make this overly complicated:
> Having the option to punch holes for all devices and (by default)
> dealing with just the devices assigned at boot may be sufficient as a
> first step. Yet (repeating just to avoid any misunderstanding) that
> makes things easier only if we decide to require device assignment to
> happen before memory getting populated (since in that case there's

Here what do you mean, 'if we decide to require device assignment to 
happen before memory getting populated'?

Because -quote-
"
In the present the device assignment is always after memory population. 
And I also mentioned previously I double checked this sequence with printk.
"

Or you already plan or deciede to change this sequence?

Thanks
Tiejun

> no need for a new domctl to communicate SBDFs, as devices needing
> holes will be known to the hypervisor already).
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  9:13                                                                               ` Chen, Tiejun
@ 2014-11-12  9:56                                                                                 ` Jan Beulich
  2014-11-12 10:18                                                                                   ` Chen, Tiejun
                                                                                                     ` (2 more replies)
  0 siblings, 3 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-12  9:56 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 12.11.14 at 10:13, <tiejun.chen@intel.com> wrote:
> On 2014/11/12 17:02, Jan Beulich wrote:
>>>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
>>>>> #2 flags field in each specific device of new domctl would control
>>>>> whether this device need to check/reserve its own RMRR range. But its
>>>>> not dependent on current device assignment domctl, so the user can use
>>>>> them to control which devices need to work as hotplug later, separately.
>>>>
>>>> And this could be left as a second step, in order for what needs to
>>>> be done now to not get more complicated that necessary.
>>>>
>>>
>>> Do you mean currently we still rely on the device assignment domctl to
>>> provide SBDF? So looks nothing should be changed in our policy.
>>
>> I can't connect your question to what I said. What I tried to tell you
> 
> Something is misunderstanding to me.
> 
>> was that I don't currently see a need to make this overly complicated:
>> Having the option to punch holes for all devices and (by default)
>> dealing with just the devices assigned at boot may be sufficient as a
>> first step. Yet (repeating just to avoid any misunderstanding) that
>> makes things easier only if we decide to require device assignment to
>> happen before memory getting populated (since in that case there's
> 
> Here what do you mean, 'if we decide to require device assignment to 
> happen before memory getting populated'?
> 
> Because -quote-
> "
> In the present the device assignment is always after memory population. 
> And I also mentioned previously I double checked this sequence with printk.
> "
> 
> Or you already plan or deciede to change this sequence?

So it is now the 3rd time that I'm telling you that part of your
decision making as to which route to follow should be to
re-consider whether the current sequence of operations shouldn't
be changed. Please also consult with the VT-d maintainers (hint to
them: participating in this discussion publicly would be really nice)
on _all_ decisions to be made here.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  9:56                                                                                 ` Jan Beulich
@ 2014-11-12 10:18                                                                                   ` Chen, Tiejun
  2014-11-19  8:17                                                                                   ` Tian, Kevin
  2014-11-20  7:45                                                                                   ` Tian, Kevin
  2 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12 10:18 UTC (permalink / raw)
  To: Jan Beulich, kevin.tian, yang.z.zhang; +Cc: tim, xen-devel



On 2014/11/12 17:56, Jan Beulich wrote:
>>>> On 12.11.14 at 10:13, <tiejun.chen@intel.com> wrote:
>> On 2014/11/12 17:02, Jan Beulich wrote:
>>>>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
>>>>>> #2 flags field in each specific device of new domctl would control
>>>>>> whether this device need to check/reserve its own RMRR range. But its
>>>>>> not dependent on current device assignment domctl, so the user can use
>>>>>> them to control which devices need to work as hotplug later, separately.
>>>>>
>>>>> And this could be left as a second step, in order for what needs to
>>>>> be done now to not get more complicated that necessary.
>>>>>
>>>>
>>>> Do you mean currently we still rely on the device assignment domctl to
>>>> provide SBDF? So looks nothing should be changed in our policy.
>>>
>>> I can't connect your question to what I said. What I tried to tell you
>>
>> Something is misunderstanding to me.
>>
>>> was that I don't currently see a need to make this overly complicated:
>>> Having the option to punch holes for all devices and (by default)
>>> dealing with just the devices assigned at boot may be sufficient as a
>>> first step. Yet (repeating just to avoid any misunderstanding) that
>>> makes things easier only if we decide to require device assignment to
>>> happen before memory getting populated (since in that case there's
>>
>> Here what do you mean, 'if we decide to require device assignment to
>> happen before memory getting populated'?
>>
>> Because -quote-
>> "
>> In the present the device assignment is always after memory population.
>> And I also mentioned previously I double checked this sequence with printk.
>> "
>>
>> Or you already plan or deciede to change this sequence?
>
> So it is now the 3rd time that I'm telling you that part of your
> decision making as to which route to follow should be to
> re-consider whether the current sequence of operations shouldn't

As I said previously it may corrupt some existing frameworks to move 
device assignment codes.

Anyway, who can determine formally this approach? Let me ping actively them.

> be changed. Please also consult with the VT-d maintainers (hint to

Especially, Yang and Kevin are always included in this thread.

> them: participating in this discussion publicly would be really nice)

Also in public.

Thanks
Tiejun

> on _all_ decisions to be made here.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  8:55                                                                       ` Jan Beulich
@ 2014-11-12 10:18                                                                         ` Chen, Tiejun
  2014-11-12 10:24                                                                           ` Jan Beulich
  2014-11-13  3:09                                                                         ` Chen, Tiejun
  2014-11-17  7:57                                                                         ` Chen, Tiejun
  2 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12 10:18 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/12 16:55, Jan Beulich wrote:
>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>> I don't see any feedback to this point, so I think you still prefer we
>> should do all check in the callback function.
>
> As a draft this looks reasonable, but there are various bugs to be
> dealt with along with cosmetic issues (I'll point out the former, but
> I'm tired of pointing out the latter once again - please go back to
> earlier reviews of patches to refresh e.g. what types to use for
> loop variables).
>
>> I tried to address this but obviously we have to pass each 'pdf' to
>> callback functions,
>
> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
> but something more neutral (maybe "id"). And you again lost the

Okay.

> segment there.

I think we don't need segment since when we passthrough a device, that 
domain doesn't matter with the real segment in phydev.

>
>> @@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>            if ( rdm.start_pfn != start || rdm.nr_pages != nr )
>>                return -ERANGE;
>>
>> -        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> -                                     &rdm, 1) )
>> -            return -EFAULT;
>> +        if ( d->arch.hvm_domain.pci_force )
>> +        {
>> +            if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                         &rdm, 1) )
>> +                return -EFAULT;
>> +        }
>> +        else
>> +        {
>> +            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
>> +            {
>> +                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
>> +                                  d->arch.hvm_domain.pcidevs[i].devfn);
>> +                if ( pt_bdf == bdf )
>> +                    if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                                 &rdm, 1) )
>> +                        return -EFAULT;
>
> I think d->arch.hvm_domain.pcidevs[] shouldn't contain duplicates,
> and hence there's no point continuing the loop if you found a match.

You're right.

>
>> +            }
>> +        }
>>        }
>>
>>        ++grdm->used_entries;
>
> This should no longer get incremented unconditionally.

Yes.

>
>> @@ -314,6 +333,7 @@ int compat_memory_op(unsigned int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) compat)
>>                    return -EFAULT;
>>
>>                grdm.used_entries = 0;
>> +            grdm.domain = current->domain;
>>                rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
>>                                                      &grdm);
>>
>
> Maybe I misguided you with an earlier response, or maybe the
> earlier reply was in a different context: There's no point
> communicating current->domain to the callback function; there
> would be a point communicating the domain if it was _not_
> always current->domain.
>

I will look into this.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12 10:18                                                                         ` Chen, Tiejun
@ 2014-11-12 10:24                                                                           ` Jan Beulich
  2014-11-12 10:32                                                                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-12 10:24 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 12.11.14 at 11:18, <tiejun.chen@intel.com> wrote:
> On 2014/11/12 16:55, Jan Beulich wrote:
>>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>>> I don't see any feedback to this point, so I think you still prefer we
>>> should do all check in the callback function.
>>
>> As a draft this looks reasonable, but there are various bugs to be
>> dealt with along with cosmetic issues (I'll point out the former, but
>> I'm tired of pointing out the latter once again - please go back to
>> earlier reviews of patches to refresh e.g. what types to use for
>> loop variables).
>>
>>> I tried to address this but obviously we have to pass each 'pdf' to
>>> callback functions,
>>
>> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
>> but something more neutral (maybe "id"). And you again lost the
> 
> Okay.
> 
>> segment there.
> 
> I think we don't need segment since when we passthrough a device, that 
> domain doesn't matter with the real segment in phydev.

How can this not matter? If 0001:bb:dd.f is associated with an RMRR
but 0000:bb:dd.f isn't, it's quite relevant which one is being handed
to a guest.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12 10:24                                                                           ` Jan Beulich
@ 2014-11-12 10:32                                                                             ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-12 10:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/12 18:24, Jan Beulich wrote:
>>>> On 12.11.14 at 11:18, <tiejun.chen@intel.com> wrote:
>> On 2014/11/12 16:55, Jan Beulich wrote:
>>>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>>>> I don't see any feedback to this point, so I think you still prefer we
>>>> should do all check in the callback function.
>>>
>>> As a draft this looks reasonable, but there are various bugs to be
>>> dealt with along with cosmetic issues (I'll point out the former, but
>>> I'm tired of pointing out the latter once again - please go back to
>>> earlier reviews of patches to refresh e.g. what types to use for
>>> loop variables).
>>>
>>>> I tried to address this but obviously we have to pass each 'pdf' to
>>>> callback functions,
>>>
>>> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
>>> but something more neutral (maybe "id"). And you again lost the
>>
>> Okay.
>>
>>> segment there.
>>
>> I think we don't need segment since when we passthrough a device, that
>> domain doesn't matter with the real segment in phydev.
>
> How can this not matter? If 0001:bb:dd.f is associated with an RMRR
> but 0000:bb:dd.f isn't, it's quite relevant which one is being handed
> to a guest.
>

In passthrough case this is needed so I will add this.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  8:55                                                                       ` Jan Beulich
  2014-11-12 10:18                                                                         ` Chen, Tiejun
@ 2014-11-13  3:09                                                                         ` Chen, Tiejun
  2014-11-14  2:21                                                                           ` Chen, Tiejun
  2014-11-17  7:57                                                                         ` Chen, Tiejun
  2 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-13  3:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/12 16:55, Jan Beulich wrote:
>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>> I don't see any feedback to this point, so I think you still prefer we
>> should do all check in the callback function.
>
> As a draft this looks reasonable, but there are various bugs to be
> dealt with along with cosmetic issues (I'll point out the former, but
> I'm tired of pointing out the latter once again - please go back to
> earlier reviews of patches to refresh e.g. what types to use for
> loop variables).
>
>> I tried to address this but obviously we have to pass each 'pdf' to
>> callback functions,
>
> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
> but something more neutral (maybe "id"). And you again lost the
> segment there.
>
>> @@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>            if ( rdm.start_pfn != start || rdm.nr_pages != nr )
>>                return -ERANGE;
>>
>> -        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> -                                     &rdm, 1) )
>> -            return -EFAULT;
>> +        if ( d->arch.hvm_domain.pci_force )
>> +        {
>> +            if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                         &rdm, 1) )
>> +                return -EFAULT;
>> +        }
>> +        else
>> +        {
>> +            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
>> +            {
>> +                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
>> +                                  d->arch.hvm_domain.pcidevs[i].devfn);
>> +                if ( pt_bdf == bdf )
>> +                    if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                                 &rdm, 1) )
>> +                        return -EFAULT;
>
> I think d->arch.hvm_domain.pcidevs[] shouldn't contain duplicates,
> and hence there's no point continuing the loop if you found a match.
>

I take a look at this again. Seems we shouldn't just check bdf since as 
you know its possible to occupy one entry by multiple different BDFs, so 
we have to filter-to-keep one entry. Instead, I really hope we'd check 
to expose before we do the hypercall.

BTW, I already ping Yang in local to look that possibility to reorder 
the sequence of the device assignment and the memory population in iommu 
side.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-13  3:09                                                                         ` Chen, Tiejun
@ 2014-11-14  2:21                                                                           ` Chen, Tiejun
  2014-11-14  8:21                                                                             ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-14  2:21 UTC (permalink / raw)
  To: Jan Beulich, yang.z.zhang, kevin.tian; +Cc: tim, xen-devel


On 2014/11/13 11:09, Chen, Tiejun wrote:
> On 2014/11/12 16:55, Jan Beulich wrote:
>>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>>> I don't see any feedback to this point, so I think you still prefer we
>>> should do all check in the callback function.
>>
>> As a draft this looks reasonable, but there are various bugs to be
>> dealt with along with cosmetic issues (I'll point out the former, but
>> I'm tired of pointing out the latter once again - please go back to
>> earlier reviews of patches to refresh e.g. what types to use for
>> loop variables).
>>
>>> I tried to address this but obviously we have to pass each 'pdf' to
>>> callback functions,
>>
>> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
>> but something more neutral (maybe "id"). And you again lost the
>> segment there.
>>
>>> @@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t
>>> start,
>>>            if ( rdm.start_pfn != start || rdm.nr_pages != nr )
>>>                return -ERANGE;
>>>
>>> -        if ( __copy_to_compat_offset(grdm->map.buffer,
>>> grdm->used_entries,
>>> -                                     &rdm, 1) )
>>> -            return -EFAULT;
>>> +        if ( d->arch.hvm_domain.pci_force )
>>> +        {
>>> +            if ( __copy_to_compat_offset(grdm->map.buffer,
>>> grdm->used_entries,
>>> +                                         &rdm, 1) )
>>> +                return -EFAULT;
>>> +        }
>>> +        else
>>> +        {
>>> +            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
>>> +            {
>>> +                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
>>> +                                  d->arch.hvm_domain.pcidevs[i].devfn);
>>> +                if ( pt_bdf == bdf )
>>> +                    if ( __copy_to_compat_offset(grdm->map.buffer,
>>> grdm->used_entries,
>>> +                                                 &rdm, 1) )
>>> +                        return -EFAULT;
>>
>> I think d->arch.hvm_domain.pcidevs[] shouldn't contain duplicates,
>> and hence there's no point continuing the loop if you found a match.
>>
>
> I take a look at this again. Seems we shouldn't just check bdf since as
> you know its possible to occupy one entry by multiple different BDFs, so
> we have to filter-to-keep one entry. Instead, I really hope we'd check
> to expose before we do the hypercall.

Even if eventually we'll reorder that sequence, this just change that 
approach to get BDF. Are you fine to this subsequent change?

@@ -894,18 +894,55 @@ int platform_supports_x2apic(void)
      return cpu_has_x2apic && ((dmar_flags & mask) == 
ACPI_DMAR_INTR_REMAP);
  }

-int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, struct 
domain *d,
+                                           void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
-    int rc = 0;
+    int rc = 0, i, j, seg_check = 1;
+    u16 id, bdf;

-    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    if ( d->arch.hvm_domain.pci_force )
      {
-        rc = func(PFN_DOWN(rmrr->base_address),
-                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
-                  ctxt);
-        if ( rc )
-            break;
+        list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+        {
+            rc = func(PFN_DOWN(rmrr->base_address),
+                      PFN_UP(rmrr->end_address) - 
PFN_DOWN(rmrr->base_address),
+                      ctxt);
+            if ( rc )
+                break;
+        }
+    }
+    else
+    {
+        list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+        {
+            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs &&
+                            seg_check; i++ )
+            {
+                if ( rmrr->segment == d->arch.hvm_domain.pcidevs[i].seg )
+                {
+                    bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[j].bus,
+                                   d->arch.hvm_domain.pcidevs[j].devfn);
+                    for (j = 0; (id = rmrr->scope.devices[j]) &&
+                            j < rmrr->scope.devices_cnt && seg_check; j++)
+                    {
+                        if ( bdf == id )
+                        {
+                            rc = func(PFN_DOWN(rmrr->base_address),
+                                      PFN_UP(rmrr->end_address) -
+                                        PFN_DOWN(rmrr->base_address),
+                                      ctxt);
+                            if ( rc )
+                                return;
+                            /* Hit this seg entry. */
+                            seg_check = 0;
+                        }
+                    }
+                }
+            }
+            /* goto next seg entry. */
+            seg_check = 1;
+        }
      }

      return rc;

>
> BTW, I already ping Yang in local to look that possibility to reorder
> the sequence of the device assignment and the memory population in iommu
> side.

Yang and Kevin,

Any comments to this requirement?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-14  2:21                                                                           ` Chen, Tiejun
@ 2014-11-14  8:21                                                                             ` Jan Beulich
  2014-11-17  7:31                                                                               ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-14  8:21 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 14.11.14 at 03:21, <tiejun.chen@intel.com> wrote:
> Even if eventually we'll reorder that sequence, this just change that 
> approach to get BDF. Are you fine to this subsequent change?

You again pass a struct domain pointer to the IOMMU-specific
function. I already told you not to do so - the domain specific
aspect should be taken care of by the callback function, i.e. you
need to make SBDF available to it (just like you already properly
did in the previous round for BDF). I suppose that'll at once make
the ugly open coding of for_each_rmrr_device() unnecessary -
you can just use that construct as replacement for what right
now is list_for_each_entry().

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-14  8:21                                                                             ` Jan Beulich
@ 2014-11-17  7:31                                                                               ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-17  7:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/14 16:21, Jan Beulich wrote:
>>>> On 14.11.14 at 03:21, <tiejun.chen@intel.com> wrote:
>> Even if eventually we'll reorder that sequence, this just change that
>> approach to get BDF. Are you fine to this subsequent change?
>
> You again pass a struct domain pointer to the IOMMU-specific
> function. I already told you not to do so - the domain specific

I remembered this comment but I want to show this may introduce many 
duplicated codes. As I understand this kind of check should be a common 
thing dependent on one given platform.

> aspect should be taken care of by the callback function, i.e. you
> need to make SBDF available to it (just like you already properly
> did in the previous round for BDF). I suppose that'll at once make
> the ugly open coding of for_each_rmrr_device() unnecessary -
> you can just use that construct as replacement for what right
> now is list_for_each_entry().
>

Okay, I will try to go there.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  8:55                                                                       ` Jan Beulich
  2014-11-12 10:18                                                                         ` Chen, Tiejun
  2014-11-13  3:09                                                                         ` Chen, Tiejun
@ 2014-11-17  7:57                                                                         ` Chen, Tiejun
  2014-11-17 10:05                                                                           ` Jan Beulich
  2 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-17  7:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/12 16:55, Jan Beulich wrote:
>>>> On 12.11.14 at 04:05, <tiejun.chen@intel.com> wrote:
>> I don't see any feedback to this point, so I think you still prefer we
>> should do all check in the callback function.
>
> As a draft this looks reasonable, but there are various bugs to be
> dealt with along with cosmetic issues (I'll point out the former, but
> I'm tired of pointing out the latter once again - please go back to
> earlier reviews of patches to refresh e.g. what types to use for
> loop variables).

'earlier reviews' means this subject email? I go back to check this but 
just see this comment related to our loop codes:

"
 >>> +        for ( i = 0; i < xdsr->num_pcidevs; ++i )
 >>> +        {
 >>> +            if ( __copy_from_guest_offset(pcidevs, xdsr->pcidevs, 
i, 1) )
 >>> +            {
 >>> +                xfree(pcidevs);
 >>> +                return -EFAULT;
 >>> +            }
 >>> +        }
 >>
 >> I don't see the need for a loop here. And you definitely can't use the
 >> double-underscore-prefixed variant the way you do.
 >
 > Do you mean this line?
 >
 > copy_from_guest_offset(pcidevs, xdsr->pcidevs, 0,
 > xdsr->num_pcidevs*sizeof(xen_guest_pcidev_info_t))

Almost:

     copy_from_guest(pcidevs, xdsr->pcidevs, xdsr->num_pcidevs * 
sizeof(*pcidevs))
"

Or I need to set as 'unsigned int' here?

Anyway I did this in the following codes firstly. If I'm still wrong I 
will correct that.

>
>> I tried to address this but obviously we have to pass each 'pdf' to
>> callback functions,
>
> Yes, but at the generic IOMMU layer this shouldn't be named "bdf",
> but something more neutral (maybe "id"). And you again lost the
> segment there.
>
>> @@ -36,9 +40,24 @@ static int get_reserved_device_memory(xen_pfn_t start,
>>            if ( rdm.start_pfn != start || rdm.nr_pages != nr )
>>                return -ERANGE;
>>
>> -        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> -                                     &rdm, 1) )
>> -            return -EFAULT;
>> +        if ( d->arch.hvm_domain.pci_force )
>> +        {
>> +            if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                         &rdm, 1) )
>> +                return -EFAULT;
>> +        }
>> +        else
>> +        {
>> +            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
>> +            {
>> +                pt_bdf = PCI_BDF2(d->arch.hvm_domain.pcidevs[i].bus,
>> +                                  d->arch.hvm_domain.pcidevs[i].devfn);
>> +                if ( pt_bdf == bdf )
>> +                    if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
>> +                                                 &rdm, 1) )
>> +                        return -EFAULT;
>
> I think d->arch.hvm_domain.pcidevs[] shouldn't contain duplicates,
> and hence there's no point continuing the loop if you found a match.
>
>> +            }
>> +        }
>>        }
>>
>>        ++grdm->used_entries;
>
> This should no longer get incremented unconditionally.
>
>> @@ -314,6 +333,7 @@ int compat_memory_op(unsigned int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) compat)
>>                    return -EFAULT;
>>
>>                grdm.used_entries = 0;
>> +            grdm.domain = current->domain;
>>                rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
>>                                                      &grdm);
>>
>
> Maybe I misguided you with an earlier response, or maybe the
> earlier reply was in a different context: There's no point
> communicating current->domain to the callback function; there
> would be a point communicating the domain if it was _not_
> always current->domain.
>

So we need the caller to pass domain ID, right?

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index af613b9..314d7e6 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -22,10 +22,13 @@ struct get_reserved_device_memory {
      unsigned int used_entries;
  };

-static int get_reserved_device_memory(xen_pfn_t start,
-                                      xen_ulong_t nr, void *ctxt)
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, 
u16 seg,
+                                      u16 *ids, int cnt, void *ctxt)
  {
      struct get_reserved_device_memory *grdm = ctxt;
+    struct domain *d = get_domain_by_id(grdm->map.domid);
+    unsigned int i, j;
+    u32 sbdf, pt_sbdf;

      if ( grdm->used_entries < grdm->map.nr_entries )
      {
@@ -36,13 +39,37 @@ static int get_reserved_device_memory(xen_pfn_t start,
          if ( rdm.start_pfn != start || rdm.nr_pages != nr )
              return -ERANGE;

-        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
-                                     &rdm, 1) )
-            return -EFAULT;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            if ( __copy_to_compat_offset(grdm->map.buffer, 
grdm->used_entries,
+                                         &rdm, 1) )
+                return -EFAULT;
+            ++grdm->used_entries;
+        }
+        else
+        {
+            for ( i = 0; i < cnt; i++ )
+            {
+                sbdf = PCI_SBDF(seg, ids[i]);
+                for ( j = 0; j < d->arch.hvm_domain.num_pcidevs; j++ )
+                {
+                    pt_sbdf = PCI_SBDF2(d->arch.hvm_domain.pcidevs[j].seg,
+                                        d->arch.hvm_domain.pcidevs[j].bus,
+ 
d->arch.hvm_domain.pcidevs[j].devfn);
+                    if ( pt_sbdf == sbdf )
+                    {
+                        if ( __copy_to_compat_offset(grdm->map.buffer,
+                                                     grdm->used_entries,
+                                                     &rdm, 1) )
+                            return -EFAULT;
+                        ++grdm->used_entries;
+                        break;
+                    }
+                }
+            }
+        }
      }

-    ++grdm->used_entries;
-
      return 0;
  }
  #endif
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 2177c56..f27b17f 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -698,10 +698,13 @@ struct get_reserved_device_memory {
      unsigned int used_entries;
  };

-static int get_reserved_device_memory(xen_pfn_t start,
-                                      xen_ulong_t nr, void *ctxt)
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, 
u16 seg,
+                                      u16 *ids, int cnt, void *ctxt)
  {
      struct get_reserved_device_memory *grdm = ctxt;
+    struct domain *d = get_domain_by_id(grdm->map.domid);
+    unsigned int i, j;
+    u32 sbdf, pt_sbdf;

      if ( grdm->used_entries < grdm->map.nr_entries )
      {
@@ -709,13 +712,37 @@ static int get_reserved_device_memory(xen_pfn_t start,
              .start_pfn = start, .nr_pages = nr
          };

-        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
-                                    &rdm, 1) )
-            return -EFAULT;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            if ( __copy_to_guest_offset(grdm->map.buffer, 
grdm->used_entries,
+                                        &rdm, 1) )
+                return -EFAULT;
+            ++grdm->used_entries;
+        }
+        else
+        {
+            for ( i = 0; i < cnt; i++ )
+            {
+                sbdf = PCI_SBDF(seg, ids[i]);
+                for ( j = 0; j < d->arch.hvm_domain.num_pcidevs; j++ )
+                {
+                    pt_sbdf = PCI_SBDF2(d->arch.hvm_domain.pcidevs[j].seg,
+                                        d->arch.hvm_domain.pcidevs[j].bus,
+ 
d->arch.hvm_domain.pcidevs[j].devfn);
+                    if ( pt_sbdf == sbdf )
+                    {
+                        if ( __copy_to_guest_offset(grdm->map.buffer,
+                                                    grdm->used_entries,
+                                                    &rdm, 1) )
+                            return -EFAULT;
+                        ++grdm->used_entries;
+                        break;
+                    }
+                }
+            }
+        }
      }

-    ++grdm->used_entries;
-
      return 0;
  }
  #endif
diff --git a/xen/drivers/passthrough/vtd/dmar.c 
b/xen/drivers/passthrough/vtd/dmar.c
index 141e735..fa813c5 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -903,6 +903,9 @@ int 
intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
      {
          rc = func(PFN_DOWN(rmrr->base_address),
                    PFN_UP(rmrr->end_address) - 
PFN_DOWN(rmrr->base_address),
+                  rmrr->segment,
+                  rmrr->scope.devices,
+                  rmrr->scope.devices_cnt,
                    ctxt);
          if ( rc )
              break;
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index f1b6a48..f8964e1 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -588,6 +588,11 @@ typedef struct xen_reserved_device_memory 
xen_reserved_device_memory_t;
  DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);

  struct xen_reserved_device_memory_map {
+    /*
+     * Domain whose reservation is being changed.
+     * Unprivileged domains can specify only DOMID_SELF.
+     */
+    domid_t        domid;
      /* IN/OUT */
      unsigned int nr_entries;
      /* OUT */
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 409f6f8..75c6759 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -120,7 +120,8 @@ void iommu_dt_domain_destroy(struct domain *d);

  struct page_info;

-typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u16 seg, u16 
*ids,
+                         int cnt, void *ctxt);

  struct iommu_ops {
      int (*init)(struct domain *d);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..ba881ef 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -31,6 +31,8 @@
  #define PCI_DEVFN2(bdf) ((bdf) & 0xff)
  #define PCI_BDF(b,d,f)  ((((b) & 0xff) << 8) | PCI_DEVFN(d,f))
  #define PCI_BDF2(b,df)  ((((b) & 0xff) << 8) | ((df) & 0xff))
+#define PCI_SBDF(s,bdf) (((s & 0xffff) << 16) | (bdf & 0xffff))
+#define PCI_SBDF2(s,b,df) (((s & 0xffff) << 16) | PCI_BDF2(b,df))

  struct pci_dev_info {
      bool_t is_extfn;

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17  7:57                                                                         ` Chen, Tiejun
@ 2014-11-17 10:05                                                                           ` Jan Beulich
  2014-11-17 11:08                                                                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-17 10:05 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>       unsigned int used_entries;
>   };
> 
> -static int get_reserved_device_memory(xen_pfn_t start,
> -                                      xen_ulong_t nr, void *ctxt)
> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16 seg,
> +                                      u16 *ids, int cnt, void *ctxt)

While the approach is a lot better than what you did previously, I still
don't like you adding 3 new parameters when one would do (calling
the callback for each SBDF individually): That way you avoid
introducing a hidden dependency on how the VT-d code manages its
internal data.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17 10:05                                                                           ` Jan Beulich
@ 2014-11-17 11:08                                                                             ` Chen, Tiejun
  2014-11-17 11:17                                                                               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-17 11:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel


On 2014/11/17 18:05, Jan Beulich wrote:
>>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
>> --- a/xen/common/memory.c
>> +++ b/xen/common/memory.c
>> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>>        unsigned int used_entries;
>>    };
>>
>> -static int get_reserved_device_memory(xen_pfn_t start,
>> -                                      xen_ulong_t nr, void *ctxt)
>> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16 seg,
>> +                                      u16 *ids, int cnt, void *ctxt)
>
> While the approach is a lot better than what you did previously, I still
> don't like you adding 3 new parameters when one would do (calling
> the callback for each SBDF individually): That way you avoid

Do you mean I should do this?

for_each_rmrr_device ( rmrr, bdf, i )
{
	 sbdf = PCI_SBDF(seg, rmrr->scope.devices[i]);
          rc = func(PFN_DOWN(rmrr->base_address),
                    PFN_UP(rmrr->end_address) - 
PFN_DOWN(rmrr->base_address),
		   sbdf,	
                    ctxt);

But each different sbdf may occupy one same rmrr entry as I said 
previously, so we have to introduce more codes to filter them as one 
identified entry in the callback.

Thanks
Tiejun

> introducing a hidden dependency on how the VT-d code manages its
> internal data.
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17 11:08                                                                             ` Chen, Tiejun
@ 2014-11-17 11:17                                                                               ` Jan Beulich
  2014-11-17 11:32                                                                                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-17 11:17 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 17.11.14 at 12:08, <tiejun.chen@intel.com> wrote:

> On 2014/11/17 18:05, Jan Beulich wrote:
>>>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>>>        unsigned int used_entries;
>>>    };
>>>
>>> -static int get_reserved_device_memory(xen_pfn_t start,
>>> -                                      xen_ulong_t nr, void *ctxt)
>>> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16 
> seg,
>>> +                                      u16 *ids, int cnt, void *ctxt)
>>
>> While the approach is a lot better than what you did previously, I still
>> don't like you adding 3 new parameters when one would do (calling
>> the callback for each SBDF individually): That way you avoid
> 
> Do you mean I should do this?
> 
> for_each_rmrr_device ( rmrr, bdf, i )
> {
> 	 sbdf = PCI_SBDF(seg, rmrr->scope.devices[i]);
>           rc = func(PFN_DOWN(rmrr->base_address),
>                     PFN_UP(rmrr->end_address) - 
> PFN_DOWN(rmrr->base_address),
> 		   sbdf,	
>                     ctxt);
> 
> But each different sbdf may occupy one same rmrr entry as I said 
> previously, so we have to introduce more codes to filter them as one 
> identified entry in the callback.

Not really - remember that part of what needs to be done is to make
sure all devices associated with a given RMRR get assigned to the
same guest? Or the callback function could use a special return value
(e.g. 1) to signal that the iteration for the current RMRR can be
terminated (or further entries skipped).

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17 11:17                                                                               ` Jan Beulich
@ 2014-11-17 11:32                                                                                 ` Chen, Tiejun
  2014-11-17 11:51                                                                                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-17 11:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/17 19:17, Jan Beulich wrote:
>>>> On 17.11.14 at 12:08, <tiejun.chen@intel.com> wrote:
>
>> On 2014/11/17 18:05, Jan Beulich wrote:
>>>>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
>>>> --- a/xen/common/memory.c
>>>> +++ b/xen/common/memory.c
>>>> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>>>>         unsigned int used_entries;
>>>>     };
>>>>
>>>> -static int get_reserved_device_memory(xen_pfn_t start,
>>>> -                                      xen_ulong_t nr, void *ctxt)
>>>> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16
>> seg,
>>>> +                                      u16 *ids, int cnt, void *ctxt)
>>>
>>> While the approach is a lot better than what you did previously, I still
>>> don't like you adding 3 new parameters when one would do (calling
>>> the callback for each SBDF individually): That way you avoid
>>
>> Do you mean I should do this?
>>
>> for_each_rmrr_device ( rmrr, bdf, i )
>> {
>> 	 sbdf = PCI_SBDF(seg, rmrr->scope.devices[i]);
>>            rc = func(PFN_DOWN(rmrr->base_address),
>>                      PFN_UP(rmrr->end_address) -
>> PFN_DOWN(rmrr->base_address),
>> 		   sbdf,	
>>                      ctxt);
>>
>> But each different sbdf may occupy one same rmrr entry as I said
>> previously, so we have to introduce more codes to filter them as one
>> identified entry in the callback.
>
> Not really - remember that part of what needs to be done is to make
> sure all devices associated with a given RMRR get assigned to the
> same guest? Or the callback function could use a special return value

Yes, but I means in the callback,

get_reserved_device_memory()
{
	...
	for(each assigned pci devs:pt_sbdf)
		if (sbdf == pt_sbdf)
			__copy_to_guest_offset(buffer, ...)

Buffer may be copied to include multiple same entries if we have two or 
more assigned devices associated to one give RMRR entry.

Thanks
Tiejun

> (e.g. 1) to signal that the iteration for the current RMRR can be
> terminated (or further entries skipped).
>
> Jan
>
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17 11:32                                                                                 ` Chen, Tiejun
@ 2014-11-17 11:51                                                                                   ` Jan Beulich
  2014-11-18  3:08                                                                                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-17 11:51 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 17.11.14 at 12:32, <tiejun.chen@intel.com> wrote:
> On 2014/11/17 19:17, Jan Beulich wrote:
>>>>> On 17.11.14 at 12:08, <tiejun.chen@intel.com> wrote:
>>
>>> On 2014/11/17 18:05, Jan Beulich wrote:
>>>>>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
>>>>> --- a/xen/common/memory.c
>>>>> +++ b/xen/common/memory.c
>>>>> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>>>>>         unsigned int used_entries;
>>>>>     };
>>>>>
>>>>> -static int get_reserved_device_memory(xen_pfn_t start,
>>>>> -                                      xen_ulong_t nr, void *ctxt)
>>>>> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16
>>> seg,
>>>>> +                                      u16 *ids, int cnt, void *ctxt)
>>>>
>>>> While the approach is a lot better than what you did previously, I still
>>>> don't like you adding 3 new parameters when one would do (calling
>>>> the callback for each SBDF individually): That way you avoid
>>>
>>> Do you mean I should do this?
>>>
>>> for_each_rmrr_device ( rmrr, bdf, i )
>>> {
>>> 	 sbdf = PCI_SBDF(seg, rmrr->scope.devices[i]);
>>>            rc = func(PFN_DOWN(rmrr->base_address),
>>>                      PFN_UP(rmrr->end_address) -
>>> PFN_DOWN(rmrr->base_address),
>>> 		   sbdf,	
>>>                      ctxt);
>>>
>>> But each different sbdf may occupy one same rmrr entry as I said
>>> previously, so we have to introduce more codes to filter them as one
>>> identified entry in the callback.
>>
>> Not really - remember that part of what needs to be done is to make
>> sure all devices associated with a given RMRR get assigned to the
>> same guest? Or the callback function could use a special return value
> 
> Yes, but I means in the callback,
> 
> get_reserved_device_memory()
> {
> 	...
> 	for(each assigned pci devs:pt_sbdf)
> 		if (sbdf == pt_sbdf)
> 			__copy_to_guest_offset(buffer, ...)
> 
> Buffer may be copied to include multiple same entries if we have two or 
> more assigned devices associated to one give RMRR entry.

Which would be easily avoided by ...

>> (e.g. 1) to signal that the iteration for the current RMRR can be
>> terminated (or further entries skipped).

... the approach I outlined.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-17 11:51                                                                                   ` Jan Beulich
@ 2014-11-18  3:08                                                                                     ` Chen, Tiejun
  2014-11-18  8:01                                                                                       ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-18  3:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/17 19:51, Jan Beulich wrote:
>>>> On 17.11.14 at 12:32, <tiejun.chen@intel.com> wrote:
>> On 2014/11/17 19:17, Jan Beulich wrote:
>>>>>> On 17.11.14 at 12:08, <tiejun.chen@intel.com> wrote:
>>>
>>>> On 2014/11/17 18:05, Jan Beulich wrote:
>>>>>>>> On 17.11.14 at 08:57, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/xen/common/memory.c
>>>>>> +++ b/xen/common/memory.c
>>>>>> @@ -698,10 +698,13 @@ struct get_reserved_device_memory {
>>>>>>          unsigned int used_entries;
>>>>>>      };
>>>>>>
>>>>>> -static int get_reserved_device_memory(xen_pfn_t start,
>>>>>> -                                      xen_ulong_t nr, void *ctxt)
>>>>>> +static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr, u16
>>>> seg,
>>>>>> +                                      u16 *ids, int cnt, void *ctxt)
>>>>>
>>>>> While the approach is a lot better than what you did previously, I still
>>>>> don't like you adding 3 new parameters when one would do (calling
>>>>> the callback for each SBDF individually): That way you avoid
>>>>
>>>> Do you mean I should do this?
>>>>
>>>> for_each_rmrr_device ( rmrr, bdf, i )
>>>> {
>>>> 	 sbdf = PCI_SBDF(seg, rmrr->scope.devices[i]);
>>>>             rc = func(PFN_DOWN(rmrr->base_address),
>>>>                       PFN_UP(rmrr->end_address) -
>>>> PFN_DOWN(rmrr->base_address),
>>>> 		   sbdf,	
>>>>                       ctxt);
>>>>
>>>> But each different sbdf may occupy one same rmrr entry as I said
>>>> previously, so we have to introduce more codes to filter them as one
>>>> identified entry in the callback.
>>>
>>> Not really - remember that part of what needs to be done is to make
>>> sure all devices associated with a given RMRR get assigned to the
>>> same guest? Or the callback function could use a special return value
>>
>> Yes, but I means in the callback,
>>
>> get_reserved_device_memory()
>> {
>> 	...
>> 	for(each assigned pci devs:pt_sbdf)
>> 		if (sbdf == pt_sbdf)
>> 			__copy_to_guest_offset(buffer, ...)
>>
>> Buffer may be copied to include multiple same entries if we have two or
>> more assigned devices associated to one give RMRR entry.
>
> Which would be easily avoided by ...
>
>>> (e.g. 1) to signal that the iteration for the current RMRR can be
>>> terminated (or further entries skipped).
>
> ... the approach I outlined.
>

Here I tried to implement what you want. Note just pick two key 
fragments since others have no big deal.

#1:

@@ -898,14 +898,25 @@ int 
intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
  {
      struct acpi_rmrr_unit *rmrr;
      int rc = 0;
+    unsigned int i;
+    u32 id;
+    u16 bdf;

      list_for_each_entry(rmrr, &acpi_rmrr_units, list)
      {
-        rc = func(PFN_DOWN(rmrr->base_address),
-                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
-                  ctxt);
-        if ( rc )
-            break;
+        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
+                    i < rmrr->scope.devices_cnt && !rc; i++)
+        {
+            id = PCI_SBDF(rmrr->segment, bdf);
+            rc = func(PFN_DOWN(rmrr->base_address),
+                               PFN_UP(rmrr->end_address) -
+                                PFN_DOWN(rmrr->base_address),
+                               id,
+                               ctxt);
+            if ( rc < 0 )
+                return rc;
+        }
+        rc = 0;
      }

      return rc;


and #2,

@@ -698,10 +698,13 @@ struct get_reserved_device_memory {
      unsigned int used_entries;
  };

-static int get_reserved_device_memory(xen_pfn_t start,
-                                      xen_ulong_t nr, void *ctxt)
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
+                                      u32 id, void *ctxt)
  {
      struct get_reserved_device_memory *grdm = ctxt;
+    struct domain *d = get_domain_by_id(grdm->map.domid);
+    unsigned int i;
+    u32 sbdf;

      if ( grdm->used_entries < grdm->map.nr_entries )
      {
@@ -709,13 +712,34 @@ static int get_reserved_device_memory(xen_pfn_t start,
              .start_pfn = start, .nr_pages = nr
          };

-        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
-                                    &rdm, 1) )
-            return -EFAULT;
+        if ( d->arch.hvm_domain.pci_force )
+        {
+            if ( __copy_to_guest_offset(grdm->map.buffer, 
grdm->used_entries,
+                                        &rdm, 1) )
+                return -EFAULT;
+            ++grdm->used_entries;
+            return 1;
+        }
+        else
+        {
+            for ( i = 0; i < d->arch.hvm_domain.num_pcidevs; i++ )
+            {
+                sbdf = PCI_SBDF2(d->arch.hvm_domain.pcidevs[i].seg,
+                                 d->arch.hvm_domain.pcidevs[i].bus,
+                                 d->arch.hvm_domain.pcidevs[i].devfn);
+                if ( sbdf == id )
+                {
+                    if ( __copy_to_guest_offset(grdm->map.buffer,
+                                                grdm->used_entries,
+                                                &rdm, 1) )
+                        return -EFAULT;
+                    ++grdm->used_entries;
+                    return 1;
+                }
+            }
+        }
      }

-    ++grdm->used_entries;
-
      return 0;
  }
  #endif


Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-18  3:08                                                                                     ` Chen, Tiejun
@ 2014-11-18  8:01                                                                                       ` Jan Beulich
  2014-11-18  8:16                                                                                         ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-18  8:01 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 18.11.14 at 04:08, <tiejun.chen@intel.com> wrote:
> Here I tried to implement what you want. Note just pick two key 
> fragments since others have no big deal.
> 
> #1:
> 
> @@ -898,14 +898,25 @@ int 
> intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>   {
>       struct acpi_rmrr_unit *rmrr;
>       int rc = 0;
> +    unsigned int i;
> +    u32 id;
> +    u16 bdf;
> 
>       list_for_each_entry(rmrr, &acpi_rmrr_units, list)
>       {
> -        rc = func(PFN_DOWN(rmrr->base_address),
> -                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
> -                  ctxt);
> -        if ( rc )
> -            break;
> +        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
> +                    i < rmrr->scope.devices_cnt && !rc; i++)
> +        {
> +            id = PCI_SBDF(rmrr->segment, bdf);
> +            rc = func(PFN_DOWN(rmrr->base_address),
> +                               PFN_UP(rmrr->end_address) -
> +                                PFN_DOWN(rmrr->base_address),
> +                               id,
> +                               ctxt);
> +            if ( rc < 0 )
> +                return rc;
> +        }
> +        rc = 0;

Getting close - the main issue is that (as previously mentioned) you
should avoid open-coding for_each_rmrr_device(). It also doesn't
look like you really need the local variable 'id'.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-18  8:01                                                                                       ` Jan Beulich
@ 2014-11-18  8:16                                                                                         ` Chen, Tiejun
  2014-11-18  9:33                                                                                           ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-18  8:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/18 16:01, Jan Beulich wrote:
>>>> On 18.11.14 at 04:08, <tiejun.chen@intel.com> wrote:
>> Here I tried to implement what you want. Note just pick two key
>> fragments since others have no big deal.
>>
>> #1:
>>
>> @@ -898,14 +898,25 @@ int
>> intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>    {
>>        struct acpi_rmrr_unit *rmrr;
>>        int rc = 0;
>> +    unsigned int i;
>> +    u32 id;
>> +    u16 bdf;
>>
>>        list_for_each_entry(rmrr, &acpi_rmrr_units, list)
>>        {
>> -        rc = func(PFN_DOWN(rmrr->base_address),
>> -                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
>> -                  ctxt);
>> -        if ( rc )
>> -            break;
>> +        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
>> +                    i < rmrr->scope.devices_cnt && !rc; i++)
>> +        {
>> +            id = PCI_SBDF(rmrr->segment, bdf);
>> +            rc = func(PFN_DOWN(rmrr->base_address),
>> +                               PFN_UP(rmrr->end_address) -
>> +                                PFN_DOWN(rmrr->base_address),
>> +                               id,
>> +                               ctxt);
>> +            if ( rc < 0 )
>> +                return rc;
>> +        }
>> +        rc = 0;
>
> Getting close - the main issue is that (as previously mentioned) you
> should avoid open-coding for_each_rmrr_device(). It also doesn't

Sorry, are you saying these lines?

 >> +        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
 >> +                    i < rmrr->scope.devices_cnt && !rc; i++)

So without lookuping devices[i], how can we call func() for each sbdf as 
you mentioned?

> look like you really need the local variable 'id'.

Okay, I can pass PCI_SBDF(rmrr->segment, bdf) directly.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-18  8:16                                                                                         ` Chen, Tiejun
@ 2014-11-18  9:33                                                                                           ` Jan Beulich
  2014-11-19  1:26                                                                                             ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-18  9:33 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 18.11.14 at 09:16, <tiejun.chen@intel.com> wrote:
> On 2014/11/18 16:01, Jan Beulich wrote:
>>>>> On 18.11.14 at 04:08, <tiejun.chen@intel.com> wrote:
>>> Here I tried to implement what you want. Note just pick two key
>>> fragments since others have no big deal.
>>>
>>> #1:
>>>
>>> @@ -898,14 +898,25 @@ int
>>> intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>>    {
>>>        struct acpi_rmrr_unit *rmrr;
>>>        int rc = 0;
>>> +    unsigned int i;
>>> +    u32 id;
>>> +    u16 bdf;
>>>
>>>        list_for_each_entry(rmrr, &acpi_rmrr_units, list)
>>>        {
>>> -        rc = func(PFN_DOWN(rmrr->base_address),
>>> -                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
>>> -                  ctxt);
>>> -        if ( rc )
>>> -            break;
>>> +        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
>>> +                    i < rmrr->scope.devices_cnt && !rc; i++)
>>> +        {
>>> +            id = PCI_SBDF(rmrr->segment, bdf);
>>> +            rc = func(PFN_DOWN(rmrr->base_address),
>>> +                               PFN_UP(rmrr->end_address) -
>>> +                                PFN_DOWN(rmrr->base_address),
>>> +                               id,
>>> +                               ctxt);
>>> +            if ( rc < 0 )
>>> +                return rc;
>>> +        }
>>> +        rc = 0;
>>
>> Getting close - the main issue is that (as previously mentioned) you
>> should avoid open-coding for_each_rmrr_device(). It also doesn't
> 
> Sorry, are you saying these lines?
> 
>  >> +        for (i = 0; (bdf = rmrr->scope.devices[i]) &&
>  >> +                    i < rmrr->scope.devices_cnt && !rc; i++)
> 
> So without lookuping devices[i], how can we call func() for each sbdf as 
> you mentioned?

You've got both rmrr and bdf in the body of for_each_rmrr_device().
After all - as I said - you just open-coded it.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-18  9:33                                                                                           ` Jan Beulich
@ 2014-11-19  1:26                                                                                             ` Chen, Tiejun
  2014-11-20  7:31                                                                                               ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-19  1:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>> So without lookuping devices[i], how can we call func() for each sbdf as
>> you mentioned?
>
> You've got both rmrr and bdf in the body of for_each_rmrr_device().
> After all - as I said - you just open-coded it.
>

Yeah, so change this again,

int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
{
     struct acpi_rmrr_unit *rmrr;
     int rc = 0;
     unsigned int i;
     u16 bdf;

     for_each_rmrr_device ( rmrr, bdf, i )
     {
         rc = func(PFN_DOWN(rmrr->base_address),
                            PFN_UP(rmrr->end_address) -
                             PFN_DOWN(rmrr->base_address),
                            PCI_SBDF(rmrr->segment, bdf),
                           ctxt);
         /* Hit this entry so just go next. */
         if ( rc == 1 )
             i = rmrr->scope.devices_cnt;
         else if ( rc < 0 )
             return rc;
     }

     return rc;
}

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  9:56                                                                                 ` Jan Beulich
  2014-11-12 10:18                                                                                   ` Chen, Tiejun
@ 2014-11-19  8:17                                                                                   ` Tian, Kevin
  2014-11-20  7:45                                                                                   ` Tian, Kevin
  2 siblings, 0 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-11-19  8:17 UTC (permalink / raw)
  To: Jan Beulich, Chen, Tiejun; +Cc: Zhang, Yang Z, tim, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, November 12, 2014 5:57 PM
> 
> >>> On 12.11.14 at 10:13, <tiejun.chen@intel.com> wrote:
> > On 2014/11/12 17:02, Jan Beulich wrote:
> >>>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
> >>>>> #2 flags field in each specific device of new domctl would control
> >>>>> whether this device need to check/reserve its own RMRR range. But its
> >>>>> not dependent on current device assignment domctl, so the user can
> use
> >>>>> them to control which devices need to work as hotplug later, separately.
> >>>>
> >>>> And this could be left as a second step, in order for what needs to
> >>>> be done now to not get more complicated that necessary.
> >>>>
> >>>
> >>> Do you mean currently we still rely on the device assignment domctl to
> >>> provide SBDF? So looks nothing should be changed in our policy.
> >>
> >> I can't connect your question to what I said. What I tried to tell you
> >
> > Something is misunderstanding to me.
> >
> >> was that I don't currently see a need to make this overly complicated:
> >> Having the option to punch holes for all devices and (by default)
> >> dealing with just the devices assigned at boot may be sufficient as a
> >> first step. Yet (repeating just to avoid any misunderstanding) that
> >> makes things easier only if we decide to require device assignment to
> >> happen before memory getting populated (since in that case there's
> >
> > Here what do you mean, 'if we decide to require device assignment to
> > happen before memory getting populated'?
> >
> > Because -quote-
> > "
> > In the present the device assignment is always after memory population.
> > And I also mentioned previously I double checked this sequence with printk.
> > "
> >
> > Or you already plan or deciede to change this sequence?
> 
> So it is now the 3rd time that I'm telling you that part of your
> decision making as to which route to follow should be to
> re-consider whether the current sequence of operations shouldn't
> be changed. Please also consult with the VT-d maintainers (hint to
> them: participating in this discussion publicly would be really nice)
> on _all_ decisions to be made here.
> 

there's no decision made privately. we hope all the discussions publicly.
will get back w/ our thoughts soon.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-19  1:26                                                                                             ` Chen, Tiejun
@ 2014-11-20  7:31                                                                                               ` Jan Beulich
  2014-11-20  8:12                                                                                                 ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-20  7:31 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 19.11.14 at 02:26, <tiejun.chen@intel.com> wrote:
>> > So without lookuping devices[i], how can we call func() for each sbdf as
>>> you mentioned?
>>
>> You've got both rmrr and bdf in the body of for_each_rmrr_device().
>> After all - as I said - you just open-coded it.
>>
> 
> Yeah, so change this again,
> 
> int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
> {
>      struct acpi_rmrr_unit *rmrr;
>      int rc = 0;
>      unsigned int i;
>      u16 bdf;
> 
>      for_each_rmrr_device ( rmrr, bdf, i )
>      {
>          rc = func(PFN_DOWN(rmrr->base_address),
>                             PFN_UP(rmrr->end_address) -
>                              PFN_DOWN(rmrr->base_address),
>                             PCI_SBDF(rmrr->segment, bdf),
>                            ctxt);
>          /* Hit this entry so just go next. */
>          if ( rc == 1 )
>              i = rmrr->scope.devices_cnt;
>          else if ( rc < 0 )
>              return rc;
>      }
> 
>      return rc;
> }

Better. Another improvement would be make it not depend on the
internal workings of for_each_rmrr_device()... And in any case you
should not special case 1 - just return when rc is negative and skip
the rest of the current RMRR when it's positive. And of course make
the function's final return value predictable.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-12  9:56                                                                                 ` Jan Beulich
  2014-11-12 10:18                                                                                   ` Chen, Tiejun
  2014-11-19  8:17                                                                                   ` Tian, Kevin
@ 2014-11-20  7:45                                                                                   ` Tian, Kevin
  2014-11-20  8:04                                                                                     ` Jan Beulich
  2 siblings, 1 reply; 180+ messages in thread
From: Tian, Kevin @ 2014-11-20  7:45 UTC (permalink / raw)
  To: Jan Beulich, Chen, Tiejun; +Cc: Zhang, Yang Z, tim, xen-devel

> From: Tian, Kevin
> Sent: Wednesday, November 19, 2014 4:18 PM
> 
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Wednesday, November 12, 2014 5:57 PM
> >
> > >>> On 12.11.14 at 10:13, <tiejun.chen@intel.com> wrote:
> > > On 2014/11/12 17:02, Jan Beulich wrote:
> > >>>>> On 12.11.14 at 09:45, <tiejun.chen@intel.com> wrote:
> > >>>>> #2 flags field in each specific device of new domctl would control
> > >>>>> whether this device need to check/reserve its own RMRR range. But
> its
> > >>>>> not dependent on current device assignment domctl, so the user can
> > use
> > >>>>> them to control which devices need to work as hotplug later,
> separately.
> > >>>>
> > >>>> And this could be left as a second step, in order for what needs to
> > >>>> be done now to not get more complicated that necessary.
> > >>>>
> > >>>
> > >>> Do you mean currently we still rely on the device assignment domctl to
> > >>> provide SBDF? So looks nothing should be changed in our policy.
> > >>
> > >> I can't connect your question to what I said. What I tried to tell you
> > >
> > > Something is misunderstanding to me.
> > >
> > >> was that I don't currently see a need to make this overly complicated:
> > >> Having the option to punch holes for all devices and (by default)
> > >> dealing with just the devices assigned at boot may be sufficient as a
> > >> first step. Yet (repeating just to avoid any misunderstanding) that
> > >> makes things easier only if we decide to require device assignment to
> > >> happen before memory getting populated (since in that case there's
> > >
> > > Here what do you mean, 'if we decide to require device assignment to
> > > happen before memory getting populated'?
> > >
> > > Because -quote-
> > > "
> > > In the present the device assignment is always after memory population.
> > > And I also mentioned previously I double checked this sequence with printk.
> > > "
> > >
> > > Or you already plan or deciede to change this sequence?
> >
> > So it is now the 3rd time that I'm telling you that part of your
> > decision making as to which route to follow should be to
> > re-consider whether the current sequence of operations shouldn't
> > be changed. Please also consult with the VT-d maintainers (hint to
> > them: participating in this discussion publicly would be really nice)
> > on _all_ decisions to be made here.
> >
> 

Yang and I did some discussion here. We understand your point to
avoid introducing new interface if we can leverage existing code.
However it's not a trivial effort to move device assignment before 
populating p2m, and there is no other benefit of doing so except
for this purpose. So we'd not suggest this way.

Current option sounds a reasonable one, i.e. passing a list of BDFs
assigned to this VM before populating p2m, and then having 
hypervisor to filter out reserved regions associated with those 
BDFs. This way libxc teaches Xen to create reserved regions once,
and then later the filtered info is returned upon query.

The limitation of wasted memory due to confliction can be
mitigated, and we considered further enhancement can be made
later in libxc that when populating p2m, the reserved regions
can be skipped explicitly at initial p2m creation phase and then 
there would be no waste at all. But this optimization takes some
time and can be built incrementally on current patch and interface, 
post 4.5 release. For now let's focus on the very correctness first.

If you agree, Tiejun will move forward to send another series for 4.5. So
far lots of opens have been closed with your help, but it also means
original v7 needs a serious update then (latest code is in deep discussion
list)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  7:45                                                                                   ` Tian, Kevin
@ 2014-11-20  8:04                                                                                     ` Jan Beulich
  2014-11-20  8:51                                                                                       ` Tian, Kevin
  2014-11-20 14:40                                                                                       ` Tian, Kevin
  0 siblings, 2 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-20  8:04 UTC (permalink / raw)
  To: Kevin Tian; +Cc: Yang Z Zhang, Tiejun Chen, tim, xen-devel

>>> On 20.11.14 at 08:45, <kevin.tian@intel.com> wrote:
> Yang and I did some discussion here. We understand your point to
> avoid introducing new interface if we can leverage existing code.
> However it's not a trivial effort to move device assignment before 
> populating p2m, and there is no other benefit of doing so except
> for this purpose. So we'd not suggest this way.

"It's not a trivial effort" is pretty vague: What specifically is it that
makes this difficult? I wouldn't expect there to be any strong
dependencies on the ordering of these two operations...

> Current option sounds a reasonable one, i.e. passing a list of BDFs
> assigned to this VM before populating p2m, and then having 
> hypervisor to filter out reserved regions associated with those 
> BDFs. This way libxc teaches Xen to create reserved regions once,
> and then later the filtered info is returned upon query.

Reasonable, but partly redundant. The positive aspect being that
it permits this list and the list of actually being assigned devices to
be different, i.e. allowing holes to be set up for devices that only
_may_ get assigned at some point.

> The limitation of wasted memory due to confliction can be
> mitigated, and we considered further enhancement can be made
> later in libxc that when populating p2m, the reserved regions
> can be skipped explicitly at initial p2m creation phase and then 
> there would be no waste at all. But this optimization takes some
> time and can be built incrementally on current patch and interface, 
> post 4.5 release. For now let's focus on the very correctness first.

I agree, as long as the optimization part doesn't get dropped after
the correctness part went in.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  7:31                                                                                               ` Jan Beulich
@ 2014-11-20  8:12                                                                                                 ` Chen, Tiejun
  2014-11-20  8:59                                                                                                   ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-20  8:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

On 2014/11/20 15:31, Jan Beulich wrote:
>>>> On 19.11.14 at 02:26, <tiejun.chen@intel.com> wrote:
>>>> So without lookuping devices[i], how can we call func() for each sbdf as
>>>> you mentioned?
>>>
>>> You've got both rmrr and bdf in the body of for_each_rmrr_device().
>>> After all - as I said - you just open-coded it.
>>>
>>
>> Yeah, so change this again,
>>
>> int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>> {
>>       struct acpi_rmrr_unit *rmrr;
>>       int rc = 0;
>>       unsigned int i;
>>       u16 bdf;
>>
>>       for_each_rmrr_device ( rmrr, bdf, i )
>>       {
>>           rc = func(PFN_DOWN(rmrr->base_address),
>>                              PFN_UP(rmrr->end_address) -
>>                               PFN_DOWN(rmrr->base_address),
>>                              PCI_SBDF(rmrr->segment, bdf),
>>                             ctxt);
>>           /* Hit this entry so just go next. */
>>           if ( rc == 1 )
>>               i = rmrr->scope.devices_cnt;
>>           else if ( rc < 0 )
>>               return rc;
>>       }
>>
>>       return rc;
>> }
>
> Better. Another improvement would be make it not depend on the
> internal workings of for_each_rmrr_device()... And in any case you

Yes but as you see,

#define for_each_rmrr_device(rmrr, bdf, idx)            \
     list_for_each_entry(rmrr, &acpi_rmrr_units, list)   \
         /* assume there never is a bdf == 0 */          \
         for (idx = 0; (bdf = rmrr->scope.devices[idx]) && \
                  idx < rmrr->scope.devices_cnt; idx++)

So,
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         rc = func(...);
         /* Hit this entry so just go next. */
         if ( rc > 0 )
             i = rmrr->scope.devices_cnt;

If you don't intend to reset something of this internal working, its 
hard to go next rmrr entry. Maybe you already have idea, so could you 
give me some hints?

> should not special case 1 - just return when rc is negative and skip
> the rest of the current RMRR when it's positive. And of course make
> the function's final return value predictable.
>

Like this,

         /* Hit this entry so just go next. */
         if ( rc > 0 )
             xxxx;
         else if ( rc < 0 )
             return rc;
     }

     return 0;

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  8:04                                                                                     ` Jan Beulich
@ 2014-11-20  8:51                                                                                       ` Tian, Kevin
  2014-11-20 14:40                                                                                       ` Tian, Kevin
  1 sibling, 0 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-11-20  8:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Chen, Tiejun, tim, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, November 20, 2014 4:04 PM
> 
> >>> On 20.11.14 at 08:45, <kevin.tian@intel.com> wrote:
> > Yang and I did some discussion here. We understand your point to
> > avoid introducing new interface if we can leverage existing code.
> > However it's not a trivial effort to move device assignment before
> > populating p2m, and there is no other benefit of doing so except
> > for this purpose. So we'd not suggest this way.
> 
> "It's not a trivial effort" is pretty vague: What specifically is it that
> makes this difficult? I wouldn't expect there to be any strong
> dependencies on the ordering of these two operations...

I'll leave to Yang to answer this part, who did a detail investigation
on that, e.g. on IOMMU page table setup, etc. But what really matters
here is not only about complexity, but also flexibility. Doing so will 
tie the policy to assigned device only, which removes the option to
support hotpluggable device.

> 
> > Current option sounds a reasonable one, i.e. passing a list of BDFs
> > assigned to this VM before populating p2m, and then having
> > hypervisor to filter out reserved regions associated with those
> > BDFs. This way libxc teaches Xen to create reserved regions once,
> > and then later the filtered info is returned upon query.
> 
> Reasonable, but partly redundant. The positive aspect being that
> it permits this list and the list of actually being assigned devices to
> be different, i.e. allowing holes to be set up for devices that only
> _may_ get assigned at some point.

redundant if we think the list only exactly matching the statically 
assigned devices. but that's just current point.

reasonable if we think there may be other policies impacting the list
(e.g. if hotplugable device may have a config option and then we have
a potential list larger than static assigned devices). From this angle
I think the new interface actually makes sense for this very purpose.

> 
> > The limitation of wasted memory due to confliction can be
> > mitigated, and we considered further enhancement can be made
> > later in libxc that when populating p2m, the reserved regions
> > can be skipped explicitly at initial p2m creation phase and then
> > there would be no waste at all. But this optimization takes some
> > time and can be built incrementally on current patch and interface,
> > post 4.5 release. For now let's focus on the very correctness first.
> 
> I agree, as long as the optimization part doesn't get dropped after
> the correctness part went in.

definitely. after putting so much effort in last months from Tiejun, 
you can see our willingness to make it correct and continuously 
improved.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  8:12                                                                                                 ` Chen, Tiejun
@ 2014-11-20  8:59                                                                                                   ` Jan Beulich
  2014-11-20 10:28                                                                                                     ` Chen, Tiejun
  0 siblings, 1 reply; 180+ messages in thread
From: Jan Beulich @ 2014-11-20  8:59 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

>>> On 20.11.14 at 09:12, <tiejun.chen@intel.com> wrote:
> On 2014/11/20 15:31, Jan Beulich wrote:
>>>>> On 19.11.14 at 02:26, <tiejun.chen@intel.com> wrote:
>>> int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
>>> {
>>>       struct acpi_rmrr_unit *rmrr;
>>>       int rc = 0;
>>>       unsigned int i;
>>>       u16 bdf;
>>>
>>>       for_each_rmrr_device ( rmrr, bdf, i )
>>>       {
>>>           rc = func(PFN_DOWN(rmrr->base_address),
>>>                              PFN_UP(rmrr->end_address) -
>>>                               PFN_DOWN(rmrr->base_address),
>>>                              PCI_SBDF(rmrr->segment, bdf),
>>>                             ctxt);
>>>           /* Hit this entry so just go next. */
>>>           if ( rc == 1 )
>>>               i = rmrr->scope.devices_cnt;
>>>           else if ( rc < 0 )
>>>               return rc;
>>>       }
>>>
>>>       return rc;
>>> }
>>
>> Better. Another improvement would be make it not depend on the
>> internal workings of for_each_rmrr_device()... And in any case you
> 
> Yes but as you see,
> 
> #define for_each_rmrr_device(rmrr, bdf, idx)            \
>      list_for_each_entry(rmrr, &acpi_rmrr_units, list)   \
>          /* assume there never is a bdf == 0 */          \
>          for (idx = 0; (bdf = rmrr->scope.devices[idx]) && \
>                   idx < rmrr->scope.devices_cnt; idx++)
> 
> So,
>      for_each_rmrr_device ( rmrr, bdf, i )
>      {
>          rc = func(...);
>          /* Hit this entry so just go next. */
>          if ( rc > 0 )
>              i = rmrr->scope.devices_cnt;
> 
> If you don't intend to reset something of this internal working, its 
> hard to go next rmrr entry. Maybe you already have idea, so could you 
> give me some hints?

Have a second struct acpi_rmrr_unit pointer, starting out as NULL
and getting set to the current one when the callback returns a
positive value. Skip further iterations as long as both pointers
match.

>> should not special case 1 - just return when rc is negative and skip
>> the rest of the current RMRR when it's positive. And of course make
>> the function's final return value predictable.
>>
> 
> Like this,
> 
>          /* Hit this entry so just go next. */
>          if ( rc > 0 )
>              xxxx;
>          else if ( rc < 0 )
>              return rc;
>      }

Yes, albeit swapping the order (and dropping the "else" along with
adding unlikely() to the error case) would be preferred.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  8:59                                                                                                   ` Jan Beulich
@ 2014-11-20 10:28                                                                                                     ` Chen, Tiejun
  0 siblings, 0 replies; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-20 10:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: yang.z.zhang, kevin.tian, tim, xen-devel

> Have a second struct acpi_rmrr_unit pointer, starting out as NULL
> and getting set to the current one when the callback returns a
> positive value. Skip further iterations as long as both pointers
> match.

Great!

int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
{
     struct acpi_rmrr_unit *rmrr, *rmrr_cur = NULL;
     int rc = 0;
     unsigned int i;
     u16 bdf;

     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr != rmrr_cur )
         {
             rc = func(PFN_DOWN(rmrr->base_address),
                       PFN_UP(rmrr->end_address) -
                         PFN_DOWN(rmrr->base_address),
                       PCI_SBDF(rmrr->segment, bdf),
                       ctxt);

             if ( unlikely(rc < 0) )
                 return rc;
             /* Hit this entry so just go next. */
             if ( rc > 0 )
                 rmrr_cur = rmrr;
         }
     }

     return 0;
}


Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20  8:04                                                                                     ` Jan Beulich
  2014-11-20  8:51                                                                                       ` Tian, Kevin
@ 2014-11-20 14:40                                                                                       ` Tian, Kevin
  2014-11-20 14:46                                                                                         ` Jan Beulich
  2014-11-20 20:11                                                                                         ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-11-20 14:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Zhang, Yang Z, Chen, Tiejun, tim, xen-devel

> From: Tian, Kevin
> Sent: Thursday, November 20, 2014 4:51 PM
> 
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Thursday, November 20, 2014 4:04 PM
> >
> > >>> On 20.11.14 at 08:45, <kevin.tian@intel.com> wrote:
> > > Yang and I did some discussion here. We understand your point to
> > > avoid introducing new interface if we can leverage existing code.
> > > However it's not a trivial effort to move device assignment before
> > > populating p2m, and there is no other benefit of doing so except
> > > for this purpose. So we'd not suggest this way.
> >
> > "It's not a trivial effort" is pretty vague: What specifically is it that
> > makes this difficult? I wouldn't expect there to be any strong
> > dependencies on the ordering of these two operations...
> 
> I'll leave to Yang to answer this part, who did a detail investigation
> on that, e.g. on IOMMU page table setup, etc. But what really matters
> here is not only about complexity, but also flexibility. Doing so will
> tie the policy to assigned device only, which removes the option to
> support hotpluggable device.
> 
> >
> > > Current option sounds a reasonable one, i.e. passing a list of BDFs
> > > assigned to this VM before populating p2m, and then having
> > > hypervisor to filter out reserved regions associated with those
> > > BDFs. This way libxc teaches Xen to create reserved regions once,
> > > and then later the filtered info is returned upon query.
> >
> > Reasonable, but partly redundant. The positive aspect being that
> > it permits this list and the list of actually being assigned devices to
> > be different, i.e. allowing holes to be set up for devices that only
> > _may_ get assigned at some point.
> 
> redundant if we think the list only exactly matching the statically
> assigned devices. but that's just current point.
> 
> reasonable if we think there may be other policies impacting the list
> (e.g. if hotplugable device may have a config option and then we have
> a potential list larger than static assigned devices). From this angle
> I think the new interface actually makes sense for this very purpose.
> 

Jan are you OK with this? In previous approach we reserved all the
RMRR regions so hotplug scenario is covered automatically. Now since
we want to do BDF specific filtering, this new interface is actually
necessary for hotplug support. If OK, Tiejun will send out a new 
series to see whether it's ready for 4.5 check-in.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20 14:40                                                                                       ` Tian, Kevin
@ 2014-11-20 14:46                                                                                         ` Jan Beulich
  2014-11-20 20:11                                                                                         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-20 14:46 UTC (permalink / raw)
  To: Kevin Tian; +Cc: Yang Z Zhang, Tiejun Chen, tim, xen-devel

>>> On 20.11.14 at 15:40, <kevin.tian@intel.com> wrote:
>>  From: Tian, Kevin
>> Sent: Thursday, November 20, 2014 4:51 PM
>> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> > Sent: Thursday, November 20, 2014 4:04 PM
>> > >>> On 20.11.14 at 08:45, <kevin.tian@intel.com> wrote:
>> > > Current option sounds a reasonable one, i.e. passing a list of BDFs
>> > > assigned to this VM before populating p2m, and then having
>> > > hypervisor to filter out reserved regions associated with those
>> > > BDFs. This way libxc teaches Xen to create reserved regions once,
>> > > and then later the filtered info is returned upon query.
>> >
>> > Reasonable, but partly redundant. The positive aspect being that
>> > it permits this list and the list of actually being assigned devices to
>> > be different, i.e. allowing holes to be set up for devices that only
>> > _may_ get assigned at some point.
>> 
>> redundant if we think the list only exactly matching the statically
>> assigned devices. but that's just current point.
>> 
>> reasonable if we think there may be other policies impacting the list
>> (e.g. if hotplugable device may have a config option and then we have
>> a potential list larger than static assigned devices). From this angle
>> I think the new interface actually makes sense for this very purpose.
> 
> Jan are you OK with this?

I can live with it, yes.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20 14:40                                                                                       ` Tian, Kevin
  2014-11-20 14:46                                                                                         ` Jan Beulich
@ 2014-11-20 20:11                                                                                         ` Konrad Rzeszutek Wilk
  2014-11-21  0:32                                                                                           ` Tian, Kevin
  1 sibling, 1 reply; 180+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-20 20:11 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Zhang, Yang Z, Chen, Tiejun, tim, Jan Beulich, xen-devel

> Jan are you OK with this? In previous approach we reserved all the
> RMRR regions so hotplug scenario is covered automatically. Now since
> we want to do BDF specific filtering, this new interface is actually
> necessary for hotplug support. If OK, Tiejun will send out a new 
> series to see whether it's ready for 4.5 check-in.

Could you also drop the 'RFC' part please? And please do mention in your
cover letter how to reproduce the initial bug (perhaps a pointer to
the hardware and software that this affects?)

Thank you.

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps
  2014-11-20 20:11                                                                                         ` Konrad Rzeszutek Wilk
@ 2014-11-21  0:32                                                                                           ` Tian, Kevin
  0 siblings, 0 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-11-21  0:32 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Zhang, Yang Z, Chen, Tiejun, tim, Jan Beulich, xen-devel

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Friday, November 21, 2014 4:11 AM
> 
> > Jan are you OK with this? In previous approach we reserved all the
> > RMRR regions so hotplug scenario is covered automatically. Now since
> > we want to do BDF specific filtering, this new interface is actually
> > necessary for hotplug support. If OK, Tiejun will send out a new
> > series to see whether it's ready for 4.5 check-in.
> 
> Could you also drop the 'RFC' part please? And please do mention in your
> cover letter how to reproduce the initial bug (perhaps a pointer to
> the hardware and software that this affects?)
> 

good suggestion. will do.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-03 10:02                                   ` Jan Beulich
@ 2014-11-21  6:26                                     ` Chen, Tiejun
  2014-11-21  7:43                                       ` Tian, Kevin
  0 siblings, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-21  6:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, yang.z.zhang

On 2014/11/3 18:02, Jan Beulich wrote:
>>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
>> On 2014/11/3 17:45, Jan Beulich wrote:
>>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
>>>> On 2014/11/3 16:53, Jan Beulich wrote:
>>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern' to them
>>>>>>>>>>>> when we really need them.
>>>>>>>>>>>
>>>>>>>>>>> Please stop thinking this way. Declarations for things defined in .c
>>>>>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>>>>>> include that header (making sure declaration and definition are and
>>>>>>>>>>> remain in sync). I hate having to again repeat my remark that you
>>>>>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>>>>>> Robust and maintainable code are a requirement in the hypervisor
>>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't also
>>>>>>>>>>> apply to application code. It's just that in the hypervisor and kernel
>>>>>>>>>>> (and certain other code system components) the consequences of
>>>>>>>>>>> being lax are much more severe.
>>>>>>>>>>
>>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>>>>>> '<xen/memory.h>,
>>>>>>>>>>
>>>>>>>>>> #include "util.h"
>>>>>>>>>> ...
>>>>>>>>>> #include <xen/memory.h>
>>>>>>>>>>
>>>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>>>>>
>>>>>>>>> Redefine? I said forward declare.
>>>>>>>>
>>>>>>>> Seems we just need to declare hvm_get_reserved_device_memory_map() in
>>>>>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>>>>>
>>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>>>>>
>>>>>>> To me this looks very much like poor programming style, even if in
>>>>>>> the context of hvmloader communicating information via global
>>>>>>> variables rather than function arguments and return values is
>>>>>>
>>>>>> Do you mean you don't like a global variable? But it can be use to get
>>>>>> RDM without more hypercall or function call in the context of hvmloader.
>>>>>
>>>>> This argument which you brought up before, and which we commented
>>>>> on before, is pretty pointless. We don't really care much about doing
>>>>> one or two more hypercalls from hvmloader, unless these would be
>>>>> long-running ones.
>>>>>
>>>>
>>>> Another benefit to use a global variable is that we wouldn't allocate
>>>> xen_reserved_device_memory * N each time, and reduce some duplicated
>>>> codes, unless you mean I should define that as static inside in local.
>>>
>>> Now this reason is indeed worth a consideration. How many times is
>>> the data being needed/retrieved?
>>
>> Currently there are two cases in tools/hvmloader, setup pci and build
>> e820 table. Each time, as you know we don't know how may entries we
>> should require, so we always allocate one instance then according to the
>> return value to allocate the proper instances to get that.
>
> Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
> a more "normal" interface.
>

Just go back here since I realize we always use mem_alloc(), which is 
pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall 
caller in hvmloader, but unfortunately we have no any associated free 
function implementation in hvmloader, so if we call this multiple times 
this means it really waster more memory in RESERVED_MEMORY. So I still 
think one global variable should be fine.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-21  6:26                                     ` Chen, Tiejun
@ 2014-11-21  7:43                                       ` Tian, Kevin
  2014-11-21  7:54                                         ` Jan Beulich
  0 siblings, 1 reply; 180+ messages in thread
From: Tian, Kevin @ 2014-11-21  7:43 UTC (permalink / raw)
  To: Chen, Tiejun, Jan Beulich
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z

> From: Chen, Tiejun
> Sent: Friday, November 21, 2014 2:26 PM
> 
> On 2014/11/3 18:02, Jan Beulich wrote:
> >>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
> >> On 2014/11/3 17:45, Jan Beulich wrote:
> >>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
> >>>> On 2014/11/3 16:53, Jan Beulich wrote:
> >>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
> >>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
> >>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
> >>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
> >>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
> >>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
> >>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
> >>>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern'
> to them
> >>>>>>>>>>>> when we really need them.
> >>>>>>>>>>>
> >>>>>>>>>>> Please stop thinking this way. Declarations for things defined
> in .c
> >>>>>>>>>>> files are to be present in headers, and the defining .c file has to
> >>>>>>>>>>> include that header (making sure declaration and definition are
> and
> >>>>>>>>>>> remain in sync). I hate having to again repeat my remark that
> you
> >>>>>>>>>>> shouldn't forget it's not application code that you're modifying.
> >>>>>>>>>>> Robust and maintainable code are a requirement in the
> hypervisor
> >>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
> >>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't
> also
> >>>>>>>>>>> apply to application code. It's just that in the hypervisor and
> kernel
> >>>>>>>>>>> (and certain other code system components) the consequences
> of
> >>>>>>>>>>> being lax are much more severe.
> >>>>>>>>>>
> >>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
> >>>>>>>>>> '<xen/memory.h>,
> >>>>>>>>>>
> >>>>>>>>>> #include "util.h"
> >>>>>>>>>> ...
> >>>>>>>>>> #include <xen/memory.h>
> >>>>>>>>>>
> >>>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
> >>>>>>>>>
> >>>>>>>>> Redefine? I said forward declare.
> >>>>>>>>
> >>>>>>>> Seems we just need to declare
> hvm_get_reserved_device_memory_map() in
> >>>>>>>> the head file, tools/firmware/hvmloader/util.h,
> >>>>>>>>
> >>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
> >>>>>>>
> >>>>>>> To me this looks very much like poor programming style, even if in
> >>>>>>> the context of hvmloader communicating information via global
> >>>>>>> variables rather than function arguments and return values is
> >>>>>>
> >>>>>> Do you mean you don't like a global variable? But it can be use to get
> >>>>>> RDM without more hypercall or function call in the context of
> hvmloader.
> >>>>>
> >>>>> This argument which you brought up before, and which we commented
> >>>>> on before, is pretty pointless. We don't really care much about doing
> >>>>> one or two more hypercalls from hvmloader, unless these would be
> >>>>> long-running ones.
> >>>>>
> >>>>
> >>>> Another benefit to use a global variable is that we wouldn't allocate
> >>>> xen_reserved_device_memory * N each time, and reduce some
> duplicated
> >>>> codes, unless you mean I should define that as static inside in local.
> >>>
> >>> Now this reason is indeed worth a consideration. How many times is
> >>> the data being needed/retrieved?
> >>
> >> Currently there are two cases in tools/hvmloader, setup pci and build
> >> e820 table. Each time, as you know we don't know how may entries we
> >> should require, so we always allocate one instance then according to the
> >> return value to allocate the proper instances to get that.
> >
> > Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
> > a more "normal" interface.
> >
> 
> Just go back here since I realize we always use mem_alloc(), which is
> pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall
> caller in hvmloader, but unfortunately we have no any associated free
> function implementation in hvmloader, so if we call this multiple times
> this means it really waster more memory in RESERVED_MEMORY. So I still
> think one global variable should be fine.
> 

it's natural to get reserved information once, and then saved for either
one-time or multiple-time references.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-21  7:43                                       ` Tian, Kevin
@ 2014-11-21  7:54                                         ` Jan Beulich
  2014-11-21  8:01                                           ` Tian, Kevin
  2014-11-21  8:54                                           ` Chen, Tiejun
  0 siblings, 2 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-21  7:54 UTC (permalink / raw)
  To: Kevin Tian, Tiejun Chen
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang

>>> On 21.11.14 at 08:43, <kevin.tian@intel.com> wrote:
>>  From: Chen, Tiejun
>> Sent: Friday, November 21, 2014 2:26 PM
>> 
>> On 2014/11/3 18:02, Jan Beulich wrote:
>> >>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
>> >> On 2014/11/3 17:45, Jan Beulich wrote:
>> >>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
>> >>>> On 2014/11/3 16:53, Jan Beulich wrote:
>> >>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>> >>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
>> >>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>> >>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>> >>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>> >>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>> >>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>> >>>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern'
>> to them
>> >>>>>>>>>>>> when we really need them.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Please stop thinking this way. Declarations for things defined
>> in .c
>> >>>>>>>>>>> files are to be present in headers, and the defining .c file has to
>> >>>>>>>>>>> include that header (making sure declaration and definition are
>> and
>> >>>>>>>>>>> remain in sync). I hate having to again repeat my remark that
>> you
>> >>>>>>>>>>> shouldn't forget it's not application code that you're modifying.
>> >>>>>>>>>>> Robust and maintainable code are a requirement in the
>> hypervisor
>> >>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>> >>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't
>> also
>> >>>>>>>>>>> apply to application code. It's just that in the hypervisor and
>> kernel
>> >>>>>>>>>>> (and certain other code system components) the consequences
>> of
>> >>>>>>>>>>> being lax are much more severe.
>> >>>>>>>>>>
>> >>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>> >>>>>>>>>> '<xen/memory.h>,
>> >>>>>>>>>>
>> >>>>>>>>>> #include "util.h"
>> >>>>>>>>>> ...
>> >>>>>>>>>> #include <xen/memory.h>
>> >>>>>>>>>>
>> >>>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>> >>>>>>>>>
>> >>>>>>>>> Redefine? I said forward declare.
>> >>>>>>>>
>> >>>>>>>> Seems we just need to declare
>> hvm_get_reserved_device_memory_map() in
>> >>>>>>>> the head file, tools/firmware/hvmloader/util.h,
>> >>>>>>>>
>> >>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>> >>>>>>>
>> >>>>>>> To me this looks very much like poor programming style, even if in
>> >>>>>>> the context of hvmloader communicating information via global
>> >>>>>>> variables rather than function arguments and return values is
>> >>>>>>
>> >>>>>> Do you mean you don't like a global variable? But it can be use to get
>> >>>>>> RDM without more hypercall or function call in the context of
>> hvmloader.
>> >>>>>
>> >>>>> This argument which you brought up before, and which we commented
>> >>>>> on before, is pretty pointless. We don't really care much about doing
>> >>>>> one or two more hypercalls from hvmloader, unless these would be
>> >>>>> long-running ones.
>> >>>>>
>> >>>>
>> >>>> Another benefit to use a global variable is that we wouldn't allocate
>> >>>> xen_reserved_device_memory * N each time, and reduce some
>> duplicated
>> >>>> codes, unless you mean I should define that as static inside in local.
>> >>>
>> >>> Now this reason is indeed worth a consideration. How many times is
>> >>> the data being needed/retrieved?
>> >>
>> >> Currently there are two cases in tools/hvmloader, setup pci and build
>> >> e820 table. Each time, as you know we don't know how may entries we
>> >> should require, so we always allocate one instance then according to the
>> >> return value to allocate the proper instances to get that.
>> >
>> > Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
>> > a more "normal" interface.
>> >
>> 
>> Just go back here since I realize we always use mem_alloc(), which is
>> pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall
>> caller in hvmloader, but unfortunately we have no any associated free
>> function implementation in hvmloader, so if we call this multiple times
>> this means it really waster more memory in RESERVED_MEMORY. So I still
>> think one global variable should be fine.
> 
> it's natural to get reserved information once, and then saved for either
> one-time or multiple-time references.

Not really natural, but possible. Yet if done this way, then the
interface shouldn't give the appearance of retrieving it every time,
i.e. the global should be set up separately and the users of the
data should access the data rather than calling a (fake) function.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-21  7:54                                         ` Jan Beulich
@ 2014-11-21  8:01                                           ` Tian, Kevin
  2014-11-21  8:54                                           ` Chen, Tiejun
  1 sibling, 0 replies; 180+ messages in thread
From: Tian, Kevin @ 2014-11-21  8:01 UTC (permalink / raw)
  To: Jan Beulich, Chen, Tiejun
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Zhang, Yang Z

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, November 21, 2014 3:54 PM
> 
> >>> On 21.11.14 at 08:43, <kevin.tian@intel.com> wrote:
> >>  From: Chen, Tiejun
> >> Sent: Friday, November 21, 2014 2:26 PM
> >>
> >> On 2014/11/3 18:02, Jan Beulich wrote:
> >> >>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
> >> >> On 2014/11/3 17:45, Jan Beulich wrote:
> >> >>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
> >> >>>> On 2014/11/3 16:53, Jan Beulich wrote:
> >> >>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
> >> >>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
> >> >>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
> >> >>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
> >> >>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
> >> >>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
> >> >>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
> >> >>>>>>>>>>>> Looks I can remove those stuff from util.h and just add
> 'extern'
> >> to them
> >> >>>>>>>>>>>> when we really need them.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Please stop thinking this way. Declarations for things defined
> >> in .c
> >> >>>>>>>>>>> files are to be present in headers, and the defining .c file has
> to
> >> >>>>>>>>>>> include that header (making sure declaration and definition
> are
> >> and
> >> >>>>>>>>>>> remain in sync). I hate having to again repeat my remark
> that
> >> you
> >> >>>>>>>>>>> shouldn't forget it's not application code that you're
> modifying.
> >> >>>>>>>>>>> Robust and maintainable code are a requirement in the
> >> hypervisor
> >> >>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which -
> just
> >> >>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't
> >> also
> >> >>>>>>>>>>> apply to application code. It's just that in the hypervisor and
> >> kernel
> >> >>>>>>>>>>> (and certain other code system components) the
> consequences
> >> of
> >> >>>>>>>>>>> being lax are much more severe.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
> >> >>>>>>>>>> '<xen/memory.h>,
> >> >>>>>>>>>>
> >> >>>>>>>>>> #include "util.h"
> >> >>>>>>>>>> ...
> >> >>>>>>>>>> #include <xen/memory.h>
> >> >>>>>>>>>>
> >> >>>>>>>>>> We can't redefine struct xen_reserved_device_memory in
> util.h.
> >> >>>>>>>>>
> >> >>>>>>>>> Redefine? I said forward declare.
> >> >>>>>>>>
> >> >>>>>>>> Seems we just need to declare
> >> hvm_get_reserved_device_memory_map() in
> >> >>>>>>>> the head file, tools/firmware/hvmloader/util.h,
> >> >>>>>>>>
> >> >>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
> >> >>>>>>>
> >> >>>>>>> To me this looks very much like poor programming style, even if in
> >> >>>>>>> the context of hvmloader communicating information via global
> >> >>>>>>> variables rather than function arguments and return values is
> >> >>>>>>
> >> >>>>>> Do you mean you don't like a global variable? But it can be use to
> get
> >> >>>>>> RDM without more hypercall or function call in the context of
> >> hvmloader.
> >> >>>>>
> >> >>>>> This argument which you brought up before, and which we
> commented
> >> >>>>> on before, is pretty pointless. We don't really care much about doing
> >> >>>>> one or two more hypercalls from hvmloader, unless these would be
> >> >>>>> long-running ones.
> >> >>>>>
> >> >>>>
> >> >>>> Another benefit to use a global variable is that we wouldn't allocate
> >> >>>> xen_reserved_device_memory * N each time, and reduce some
> >> duplicated
> >> >>>> codes, unless you mean I should define that as static inside in local.
> >> >>>
> >> >>> Now this reason is indeed worth a consideration. How many times is
> >> >>> the data being needed/retrieved?
> >> >>
> >> >> Currently there are two cases in tools/hvmloader, setup pci and build
> >> >> e820 table. Each time, as you know we don't know how may entries we
> >> >> should require, so we always allocate one instance then according to the
> >> >> return value to allocate the proper instances to get that.
> >> >
> >> > Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
> >> > a more "normal" interface.
> >> >
> >>
> >> Just go back here since I realize we always use mem_alloc(), which is
> >> pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall
> >> caller in hvmloader, but unfortunately we have no any associated free
> >> function implementation in hvmloader, so if we call this multiple times
> >> this means it really waster more memory in RESERVED_MEMORY. So I still
> >> think one global variable should be fine.
> >
> > it's natural to get reserved information once, and then saved for either
> > one-time or multiple-time references.
> 
> Not really natural, but possible. Yet if done this way, then the
> interface shouldn't give the appearance of retrieving it every time,
> i.e. the global should be set up separately and the users of the
> data should access the data rather than calling a (fake) function.
> 

that's what I meant.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-21  7:54                                         ` Jan Beulich
  2014-11-21  8:01                                           ` Tian, Kevin
@ 2014-11-21  8:54                                           ` Chen, Tiejun
  2014-11-21  9:33                                             ` Jan Beulich
  1 sibling, 1 reply; 180+ messages in thread
From: Chen, Tiejun @ 2014-11-21  8:54 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: wei.liu2, ian.campbell, stefano.stabellini, tim, ian.jackson,
	xen-devel, Yang Z Zhang

On 2014/11/21 15:54, Jan Beulich wrote:
>>>> On 21.11.14 at 08:43, <kevin.tian@intel.com> wrote:
>>>   From: Chen, Tiejun
>>> Sent: Friday, November 21, 2014 2:26 PM
>>>
>>> On 2014/11/3 18:02, Jan Beulich wrote:
>>>>>>> On 03.11.14 at 10:55, <tiejun.chen@intel.com> wrote:
>>>>> On 2014/11/3 17:45, Jan Beulich wrote:
>>>>>>>>> On 03.11.14 at 10:32, <tiejun.chen@intel.com> wrote:
>>>>>>> On 2014/11/3 16:53, Jan Beulich wrote:
>>>>>>>>>>> On 03.11.14 at 03:22, <tiejun.chen@intel.com> wrote:
>>>>>>>>> On 2014/10/31 16:14, Jan Beulich wrote:
>>>>>>>>>>>>> On 31.10.14 at 03:20, <tiejun.chen@intel.com> wrote:
>>>>>>>>>>> On 2014/10/30 17:13, Jan Beulich wrote:
>>>>>>>>>>>>>>> On 30.10.14 at 06:55, <tiejun.chen@intel.com> wrote:
>>>>>>>>>>>>> On 2014/10/29 17:05, Jan Beulich wrote:
>>>>>>>>>>>>>>>>> On 29.10.14 at 07:54, <tiejun.chen@intel.com> wrote:
>>>>>>>>>>>>>>> Looks I can remove those stuff from util.h and just add 'extern'
>>> to them
>>>>>>>>>>>>>>> when we really need them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please stop thinking this way. Declarations for things defined
>>> in .c
>>>>>>>>>>>>>> files are to be present in headers, and the defining .c file has to
>>>>>>>>>>>>>> include that header (making sure declaration and definition are
>>> and
>>>>>>>>>>>>>> remain in sync). I hate having to again repeat my remark that
>>> you
>>>>>>>>>>>>>> shouldn't forget it's not application code that you're modifying.
>>>>>>>>>>>>>> Robust and maintainable code are a requirement in the
>>> hypervisor
>>>>>>>>>>>>>> (and, as said it being an extension of it, hvmloader). Which - just
>>>>>>>>>>>>>> to avoid any misunderstanding - isn't to say that this shouldn't
>>> also
>>>>>>>>>>>>>> apply to application code. It's just that in the hypervisor and
>>> kernel
>>>>>>>>>>>>>> (and certain other code system components) the consequences
>>> of
>>>>>>>>>>>>>> being lax are much more severe.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Okay. But currently, the pci.c file already include 'util.h' and
>>>>>>>>>>>>> '<xen/memory.h>,
>>>>>>>>>>>>>
>>>>>>>>>>>>> #include "util.h"
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> #include <xen/memory.h>
>>>>>>>>>>>>>
>>>>>>>>>>>>> We can't redefine struct xen_reserved_device_memory in util.h.
>>>>>>>>>>>>
>>>>>>>>>>>> Redefine? I said forward declare.
>>>>>>>>>>>
>>>>>>>>>>> Seems we just need to declare
>>> hvm_get_reserved_device_memory_map() in
>>>>>>>>>>> the head file, tools/firmware/hvmloader/util.h,
>>>>>>>>>>>
>>>>>>>>>>> unsigned int hvm_get_reserved_device_memory_map(void);
>>>>>>>>>>
>>>>>>>>>> To me this looks very much like poor programming style, even if in
>>>>>>>>>> the context of hvmloader communicating information via global
>>>>>>>>>> variables rather than function arguments and return values is
>>>>>>>>>
>>>>>>>>> Do you mean you don't like a global variable? But it can be use to get
>>>>>>>>> RDM without more hypercall or function call in the context of
>>> hvmloader.
>>>>>>>>
>>>>>>>> This argument which you brought up before, and which we commented
>>>>>>>> on before, is pretty pointless. We don't really care much about doing
>>>>>>>> one or two more hypercalls from hvmloader, unless these would be
>>>>>>>> long-running ones.
>>>>>>>>
>>>>>>>
>>>>>>> Another benefit to use a global variable is that we wouldn't allocate
>>>>>>> xen_reserved_device_memory * N each time, and reduce some
>>> duplicated
>>>>>>> codes, unless you mean I should define that as static inside in local.
>>>>>>
>>>>>> Now this reason is indeed worth a consideration. How many times is
>>>>>> the data being needed/retrieved?
>>>>>
>>>>> Currently there are two cases in tools/hvmloader, setup pci and build
>>>>> e820 table. Each time, as you know we don't know how may entries we
>>>>> should require, so we always allocate one instance then according to the
>>>>> return value to allocate the proper instances to get that.
>>>>
>>>> Hmm, two uses isn't really that bad, i.e. I'd then still be in favor of
>>>> a more "normal" interface.
>>>>
>>>
>>> Just go back here since I realize we always use mem_alloc(), which is
>>> pick from RESERVED_MEMORY, to allocate all buffer inside this hypercall
>>> caller in hvmloader, but unfortunately we have no any associated free
>>> function implementation in hvmloader, so if we call this multiple times
>>> this means it really waster more memory in RESERVED_MEMORY. So I still
>>> think one global variable should be fine.
>>
>> it's natural to get reserved information once, and then saved for either
>> one-time or multiple-time references.
>
> Not really natural, but possible. Yet if done this way, then the
> interface shouldn't give the appearance of retrieving it every time,
> i.e. the global should be set up separately and the users of the

Shouldn't we exactly implemented this previously?

+struct xen_mem_reserved_device_memory *rdm_map;

As a global variable, any caller should check if this is !NULL before 
they call that function.

Thanks
Tiejun

> data should access the data rather than calling a (fake) function.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 180+ messages in thread

* Re: [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps
  2014-11-21  8:54                                           ` Chen, Tiejun
@ 2014-11-21  9:33                                             ` Jan Beulich
  0 siblings, 0 replies; 180+ messages in thread
From: Jan Beulich @ 2014-11-21  9:33 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Kevin Tian, wei.liu2, ian.campbell, stefano.stabellini, tim,
	ian.jackson, xen-devel, Yang Z Zhang

>>> On 21.11.14 at 09:54, <tiejun.chen@intel.com> wrote:
> On 2014/11/21 15:54, Jan Beulich wrote:
>>>>> On 21.11.14 at 08:43, <kevin.tian@intel.com> wrote:
>>> it's natural to get reserved information once, and then saved for either
>>> one-time or multiple-time references.
>>
>> Not really natural, but possible. Yet if done this way, then the
>> interface shouldn't give the appearance of retrieving it every time,
>> i.e. the global should be set up separately and the users of the
> 
> Shouldn't we exactly implemented this previously?

Not afair (the mention of how it should not be done above was
specifically targeting what I recall you did so far).

> As a global variable, any caller should check if this is !NULL before 
> they call that function.

Of course.

Jan

^ permalink raw reply	[flat|nested] 180+ messages in thread

end of thread, other threads:[~2014-11-21  9:33 UTC | newest]

Thread overview: 180+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-24  7:34 [v7][RFC][PATCH 01/13] xen: RMRR fix Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 01/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
2014-10-24 14:11   ` Jan Beulich
2014-10-27  2:11     ` Chen, Tiejun
2014-10-27  2:18       ` Chen, Tiejun
2014-10-27  9:42       ` Jan Beulich
2014-10-28  2:22         ` Chen, Tiejun
2014-10-27 13:35   ` Julien Grall
2014-10-28  2:35     ` Chen, Tiejun
2014-10-28 10:36       ` Jan Beulich
2014-10-29  0:40         ` Chen, Tiejun
2014-10-29  8:53           ` Jan Beulich
2014-10-30  2:53             ` Chen, Tiejun
2014-10-30  9:10               ` Jan Beulich
2014-10-31  1:03                 ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 02/13] tools/libxc: introduce hypercall for xc_reserved_device_memory_map Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 03/13] tools/libxc: check if modules space is overlapping with reserved device memory Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 04/13] hvmloader/util: get reserved device memory maps Tiejun Chen
2014-10-24 14:22   ` Jan Beulich
2014-10-27  3:12     ` Chen, Tiejun
2014-10-27  9:45       ` Jan Beulich
2014-10-28  5:21         ` Chen, Tiejun
2014-10-28  9:48           ` Jan Beulich
2014-10-29  6:54             ` Chen, Tiejun
2014-10-29  9:05               ` Jan Beulich
2014-10-30  5:55                 ` Chen, Tiejun
2014-10-30  9:13                   ` Jan Beulich
2014-10-31  2:20                     ` Chen, Tiejun
2014-10-31  8:14                       ` Jan Beulich
2014-11-03  2:22                         ` Chen, Tiejun
2014-11-03  8:53                           ` Jan Beulich
2014-11-03  9:32                             ` Chen, Tiejun
2014-11-03  9:45                               ` Jan Beulich
2014-11-03  9:55                                 ` Chen, Tiejun
2014-11-03 10:02                                   ` Jan Beulich
2014-11-21  6:26                                     ` Chen, Tiejun
2014-11-21  7:43                                       ` Tian, Kevin
2014-11-21  7:54                                         ` Jan Beulich
2014-11-21  8:01                                           ` Tian, Kevin
2014-11-21  8:54                                           ` Chen, Tiejun
2014-11-21  9:33                                             ` Jan Beulich
2014-10-24 14:27   ` Jan Beulich
2014-10-27  5:07     ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 05/13] hvmloader/mmio: reconcile guest mmio with reserved device memory Tiejun Chen
2014-10-24 14:42   ` Jan Beulich
2014-10-27  7:12     ` Chen, Tiejun
2014-10-27  9:56       ` Jan Beulich
2014-10-28  7:11         ` Chen, Tiejun
2014-10-28  9:56           ` Jan Beulich
2014-10-29  7:03             ` Chen, Tiejun
2014-10-29  9:08               ` Jan Beulich
2014-10-30  3:18                 ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 06/13] hvmloader/ram: check if guest memory is out of reserved device memory maps Tiejun Chen
2014-10-24 14:56   ` Jan Beulich
2014-10-27  8:09     ` Chen, Tiejun
2014-10-27 10:17       ` Jan Beulich
2014-10-28  7:47         ` Chen, Tiejun
2014-10-28 10:06           ` Jan Beulich
2014-10-29  7:43             ` Chen, Tiejun
2014-10-29  9:15               ` Jan Beulich
2014-10-30  3:11                 ` Chen, Tiejun
2014-10-30  9:20                   ` Jan Beulich
2014-10-31  5:41                     ` Chen, Tiejun
2014-10-31  6:21                       ` Tian, Kevin
2014-10-31  7:02                         ` Chen, Tiejun
2014-10-31  8:20                         ` Jan Beulich
2014-11-03  5:49                           ` Chen, Tiejun
2014-11-03  8:56                             ` Jan Beulich
2014-11-03  9:40                               ` Chen, Tiejun
2014-11-03  9:51                                 ` Jan Beulich
2014-11-03 11:32                                   ` Chen, Tiejun
2014-11-03 11:43                                     ` Jan Beulich
2014-11-03 11:58                                       ` Chen, Tiejun
2014-11-03 12:34                                         ` Jan Beulich
2014-11-04  5:05                                           ` Chen, Tiejun
2014-11-04  7:54                                             ` Jan Beulich
2014-11-05  2:59                                               ` Chen, Tiejun
2014-11-05 17:00                                                 ` Jan Beulich
2014-11-06  9:28                                                   ` Chen, Tiejun
2014-11-06 10:06                                                     ` Jan Beulich
2014-11-07 10:27                                                       ` Chen, Tiejun
2014-11-07 11:08                                                         ` Jan Beulich
2014-11-11  6:32                                                           ` Chen, Tiejun
2014-11-11  7:49                                                             ` Chen, Tiejun
2014-11-11  9:03                                                               ` Jan Beulich
2014-11-11  9:06                                                                 ` Jan Beulich
2014-11-11  9:42                                                                   ` Chen, Tiejun
2014-11-11 10:07                                                                     ` Jan Beulich
2014-11-12  1:36                                                                       ` Chen, Tiejun
2014-11-12  8:37                                                                         ` Jan Beulich
2014-11-12  8:45                                                                           ` Chen, Tiejun
2014-11-12  9:02                                                                             ` Jan Beulich
2014-11-12  9:13                                                                               ` Chen, Tiejun
2014-11-12  9:56                                                                                 ` Jan Beulich
2014-11-12 10:18                                                                                   ` Chen, Tiejun
2014-11-19  8:17                                                                                   ` Tian, Kevin
2014-11-20  7:45                                                                                   ` Tian, Kevin
2014-11-20  8:04                                                                                     ` Jan Beulich
2014-11-20  8:51                                                                                       ` Tian, Kevin
2014-11-20 14:40                                                                                       ` Tian, Kevin
2014-11-20 14:46                                                                                         ` Jan Beulich
2014-11-20 20:11                                                                                         ` Konrad Rzeszutek Wilk
2014-11-21  0:32                                                                                           ` Tian, Kevin
2014-11-12  3:05                                                                     ` Chen, Tiejun
2014-11-12  8:55                                                                       ` Jan Beulich
2014-11-12 10:18                                                                         ` Chen, Tiejun
2014-11-12 10:24                                                                           ` Jan Beulich
2014-11-12 10:32                                                                             ` Chen, Tiejun
2014-11-13  3:09                                                                         ` Chen, Tiejun
2014-11-14  2:21                                                                           ` Chen, Tiejun
2014-11-14  8:21                                                                             ` Jan Beulich
2014-11-17  7:31                                                                               ` Chen, Tiejun
2014-11-17  7:57                                                                         ` Chen, Tiejun
2014-11-17 10:05                                                                           ` Jan Beulich
2014-11-17 11:08                                                                             ` Chen, Tiejun
2014-11-17 11:17                                                                               ` Jan Beulich
2014-11-17 11:32                                                                                 ` Chen, Tiejun
2014-11-17 11:51                                                                                   ` Jan Beulich
2014-11-18  3:08                                                                                     ` Chen, Tiejun
2014-11-18  8:01                                                                                       ` Jan Beulich
2014-11-18  8:16                                                                                         ` Chen, Tiejun
2014-11-18  9:33                                                                                           ` Jan Beulich
2014-11-19  1:26                                                                                             ` Chen, Tiejun
2014-11-20  7:31                                                                                               ` Jan Beulich
2014-11-20  8:12                                                                                                 ` Chen, Tiejun
2014-11-20  8:59                                                                                                   ` Jan Beulich
2014-11-20 10:28                                                                                                     ` Chen, Tiejun
2014-11-11  8:59                                                             ` Jan Beulich
2014-11-11  9:35                                                               ` Chen, Tiejun
2014-11-11  9:42                                                                 ` Jan Beulich
2014-11-11  9:51                                                                   ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 07/13] xen/x86/p2m: introduce p2m_check_reserved_device_memory Tiejun Chen
2014-10-24 15:02   ` Jan Beulich
2014-10-27  8:50     ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 08/13] xen/x86/p2m: set p2m_access_n for reserved device memory mapping Tiejun Chen
2014-10-24 15:11   ` Jan Beulich
2014-10-27  9:05     ` Chen, Tiejun
2014-10-27 10:33       ` Jan Beulich
2014-10-28  8:26         ` Chen, Tiejun
2014-10-28 10:12           ` Jan Beulich
2014-10-29  8:20             ` Chen, Tiejun
2014-10-29  9:20               ` Jan Beulich
2014-10-30  7:39                 ` Chen, Tiejun
2014-10-30  9:24                   ` Jan Beulich
2014-10-31  2:50                     ` Chen, Tiejun
2014-10-31  8:25                       ` Jan Beulich
2014-11-03  6:20                         ` Chen, Tiejun
2014-11-03  9:00                           ` Jan Beulich
2014-11-03  9:51                             ` Chen, Tiejun
2014-11-03 10:03                               ` Jan Beulich
2014-11-03 11:48                                 ` Chen, Tiejun
2014-11-03 11:53                                   ` Jan Beulich
2014-11-04  1:35                                     ` Chen, Tiejun
2014-11-04  8:02                                       ` Jan Beulich
2014-11-04 10:41                                         ` Chen, Tiejun
2014-11-04 11:41                                           ` Jan Beulich
2014-11-04 11:51                                             ` Chen, Tiejun
2014-10-24  7:34 ` [v7][RFC][PATCH 09/13] xen/x86/ept: handle reserved device memory in ept_handle_violation Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 10/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 11/13] xen:vtd: create RMRR mapping Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 12/13] xen/vtd: re-enable USB device assignment Tiejun Chen
2014-10-24  7:34 ` [v7][RFC][PATCH 13/13] xen/vtd: group assigned device with RMRR Tiejun Chen
2014-10-24 10:52 ` [v7][RFC][PATCH 01/13] xen: RMRR fix Jan Beulich
2014-10-27  2:00   ` Chen, Tiejun
2014-10-27  9:41     ` Jan Beulich
2014-10-28  8:36       ` Chen, Tiejun
2014-10-28  9:34         ` Jan Beulich
2014-10-28  9:39           ` Razvan Cojocaru
2014-10-29  0:51             ` Chen, Tiejun
2014-10-29  0:48           ` Chen, Tiejun
2014-10-29  2:51             ` Chen, Tiejun
2014-10-29  8:45               ` Jan Beulich
2014-10-30  8:21                 ` Chen, Tiejun
2014-10-30  9:07                   ` Jan Beulich
2014-10-31  3:11                     ` Chen, Tiejun
2014-10-29  8:44             ` Jan Beulich
2014-10-30  2:51               ` Chen, Tiejun
2014-10-30 22:15 ` Tim Deegan
2014-10-31  2:53   ` Chen, Tiejun
2014-10-31  9:10     ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.