All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 00/13] Fix RMRR
@ 2015-04-10  9:21 Tiejun Chen
  2015-04-10  9:21 ` [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy Tiejun Chen
                   ` (12 more replies)
  0 siblings, 13 replies; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

RMRR is an acronym for Reserved Memory Region Reporting, expected to
be used for legacy usages (such as USB, UMA Graphics, etc.) requiring
reserved memory. Special treatment is required in system software to
setup those reserved regions in IOMMU translation structures, otherwise
passing through a device with RMRR reported may not work correctly.

This patch set tries to enhance existing Xen RMRR implementation to fix
various reported and theoretical problems. Most noteworthy changes are
to setup identity mapping in p2m layer and handle possible conflicts between
reported regions and gfn space. Initial proposal can be found at:
    http://lists.xenproject.org/archives/html/xen-devel/2015-01/msg00524.html
and after a long discussion a summarized agreement is here:
    http://lists.xen.org/archives/html/xen-devel/2015-01/msg01580.html

Below is a key summary of this patch set according to agreed proposal:

1. Use RDM (Reserved Device Memory) name in user space as a general 
description instead of using ACPI RMRR name directly.

2. Introduce configuration parameters to allow user control both per-device 
and global RDM resources along with desired policies upon a detected conflict.

3. Introduce a new hypercall to query global and per-device RDM resources.

4. Extend libxl to be a central place to manage RDM resources and handle 
potential conflicts between reserved regions and gfn space. One simplification
goal is made to keep existing lowmem / mmio / highmem layout which is
passed around various function blocks. So a reasonable assumption
is made, that conflicts falling into below areas are not re-arranged otherwise
it will result in a more scattered layout:
    a) in highmem region (>4G)
    b) in lowmem region, and below a predefined boundary (default 2G)
  a) is a new assumption not discussed before. From VT-d spec this is 
possible but no such observation in real-world. So we can make this
reasonable assumption until there's real usage on it.

5. Extend XENMEM_set_memory_map usable for HVM guest, and then have
libxl to use that hypercall to carry RDM information to hvmloader. There
is one difference from original discussion. Previously we discussed to
introduce a new E820 type specifically for RDM entries. After more thought
we think it's OK to just tag them as E820_reserved. Actually hvmloader
doesn't need to know whether the reserved entries come from RDM or
from other purposes. 

6. Then in hvmloader the change is generic for XENMEM_memory_map
change. Given a predefined memory layout, hvmloader should avoid
allocating all reserved entries for other usages (opregion, mmio, etc.)

7. Extend existing device passthrough hypercall to carry conflict handling
policy.

8. Setup identity map in p2m layer for RMRRs reported for the given
device. And conflicts are handled according to specified policy in hypercall.

Current patch set contains core enhancements calling for comments.
There are still several tasks not implemented now. We'll include them
in final version after RFC is agreed:

- remove existing USB hack
- detect and fail assigning device which has a shared RMRR with another device
- add a config parameter to configure that memory boundary flexibly
- In the case of hotplug we also need to figure out a way to fix that policy
  conflict between the per-pci policy and the global policy but firstly we think
  we'd better collect some good or correct ideas to step next in RFC. 

So here I made this as RFC to collect your any comments.

----------------------------------------------------------------
Jan Beulich (1):
      introduce XENMEM_reserved_device_memory_map
 
Tiejun Chen (12):
      tools: introduce some new parameters to set rdm policy
      tools/libxc: Expose new hypercall xc_reserved_device_memory_map
      tools/libxl: detect and avoid conflicts with RDM
      xen/x86/p2m: introduce set_identity_p2m_entry
      xen:vtd: create RMRR mapping
      xen/passthrough: extend hypercall to support rdm reservation policy
      tools: extend xc_assign_device() to support rdm reservation policy
      xen: enable XENMEM_set_memory_map in hvm
      tools: extend XENMEM_set_memory_map
      hvmloader: get guest memory map into memory_map[]
      hvmloader/pci: skip reserved ranges
      hvmloader/e820: construct guest e820 table

 docs/man/xl.cfg.pod.5                       |  44 +++++
 docs/misc/vtd.txt                           |  34 ++++
 tools/firmware/hvmloader/e820.c             |  40 ++--
 tools/firmware/hvmloader/e820.h             |   7 +
 tools/firmware/hvmloader/hvmloader.c        |  36 ++++
 tools/firmware/hvmloader/pci.c              |  19 +-
 tools/firmware/hvmloader/util.c             |  25 +++
 tools/firmware/hvmloader/util.h             |  10 +
 tools/libxc/include/xenctrl.h               |  19 +-
 tools/libxc/include/xenguest.h              |   3 +-
 tools/libxc/xc_domain.c                     |  80 +++++++-
 tools/libxc/xc_hvm_build_x86.c              |  28 +--
 tools/libxl/libxl_create.c                  |   7 +-
 tools/libxl/libxl_dm.c                      | 195 ++++++++++++++++++++
 tools/libxl/libxl_dom.c                     |  99 +++++++++-
 tools/libxl/libxl_internal.h                |  11 +-
 tools/libxl/libxl_pci.c                     |  11 +-
 tools/libxl/libxl_types.idl                 |  25 +++
 tools/libxl/libxlu_pci.c                    |  78 ++++++++
 tools/libxl/libxlutil.h                     |   4 +
 tools/libxl/xl_cmdimpl.c                    |  44 ++++-
 tools/libxl/xl_cmdtable.c                   |   2 +-
 tools/ocaml/libs/xc/xenctrl_stubs.c         |  17 +-
 tools/python/xen/lowlevel/xc/xc.c           |  29 ++-
 xen/arch/x86/hvm/hvm.c                      |   2 -
 xen/arch/x86/mm.c                           |   6 -
 xen/arch/x86/mm/p2m.c                       |  30 +++
 xen/common/compat/memory.c                  |  66 +++++++
 xen/common/memory.c                         |  63 +++++++
 xen/drivers/passthrough/amd/pci_amd_iommu.c |   3 +-
 xen/drivers/passthrough/iommu.c             |  10 +
 xen/drivers/passthrough/pci.c               |  10 +-
 xen/drivers/passthrough/vtd/dmar.c          |  32 ++++
 xen/drivers/passthrough/vtd/extern.h        |   1 +
 xen/drivers/passthrough/vtd/iommu.c         |  37 +++-
 xen/include/asm-x86/p2m.h                   |   4 +
 xen/include/public/domctl.h                 |   4 +
 xen/include/public/memory.h                 |  32 +++-
 xen/include/xen/iommu.h                     |   6 +-
 xen/include/xen/pci.h                       |   2 +
 xen/include/xlat.lst                        |   3 +-
 41 files changed, 1101 insertions(+), 77 deletions(-)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-05-08 13:04   ` Wei Liu
  2015-04-10  9:21 ` [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

This patch introduces user configurable parameters to specify RDM
resource and according policies,

Global RDM parameter:
    rdm = [ 'host, reserve=force/try' ]
Per-device RDM parameter:
    pci = [ 'sbdf, rdm_reserve=force/try' ]

Global RDM parameter allows user to specify reserved regions explicitly,
e.g. using 'host' to include all reserved regions reported on this platform
which is good to handle hotplug scenario. In the future this parameter
may be further extended to allow specifying random regions, e.g. even
those belonging to another platform as a preparation for live migration
with passthrough devices.

'force/try' policy decides how to handle conflict when reserving RDM
regions in pfn space. If conflict exists, 'force' means an immediate error
so VM will be killed, while 'try' allows moving forward with a warning
message thrown out.

Default per-device RDM policy is 'force', while default global RDM policy
is 'try'. When both policies are specified on a given region, 'force' is
always preferred.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
 docs/misc/vtd.txt           | 34 ++++++++++++++++++++
 tools/libxl/libxl_create.c  |  5 +++
 tools/libxl/libxl_types.idl | 18 +++++++++++
 tools/libxl/libxlu_pci.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxlutil.h     |  4 +++
 tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
 7 files changed, 203 insertions(+), 1 deletion(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 408653f..9ed3055 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -583,6 +583,36 @@ assigned slave device.
 
 =back
 
+=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
+
+(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
+which is necessary to enable robust device passthrough usage. One example of
+RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
+structure on x86 platform.
+Each B<RDM_CHECK_STRING> has the form C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
+
+=over 4
+
+=item B<"TYPE">
+
+Currently we just have one type. 'host' means all reserved device memory on
+this platform should be reserved in this VM's pfn space.
+
+=item B<KEY=VALUE>
+
+Possible B<KEY>s are:
+
+=over 4
+
+=item B<reserve="STRING">
+
+Conflict may be detected when reserving reserved device memory in gfn space.
+'force' means an unsolved conflict leads to immediate VM destroy, while
+'try' allows VM moving forward with a warning message thrown out. 'try'
+is default.
+
+Note this may be overrided by another sub item, rdm_reserve, in pci device.
+
 =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
 
 Specifies the host PCI devices to passthrough to this guest. Each B<PCI_SPEC_STRING>
@@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
 D0-D3hot power management states for the PCI device. False (0) by
 default.
 
+=item B<rdm_check="STRING">
+
+(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
+which is necessary to enable robust device passthrough usage. One example of
+RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
+structure on x86 platform.
+
+Conflict may be detected when reserving reserved device memory in gfn space.
+'force' means an unsolved conflict leads to immediate VM destroy, while
+'try' allows VM moving forward with a warning message thrown out. 'force'
+is default.
+
+Note this would override another global item, rdm = [''].
+
 =back
 
 =back
diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
index 9af0e99..d7434d6 100644
--- a/docs/misc/vtd.txt
+++ b/docs/misc/vtd.txt
@@ -111,6 +111,40 @@ in the config file:
 To override for a specific device:
 	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
 
+RDM, 'reserved device memory', for PCI Device Passthrough
+---------------------------------------------------------
+
+There are some devices the BIOS controls, for e.g. USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+identity mappings for these regions for these devices to access these regions.
+
+While creating a VM we should reserve them in advance, and avoid any conflicts.
+So we introduce user configurable parameters to specify RDM resource and
+according policies,
+
+To enable this globally, add "rdm" in the config file:
+
+    rdm = [ 'host, reserve=force/try' ]
+
+Or just for a specific device:
+
+	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
+
+Global RDM parameter allows user to specify reserved regions explicitly.
+Using 'host' to include all reserved regions reported on this platform
+which is good to handle hotplug scenario. In the future this parameter
+may be further extended to allow specifying random regions, e.g. even
+those belonging to another platform as a preparation for live migration
+with passthrough devices.
+
+'force/try' policy decides how to handle conflict when reserving RDM
+regions in pfn space. If conflict exists, 'force' means an immediate error
+so VM will be killed, while 'try' allows moving forward with a warning
+message thrown out.
+
 
 Caveat on Conventional PCI Device Passthrough
 ---------------------------------------------
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 98687bd..9ed40d4 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
     }
 
     for (i = 0; i < d_config->num_pcidevs; i++) {
+        /*
+         * If the rdm global policy is 'force' we should override each device.
+         */
+        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
+            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
         ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
         if (ret < 0) {
             LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 47af340..5786455 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
     (2, "PV"),
     ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
 
+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
+    (0, "none"),
+    (1, "host"),
+    ])
+
+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
+    (-1, "invalid"),
+    (0, "force"),
+    (1, "try"),
+    ])
+
 libxl_channel_connection = Enumeration("channel_connection", [
     (0, "UNKNOWN"),
     (1, "PTY"),
@@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])
 
+libxl_rdm_reserve = Struct("rdm_reserve", [
+    ("type",    libxl_rdm_reserve_type),
+    ("reserve",   libxl_rdm_reserve_flag),
+    ])
+
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
@@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("kernel",           string),
     ("cmdline",          string),
     ("ramdisk",          string),
+    ("rdm",     libxl_rdm_reserve),
     ("u", KeyedUnion(None, libxl_domain_type, "type",
                 [("hvm", Struct(None, [("firmware",         string),
                                        ("bios",             libxl_bios_type),
@@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
     ("power_mgmt", bool),
     ("permissive", bool),
     ("seize", bool),
+    ("rdm_reserve",   libxl_rdm_reserve_flag),
     ])
 
 libxl_device_vtpm = Struct("device_vtpm", [
diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
index 26fb143..45be0d9 100644
--- a/tools/libxl/libxlu_pci.c
+++ b/tools/libxl/libxlu_pci.c
@@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain,
 #define STATE_OPTIONS_K 6
 #define STATE_OPTIONS_V 7
 #define STATE_TERMINAL  8
+#define STATE_TYPE      9
+#define STATE_CHECK_FLAG      10
 int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str)
 {
     unsigned state = STATE_DOMAIN;
@@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str
                     pcidev->permissive = atoi(tok);
                 }else if ( !strcmp(optkey, "seize") ) {
                     pcidev->seize = atoi(tok);
+                }else if ( !strcmp(optkey, "rdm_reserve") ) {
+                    if ( !strcmp(tok, "force") ) {
+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
+                    } else if ( !strcmp(tok, "try") ) {
+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
+                    } else {
+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
+                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
+                                          " %s, so goes 'force' by default.",
+                                     tok);
+                    }
                 }else{
                     XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
                 }
@@ -167,6 +180,71 @@ parse_error:
     return ERROR_INVAL;
 }
 
+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
+{
+    unsigned state = STATE_TYPE;
+    char *buf2, *tok, *ptr, *end;
+
+    if ( NULL == (buf2 = ptr = strdup(str)) )
+        return ERROR_NOMEM;
+
+    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
+        switch(state) {
+        case STATE_TYPE:
+            if ( *ptr == '\0' || *ptr == ',' ) {
+                state = STATE_CHECK_FLAG;
+                *ptr = '\0';
+                if ( !strcmp(tok, "host") ) {
+                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
+                } else {
+                    XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok);
+                    goto parse_error;
+                }
+                tok = ptr + 1;
+            }
+            break;
+        case STATE_CHECK_FLAG:
+            if ( *ptr == '=' ) {
+                state = STATE_OPTIONS_V;
+                *ptr = '\0';
+                if ( strcmp(tok, "reserve") ) {
+                    XLU__PCI_ERR(cfg, "Unknown RDM property value: %s", tok);
+                    goto parse_error;
+                }
+                tok = ptr + 1;
+            }
+            break;
+        case STATE_OPTIONS_V:
+            if ( *ptr == ',' || *ptr == '\0' ) {
+                state = STATE_TERMINAL;
+                *ptr = '\0';
+                if ( !strcmp(tok, "force") ) {
+                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
+                } else if ( !strcmp(tok, "try") ) {
+                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
+                } else {
+                    XLU__PCI_ERR(cfg, "Unknown RDM property flag value: %s",
+                                 tok);
+                    goto parse_error;
+                }
+                tok = ptr + 1;
+            }
+        default:
+            break;
+        }
+    }
+
+    free(buf2);
+
+    if ( tok != ptr || state != STATE_TERMINAL )
+        goto parse_error;
+
+    return 0;
+
+parse_error:
+    return ERROR_INVAL;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
index 0333e55..80497f8 100644
--- a/tools/libxl/libxlutil.h
+++ b/tools/libxl/libxlutil.h
@@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs, const char *const *specs,
  */
 int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str);
 
+/*
+ * RDM parsing
+ */
+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str);
 
 /*
  * Vif rate parsing.
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 04faf98..9a58464 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -988,7 +988,7 @@ static void parse_config_data(const char *config_source,
     const char *buf;
     long l;
     XLU_Config *config;
-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
+    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms, *rdms;
     XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
     int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
     int pci_power_mgmt = 0;
@@ -1727,6 +1727,23 @@ skip_vfb:
         xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0);
     }
 
+    /*
+     * By default our global policy is to query all rdm entries, and
+     * force reserve them.
+     */
+    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
+    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
+    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
+        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
+            libxl_rdm_reserve rdm;
+            if (!xlu_rdm_parse(config, &rdm, buf))
+            {
+                b_info->rdm.type = rdm.type;
+                b_info->rdm.reserve = rdm.reserve;
+            }
+        }
+    }
+
     if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
         d_config->num_pcidevs = 0;
         d_config->pcidevs = NULL;
@@ -1741,6 +1758,8 @@ skip_vfb:
             pcidev->power_mgmt = pci_power_mgmt;
             pcidev->permissive = pci_permissive;
             pcidev->seize = pci_seize;
+            /* We'd like to force reserve rdm specific to a device by default.*/
+            pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
             if (!xlu_pci_parse_bdf(config, pcidev, buf))
                 d_config->num_pcidevs++;
         }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
  2015-04-10  9:21 ` [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-16 14:59   ` Tim Deegan
  2015-04-10  9:21 ` [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map Tiejun Chen
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

From: Jan Beulich <jbeulich@suse.com>

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/common/compat/memory.c           | 66 ++++++++++++++++++++++++++++++++++++
 xen/common/memory.c                  | 63 ++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/iommu.c      | 10 ++++++
 xen/drivers/passthrough/vtd/dmar.c   | 32 +++++++++++++++++
 xen/drivers/passthrough/vtd/extern.h |  1 +
 xen/drivers/passthrough/vtd/iommu.c  |  1 +
 xen/include/public/memory.h          | 32 ++++++++++++++++-
 xen/include/xen/iommu.h              |  4 +++
 xen/include/xen/pci.h                |  2 ++
 xen/include/xlat.lst                 |  3 +-
 10 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index b258138..3704147 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -17,6 +17,45 @@ CHECK_TYPE(domid);
 CHECK_mem_access_op;
 CHECK_vmemrange;
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct compat_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
+                                      u32 id, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+    u32 sbdf;
+    struct compat_reserved_device_memory rdm = {
+        .start_pfn = start, .nr_pages = nr
+    };
+
+    if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+        return -ERANGE;
+
+    sbdf = PCI_SBDF2(grdm->map.seg, grdm->map.bus, grdm->map.devfn);
+    if ( (grdm->map.flag & PCI_DEV_RDM_ALL) || (sbdf == id) )
+    {
+        if ( grdm->used_entries < grdm->map.nr_entries )
+        {
+            if ( __copy_to_compat_offset(grdm->map.buffer,
+                                         grdm->used_entries,
+                                         &rdm,
+                                         1) )
+            {
+                return -EFAULT;
+            }
+        }
+        ++grdm->used_entries;
+        return 1;
+    }
+
+    return 0;
+}
+#endif
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -303,6 +342,33 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( grdm.map.nr_entries )
+            {
+                if ( __copy_to_guest(compat, &grdm.map, 1) )
+                    rc = -EFAULT;
+            }
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 063a1c5..1faef43 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -748,6 +748,42 @@ static int construct_memop_from_reservation(
     return 0;
 }
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
+                                      u32 id, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+    u32 sbdf;
+    struct xen_reserved_device_memory rdm = {
+        .start_pfn = start, .nr_pages = nr
+    };
+
+    sbdf = PCI_SBDF2(grdm->map.seg, grdm->map.bus, grdm->map.devfn);
+    if ( (grdm->map.flag & PCI_DEV_RDM_ALL) || (sbdf == id) )
+    {
+        if ( grdm->used_entries < grdm->map.nr_entries )
+        {
+            if ( __copy_to_guest_offset(grdm->map.buffer,
+                                        grdm->used_entries,
+                                        &rdm,
+                                        1) )
+            {
+                return -EFAULT;
+            }
+        }
+        ++grdm->used_entries;
+        return 1;
+    }
+
+    return 0;
+}
+#endif
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1162,6 +1198,33 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( grdm.map.nr_entries )
+        {
+            if ( __copy_to_guest(arg, &grdm.map, 1) )
+                rc = -EFAULT;
+        }
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 92ea26f..c7aec87 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
diff --git a/xen/drivers/passthrough/vtd/dmar.c b/xen/drivers/passthrough/vtd/dmar.c
index 1152c3a..c6b4146 100644
--- a/xen/drivers/passthrough/vtd/dmar.c
+++ b/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,35 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr, *rmrr_cur = NULL;
+    int rc = 0;
+    unsigned int i;
+    u16 bdf;
+
+    for_each_rmrr_device ( rmrr, bdf, i )
+    {
+        if ( rmrr != rmrr_cur )
+        {
+            rc = func(PFN_DOWN(rmrr->base_address),
+                      PFN_UP(rmrr->end_address) -
+                        PFN_DOWN(rmrr->base_address),
+                      PCI_SBDF(rmrr->segment, bdf),
+                      ctxt);
+
+            if ( unlikely(rc < 0) )
+                return rc;
+
+            if ( !rc )
+                continue;
+
+            /* Just go next. */
+            if ( rc == 1 )
+                rmrr_cur = rmrr;
+        }
+    }
+
+    return 0;
+}
diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 5524dba..f9ee9b0 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct domain *domain, struct iommu *iommu,
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 891b9e3..4e789d1 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2477,6 +2477,7 @@ const struct iommu_ops intel_iommu_ops = {
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 2b5206b..36e5f54 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -574,7 +574,37 @@ struct xen_vnuma_topology_info {
 typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * For legacy reasons, some devices must be configured with special memory
+ * regions to function correctly.  The guest would take these regions
+ * according to different user policies.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+    /* IN */
+    /* Currently just one bit to indicate checkng all Reserved Device Memory. */
+#define PCI_DEV_RDM_ALL   0x1
+    uint32_t        flag;
+    /* IN */
+    uint16_t        seg;
+    uint8_t         bus;
+    uint8_t         devfn;
+    /* IN/OUT */
+    unsigned int    nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index bf4aff0..8565b82 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -121,6 +121,8 @@ void iommu_dt_domain_destroy(struct domain *d);
 
 struct page_info;
 
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -152,12 +154,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 3988ee68..a27417b 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -32,6 +32,8 @@
 #define PCI_DEVFN2(bdf) ((bdf) & 0xff)
 #define PCI_BDF(b,d,f)  ((((b) & 0xff) << 8) | PCI_DEVFN(d,f))
 #define PCI_BDF2(b,df)  ((((b) & 0xff) << 8) | ((df) & 0xff))
+#define PCI_SBDF(s,bdf) (((s & 0xffff) << 16) | (bdf & 0xffff))
+#define PCI_SBDF2(s,b,df) (((s & 0xffff) << 16) | PCI_BDF2(b,df))
 
 struct pci_dev_info {
     bool_t is_extfn;
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 9c9fd9a..dd23559 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -61,9 +61,10 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+!	reserved_device_memory_map	memory.h
 ?	vmemrange			memory.h
 !	vnuma_topology_info		memory.h
 ?	physdev_eoi			physdev.h
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
  2015-04-10  9:21 ` [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy Tiejun Chen
  2015-04-10  9:21 ` [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-05-08 13:07   ` Wei Liu
  2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

We will introduce the hypercall xc_reserved_device_memory_map
approach to libxc. This helps us get rdm entry info according to
different parameters. If flag == PCI_DEV_RDM_ALL, all entries
should be exposed. Or we just expose that rdm entry specific to
a SBDF.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxc/include/xenctrl.h |  8 ++++++++
 tools/libxc/xc_domain.c       | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index df18292..59bbe06 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1294,6 +1294,14 @@ int xc_domain_set_memory_map(xc_interface *xch,
 int xc_get_machine_memory_map(xc_interface *xch,
                               struct e820entry entries[],
                               uint32_t max_entries);
+
+int xc_reserved_device_memory_map(xc_interface *xch,
+                                  uint32_t flag,
+                                  uint16_t seg,
+                                  uint8_t bus,
+                                  uint8_t devfn,
+                                  struct xen_reserved_device_memory entries[],
+                                  uint32_t *max_entries);
 #endif
 int xc_domain_set_time_offset(xc_interface *xch,
                               uint32_t domid,
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 845d1d7..4f8383e 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -675,6 +675,42 @@ int xc_domain_set_memory_map(xc_interface *xch,
 
     return rc;
 }
+
+int xc_reserved_device_memory_map(xc_interface *xch,
+                                  uint32_t flag,
+                                  uint16_t seg,
+                                  uint8_t bus,
+                                  uint8_t devfn,
+                                  struct xen_reserved_device_memory entries[],
+                                  uint32_t *max_entries)
+{
+    int rc;
+    struct xen_reserved_device_memory_map xrdmmap = {
+        .flag = flag,
+        .seg = seg,
+        .bus = bus,
+        .devfn = devfn,
+        .nr_entries = *max_entries
+    };
+    DECLARE_HYPERCALL_BOUNCE(entries,
+                             sizeof(struct xen_reserved_device_memory) *
+                             *max_entries, XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( xc_hypercall_bounce_pre(xch, entries) )
+        return -1;
+
+    set_xen_guest_handle(xrdmmap.buffer, entries);
+
+    rc = do_memory_op(xch, XENMEM_reserved_device_memory_map,
+                      &xrdmmap, sizeof(xrdmmap));
+
+    xc_hypercall_bounce_post(xch, entries);
+
+    *max_entries = xrdmmap.nr_entries;
+
+    return rc;
+}
+
 int xc_get_machine_memory_map(xc_interface *xch,
                               struct e820entry entries[],
                               uint32_t max_entries)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (2 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-15 13:10   ` Ian Jackson
                     ` (2 more replies)
  2015-04-10  9:21 ` [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
                   ` (8 subsequent siblings)
  12 siblings, 3 replies; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

While building a VM, HVM domain builder provides struct hvm_info_table{}
to help hvmloader. Currently it includes two fields to construct guest
e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
check them to fix any conflict with RAM.

RMRR can reside in address space beyond 4G theoretically, but we never
see this in real world. So in order to avoid breaking highmem layout
we don't solve highmem conflict. Note this means highmem rmrr could still
be supported if no conflict.

But in the case of lowmem, RMRR probably scatter the whole RAM space.
Especially multiple RMRR entries would worsen this to lead a complicated
memory layout. And then its hard to extend hvm_info_table{} to work
hvmloader out. So here we're trying to figure out a simple solution to
avoid breaking existing layout. So when a conflict occurs,

    #1. Above a predefined boundary (default 2G)
        - move lowmem_end below reserved region to solve conflict;

    #2 Below a predefined boundary (default 2G)
        - Check force/try policy.
        "force" policy leads to fail libxl. Note when both policies
        are specified on a given region, 'force' is always preferred.
        "try" policy issue a warning message and also mask this entry INVALID
        to indicate we shouldn't expose this entry to hvmloader.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxc/include/xenctrl.h  |   8 ++
 tools/libxc/include/xenguest.h |   3 +-
 tools/libxc/xc_domain.c        |  40 +++++++++
 tools/libxc/xc_hvm_build_x86.c |  28 +++---
 tools/libxl/libxl_create.c     |   2 +-
 tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_dom.c        |  27 +++++-
 tools/libxl/libxl_internal.h   |  11 ++-
 tools/libxl/libxl_types.idl    |   7 ++
 9 files changed, 303 insertions(+), 18 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 59bbe06..299b95f 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
                      uint32_t *num_sdevs,
                      uint32_t *sdev_array);
 
+struct xen_reserved_device_memory
+*xc_device_get_rdm(xc_interface *xch,
+                   uint32_t flag,
+                   uint16_t seg,
+                   uint8_t bus,
+                   uint8_t devfn,
+                   unsigned int *nr_entries);
+
 int xc_test_assign_device(xc_interface *xch,
                           uint32_t domid,
                           uint32_t machine_sbdf);
diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40bbac8..0f1c23b 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -218,7 +218,8 @@ struct xc_hvm_firmware_module {
 };
 
 struct xc_hvm_build_args {
-    uint64_t mem_size;           /* Memory size in bytes. */
+    uint64_t lowmem_size;        /* All low memory size in bytes. */
+    uint64_t mem_size;           /* All memory size in bytes. */
     uint64_t mem_target;         /* Memory target in bytes. */
     uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
     const char *image_file_name; /* File name of the image to load. */
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 4f8383e..85b18ea 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1665,6 +1665,46 @@ int xc_assign_device(
     return do_domctl(xch, &domctl);
 }
 
+struct xen_reserved_device_memory
+*xc_device_get_rdm(xc_interface *xch,
+                   uint32_t flag,
+                   uint16_t seg,
+                   uint8_t bus,
+                   uint8_t devfn,
+                   unsigned int *nr_entries)
+{
+    struct xen_reserved_device_memory *xrdm = NULL;
+    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
+                                           nr_entries);
+
+    if ( rc < 0 )
+    {
+        if ( errno == ENOBUFS )
+        {
+            if ( (xrdm = malloc(*nr_entries *
+                                sizeof(xen_reserved_device_memory_t))) == NULL )
+            {
+                PERROR("Could not allocate memory.");
+                goto out;
+            }
+            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
+                                               nr_entries);
+            if ( rc )
+            {
+                PERROR("Could not get reserved device memory maps.");
+                free(xrdm);
+                xrdm = NULL;
+            }
+        }
+        else {
+            PERROR("Could not get reserved device memory maps.");
+        }
+    }
+
+ out:
+    return xrdm;
+}
+
 int xc_get_device_group(
     xc_interface *xch,
     uint32_t domid,
diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index c81a25b..3f87bb3 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
 }
 
 static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
-                           uint64_t mmio_start, uint64_t mmio_size)
+                           uint64_t lowmem_end)
 {
     struct hvm_info_table *hvm_info = (struct hvm_info_table *)
         (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
-    uint64_t lowmem_end = mem_size, highmem_end = 0;
+    uint64_t highmem_end = 0;
     uint8_t sum;
     int i;
 
-    if ( lowmem_end > mmio_start )
-    {
-        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
-        lowmem_end = mmio_start;
-    }
+    if ( mem_size > lowmem_end )
+        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
 
     memset(hvm_info_page, 0, PAGE_SIZE);
 
@@ -252,7 +249,7 @@ static int setup_guest(xc_interface *xch,
     void *hvm_info_page;
     uint32_t *ident_pt;
     struct elf_binary elf;
-    uint64_t v_start, v_end;
+    uint64_t v_start, v_end, v_lowend, lowmem_end;
     uint64_t m_start = 0, m_end = 0;
     int rc;
     xen_capabilities_info_t caps;
@@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
     elf_parse_binary(&elf);
     v_start = 0;
     v_end = args->mem_size;
+    v_lowend = lowmem_end = args->lowmem_size;
 
     if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
     {
@@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
 
     for ( i = 0; i < nr_pages; i++ )
         page_array[i] = i;
-    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
-        page_array[i] += mmio_size >> PAGE_SHIFT;
+    /*
+     * This condition 'lowmem_end <= mmio_start' is always true.
+     */
+    for ( i = lowmem_end >> PAGE_SHIFT; i < nr_pages; i++ )
+        page_array[i] += ((1ull << 32) - lowmem_end) >> PAGE_SHIFT;
 
     /*
      * Try to claim pages for early warning of insufficient memory available.
@@ -469,7 +470,7 @@ static int setup_guest(xc_interface *xch,
               xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
               HVM_INFO_PFN)) == NULL )
         goto error_out;
-    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
+    build_hvm_info(hvm_info_page, v_end, v_lowend);
     munmap(hvm_info_page, PAGE_SIZE);
 
     /* Allocate and clear special pages. */
@@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
     if ( args.mem_target == 0 )
         args.mem_target = args.mem_size;
 
-    if ( args.mmio_size == 0 )
-        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
-
     /* An HVM guest must be initialised with at least 2MB memory. */
     if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
         return -1;
@@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
     args.mem_size = (uint64_t)memsize << 20;
     args.mem_target = (uint64_t)target << 20;
     args.image_file_name = image_name;
+    if ( args.mmio_size == 0 )
+        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
 
     return xc_hvm_build(xch, domid, &args);
 }
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 9ed40d4..eab86fd 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -430,7 +430,7 @@ int libxl__domain_build(libxl__gc *gc,
 
     switch (info->type) {
     case LIBXL_DOMAIN_TYPE_HVM:
-        ret = libxl__build_hvm(gc, domid, info, state);
+        ret = libxl__build_hvm(gc, domid, d_config, state);
         if (ret)
             goto out;
 
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index a8b08f2..9ad04ae 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
     return dm;
 }
 
+/*
+ * Check whether there exists rdm hole in the specified memory range.
+ * Returns 1 if exists, else returns 0.
+ */
+static int check_rdm_hole(uint64_t start, uint64_t memsize,
+                          uint64_t rdm_start, uint64_t rdm_size)
+{
+    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
+        return 0;
+    else
+        return 1;
+}
+
+/*
+ * Check reported RDM regions and handle potential gfn conflicts according
+ * to user preferred policy.
+ */
+int libxl__domain_device_check_rdm(libxl__gc *gc,
+                                   libxl_domain_config *d_config,
+                                   uint64_t rdm_mem_guard,
+                                   struct xc_hvm_build_args *args)
+{
+    int i, j, conflict;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+    struct xen_reserved_device_memory *xrdm = NULL;
+    unsigned int nr_all_rdms = 0;
+    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
+    uint32_t type = d_config->b_info.rdm.type;
+    uint16_t seg;
+    uint8_t bus, devfn;
+
+    /* Might not to expose rdm. */
+    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
+        return 0;
+
+    /* Collect all rdm info if exist. */
+    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
+                             0, 0, 0, &nr_all_rdms);
+    if (!nr_all_rdms)
+        return 0;
+    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
+                                   sizeof(libxl_device_rdm));
+    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
+
+    /* Query all RDM entries in this platform */
+    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
+        d_config->num_rdms = nr_all_rdms;
+        for (i = 0; i < d_config->num_rdms; i++) {
+            d_config->rdms[i].start =
+                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
+            d_config->rdms[i].size =
+                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
+            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
+        }
+    } else {
+        d_config->num_rdms = 0;
+    }
+    free(xrdm);
+
+    /* Query RDM entries per-device */
+    for (i = 0; i < d_config->num_pcidevs; i++) {
+        unsigned int nr_entries = 0;
+        bool new = true;
+        seg = d_config->pcidevs[i].domain;
+        bus = d_config->pcidevs[i].bus;
+        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
+        nr_entries = 0;
+        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
+                                 seg, bus, devfn, &nr_entries);
+        /* No RDM to associated with this device. */
+        if (!nr_entries)
+            continue;
+
+        /* Need to check whether this entry is already saved in the array.
+         * This could come from two cases:
+         *
+         *   - user may configure to get all RMRRs in this platform, which
+         * is already queried before this point
+         *   - or two assigned devices may share one RMRR entry
+         *
+         * different policies may be configured on the same RMRR due to above
+         * two cases. We choose a simple policy to always favor stricter policy
+         */
+        for (j = 0; j < d_config->num_rdms; j++) {
+            if (d_config->rdms[j].start ==
+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
+             {
+                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
+                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
+                new = false;
+                break;
+            }
+        }
+
+        if (new) {
+            if (d_config->num_rdms > nr_all_rdms - 1) {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
+                free(xrdm);
+                return -1;
+            }
+
+            /*
+             * This is a new entry.
+             */
+            d_config->rdms[d_config->num_rdms].start =
+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
+            d_config->rdms[d_config->num_rdms].size =
+                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
+            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
+            d_config->num_rdms++;
+        }
+        free(xrdm);
+    }
+
+    /* Fix highmem. */
+    if (args->mem_size > args->lowmem_size)
+        highmem_end += (args->mem_size - args->lowmem_size);
+    /* Next step is to check and avoid potential conflict between RDM entries
+     * and guest RAM. To avoid intrusive impact to existing memory layout
+     * {lowmem, mmio, highmem} which is passed around various function blocks,
+     * below conflicts are not handled which are rare and handling them would
+     * lead to a more scattered layout:
+     *  - RMRR in highmem area (>4G)
+     *  - RMRR lower than a defined memory boundary (e.g. 2G)
+     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
+     * end below reserved region to solve conflict.
+     *
+     * If a conflict is detected on a given RMRR entry, an error will be
+     * returned.
+     * If 'force' policy is specified. Or conflict is treated as a warning if
+     * 'try' policy is specified, and we also mark this as INVALID not to expose
+     * this entry to hvmloader.
+     *
+     * Firstly we should check the case of rdm < 4G because we may need to
+     * expand highmem_end.
+     */
+    for (i = 0; i < d_config->num_rdms; i++) {
+        rdm_start = d_config->rdms[i].start;
+        rdm_size = d_config->rdms[i].size;
+        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
+
+        if (!conflict)
+            continue;
+
+        /*
+         * Just check if RDM > our memory boundary
+         */
+        if (d_config->rdms[i].start > rdm_mem_guard) {
+            /*
+             * We will move downwards lowmem_end so we have to expand
+             * highmem_end.
+             */
+            highmem_end += (args->lowmem_size - rdm_start);
+            /* Now move downwards lowmem_end. */
+            args->lowmem_size = rdm_start;
+        }
+    }
+
+    /*
+     * Finally we can take same policy to check lowmem(< 2G) and
+     * highmem adjusted above.
+     */
+    for (i = 0; i < d_config->num_rdms; i++) {
+        rdm_start = d_config->rdms[i].start;
+        rdm_size = d_config->rdms[i].size;
+        /* Does this entry conflict with lowmem? */
+        conflict = check_rdm_hole(0, args->lowmem_size,
+                                  rdm_start, rdm_size);
+        /* Does this entry conflict with highmem? */
+        conflict |= check_rdm_hole((1ULL<<32), highmem_end,
+                                   rdm_start, rdm_size);
+
+        if (!conflict)
+            continue;
+
+        if(d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_FORCE) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "RDM conflict at 0x%lx.\n",
+                                                d_config->rdms[i].start);
+            return -1;
+        } else {
+            LIBXL__LOG(CTX, LIBXL__LOG_WARNING,
+                                        "Ignoring RDM conflict at 0x%lx.\n",
+                                        d_config->rdms[i].start);
+
+            /*
+             * Then mask this INVALID to indicate we shouldn't expose this
+             * to hvmloader.
+             */
+            d_config->rdms[i].flag = LIBXL_RDM_RESERVE_FLAG_INVALID;
+        }
+    }
+
+    return 0;
+}
+
 const libxl_vnc_info *libxl__dm_vnc(const libxl_domain_config *guest_config)
 {
     const libxl_vnc_info *vnc = NULL;
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index a16d4a1..ee67282 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -788,12 +788,14 @@ out:
 }
 
 int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
-              libxl_domain_build_info *info,
+              libxl_domain_config *d_config,
               libxl__domain_build_state *state)
 {
     libxl_ctx *ctx = libxl__gc_owner(gc);
     struct xc_hvm_build_args args = {};
     int ret, rc = ERROR_FAIL;
+    libxl_domain_build_info *const info = &d_config->b_info;
+    uint64_t rdm_mem_boundary, mmio_start;
 
     memset(&args, 0, sizeof(struct xc_hvm_build_args));
     /* The params from the configuration file are in Mb, which are then
@@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
      * Do all this in one step here...
      */
     args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
+    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
+                            (1ULL << 32) : args.mem_size;
     args.mem_target = (uint64_t)(info->target_memkb - info->video_memkb) << 10;
     args.claim_enabled = libxl_defbool_val(info->claim_mode);
     if (info->u.hvm.mmio_hole_memkb) {
@@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
             args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
     }
+
+    if (args.mmio_size == 0)
+        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
+    mmio_start = (1ull << 32) - args.mmio_size;
+
+    if (args.lowmem_size > mmio_start)
+        args.lowmem_size = mmio_start;
+
+    /*
+     * We'd like to set a memory boundary to determine if we need to check
+     * any overlap with reserved device memory.
+     *
+     * TODO: we will add a config parameter for this boundary value next.
+     */
+    rdm_mem_boundary = 0x80000000;
+    ret = libxl__domain_device_check_rdm(gc, d_config, rdm_mem_boundary, &args);
+    if (ret) {
+        LOG(ERROR, "checking reserved device memory failed");
+        goto out;
+    }
+
     if (libxl__domain_firmware(gc, info, &args)) {
         LOG(ERROR, "initializing domain firmware failed");
         goto out;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 934465a..64d05b3 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -985,7 +985,7 @@ _hidden int libxl__build_post(libxl__gc *gc, uint32_t domid,
 _hidden int libxl__build_pv(libxl__gc *gc, uint32_t domid,
              libxl_domain_build_info *info, libxl__domain_build_state *state);
 _hidden int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
-              libxl_domain_build_info *info,
+              libxl_domain_config *d_config,
               libxl__domain_build_state *state);
 
 _hidden int libxl__qemu_traditional_cmd(libxl__gc *gc, uint32_t domid,
@@ -1480,6 +1480,15 @@ _hidden int libxl__need_xenpv_qemu(libxl__gc *gc,
         int nr_channels, libxl_device_channel *channels);
 
 /*
+ * This function will fix reserved device memory conflict
+ * according to user's configuration.
+ */
+_hidden int libxl__domain_device_check_rdm(libxl__gc *gc,
+                                   libxl_domain_config *d_config,
+                                   uint64_t rdm_mem_guard,
+                                   struct xc_hvm_build_args *args);
+
+/*
  * This function will cause the whole libxl process to hang
  * if the device model does not respond.  It is deprecated.
  *
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 5786455..218931b 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -541,6 +541,12 @@ libxl_device_pci = Struct("device_pci", [
     ("rdm_reserve",   libxl_rdm_reserve_flag),
     ])
 
+libxl_device_rdm = Struct("device_rdm", [
+    ("start", uint64),
+    ("size", uint64),
+    ("flag", bool),
+    ])
+
 libxl_device_vtpm = Struct("device_vtpm", [
     ("backend_domid",    libxl_domid),
     ("backend_domname",  string),
@@ -567,6 +573,7 @@ libxl_domain_config = Struct("domain_config", [
     ("disks", Array(libxl_device_disk, "num_disks")),
     ("nics", Array(libxl_device_nic, "num_nics")),
     ("pcidevs", Array(libxl_device_pci, "num_pcidevs")),
+    ("rdms", Array(libxl_device_rdm, "num_rdms")),
     ("vfbs", Array(libxl_device_vfb, "num_vfbs")),
     ("vkbs", Array(libxl_device_vkb, "num_vkbs")),
     ("vtpms", Array(libxl_device_vtpm, "num_vtpms")),
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (3 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-16 15:05   ` Tim Deegan
  2015-04-10  9:21 ` [RFC][PATCH 06/13] xen:vtd: create RMRR mapping Tiejun Chen
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

We will create this sort of identity mapping as follows:

If the gfn space is unoccupied, we just set the mapping. If the space
is already occupied by 1:1 mappings, do nothing. Failed for any
other cases.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/arch/x86/mm/p2m.c     | 30 ++++++++++++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  4 ++++
 2 files changed, 34 insertions(+)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 6a06e9f..d98730c 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -862,6 +862,36 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
     return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct, access);
 }
 
+int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
+                           p2m_access_t p2ma)
+{
+    p2m_type_t p2mt;
+    p2m_access_t a;
+    mfn_t mfn;
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    int ret = -EBUSY;
+
+    gfn_lock(p2m, gfn, 0);
+
+    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL);
+
+    if ( !mfn_valid(mfn) )
+        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
+                            p2ma);
+    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
+        ret = 0;
+    else
+    {
+        printk(XENLOG_G_WARNING
+               "Cannot identity map d%d:%lx, already mapped to %lx.\n",
+               d->domain_id, gfn, mfn_x(mfn));
+    }
+
+    gfn_unlock(p2m, gfn, 0);
+
+    return ret;
+}
+
 /* Returns: 0 for success, -errno for failure */
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
 {
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index e93c551..a9c0415 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -532,6 +532,10 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
                        p2m_access_t access);
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 
+/* Set identity addresses in the p2m table (for pass-through) */
+int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
+                           p2m_access_t p2ma);
+
 /* Add foreign mapping to the guest's p2m table. */
 int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
                     unsigned long gpfn, domid_t foreign_domid);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 06/13] xen:vtd: create RMRR mapping
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (4 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-16 15:16   ` Tim Deegan
  2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

RMRR reserved regions must be setup in the pfn space with an identity
mapping to reported mfn. However existing code has problem to setup
correct mapping when VT-d shares EPT page table, so lead to problem
when assigning devices (e.g GPU) with RMRR reported. This patch
aims to setup identity mapping in p2m layer, regardless of whether EPT
is shared or not.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/drivers/passthrough/vtd/iommu.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 4e789d1..f8fc6c3 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1847,6 +1847,12 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
 
         if ( err )
             return err;
+
+        if ( !is_hardware_domain(d) )
+        {
+            if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
+                return err;
+        }
         base_pfn++;
     }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (5 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 06/13] xen:vtd: create RMRR mapping Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-16 15:40   ` Tim Deegan
                     ` (2 more replies)
  2015-04-10  9:21 ` [RFC][PATCH 08/13] tools: extend xc_assign_device() " Tiejun Chen
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

This patch extends the existing hypercall to support rdm reservation policy.
We return error or just throw out a warning message depending on whether
the policy is 'force' or 'try'.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/drivers/passthrough/amd/pci_amd_iommu.c |  3 ++-
 xen/drivers/passthrough/pci.c               | 10 +++++----
 xen/drivers/passthrough/vtd/iommu.c         | 32 +++++++++++++++++++++--------
 xen/include/public/domctl.h                 |  4 ++++
 xen/include/xen/iommu.h                     |  2 +-
 5 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index e83bb35..920b35a 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -394,7 +394,8 @@ static int reassign_device(struct domain *source, struct domain *target,
 }
 
 static int amd_iommu_assign_device(struct domain *d, u8 devfn,
-                                   struct pci_dev *pdev)
+                                   struct pci_dev *pdev,
+                                   u32 flag)
 {
     struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
     int bdf = PCI_BDF2(pdev->bus, devfn);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 4b83583..1b040d9 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1333,7 +1333,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
     return pdev ? 0 : -EBUSY;
 }
 
-static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn)
+static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
 {
     struct hvm_iommu *hd = domain_hvm_iommu(d);
     struct pci_dev *pdev;
@@ -1383,7 +1383,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn)
 
     pdev->fault.count = 0;
 
-    if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev))) )
+    if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
         goto done;
 
     for ( ; pdev->phantom_stride; rc = 0 )
@@ -1391,7 +1391,7 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn)
         devfn += pdev->phantom_stride;
         if ( PCI_SLOT(devfn) != PCI_SLOT(pdev->devfn) )
             break;
-        rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev));
+        rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag);
         if ( rc )
             printk(XENLOG_G_WARNING "d%d: assign %04x:%02x:%02x.%u failed (%d)\n",
                    d->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1508,6 +1508,7 @@ int iommu_do_pci_domctl(
 {
     u16 seg;
     u8 bus, devfn;
+    u32 flag;
     int ret = 0;
 
     switch ( domctl->cmd )
@@ -1576,9 +1577,10 @@ int iommu_do_pci_domctl(
         seg = domctl->u.assign_device.machine_sbdf >> 16;
         bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
         devfn = domctl->u.assign_device.machine_sbdf & 0xff;
+        flag = domctl->u.assign_device.sbdf_flag;
 
         ret = device_assigned(seg, bus, devfn) ?:
-              assign_device(d, seg, bus, devfn);
+              assign_device(d, seg, bus, devfn, flag);
         if ( ret == -ERESTART )
             ret = hypercall_create_continuation(__HYPERVISOR_domctl,
                                                 "h", u_domctl);
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index f8fc6c3..2681166 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1793,8 +1793,14 @@ static void iommu_set_pgd(struct domain *d)
     hd->arch.pgd_maddr = pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
 }
 
+/*
+ * In some cases, e.g. add a device to hwdomain, and remove a device from
+ * user domain, 'try' is fine enough since this is always safe to hwdomain.
+ */
+#define XEN_DOMCTL_PCIDEV_RDM_DEFAULT XEN_DOMCTL_PCIDEV_RDM_TRY
 static int rmrr_identity_mapping(struct domain *d, bool_t map,
-                                 const struct acpi_rmrr_unit *rmrr)
+                                 const struct acpi_rmrr_unit *rmrr,
+                                 u32 flag)
 {
     unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
     unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
@@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
         if ( !is_hardware_domain(d) )
         {
             if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
-                return err;
+            {
+                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
+                {
+                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
+                }
+                else
+                    return err;
+            }
         }
         base_pfn++;
     }
@@ -1892,7 +1905,8 @@ static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev)
              PCI_BUS(bdf) == pdev->bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr);
+            ret = rmrr_identity_mapping(pdev->domain, 1, rmrr,
+                                        XEN_DOMCTL_PCIDEV_RDM_DEFAULT);
             if ( ret )
                 dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
                         pdev->domain->domain_id);
@@ -1933,7 +1947,8 @@ static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
              PCI_DEVFN2(bdf) != devfn )
             continue;
 
-        rmrr_identity_mapping(pdev->domain, 0, rmrr);
+        rmrr_identity_mapping(pdev->domain, 0, rmrr,
+                              XEN_DOMCTL_PCIDEV_RDM_DEFAULT);
     }
 
     return domain_context_unmap(pdev->domain, devfn, pdev);
@@ -2091,7 +2106,7 @@ static void __hwdom_init setup_hwdom_rmrr(struct domain *d)
     spin_lock(&pcidevs_lock);
     for_each_rmrr_device ( rmrr, bdf, i )
     {
-        ret = rmrr_identity_mapping(d, 1, rmrr);
+        ret = rmrr_identity_mapping(d, 1, rmrr, XEN_DOMCTL_PCIDEV_RDM_DEFAULT);
         if ( ret )
             dprintk(XENLOG_ERR VTDPREFIX,
                      "IOMMU: mapping reserved region failed\n");
@@ -2234,7 +2249,8 @@ static int reassign_device_ownership(
                  PCI_BUS(bdf) == pdev->bus &&
                  PCI_DEVFN2(bdf) == devfn )
             {
-                ret = rmrr_identity_mapping(source, 0, rmrr);
+                ret = rmrr_identity_mapping(source, 0, rmrr,
+                                            XEN_DOMCTL_PCIDEV_RDM_DEFAULT);
                 if ( ret != -ENOENT )
                     return ret;
             }
@@ -2258,7 +2274,7 @@ static int reassign_device_ownership(
 }
 
 static int intel_iommu_assign_device(
-    struct domain *d, u8 devfn, struct pci_dev *pdev)
+    struct domain *d, u8 devfn, struct pci_dev *pdev, u32 flag)
 {
     struct acpi_rmrr_unit *rmrr;
     int ret = 0, i;
@@ -2287,7 +2303,7 @@ static int intel_iommu_assign_device(
              PCI_BUS(bdf) == bus &&
              PCI_DEVFN2(bdf) == devfn )
         {
-            ret = rmrr_identity_mapping(d, 1, rmrr);
+            ret = rmrr_identity_mapping(d, 1, rmrr, flag);
             if ( ret )
             {
                 reassign_device_ownership(d, hardware_domain, devfn, pdev);
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index ca0e51e..e5ba7cb 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
 /* XEN_DOMCTL_deassign_device */
 struct xen_domctl_assign_device {
     uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
+    /* IN */
+#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
+#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
+    uint32_t  sbdf_flag;   /* flag of assigned device */
 };
 typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 8565b82..0d10b3d 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -129,7 +129,7 @@ struct iommu_ops {
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
-    int (*assign_device)(struct domain *, u8 devfn, device_t *dev);
+    int (*assign_device)(struct domain *, u8 devfn, device_t *dev, u32 flag);
     int (*reassign_device)(struct domain *s, struct domain *t,
                            u8 devfn, device_t *dev);
 #ifdef HAS_PCI
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 08/13] tools: extend xc_assign_device() to support rdm reservation policy
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (6 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
@ 2015-04-10  9:21 ` Tiejun Chen
  2015-04-20 13:39   ` Jan Beulich
  2015-04-10  9:22 ` [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm Tiejun Chen
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:21 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

This patch passes rdm reservation policy to xc_assign_device() so the policy
is checked when assigning devices to a VM.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxc/include/xenctrl.h       |  3 ++-
 tools/libxc/xc_domain.c             |  4 +++-
 tools/libxl/libxl_pci.c             | 11 ++++++++++-
 tools/libxl/xl_cmdimpl.c            | 23 +++++++++++++++++++----
 tools/libxl/xl_cmdtable.c           |  2 +-
 tools/ocaml/libs/xc/xenctrl_stubs.c | 17 +++++++++++++----
 tools/python/xen/lowlevel/xc/xc.c   | 29 +++++++++++++++++++----------
 7 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 299b95f..ff33769 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2044,7 +2044,8 @@ int xc_hvm_destroy_ioreq_server(xc_interface *xch,
 /* HVM guest pass-through */
 int xc_assign_device(xc_interface *xch,
                      uint32_t domid,
-                     uint32_t machine_sbdf);
+                     uint32_t machine_sbdf,
+                     uint32_t flag);
 
 int xc_get_device_group(xc_interface *xch,
                      uint32_t domid,
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 85b18ea..a33311a 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1654,13 +1654,15 @@ int xc_domain_setdebugging(xc_interface *xch,
 int xc_assign_device(
     xc_interface *xch,
     uint32_t domid,
-    uint32_t machine_sbdf)
+    uint32_t machine_sbdf,
+    uint32_t flag)
 {
     DECLARE_DOMCTL;
 
     domctl.cmd = XEN_DOMCTL_assign_device;
     domctl.domain = domid;
     domctl.u.assign_device.machine_sbdf = machine_sbdf;
+    domctl.u.assign_device.sbdf_flag = flag;
 
     return do_domctl(xch, &domctl);
 }
diff --git a/tools/libxl/libxl_pci.c b/tools/libxl/libxl_pci.c
index f3ae132..dd96b4a 100644
--- a/tools/libxl/libxl_pci.c
+++ b/tools/libxl/libxl_pci.c
@@ -895,6 +895,7 @@ static int do_pci_add(libxl__gc *gc, uint32_t domid, libxl_device_pci *pcidev, i
     FILE *f;
     unsigned long long start, end, flags, size;
     int irq, i, rc, hvm = 0;
+    uint32_t flag;
 
     if (type == LIBXL_DOMAIN_TYPE_INVALID)
         return ERROR_FAIL;
@@ -988,7 +989,15 @@ static int do_pci_add(libxl__gc *gc, uint32_t domid, libxl_device_pci *pcidev, i
 
 out:
     if (!libxl_is_stubdom(ctx, domid, NULL)) {
-        rc = xc_assign_device(ctx->xch, domid, pcidev_encode_bdf(pcidev));
+        if (pcidev->rdm_reserve == LIBXL_RDM_RESERVE_FLAG_TRY) {
+            flag = XEN_DOMCTL_PCIDEV_RDM_TRY;
+        } else if (pcidev->rdm_reserve == LIBXL_RDM_RESERVE_FLAG_FORCE) {
+            flag = XEN_DOMCTL_PCIDEV_RDM_FORCE;
+        } else {
+            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "unkwon rdm check flag.");
+            return ERROR_FAIL;
+        }
+        rc = xc_assign_device(ctx->xch, domid, pcidev_encode_bdf(pcidev), flag);
         if (rc < 0 && (hvm || errno != ENOSYS)) {
             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_assign_device failed");
             return ERROR_FAIL;
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 9a58464..24ace59 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -3153,7 +3153,8 @@ int main_pcidetach(int argc, char **argv)
     pcidetach(domid, bdf, force);
     return 0;
 }
-static void pciattach(uint32_t domid, const char *bdf, const char *vs)
+static void pciattach(uint32_t domid, const char *bdf, const char *vs,
+                      uint32_t flag)
 {
     libxl_device_pci pcidev;
     XLU_Config *config;
@@ -3163,6 +3164,7 @@ static void pciattach(uint32_t domid, const char *bdf, const char *vs)
     config = xlu_cfg_init(stderr, "command line");
     if (!config) { perror("xlu_cfg_inig"); exit(-1); }
 
+    pcidev.rdm_reserve = flag;
     if (xlu_pci_parse_bdf(config, &pcidev, bdf)) {
         fprintf(stderr, "pci-attach: malformed BDF specification \"%s\"\n", bdf);
         exit(2);
@@ -3175,9 +3177,9 @@ static void pciattach(uint32_t domid, const char *bdf, const char *vs)
 
 int main_pciattach(int argc, char **argv)
 {
-    uint32_t domid;
+    uint32_t domid, flag;
     int opt;
-    const char *bdf = NULL, *vs = NULL;
+    const char *bdf = NULL, *vs = NULL, *rdm_policy = NULL;
 
     SWITCH_FOREACH_OPT(opt, "", NULL, "pci-attach", 2) {
         /* No options */
@@ -3189,7 +3191,20 @@ int main_pciattach(int argc, char **argv)
     if (optind + 1 < argc)
         vs = argv[optind + 2];
 
-    pciattach(domid, bdf, vs);
+    if (optind + 2 < argc) {
+        rdm_policy = argv[optind + 3];
+    }
+    if (!strcmp(rdm_policy, "force")) {
+        flag = LIBXL_RDM_RESERVE_FLAG_FORCE;
+    } else if (!strcmp(rdm_policy, "try")) {
+        flag = LIBXL_RDM_RESERVE_FLAG_TRY;
+    } else {
+        fprintf(stderr, "%s is an invalid rdm policy: 'force'|'try'\n",
+                rdm_policy);
+        exit(2);
+    }
+
+    pciattach(domid, bdf, vs, flag);
     return 0;
 }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 22ab63b..332de33 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -86,7 +86,7 @@ struct cmd_spec cmd_table[] = {
     { "pci-attach",
       &main_pciattach, 0, 1,
       "Insert a new pass-through pci device",
-      "<Domain> <BDF> [Virtual Slot]",
+      "<Domain> <BDF> [Virtual Slot] <policy to reserve rdm['force'|'try']>",
     },
     { "pci-detach",
       &main_pcidetach, 0, 1,
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c
index 64f1137..c9a5437 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -1172,12 +1172,18 @@ CAMLprim value stub_xc_domain_test_assign_device(value xch, value domid, value d
 	CAMLreturn(Val_bool(ret == 0));
 }
 
-CAMLprim value stub_xc_domain_assign_device(value xch, value domid, value desc)
+static int domain_assign_device_rdm_flag_table[] = {
+	XEN_DOMCTL_PCIDEV_RDM_TRY,
+    XEN_DOMCTL_PCIDEV_RDM_FORCE,
+};
+
+CAMLprim value stub_xc_domain_assign_device(value xch, value domid, value desc,
+                                            value rflag)
 {
-	CAMLparam3(xch, domid, desc);
+	CAMLparam4(xch, domid, desc, rflag);
 	int ret;
 	int domain, bus, dev, func;
-	uint32_t sbdf;
+	uint32_t sbdf, flag;
 
 	domain = Int_val(Field(desc, 0));
 	bus = Int_val(Field(desc, 1));
@@ -1185,7 +1191,10 @@ CAMLprim value stub_xc_domain_assign_device(value xch, value domid, value desc)
 	func = Int_val(Field(desc, 3));
 	sbdf = encode_sbdf(domain, bus, dev, func);
 
-	ret = xc_assign_device(_H(xch), _D(domid), sbdf);
+	ret = Int_val(Field(rflag, 0));
+	flag = domain_assign_device_rdm_flag_table[ret];
+
+	ret = xc_assign_device(_H(xch), _D(domid), sbdf, flag);
 
 	if (ret < 0)
 		failwith_xc(_H(xch));
diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c
index 2aa0dc7..41e55f9 100644
--- a/tools/python/xen/lowlevel/xc/xc.c
+++ b/tools/python/xen/lowlevel/xc/xc.c
@@ -592,7 +592,8 @@ static int token_value(char *token)
     return strtol(token, NULL, 16);
 }
 
-static int next_bdf(char **str, int *seg, int *bus, int *dev, int *func)
+static int next_bdf(char **str, int *seg, int *bus, int *dev, int *func,
+                    int *flag)
 {
     char *token;
 
@@ -607,8 +608,16 @@ static int next_bdf(char **str, int *seg, int *bus, int *dev, int *func)
     *dev  = token_value(token);
     token = strchr(token, ',') + 1;
     *func  = token_value(token);
-    token = strchr(token, ',');
-    *str = token ? token + 1 : NULL;
+    token = strchr(token, ',') + 1;
+    if ( token ) {
+        *flag = token_value(token);
+        *str = token + 1;
+    }
+    else
+    {
+        *flag = XEN_DOMCTL_PCIDEV_RDM_FORCE;
+        *str = NULL;
+    }
 
     return 1;
 }
@@ -620,14 +629,14 @@ static PyObject *pyxc_test_assign_device(XcObject *self,
     uint32_t dom;
     char *pci_str;
     int32_t sbdf = 0;
-    int seg, bus, dev, func;
+    int seg, bus, dev, func, flag;
 
     static char *kwd_list[] = { "domid", "pci", NULL };
     if ( !PyArg_ParseTupleAndKeywords(args, kwds, "is", kwd_list,
                                       &dom, &pci_str) )
         return NULL;
 
-    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func) )
+    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func, &flag) )
     {
         sbdf = seg << 16;
         sbdf |= (bus & 0xff) << 8;
@@ -653,21 +662,21 @@ static PyObject *pyxc_assign_device(XcObject *self,
     uint32_t dom;
     char *pci_str;
     int32_t sbdf = 0;
-    int seg, bus, dev, func;
+    int seg, bus, dev, func, flag;
 
     static char *kwd_list[] = { "domid", "pci", NULL };
     if ( !PyArg_ParseTupleAndKeywords(args, kwds, "is", kwd_list,
                                       &dom, &pci_str) )
         return NULL;
 
-    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func) )
+    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func, &flag) )
     {
         sbdf = seg << 16;
         sbdf |= (bus & 0xff) << 8;
         sbdf |= (dev & 0x1f) << 3;
         sbdf |= (func & 0x7);
 
-        if ( xc_assign_device(self->xc_handle, dom, sbdf) != 0 )
+        if ( xc_assign_device(self->xc_handle, dom, sbdf, flag) != 0 )
         {
             if (errno == ENOSYS)
                 sbdf = -1;
@@ -686,14 +695,14 @@ static PyObject *pyxc_deassign_device(XcObject *self,
     uint32_t dom;
     char *pci_str;
     int32_t sbdf = 0;
-    int seg, bus, dev, func;
+    int seg, bus, dev, func, flag;
 
     static char *kwd_list[] = { "domid", "pci", NULL };
     if ( !PyArg_ParseTupleAndKeywords(args, kwds, "is", kwd_list,
                                       &dom, &pci_str) )
         return NULL;
 
-    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func) )
+    while ( next_bdf(&pci_str, &seg, &bus, &dev, &func, &flag) )
     {
         sbdf = seg << 16;
         sbdf |= (bus & 0xff) << 8;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (7 preceding siblings ...)
  2015-04-10  9:21 ` [RFC][PATCH 08/13] tools: extend xc_assign_device() " Tiejun Chen
@ 2015-04-10  9:22 ` Tiejun Chen
  2015-04-16 15:42   ` Tim Deegan
  2015-04-20 13:46   ` Jan Beulich
  2015-04-10  9:22 ` [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map Tiejun Chen
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:22 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

This patch enables XENMEM_set_memory_map in hvm. So we can use it to
setup the e820 mappings.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 xen/arch/x86/hvm/hvm.c | 2 --
 xen/arch/x86/mm.c      | 6 ------
 2 files changed, 8 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 4734d71..0e05e47 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     switch ( cmd & MEMOP_CMD_MASK )
     {
-    case XENMEM_memory_map:
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
@@ -4805,7 +4804,6 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     switch ( cmd & MEMOP_CMD_MASK )
     {
-    case XENMEM_memory_map:
     case XENMEM_machine_memory_map:
     case XENMEM_machphys_mapping:
         return -ENOSYS;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 45acfb5..c75a8bc 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4693,12 +4693,6 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             return rc;
         }
 
-        if ( is_hvm_domain(d) )
-        {
-            rcu_unlock_domain(d);
-            return -EPERM;
-        }
-
         e820 = xmalloc_array(e820entry_t, fmap.map.nr_entries);
         if ( e820 == NULL )
         {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (8 preceding siblings ...)
  2015-04-10  9:22 ` [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm Tiejun Chen
@ 2015-04-10  9:22 ` Tiejun Chen
  2015-04-10 10:01   ` Wei Liu
  2015-04-20 13:51   ` Jan Beulich
  2015-04-10  9:22 ` [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[] Tiejun Chen
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:22 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

Here we'll construct a basic guest e820 table via
XENMEM_set_memory_map. This table includes lowmem, highmem
and RDMs if they exist. And hvmloader would need this info
later.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/libxl/libxl_dom.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index ee67282..82468be 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -787,6 +787,70 @@ out:
     return rc;
 }
 
+static int libxl__domain_construct_memmap(libxl_ctx *ctx,
+                                          libxl_domain_config *d_config,
+                                          uint32_t domid,
+                                          struct xc_hvm_build_args *args,
+                                          int num_pcidevs,
+                                          libxl_device_pci *pcidevs)
+{
+    unsigned int nr = 0, i;
+    /* We always own at least one lowmem entry. */
+    unsigned int e820_entries = 1;
+    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
+    struct e820entry *e820 = NULL;
+
+    /* Add all rdm entries. */
+    e820_entries += d_config->num_rdms;
+
+    /* If we should have a highmem range. */
+    if (highmem_size)
+    {
+        highmem_end = (1ull<<32) + highmem_size;
+        e820_entries++;
+    }
+
+    e820 = malloc(sizeof(struct e820entry) * e820_entries);
+    if (!e820) {
+        return -1;
+    }
+
+    /* Low memory */
+    e820[nr].addr = 0x100000;
+    e820[nr].size = args->lowmem_size - 0x100000;
+    e820[nr].type = E820_RAM;
+    nr++;
+
+    /* RDM mapping */
+    for (i = 0; i < d_config->num_rdms; i++) {
+        /*
+         * We should drop this kind of rdm entry.
+         */
+        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
+            continue;
+
+        e820[nr].addr = d_config->rdms[i].start;
+        e820[nr].size = d_config->rdms[i].size;
+        e820[nr].type = E820_RESERVED;
+        nr++;
+    }
+
+    /* High memory */
+    if (highmem_size) {
+        e820[nr].addr = ((uint64_t)1 << 32);
+        e820[nr].size = highmem_size;
+        e820[nr].type = E820_RAM;
+    }
+
+    if (xc_domain_set_memory_map(ctx->xch, domid, e820, e820_entries) != 0) {
+        free(e820);
+        return -1;
+    }
+
+    free(e820);
+    return 0;
+}
+
 int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
               libxl_domain_config *d_config,
               libxl__domain_build_state *state)
@@ -836,6 +900,14 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         goto out;
     }
 
+    if (libxl__domain_construct_memmap(ctx, d_config, domid,
+                                       &args,
+                                       d_config->num_pcidevs,
+                                       d_config->pcidevs)) {
+        LOG(ERROR, "setting domain rdm memory map failed");
+        goto out;
+    }
+
     if (libxl__domain_firmware(gc, info, &args)) {
         LOG(ERROR, "initializing domain firmware failed");
         goto out;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[]
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (9 preceding siblings ...)
  2015-04-10  9:22 ` [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map Tiejun Chen
@ 2015-04-10  9:22 ` Tiejun Chen
  2015-04-20 13:57   ` Jan Beulich
  2015-04-10  9:22 ` [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges Tiejun Chen
  2015-04-10  9:22 ` [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table Tiejun Chen
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:22 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

Now we get this map layout by call XENMEM_memory_map then
save them into one global variable memory_map[]. It should
include lowmem range, rdm range and highmem range. Note
rdm range and highmem range may not exist in some cases.

And here we need to check if any reserved memory conflicts with
[RESERVED_MEMORY_DYNAMIC_START - 1, RESERVED_MEMORY_DYNAMIC_END].
This range is used to allocate memory in hvmloder level, and
we would lead hvmloader failed in case of conflict since its
another rare possibility in real world.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/e820.h      |  7 +++++++
 tools/firmware/hvmloader/hvmloader.c | 36 ++++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/util.c      | 25 +++++++++++++++++++++++++
 tools/firmware/hvmloader/util.h      | 10 ++++++++++
 4 files changed, 78 insertions(+)

diff --git a/tools/firmware/hvmloader/e820.h b/tools/firmware/hvmloader/e820.h
index b2ead7f..8b5a9e0 100644
--- a/tools/firmware/hvmloader/e820.h
+++ b/tools/firmware/hvmloader/e820.h
@@ -15,6 +15,13 @@ struct e820entry {
     uint32_t type;
 } __attribute__((packed));
 
+#define E820MAX	128
+
+struct e820map {
+    unsigned int nr_map;
+    struct e820entry map[E820MAX];
+};
+
 #endif /* __HVMLOADER_E820_H__ */
 
 /*
diff --git a/tools/firmware/hvmloader/hvmloader.c b/tools/firmware/hvmloader/hvmloader.c
index 25b7f08..5595700 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -107,6 +107,8 @@ asm (
     "    .text                       \n"
     );
 
+struct e820map memory_map;
+
 unsigned long scratch_start = SCRATCH_PHYSICAL_ADDRESS;
 
 static void init_hypercalls(void)
@@ -199,6 +201,38 @@ static void apic_setup(void)
     ioapic_write(0x11, SET_APIC_ID(LAPIC_ID(0)));
 }
 
+void memory_map_setup(void)
+{
+    unsigned int nr_entries = E820MAX, i;
+    int rc;
+    uint64_t alloc_addr = RESERVED_MEMORY_DYNAMIC_START - 1;
+    uint64_t alloc_size = RESERVED_MEMORY_DYNAMIC_END - alloc_addr;
+
+    rc = get_mem_mapping_layout(memory_map.map, &nr_entries);
+
+    if ( rc )
+    {
+        printf("Failed to get guest memory map.\n");
+        BUG();
+    }
+
+    memory_map.nr_map = nr_entries;
+
+    for ( i = 0; i < nr_entries; i++ )
+    {
+        if ( memory_map.map[i].type == E820_RESERVED )
+        {
+            if ( check_hole_conflict(alloc_addr, alloc_size,
+                                     memory_map.map[i].addr,
+                                     memory_map.map[i].size) )
+            {
+                printf("RDM conflicts Memory allocation.\n");
+                BUG();
+            }
+        }
+    }
+}
+
 struct bios_info {
     const char *key;
     const struct bios_config *bios;
@@ -262,6 +296,8 @@ int main(void)
 
     init_hypercalls();
 
+    memory_map_setup();
+
     xenbus_setup();
 
     bios = detect_bios();
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 80d822f..0e08cb8 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -27,6 +27,16 @@
 #include <xen/memory.h>
 #include <xen/sched.h>
 
+int check_hole_conflict(uint64_t start, uint64_t size,
+                        uint64_t reserved_start, uint64_t reserved_size)
+{
+    if ( start + size <= reserved_start ||
+            start >= reserved_start + reserved_size )
+        return 0;
+    else
+        return 1;
+}
+
 void wrmsr(uint32_t idx, uint64_t v)
 {
     asm volatile (
@@ -368,6 +378,21 @@ uuid_to_string(char *dest, uint8_t *uuid)
     *p = '\0';
 }
 
+int get_mem_mapping_layout(struct e820entry entries[], uint32_t *max_entries)
+{
+    int rc;
+    struct xen_memory_map memmap = {
+        .nr_entries = *max_entries
+    };
+
+    set_xen_guest_handle(memmap.buffer, entries);
+
+    rc = hypercall_memory_op(XENMEM_memory_map, &memmap);
+    *max_entries = memmap.nr_entries;
+
+    return rc;
+}
+
 void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
 {
     static int over_allocated;
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index a70e4aa..1121a65 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -6,6 +6,7 @@
 #include <stddef.h>
 #include <xen/xen.h>
 #include <xen/hvm/hvm_info_table.h>
+#include "e820.h"
 
 #define __STR(...) #__VA_ARGS__
 #define STR(...) __STR(__VA_ARGS__)
@@ -222,6 +223,9 @@ int hvm_param_set(uint32_t index, uint64_t value);
 /* Setup PCI bus */
 void pci_setup(void);
 
+/* Setup memory map  */
+void memory_map_setup(void);
+
 /* Prepare the 32bit BIOS */
 uint32_t rombios_highbios_setup(void);
 
@@ -249,6 +253,12 @@ void perform_tests(void);
 
 extern char _start[], _end[];
 
+int get_mem_mapping_layout(struct e820entry entries[], uint32_t *max_entries);
+
+extern struct e820map memory_map;
+int check_hole_conflict(uint64_t start, uint64_t size,
+                        uint64_t reserved_start, uint64_t reserved_size);
+
 #endif /* __HVMLOADER_UTIL_H__ */
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (10 preceding siblings ...)
  2015-04-10  9:22 ` [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[] Tiejun Chen
@ 2015-04-10  9:22 ` Tiejun Chen
  2015-04-20 14:21   ` Jan Beulich
  2015-04-10  9:22 ` [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table Tiejun Chen
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:22 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

When allocating mmio address for PCI bars, we need to make
sure they don't overlap with reserved regions.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/pci.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
index 4e8d803..51aa65d 100644
--- a/tools/firmware/hvmloader/pci.c
+++ b/tools/firmware/hvmloader/pci.c
@@ -59,8 +59,8 @@ void pci_setup(void)
         uint32_t bar_reg;
         uint64_t bar_sz;
     } *bars = (struct bars *)scratch_start;
-    unsigned int i, nr_bars = 0;
-    uint64_t mmio_hole_size = 0;
+    unsigned int i, j, nr_bars = 0;
+    uint64_t mmio_hole_size = 0, reserved_end;
 
     const char *s;
     /*
@@ -393,8 +393,23 @@ void pci_setup(void)
         }
 
         base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
+ reallocate_mmio:
         bar_data |= (uint32_t)base;
         bar_data_upper = (uint32_t)(base >> 32);
+        for ( j = 0; j < memory_map.nr_map ; j++ )
+        {
+            if ( memory_map.map[j].type != E820_RAM )
+            {
+                reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
+                if ( check_hole_conflict(base, bar_sz,
+                                         memory_map.map[j].addr,
+                                         memory_map.map[j].size) )
+                {
+                    base = (reserved_end  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
+                    goto reallocate_mmio;
+                }
+            }
+        }
         base += bar_sz;
 
         if ( (base < resource->base) || (base > resource->max) )
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
                   ` (11 preceding siblings ...)
  2015-04-10  9:22 ` [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges Tiejun Chen
@ 2015-04-10  9:22 ` Tiejun Chen
  2015-04-20 14:29   ` Jan Beulich
  12 siblings, 1 reply; 125+ messages in thread
From: Tiejun Chen @ 2015-04-10  9:22 UTC (permalink / raw)
  To: JBeulich, tim, konrad.wilk, andrew.cooper3, kevin.tian,
	yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

Now we can use that memory map to build our final
e820 table but it may need to reorder all e820
entries.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 tools/firmware/hvmloader/e820.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 2e05e93..3aa0be4 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -73,7 +73,8 @@ int build_e820_table(struct e820entry *e820,
                      unsigned int lowmem_reserved_base,
                      unsigned int bios_image_base)
 {
-    unsigned int nr = 0;
+    unsigned int nr = 0, i, j;
+    struct e820entry tmp;
 
     if ( !lowmem_reserved_base )
             lowmem_reserved_base = 0xA0000;
@@ -119,10 +120,6 @@ int build_e820_table(struct e820entry *e820,
 
     /* Low RAM goes here. Reserve space for special pages. */
     BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
-    e820[nr].addr = 0x100000;
-    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
-    e820[nr].type = E820_RAM;
-    nr++;
 
     /*
      * Explicitly reserve space for special pages.
@@ -159,16 +156,37 @@ int build_e820_table(struct e820entry *e820,
         nr++;
     }
 
-
-    if ( hvm_info->high_mem_pgend )
+    /* Construct the remaings according memory_map[]. */
+    for ( i = 0; i < memory_map.nr_map; i++ )
     {
-        e820[nr].addr = ((uint64_t)1 << 32);
-        e820[nr].size =
-            ((uint64_t)hvm_info->high_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
-        e820[nr].type = E820_RAM;
+        e820[nr].addr = memory_map.map[i].addr;
+        e820[nr].size = memory_map.map[i].size;
+        e820[nr].type = memory_map.map[i].type;
         nr++;
     }
 
+    /* May need to reorder all e820 entries. */
+    for ( j = 0; j < nr-1; j++ )
+    {
+        for ( i = j+1; i < nr; i++ )
+        {
+            if ( e820[j].addr > e820[i].addr )
+            {
+                tmp.addr = e820[j].addr;
+                tmp.size = e820[j].size;
+                tmp.type = e820[j].type;
+
+                e820[j].addr = e820[i].addr;
+                e820[j].size = e820[i].size;
+                e820[j].type = e820[i].type;
+
+                e820[i].addr = tmp.addr;
+                e820[i].size = tmp.size;
+                e820[i].type = tmp.type;
+            }
+        }
+    }
+
     return nr;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-10  9:22 ` [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map Tiejun Chen
@ 2015-04-10 10:01   ` Wei Liu
  2015-04-13  2:09     ` Chen, Tiejun
  2015-04-20 13:51   ` Jan Beulich
  1 sibling, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-04-10 10:01 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Fri, Apr 10, 2015 at 05:22:01PM +0800, Tiejun Chen wrote:
> Here we'll construct a basic guest e820 table via
> XENMEM_set_memory_map. This table includes lowmem, highmem
> and RDMs if they exist. And hvmloader would need this info
> later.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> ---
>  tools/libxl/libxl_dom.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 72 insertions(+)
> 
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index ee67282..82468be 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -787,6 +787,70 @@ out:
>      return rc;
>  }
>  
> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,

Internal function should take libxl__gc *gc.

> +                                          libxl_domain_config *d_config,
> +                                          uint32_t domid,
> +                                          struct xc_hvm_build_args *args,
> +                                          int num_pcidevs,
> +                                          libxl_device_pci *pcidevs)

I think domid, num_pcidevs and pcidevs should be in d_config by the
time you call this function? At least num_pcidevs and pcidevs should be
available.

That said, I don't see pci stuff being used anywhere in the function, so
please just delete them.

> +{
> +    unsigned int nr = 0, i;
> +    /* We always own at least one lowmem entry. */
> +    unsigned int e820_entries = 1;
> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;

FWIW there are some new output fields called lowmem_end, highmem_end and
mmio_start in xc_hvm_build_args. Those might be useful as well?

Note that they are only available *after* calling xc_hvm_build, which
seems to *not* be the case for you.

So if you somehow find those fields useful you might want to consider
moving the callsite a bit later.

> +    struct e820entry *e820 = NULL;
> +
> +    /* Add all rdm entries. */
> +    e820_entries += d_config->num_rdms;
> +
> +    /* If we should have a highmem range. */
> +    if (highmem_size)
> +    {
> +        highmem_end = (1ull<<32) + highmem_size;
> +        e820_entries++;
> +    }
> +
> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);

You can use libxl__malloc(gc, ...).

> +    if (!e820) {
> +        return -1;
> +    }

No need to check if you use libxl__malloc.

> +
> +    /* Low memory */
> +    e820[nr].addr = 0x100000;

Hardcoded value?

> +    e820[nr].size = args->lowmem_size - 0x100000;
> +    e820[nr].type = E820_RAM;
> +    nr++;
> +
> +    /* RDM mapping */
> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        /*
> +         * We should drop this kind of rdm entry.
> +         */
> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
> +            continue;
> +
> +        e820[nr].addr = d_config->rdms[i].start;
> +        e820[nr].size = d_config->rdms[i].size;
> +        e820[nr].type = E820_RESERVED;
> +        nr++;
> +    }
> +
> +    /* High memory */
> +    if (highmem_size) {
> +        e820[nr].addr = ((uint64_t)1 << 32);
> +        e820[nr].size = highmem_size;
> +        e820[nr].type = E820_RAM;
> +    }
> +
> +    if (xc_domain_set_memory_map(ctx->xch, domid, e820, e820_entries) != 0) {
> +        free(e820);

No need to free if you use libxl__malloc(gc, ...).

> +        return -1;
> +    }
> +
> +    free(e820);

Ditto.

Wei.

> +    return 0;
> +}
> +
>  int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>                libxl_domain_config *d_config,
>                libxl__domain_build_state *state)
> @@ -836,6 +900,14 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>          goto out;
>      }
>  
> +    if (libxl__domain_construct_memmap(ctx, d_config, domid,
> +                                       &args,
> +                                       d_config->num_pcidevs,
> +                                       d_config->pcidevs)) {
> +        LOG(ERROR, "setting domain rdm memory map failed");
> +        goto out;
> +    }
> +
>      if (libxl__domain_firmware(gc, info, &args)) {
>          LOG(ERROR, "initializing domain firmware failed");
>          goto out;
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-10 10:01   ` Wei Liu
@ 2015-04-13  2:09     ` Chen, Tiejun
  2015-04-13 11:02       ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-13  2:09 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

Wei,

Thanks for your quick comment.

>>
>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>
> Internal function should take libxl__gc *gc.

Right.

>
>> +                                          libxl_domain_config *d_config,
>> +                                          uint32_t domid,
>> +                                          struct xc_hvm_build_args *args,
>> +                                          int num_pcidevs,
>> +                                          libxl_device_pci *pcidevs)
>
> I think domid, num_pcidevs and pcidevs should be in d_config by the
> time you call this function? At least num_pcidevs and pcidevs should be

Yes, but looks 'domid' is still needed.

> available.
>
> That said, I don't see pci stuff being used anywhere in the function, so
> please just delete them.

Sorry I really should clean these stuffs.

>
>> +{
>> +    unsigned int nr = 0, i;
>> +    /* We always own at least one lowmem entry. */
>> +    unsigned int e820_entries = 1;
>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>
> FWIW there are some new output fields called lowmem_end, highmem_end and
> mmio_start in xc_hvm_build_args. Those might be useful as well?

No. These e820 entries will be used in hvmloader.

Note these constructed e820 info are based on args->mem_size, 
args->lowmem_size and args->mmio_size. And looks these fields are never 
changed inside xc_hvm_build_args().

>
> Note that they are only available *after* calling xc_hvm_build, which
> seems to *not* be the case for you.
>
> So if you somehow find those fields useful you might want to consider
> moving the callsite a bit later.

But I think I'd better follow your logic here since we can't guarantee 
all related fields wouldn't be modified in the future.

So I will push libxl__domain_construct_memmap() after xc_hvm_build_args().

>
>> +    struct e820entry *e820 = NULL;
>> +
>> +    /* Add all rdm entries. */
>> +    e820_entries += d_config->num_rdms;
>> +
>> +    /* If we should have a highmem range. */
>> +    if (highmem_size)
>> +    {
>> +        highmem_end = (1ull<<32) + highmem_size;
>> +        e820_entries++;
>> +    }
>> +
>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>
> You can use libxl__malloc(gc, ...).

Good simplification.

>
>> +    if (!e820) {
>> +        return -1;
>> +    }
>
> No need to check if you use libxl__malloc.

Sure.

>
>> +
>> +    /* Low memory */
>> +    e820[nr].addr = 0x100000;
>
> Hardcoded value?

Yes. Actually, we intend to use this to present that lowmem entry,

tools/firmware/hvmloader/e820.c:

     /* Low RAM goes here. Reserve space for special pages. */
     ...
     e820[nr].addr = 0x100000;

>
>> +    e820[nr].size = args->lowmem_size - 0x100000;
>> +    e820[nr].type = E820_RAM;
>> +    nr++;
>> +
>> +    /* RDM mapping */
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        /*
>> +         * We should drop this kind of rdm entry.
>> +         */
>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>> +            continue;
>> +
>> +        e820[nr].addr = d_config->rdms[i].start;
>> +        e820[nr].size = d_config->rdms[i].size;
>> +        e820[nr].type = E820_RESERVED;
>> +        nr++;
>> +    }
>> +
>> +    /* High memory */
>> +    if (highmem_size) {
>> +        e820[nr].addr = ((uint64_t)1 << 32);
>> +        e820[nr].size = highmem_size;
>> +        e820[nr].type = E820_RAM;
>> +    }
>> +
>> +    if (xc_domain_set_memory_map(ctx->xch, domid, e820, e820_entries) != 0) {
>> +        free(e820);
>
> No need to free if you use libxl__malloc(gc, ...).

Sure.

>
>> +        return -1;
>> +    }
>> +
>> +    free(e820);
>
> Ditto.
>

Sure.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-13  2:09     ` Chen, Tiejun
@ 2015-04-13 11:02       ` Wei Liu
  2015-04-14  0:42         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-04-13 11:02 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Mon, Apr 13, 2015 at 10:09:51AM +0800, Chen, Tiejun wrote:
[...]
> >Hardcoded value?
> 
> Yes. Actually, we intend to use this to present that lowmem entry,
> 
> tools/firmware/hvmloader/e820.c:
> 
>     /* Low RAM goes here. Reserve space for special pages. */
>     ...
>     e820[nr].addr = 0x100000;
> 

I don't like the idea of having two hardcoded values in different
locations. Please put this value into a header file and reference it
here and in hvmloader.

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-13 11:02       ` Wei Liu
@ 2015-04-14  0:42         ` Chen, Tiejun
  2015-05-05  9:32           ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-14  0:42 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/4/13 19:02, Wei Liu wrote:
> On Mon, Apr 13, 2015 at 10:09:51AM +0800, Chen, Tiejun wrote:
> [...]
>>> Hardcoded value?
>>
>> Yes. Actually, we intend to use this to present that lowmem entry,
>>
>> tools/firmware/hvmloader/e820.c:
>>
>>      /* Low RAM goes here. Reserve space for special pages. */
>>      ...
>>      e820[nr].addr = 0x100000;
>>
>
> I don't like the idea of having two hardcoded values in different

Just one place since based on our logic, hvmloader doesn't have this 
setting anymore and actually it really grab that info from here. Please 
refer to patch #13.

> locations. Please put this value into a header file and reference it
> here and in hvmloader.

Anyway, I'd like to define this here directly since no one consumes this 
again.

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 5134b33..af747e6 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -787,6 +787,7 @@ out:
      return rc;
  }

+#define GUEST_LOW_MEM_START_DEFAULT 0x100000
  static int libxl__domain_construct_memmap(libxl__gc *gc,
                                            libxl_domain_config *d_config,
                                            uint32_t domid,
@@ -812,8 +813,8 @@ static int libxl__domain_construct_memmap(libxl__gc *gc,
      e820 = libxl__malloc(gc, sizeof(struct e820entry) * e820_entries);

      /* Low memory */
-    e820[nr].addr = 0x100000;
-    e820[nr].size = args->lowmem_size - 0x100000;
+    e820[nr].addr = GUEST_LOW_MEM_START_DEFAULT;
+    e820[nr].size = args->lowmem_size - GUEST_LOW_MEM_START_DEFAULT;
      e820[nr].type = E820_RAM;
      nr++;

Is this fine to you?

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
@ 2015-04-15 13:10   ` Ian Jackson
  2015-04-15 18:22     ` Tian, Kevin
  2015-04-23 12:31     ` Chen, Tiejun
  2015-04-20 11:13   ` Jan Beulich
  2015-05-08 14:43   ` Wei Liu
  2 siblings, 2 replies; 125+ messages in thread
From: Ian Jackson @ 2015-04-15 13:10 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

Tiejun Chen writes ("[RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM"):
> While building a VM, HVM domain builder provides struct hvm_info_table{}
> to help hvmloader. Currently it includes two fields to construct guest
> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
> check them to fix any conflict with RAM.

I'm not really qualified to understand all of this, because I'm not an
x86 expert - I don't even know what RDM is.  But this does all seem
very complicated.  Can I have a second opinion from an x86 expert ?

I had a quick look at the libxl code and it at the very least needs
updating to conform to tools/libxl/CODING_STYLE.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-15 13:10   ` Ian Jackson
@ 2015-04-15 18:22     ` Tian, Kevin
  2015-04-23 12:31     ` Chen, Tiejun
  1 sibling, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2015-04-15 18:22 UTC (permalink / raw)
  To: Ian Jackson, Chen, Tiejun
  Cc: wei.liu2, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, Zhang, Yang Z

> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: Wednesday, April 15, 2015 9:10 PM
> 
> Tiejun Chen writes ("[RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts
> with RDM"):
> > While building a VM, HVM domain builder provides struct hvm_info_table{}
> > to help hvmloader. Currently it includes two fields to construct guest
> > e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we
> should
> > check them to fix any conflict with RAM.
> 
> I'm not really qualified to understand all of this, because I'm not an
> x86 expert - I don't even know what RDM is.  But this does all seem
> very complicated.  Can I have a second opinion from an x86 expert ?

[Patch 00/13] provides some background for necessary changes. In a nutshell,
the goal is to reserve ACPI reported reserved memory (through RMRR table,
but renamed to Reserved Device Memory (RDM) as a general user space
term here) in gfn space. There are various options to do reservation and 
handle potential conflicts, and based on long discussion people agreed that 
putting that logic centrally in libxl looks a cleaner way to go. That's why this 
patch is a bit complex being a core logic of this patch set. I'd expect other 
maintainers to also review this part carefully.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-10  9:21 ` [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
@ 2015-04-16 14:59   ` Tim Deegan
  2015-04-16 15:10     ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 14:59 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

Hi,

At 17:21 +0800 on 10 Apr (1428686513), Tiejun Chen wrote:
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index 2b5206b..36e5f54 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -574,7 +574,37 @@ struct xen_vnuma_topology_info {
>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
>  
> -/* Next available subop number is 27 */
> +/*
> + * For legacy reasons, some devices must be configured with special memory
> + * regions to function correctly.  The guest would take these regions
> + * according to different user policies.
> + */

I don't understand what this means.  Can you try to write a comment
that would tell an OS developer:
 - what the reserved device memory map actually means; and
 - what this hypercall does.

> @@ -121,6 +121,8 @@ void iommu_dt_domain_destroy(struct domain *d);
>  
>  struct page_info;
>  
> +typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);

This needs a comment describing what the return values are.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry
  2015-04-10  9:21 ` [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
@ 2015-04-16 15:05   ` Tim Deegan
  2015-04-23 12:33     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 15:05 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

Hi,

At 17:21 +0800 on 10 Apr (1428686516), Tiejun Chen wrote:
> @@ -862,6 +862,36 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
>      return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct, access);
>  }
>  
> +int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
> +                           p2m_access_t p2ma)
> +{
> +    p2m_type_t p2mt;
> +    p2m_access_t a;
> +    mfn_t mfn;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    int ret = -EBUSY;
> +
> +    gfn_lock(p2m, gfn, 0);
> +
> +    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL);
> +
> +    if ( !mfn_valid(mfn) )

I don't think this check is quite right -- for example, this p2m entry
might be an MMIO mapping or a PoD entry.  "if ( p2mt == p2m_invalid )"
would be better.

> +        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,

This line seems to be > 80 chars -- can you wrap it a bit earlier, please?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-16 14:59   ` Tim Deegan
@ 2015-04-16 15:10     ` Jan Beulich
  2015-04-16 15:24       ` Tim Deegan
  2015-04-23 12:32       ` Chen, Tiejun
  0 siblings, 2 replies; 125+ messages in thread
From: Jan Beulich @ 2015-04-16 15:10 UTC (permalink / raw)
  To: Tim Deegan
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, yang.z.zhang, Tiejun Chen

>>> On 16.04.15 at 16:59, <tim@xen.org> wrote:
> At 17:21 +0800 on 10 Apr (1428686513), Tiejun Chen wrote:
>> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
>> index 2b5206b..36e5f54 100644
>> --- a/xen/include/public/memory.h
>> +++ b/xen/include/public/memory.h
>> @@ -574,7 +574,37 @@ struct xen_vnuma_topology_info {
>>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
>>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
>>  
>> -/* Next available subop number is 27 */
>> +/*
>> + * For legacy reasons, some devices must be configured with special memory
>> + * regions to function correctly.  The guest would take these regions
>> + * according to different user policies.
>> + */
> 
> I don't understand what this means.  Can you try to write a comment
> that would tell an OS developer:
>  - what the reserved device memory map actually means; and
>  - what this hypercall does.

For one, this is meant to be a tools only interface, hence the OS
developer shouldn't care much. And then I don't think we should
be explaining the RMRR concept here. Which would leave to add
sentence saying "This hypercall allows to retrieve ...".

>> @@ -121,6 +121,8 @@ void iommu_dt_domain_destroy(struct domain *d);
>>  
>>  struct page_info;
>>  
>> +typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
> 
> This needs a comment describing what the return values are.

Will do.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 06/13] xen:vtd: create RMRR mapping
  2015-04-10  9:21 ` [RFC][PATCH 06/13] xen:vtd: create RMRR mapping Tiejun Chen
@ 2015-04-16 15:16   ` Tim Deegan
  2015-04-16 15:50     ` Tian, Kevin
  0 siblings, 1 reply; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 15:16 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

Hi,

At 17:21 +0800 on 10 Apr (1428686517), Tiejun Chen wrote:
> RMRR reserved regions must be setup in the pfn space with an identity
> mapping to reported mfn. However existing code has problem to setup
> correct mapping when VT-d shares EPT page table, so lead to problem
> when assigning devices (e.g GPU) with RMRR reported. This patch
> aims to setup identity mapping in p2m layer, regardless of whether EPT
> is shared or not.

This leaves the IOMMU mapping in place even if the p2m mapping
fails.

I think it would be better to keep the p2m as the
canonical mapping and make the iommu tables match the p2m.
So, this function should not call intel_iommu_map_page() itself;
rather, set_identity_p2m_entry() should call intel_iommu_map_page() if
it adds the p2m entry.

Likewise for the unmap code just above this: we should be calling a
p2m function here (guest_physmap_remove_page(), I think), which will
then call iommu_unmap_page() for us.

Cheers,

Tim.

> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> ---
>  xen/drivers/passthrough/vtd/iommu.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 4e789d1..f8fc6c3 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1847,6 +1847,12 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>  
>          if ( err )
>              return err;
> +
> +        if ( !is_hardware_domain(d) )
> +        {
> +            if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
> +                return err;
> +        }
>          base_pfn++;
>      }
>  
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-16 15:10     ` Jan Beulich
@ 2015-04-16 15:24       ` Tim Deegan
  2015-04-16 15:40         ` Tian, Kevin
  2015-04-23 12:32       ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 15:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, yang.z.zhang, Tiejun Chen

At 16:10 +0100 on 16 Apr (1429200644), Jan Beulich wrote:
> >>> On 16.04.15 at 16:59, <tim@xen.org> wrote:
> > At 17:21 +0800 on 10 Apr (1428686513), Tiejun Chen wrote:
> >> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> >> index 2b5206b..36e5f54 100644
> >> --- a/xen/include/public/memory.h
> >> +++ b/xen/include/public/memory.h
> >> @@ -574,7 +574,37 @@ struct xen_vnuma_topology_info {
> >>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
> >>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
> >>  
> >> -/* Next available subop number is 27 */
> >> +/*
> >> + * For legacy reasons, some devices must be configured with special memory
> >> + * regions to function correctly.  The guest would take these regions
> >> + * according to different user policies.
> >> + */
> > 
> > I don't understand what this means.  Can you try to write a comment
> > that would tell an OS developer:
> >  - what the reserved device memory map actually means; and
> >  - what this hypercall does.
> 
> For one, this is meant to be a tools only interface, hence the OS
> developer shouldn't care much. And then I don't think we should
> be explaining the RMRR concept here.

Fair enough.  How about:

/*
 * On some legacy devices, certain guest-physical addresses cannot
 * safely be used to map guest RAM.  This hypercall enumerates those
 * regions so the toolstack can avoid using them.
 */

Tim.

> >> @@ -121,6 +121,8 @@ void iommu_dt_domain_destroy(struct domain *d);
> >>  
> >>  struct page_info;
> >>  
> >> +typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
> > 
> > This needs a comment describing what the return values are.
> 
> Will do.
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-16 15:24       ` Tim Deegan
@ 2015-04-16 15:40         ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2015-04-16 15:40 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich
  Cc: wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson, xen-devel,
	stefano.stabellini, Zhang, Yang Z, Chen, Tiejun

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, April 16, 2015 11:24 PM
> 
> At 16:10 +0100 on 16 Apr (1429200644), Jan Beulich wrote:
> > >>> On 16.04.15 at 16:59, <tim@xen.org> wrote:
> > > At 17:21 +0800 on 10 Apr (1428686513), Tiejun Chen wrote:
> > >> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> > >> index 2b5206b..36e5f54 100644
> > >> --- a/xen/include/public/memory.h
> > >> +++ b/xen/include/public/memory.h
> > >> @@ -574,7 +574,37 @@ struct xen_vnuma_topology_info {
> > >>  typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
> > >>  DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
> > >>
> > >> -/* Next available subop number is 27 */
> > >> +/*
> > >> + * For legacy reasons, some devices must be configured with special
> memory
> > >> + * regions to function correctly.  The guest would take these regions
> > >> + * according to different user policies.
> > >> + */
> > >
> > > I don't understand what this means.  Can you try to write a comment
> > > that would tell an OS developer:
> > >  - what the reserved device memory map actually means; and
> > >  - what this hypercall does.
> >
> > For one, this is meant to be a tools only interface, hence the OS
> > developer shouldn't care much. And then I don't think we should
> > be explaining the RMRR concept here.
> 
> Fair enough.  How about:
> 
> /*
>  * On some legacy devices, certain guest-physical addresses cannot
>  * safely be used to map guest RAM.  This hypercall enumerates those
>  * regions so the toolstack can avoid using them.
>  */
> 

to be accurate it's not only about "used to map guest RAM". It's about
"allocated for any use including guest RAM, PCI BAR, etc.". How about:

/*
 * On some legacy devices, certain guest-physical addresses must be 
 * reserved in gfn space which cannot safely be used for other purposes
 * e.g. to map guest RAM.  This hypercall enumerates those regions 
 * so the toolstack can avoid using them.
 */

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
@ 2015-04-16 15:40   ` Tim Deegan
  2015-04-23 12:32     ` Chen, Tiejun
  2015-04-20 13:36   ` Jan Beulich
  2015-05-08 16:07   ` Julien Grall
  2 siblings, 1 reply; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 15:40 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

Hi,

At 17:21 +0800 on 10 Apr (1428686518), Tiejun Chen wrote:
> +/*
> + * In some cases, e.g. add a device to hwdomain, and remove a device from
> + * user domain, 'try' is fine enough since this is always safe to hwdomain.
> + */
> +#define XEN_DOMCTL_PCIDEV_RDM_DEFAULT XEN_DOMCTL_PCIDEV_RDM_TRY

Do we need a way to change this default?

>  static int rmrr_identity_mapping(struct domain *d, bool_t map,
> -                                 const struct acpi_rmrr_unit *rmrr)
> +                                 const struct acpi_rmrr_unit *rmrr,
> +                                 u32 flag)
>  {
>      unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
>      unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>          if ( !is_hardware_domain(d) )
>          {
>              if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
> -                return err;
> +            {
> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> +                {
> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");

This is a bit cryptic.  How about:
 "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may be unstable.",
(and pass in the devfn from the caller so we can print the details of
the device).

> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>  /* XEN_DOMCTL_deassign_device */
>  struct xen_domctl_assign_device {
>      uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
> +    /* IN */
> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1

"STRICT" might be a better word than "FORCE" (here and everywhere
else).  "FORCE" sounds like either Xen will assign the device even if
it's unsafe,  which is the opposite of what's meant IIUC.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-04-10  9:22 ` [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm Tiejun Chen
@ 2015-04-16 15:42   ` Tim Deegan
  2015-04-20 13:46   ` Jan Beulich
  1 sibling, 0 replies; 125+ messages in thread
From: Tim Deegan @ 2015-04-16 15:42 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

At 17:22 +0800 on 10 Apr (1428686520), Tiejun Chen wrote:
> This patch enables XENMEM_set_memory_map in hvm. So we can use it to
> setup the e820 mappings.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 06/13] xen:vtd: create RMRR mapping
  2015-04-16 15:16   ` Tim Deegan
@ 2015-04-16 15:50     ` Tian, Kevin
  0 siblings, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2015-04-16 15:50 UTC (permalink / raw)
  To: Tim Deegan, Chen, Tiejun
  Cc: wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson, xen-devel,
	stefano.stabellini, JBeulich, Zhang, Yang Z

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, April 16, 2015 11:16 PM
> 
> Hi,
> 
> At 17:21 +0800 on 10 Apr (1428686517), Tiejun Chen wrote:
> > RMRR reserved regions must be setup in the pfn space with an identity
> > mapping to reported mfn. However existing code has problem to setup
> > correct mapping when VT-d shares EPT page table, so lead to problem
> > when assigning devices (e.g GPU) with RMRR reported. This patch
> > aims to setup identity mapping in p2m layer, regardless of whether EPT
> > is shared or not.
> 
> This leaves the IOMMU mapping in place even if the p2m mapping
> fails.
> 
> I think it would be better to keep the p2m as the
> canonical mapping and make the iommu tables match the p2m.
> So, this function should not call intel_iommu_map_page() itself;
> rather, set_identity_p2m_entry() should call intel_iommu_map_page() if
> it adds the p2m entry.
> 
> Likewise for the unmap code just above this: we should be calling a
> p2m function here (guest_physmap_remove_page(), I think), which will
> then call iommu_unmap_page() for us.
> 
> Cheers,
> 
> Tim.
> 

Thanks for suggestion. It looks like a right direction.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
  2015-04-15 13:10   ` Ian Jackson
@ 2015-04-20 11:13   ` Jan Beulich
  2015-05-06 15:00     ` Chen, Tiejun
  2015-05-08 14:43   ` Wei Liu
  2 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 11:13 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>      return do_domctl(xch, &domctl);
>  }
>  
> +struct xen_reserved_device_memory
> +*xc_device_get_rdm(xc_interface *xch,
> +                   uint32_t flag,
> +                   uint16_t seg,
> +                   uint8_t bus,
> +                   uint8_t devfn,
> +                   unsigned int *nr_entries)
> +{

So what's the point of having both this new function and
xc_reserved_device_memory_map()? Is the latter useful for
anything besides the purpose here?

> +    struct xen_reserved_device_memory *xrdm = NULL;
> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
> +                                           nr_entries);
> +
> +    if ( rc < 0 )
> +    {
> +        if ( errno == ENOBUFS )
> +        {
> +            if ( (xrdm = malloc(*nr_entries *
> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
> +            {
> +                PERROR("Could not allocate memory.");

Now that's exactly the kind of error message that makes no sense:
As errno will already cause PERROR() to print something along the
lines of the message you provide here, you're just creating
redundancy. Indicating the purpose of the allocation, otoh, would
add helpful context for the one inspecting the resulting log.

> @@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
>      elf_parse_binary(&elf);
>      v_start = 0;
>      v_end = args->mem_size;
> +    v_lowend = lowmem_end = args->lowmem_size;

Considering that both variables get set to the same value and
don't appear to be changed subsequently - why two variables?

> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>  
>      for ( i = 0; i < nr_pages; i++ )
>          page_array[i] = i;
> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
> -        page_array[i] += mmio_size >> PAGE_SHIFT;
> +    /*
> +     * This condition 'lowmem_end <= mmio_start' is always true.
> +     */

For one I think you mean "The", not "This", as there's no such
condition around here. And then - why? DYM "is supposed to
always be true"? In which case you may want to check...

> @@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
>      if ( args.mem_target == 0 )
>          args.mem_target = args.mem_size;
>  
> -    if ( args.mmio_size == 0 )
> -        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
> -
>      /* An HVM guest must be initialised with at least 2MB memory. */
>      if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
>          return -1;
> @@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
>      args.mem_size = (uint64_t)memsize << 20;
>      args.mem_target = (uint64_t)target << 20;
>      args.image_file_name = image_name;
> +    if ( args.mmio_size == 0 )
> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;

Is this code motion really necessary?

> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
>      return dm;
>  }
>  
> +/*
> + * Check whether there exists rdm hole in the specified memory range.
> + * Returns 1 if exists, else returns 0.
> + */
> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
> +                          uint64_t rdm_start, uint64_t rdm_size)
> +{
> +    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
> +        return 0;
> +    else
> +        return 1;
> +}

The function appears to return a boolean type, so please make its
return value express this. Also the name doesn't really imply the
meaning of "true" and "false" being returned - how about
overlaps_rdm()? And finally, while I'm not the maintainer of this code
and hence don't have the final say on stylistic issues, I don't see
why the above couldn't be expressed with a single return statement.

> +int libxl__domain_device_check_rdm(libxl__gc *gc,
> +                                   libxl_domain_config *d_config,
> +                                   uint64_t rdm_mem_guard,
> +                                   struct xc_hvm_build_args *args)
> +{
> +    int i, j, conflict;
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +    struct xen_reserved_device_memory *xrdm = NULL;
> +    unsigned int nr_all_rdms = 0;
> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
> +    uint32_t type = d_config->b_info.rdm.type;
> +    uint16_t seg;
> +    uint8_t bus, devfn;
> +
> +    /* Might not to expose rdm. */
> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
> +        return 0;
> +
> +    /* Collect all rdm info if exist. */
> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
> +                             0, 0, 0, &nr_all_rdms);

What meaning has passing a libxl private value to a libxc function?

> +    if (!nr_all_rdms)
> +        return 0;
> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
> +                                   sizeof(libxl_device_rdm));
> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
> +
> +    /* Query all RDM entries in this platform */
> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
> +        d_config->num_rdms = nr_all_rdms;
> +        for (i = 0; i < d_config->num_rdms; i++) {
> +            d_config->rdms[i].start =
> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
> +            d_config->rdms[i].size =
> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
> +        }
> +    } else {
> +        d_config->num_rdms = 0;
> +    }
> +    free(xrdm);
> +
> +    /* Query RDM entries per-device */
> +    for (i = 0; i < d_config->num_pcidevs; i++) {
> +        unsigned int nr_entries = 0;
> +        bool new = true;
> +        seg = d_config->pcidevs[i].domain;
> +        bus = d_config->pcidevs[i].bus;
> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
> +        nr_entries = 0;
> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
> +                                 seg, bus, devfn, &nr_entries);
> +        /* No RDM to associated with this device. */
> +        if (!nr_entries)
> +            continue;
> +
> +        /* Need to check whether this entry is already saved in the array.
> +         * This could come from two cases:
> +         *
> +         *   - user may configure to get all RMRRs in this platform, which
> +         * is already queried before this point
> +         *   - or two assigned devices may share one RMRR entry
> +         *
> +         * different policies may be configured on the same RMRR due to above
> +         * two cases. We choose a simple policy to always favor stricter policy
> +         */
> +        for (j = 0; j < d_config->num_rdms; j++) {
> +            if (d_config->rdms[j].start ==
> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
> +             {
> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
> +                new = false;
> +                break;
> +            }
> +        }
> +
> +        if (new) {
> +            if (d_config->num_rdms > nr_all_rdms - 1) {

">= nr_all_rdms" expresses the same thing.

> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
> +                free(xrdm);
> +                return -1;
> +            }
> +
> +            /*
> +             * This is a new entry.
> +             */
> +            d_config->rdms[d_config->num_rdms].start =
> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
> +            d_config->rdms[d_config->num_rdms].size =
> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
> +            d_config->rdms[d_config->num_rdms].flag = 
> d_config->pcidevs[i].rdm_reserve;
> +            d_config->num_rdms++;
> +        }
> +        free(xrdm);
> +    }
> +
> +    /* Fix highmem. */
> +    if (args->mem_size > args->lowmem_size)
> +        highmem_end += (args->mem_size - args->lowmem_size);

This looks misplaced. In particular it doesn't depend on anything done
so far in this function, so it being here is at least confusing. Is there
any reason not to set up the variable correctly right at the beginning
of the function, or do _all_ of its initialization here?

> +    /* Next step is to check and avoid potential conflict between RDM entries
> +     * and guest RAM. To avoid intrusive impact to existing memory layout
> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
> +     * below conflicts are not handled which are rare and handling them would
> +     * lead to a more scattered layout:
> +     *  - RMRR in highmem area (>4G)
> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)

So how is that going to do on my box having RMRRs below 1Mb?

> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
> +     * end below reserved region to solve conflict.
> +     *
> +     * If a conflict is detected on a given RMRR entry, an error will be
> +     * returned.
> +     * If 'force' policy is specified. Or conflict is treated as a warning if

The first sentence here appears to belong at the end of the one ending
on the previous line.

> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
> +     * this entry to hvmloader.

What is "this" in "... also mark this as ..."? Certainly neither the conflict
nor the warning.

> +     *
> +     * Firstly we should check the case of rdm < 4G because we may need to
> +     * expand highmem_end.
> +     */
> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        rdm_start = d_config->rdms[i].start;
> +        rdm_size = d_config->rdms[i].size;
> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
> +
> +        if (!conflict)
> +            continue;
> +
> +        /*
> +         * Just check if RDM > our memory boundary
> +         */
> +        if (d_config->rdms[i].start > rdm_mem_guard) {
> +            /*
> +             * We will move downwards lowmem_end so we have to expand
> +             * highmem_end.
> +             */
> +            highmem_end += (args->lowmem_size - rdm_start);
> +            /* Now move downwards lowmem_end. */
> +            args->lowmem_size = rdm_start;

Considering that the action here doesn't depend on the specific
->rdms[] slot being looked at, I don't see why the loop needs to
continue. Also - what if rdm_mem_guard > args->lowmem_size?

> +        }
> +    }
> +
> +    /*
> +     * Finally we can take same policy to check lowmem(< 2G) and
> +     * highmem adjusted above.
> +     */

I don't follow - neither the code above nor the one below makes any
distinction between memory below 2Gb and memory below 4Gb (as
far as check_rdm_hole() calls are concerned).

> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        rdm_start = d_config->rdms[i].start;
> +        rdm_size = d_config->rdms[i].size;
> +        /* Does this entry conflict with lowmem? */
> +        conflict = check_rdm_hole(0, args->lowmem_size,
> +                                  rdm_start, rdm_size);
> +        /* Does this entry conflict with highmem? */
> +        conflict |= check_rdm_hole((1ULL<<32), highmem_end,

The second parameter of the function is a size, not an address.

> @@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>       * Do all this in one step here...
>       */
>      args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
> +    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
> +                            (1ULL << 32) : args.mem_size;

Use min().

> @@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>          if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
>              args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
>      }
> +
> +    if (args.mmio_size == 0)
> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
> +    mmio_start = (1ull << 32) - args.mmio_size;
> +
> +    if (args.lowmem_size > mmio_start)
> +        args.lowmem_size = mmio_start;
> +
> +    /*
> +     * We'd like to set a memory boundary to determine if we need to check
> +     * any overlap with reserved device memory.
> +     *
> +     * TODO: we will add a config parameter for this boundary value next.

According to the titles of subsequent patches this doesn't seem to
happen within this series.

But overall the intentions of this patch seem right.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
  2015-04-16 15:40   ` Tim Deegan
@ 2015-04-20 13:36   ` Jan Beulich
  2015-05-11  8:37     ` Chen, Tiejun
  2015-05-08 16:07   ` Julien Grall
  2 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 13:36 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -1793,8 +1793,14 @@ static void iommu_set_pgd(struct domain *d)
>      hd->arch.pgd_maddr = pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
>  }
>  
> +/*
> + * In some cases, e.g. add a device to hwdomain, and remove a device from
> + * user domain, 'try' is fine enough since this is always safe to hwdomain.
> + */
> +#define XEN_DOMCTL_PCIDEV_RDM_DEFAULT XEN_DOMCTL_PCIDEV_RDM_TRY

Then why invent this one instead of just using ..._TRY at the respective
call sites.

> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>          if ( !is_hardware_domain(d) )
>          {
>              if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
> -                return err;
> +            {
> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> +                {
> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
> +                }

Iirc someone else already pointed out that this message needs to
change to something understandable. Perhaps it should also log
the PFN causing the error. And the braces here should be dropped
(if you inverted the condition you wouldn't even need an "else"; or
wait - this shouldn't be just "else", as you shouldn't imply anything
other than ..._TRY means ..._FORCE, albeit the respective check
perhaps belongs in the generic IOMMU code, not here).

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>  /* XEN_DOMCTL_deassign_device */
>  struct xen_domctl_assign_device {
>      uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
> +    /* IN */
> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
> +    uint32_t  sbdf_flag;   /* flag of assigned device */

Why do you call this "sbdf_flag" - it holds nothing SBDF-like.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 08/13] tools: extend xc_assign_device() to support rdm reservation policy
  2015-04-10  9:21 ` [RFC][PATCH 08/13] tools: extend xc_assign_device() " Tiejun Chen
@ 2015-04-20 13:39   ` Jan Beulich
  2015-05-11  9:45     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 13:39 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -1654,13 +1654,15 @@ int xc_domain_setdebugging(xc_interface *xch,
>  int xc_assign_device(
>      xc_interface *xch,
>      uint32_t domid,
> -    uint32_t machine_sbdf)
> +    uint32_t machine_sbdf,
> +    uint32_t flag)
>  {
>      DECLARE_DOMCTL;
>  
>      domctl.cmd = XEN_DOMCTL_assign_device;
>      domctl.domain = domid;
>      domctl.u.assign_device.machine_sbdf = machine_sbdf;
> +    domctl.u.assign_device.sbdf_flag = flag;

The previous patch needs to initialize this field, in order to not pass
random input to the hypervisor. Using the ..._TRY value here
intermediately (until this patch gets applied) would seem the right
approach.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-04-10  9:22 ` [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm Tiejun Chen
  2015-04-16 15:42   ` Tim Deegan
@ 2015-04-20 13:46   ` Jan Beulich
  2015-05-15  2:33     ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 13:46 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>      switch ( cmd & MEMOP_CMD_MASK )
>      {
> -    case XENMEM_memory_map:

Title and description talk about XENMEM_set_memory_map only. As
I think the implementation is right, the former will need updating. Do
you actually need a HVM domain to be able to XENMEM_set_memory_map
on itself? If not, it should probably replace XENMEM_memory_map here.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-10  9:22 ` [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map Tiejun Chen
  2015-04-10 10:01   ` Wei Liu
@ 2015-04-20 13:51   ` Jan Beulich
  2015-05-15  2:57     ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 13:51 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -787,6 +787,70 @@ out:
>      return rc;
>  }
>  
> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
> +                                          libxl_domain_config *d_config,
> +                                          uint32_t domid,
> +                                          struct xc_hvm_build_args *args,
> +                                          int num_pcidevs,
> +                                          libxl_device_pci *pcidevs)
> +{
> +    unsigned int nr = 0, i;
> +    /* We always own at least one lowmem entry. */
> +    unsigned int e820_entries = 1;
> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
> +    struct e820entry *e820 = NULL;
> +
> +    /* Add all rdm entries. */
> +    e820_entries += d_config->num_rdms;
> +
> +    /* If we should have a highmem range. */
> +    if (highmem_size)
> +    {
> +        highmem_end = (1ull<<32) + highmem_size;
> +        e820_entries++;
> +    }
> +
> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
> +    if (!e820) {
> +        return -1;
> +    }
> +
> +    /* Low memory */
> +    e820[nr].addr = 0x100000;
> +    e820[nr].size = args->lowmem_size - 0x100000;
> +    e820[nr].type = E820_RAM;

If you really mean it to be this lax (not covering the low 1Mb), then
you need to explain why in a comment (and the consuming side
should also have a similar explanation then).

> +    nr++;
> +
> +    /* RDM mapping */
> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        /*
> +         * We should drop this kind of rdm entry.
> +         */
> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
> +            continue;
> +
> +        e820[nr].addr = d_config->rdms[i].start;
> +        e820[nr].size = d_config->rdms[i].size;
> +        e820[nr].type = E820_RESERVED;
> +        nr++;
> +    }

Is this guaranteed not to produce overlapping entries?

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[]
  2015-04-10  9:22 ` [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[] Tiejun Chen
@ 2015-04-20 13:57   ` Jan Beulich
  2015-05-15  3:10     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 13:57 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/util.c
> +++ b/tools/firmware/hvmloader/util.c
> @@ -27,6 +27,16 @@
>  #include <xen/memory.h>
>  #include <xen/sched.h>
>  
> +int check_hole_conflict(uint64_t start, uint64_t size,
> +                        uint64_t reserved_start, uint64_t reserved_size)
> +{
> +    if ( start + size <= reserved_start ||
> +            start >= reserved_start + reserved_size )
> +        return 0;
> +    else
> +        return 1;
> +}

See the comments on the similar tool stack function. Also please get
indentation right.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-04-10  9:22 ` [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges Tiejun Chen
@ 2015-04-20 14:21   ` Jan Beulich
  2015-05-15  3:18     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 14:21 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/pci.c
> +++ b/tools/firmware/hvmloader/pci.c
> @@ -59,8 +59,8 @@ void pci_setup(void)
>          uint32_t bar_reg;
>          uint64_t bar_sz;
>      } *bars = (struct bars *)scratch_start;
> -    unsigned int i, nr_bars = 0;
> -    uint64_t mmio_hole_size = 0;
> +    unsigned int i, j, nr_bars = 0;
> +    uint64_t mmio_hole_size = 0, reserved_end;
>  
>      const char *s;
>      /*
> @@ -393,8 +393,23 @@ void pci_setup(void)
>          }
>  
>          base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
> + reallocate_mmio:
>          bar_data |= (uint32_t)base;
>          bar_data_upper = (uint32_t)(base >> 32);
> +        for ( j = 0; j < memory_map.nr_map ; j++ )
> +        {
> +            if ( memory_map.map[j].type != E820_RAM )
> +            {
> +                reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
> +                if ( check_hole_conflict(base, bar_sz,
> +                                         memory_map.map[j].addr,
> +                                         memory_map.map[j].size) )
> +                {
> +                    base = (reserved_end  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
> +                    goto reallocate_mmio;
> +                }
> +            }
> +        }
>          base += bar_sz;
>  
>          if ( (base < resource->base) || (base > resource->max) )

But you do nothing to make sure the MMIO regions all fit in the
available window (see the code ahead of this relocating RAM if
necessary).

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-04-10  9:22 ` [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table Tiejun Chen
@ 2015-04-20 14:29   ` Jan Beulich
  2015-05-15  6:11     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-20 14:29 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
> --- a/tools/firmware/hvmloader/e820.c
> +++ b/tools/firmware/hvmloader/e820.c
> @@ -73,7 +73,8 @@ int build_e820_table(struct e820entry *e820,
>                       unsigned int lowmem_reserved_base,
>                       unsigned int bios_image_base)
>  {
> -    unsigned int nr = 0;
> +    unsigned int nr = 0, i, j;
> +    struct e820entry tmp;

The declaration of "tmp" belongs in the most narrow scope you need
it in.

> @@ -119,10 +120,6 @@ int build_e820_table(struct e820entry *e820,
>  
>      /* Low RAM goes here. Reserve space for special pages. */
>      BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
> -    e820[nr].addr = 0x100000;
> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
> -    e820[nr].type = E820_RAM;
> -    nr++;

I think the above comment needs adjustment with all this code
removed. I also wonder how meaningful the BUG_ON() is with
->low_mem_pgend no longer used for E820 table construction.
Perhaps this needs another BUG_ON() validating that the field
matches some value from memory_map.map[]?

> @@ -159,16 +156,37 @@ int build_e820_table(struct e820entry *e820,
>          nr++;
>      }
>  
> -
> -    if ( hvm_info->high_mem_pgend )
> +    /* Construct the remaings according memory_map[]. */
> +    for ( i = 0; i < memory_map.nr_map; i++ )
>      {
> -        e820[nr].addr = ((uint64_t)1 << 32);
> -        e820[nr].size =
> -            ((uint64_t)hvm_info->high_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
> -        e820[nr].type = E820_RAM;
> +        e820[nr].addr = memory_map.map[i].addr;
> +        e820[nr].size = memory_map.map[i].size;
> +        e820[nr].type = memory_map.map[i].type;

Afaict you could use structure assignment here to make this
more readable.

>          nr++;
>      }
>  
> +    /* May need to reorder all e820 entries. */
> +    for ( j = 0; j < nr-1; j++ )
> +    {
> +        for ( i = j+1; i < nr; i++ )
> +        {
> +            if ( e820[j].addr > e820[i].addr )
> +            {
> +                tmp.addr = e820[j].addr;
> +                tmp.size = e820[j].size;
> +                tmp.type = e820[j].type;
> +
> +                e820[j].addr = e820[i].addr;
> +                e820[j].size = e820[i].size;
> +                e820[j].type = e820[i].type;
> +
> +                e820[i].addr = tmp.addr;
> +                e820[i].size = tmp.size;
> +                e820[i].type = tmp.type;

Please again use structure assignments to make this more readable.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-15 13:10   ` Ian Jackson
  2015-04-15 18:22     ` Tian, Kevin
@ 2015-04-23 12:31     ` Chen, Tiejun
  1 sibling, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-23 12:31 UTC (permalink / raw)
  To: Ian Jackson
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

On 2015/4/15 21:10, Ian Jackson wrote:
> Tiejun Chen writes ("[RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM"):
>> While building a VM, HVM domain builder provides struct hvm_info_table{}
>> to help hvmloader. Currently it includes two fields to construct guest
>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
>> check them to fix any conflict with RAM.
>

Thanks for your review.

> I'm not really qualified to understand all of this, because I'm not an
> x86 expert - I don't even know what RDM is.  But this does all seem
> very complicated.  Can I have a second opinion from an x86 expert ?

I hope Kevin's reply is fine to you. But if you still have further 
question let me know and I'd like to make this point clear to you :)

>
> I had a quick look at the libxl code and it at the very least needs
> updating to conform to tools/libxl/CODING_STYLE.
>

I took a look at some libxl codes and found some obvious code style issues.

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 8a5f589..ff40c65 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -97,7 +97,7 @@ const char *libxl__domain_device_model(libxl__gc *gc,
  static int check_rdm_hole(uint64_t start, uint64_t memsize,
                            uint64_t rdm_start, uint64_t rdm_size)
  {
-    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
+    if (start + memsize <= rdm_start || start >= rdm_start + rdm_size)
          return 0;
      else
          return 1;
@@ -163,7 +163,8 @@ int libxl__domain_device_check_rdm(libxl__gc *gc,
          if (!nr_entries)
              continue;

-        /* Need to check whether this entry is already saved in the array.
+        /*
+         * Need to check whether this entry is already saved in the array.
           * This could come from two cases:
           *
           *   - user may configure to get all RMRRs in this platform, which
@@ -207,7 +208,8 @@ int libxl__domain_device_check_rdm(libxl__gc *gc,
      /* Fix highmem. */
      if (args->mem_size > args->lowmem_size)
          highmem_end += (args->mem_size - args->lowmem_size);
-    /* Next step is to check and avoid potential conflict between RDM 
entries
+    /*
+     * Next step is to check and avoid potential conflict between RDM 
entries
       * and guest RAM. To avoid intrusive impact to existing memory layout
       * {lowmem, mmio, highmem} which is passed around various function 
blocks,
       * below conflicts are not handled which are rare and handling 
them would

Is this good? I know Jan had some comments on this patch as well so 
actually something needs to be changed but here just lets focus on code 
style firstly :)

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-16 15:10     ` Jan Beulich
  2015-04-16 15:24       ` Tim Deegan
@ 2015-04-23 12:32       ` Chen, Tiejun
  2015-04-23 12:59         ` Jan Beulich
  1 sibling, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-23 12:32 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, yang.z.zhang

  >>> @@ -121,6 +121,8 @@ void iommu_dt_domain_destroy(struct domain *d);
>>>
>>>   struct page_info;
>>>
>>> +typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ctxt);
>>
>> This needs a comment describing what the return values are.
>
> Will do.
>


I'm not sure if yourself determine providing this. So here I just 
initial one draft,

/*
  * Get rdm info handling.
  *
  *    return   0: Don't hit rdm. This means either there's no any rdm 
existent,
  *                or there's no any rdm matching requirements.
  *           < 0: Failed.
  *           > 0: Handle to special case.
  *                  1: Hit one rdm entry.
  *                  others: Currently this is never happened.
  */

But if you already have one just please ignore this and tell me

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-16 15:40   ` Tim Deegan
@ 2015-04-23 12:32     ` Chen, Tiejun
  2015-04-23 13:05       ` Tim Deegan
  2015-04-23 13:59       ` Jan Beulich
  0 siblings, 2 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-23 12:32 UTC (permalink / raw)
  To: Tim Deegan
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

On 2015/4/16 23:40, Tim Deegan wrote:
> Hi,
>
> At 17:21 +0800 on 10 Apr (1428686518), Tiejun Chen wrote:
>> +/*
>> + * In some cases, e.g. add a device to hwdomain, and remove a device from
>> + * user domain, 'try' is fine enough since this is always safe to hwdomain.
>> + */
>> +#define XEN_DOMCTL_PCIDEV_RDM_DEFAULT XEN_DOMCTL_PCIDEV_RDM_TRY
>
> Do we need a way to change this default?

As I said in its comment here, this behavior should be rarely changed in 
the future, and even this really needs to balance something, I just 
think this way is enough and simply.

>
>>   static int rmrr_identity_mapping(struct domain *d, bool_t map,
>> -                                 const struct acpi_rmrr_unit *rmrr)
>> +                                 const struct acpi_rmrr_unit *rmrr,
>> +                                 u32 flag)
>>   {
>>       unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K;
>>       unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
>> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>>           if ( !is_hardware_domain(d) )
>>           {
>>               if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
>> -                return err;
>> +            {
>> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
>> +                {
>> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
>
> This is a bit cryptic.  How about:
>   "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may be unstable.",
> (and pass in the devfn from the caller so we can print the details of
> the device).

Got it but we can't get SBDF here directly.

So just now we can have this line.

         {
             if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
                 dprintk(XENLOG_ERR VTDPREFIX,
                         "RMRR mapping failed to pfn:%"PRIx64""
                         " so Dom%d may be unstable.\n",
                         base_pfn, d->domain_id);
             else
                 return err;
         }

Certainly, we can extend rmrr_identity_mapping() to own its associated 
SBDF as an input parameter (and bring some syncs) if you still think 
this is necessary.

>
>> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>>   /* XEN_DOMCTL_deassign_device */
>>   struct xen_domctl_assign_device {
>>       uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
>> +    /* IN */
>> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
>> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
>
> "STRICT" might be a better word than "FORCE" (here and everywhere
> else).  "FORCE" sounds like either Xen will assign the device even if
> it's unsafe,  which is the opposite of what's meant IIUC.

This is definitely fine to me but this is derived from our policy based 
on that previous design,

Global RDM parameter:
     rdm = [ 'host, reserve=force/try' ]
Per-device RDM parameter:
     pci = [ 'sbdf, rdm_reserve=force/try' ]

Please refer to patch #1. So I guess we need a further agreement or 
comments from other guys :)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry
  2015-04-16 15:05   ` Tim Deegan
@ 2015-04-23 12:33     ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-23 12:33 UTC (permalink / raw)
  To: Tim Deegan
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

On 2015/4/16 23:05, Tim Deegan wrote:
> Hi,
>
> At 17:21 +0800 on 10 Apr (1428686516), Tiejun Chen wrote:
>> @@ -862,6 +862,36 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
>>       return set_typed_p2m_entry(d, gfn, mfn, p2m_mmio_direct, access);
>>   }
>>
>> +int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
>> +                           p2m_access_t p2ma)
>> +{
>> +    p2m_type_t p2mt;
>> +    p2m_access_t a;
>> +    mfn_t mfn;
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    int ret = -EBUSY;
>> +
>> +    gfn_lock(p2m, gfn, 0);
>> +
>> +    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL);
>> +
>> +    if ( !mfn_valid(mfn) )
>
> I don't think this check is quite right -- for example, this p2m entry
> might be an MMIO mapping or a PoD entry.  "if ( p2mt == p2m_invalid )"
> would be better.

Okay.

>
>> +        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K, p2m_mmio_direct,
>
> This line seems to be > 80 chars -- can you wrap it a bit earlier, please?
>

Thanks.

Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-23 12:32       ` Chen, Tiejun
@ 2015-04-23 12:59         ` Jan Beulich
  2015-04-24  1:17           ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-04-23 12:59 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Tim Deegan, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
> But if you already have one just please ignore this and tell me

Here's what I currently have:

introduce XENMEM_reserved_device_memory_map

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v??: Comment iommu_grdm_t typedef. Comment on the purpose of the new
     hypercall in the public header. (Both requested by Tim.)

--- unstable.orig/xen/common/compat/memory.c
+++ unstable/xen/common/compat/memory.c
@@ -17,6 +17,37 @@ CHECK_TYPE(domid);
 CHECK_mem_access_op;
 CHECK_vmemrange;
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct compat_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct compat_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+            return -ERANGE;
+
+        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
+                                     &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -303,6 +334,29 @@ int compat_memory_op(unsigned int cmd, X
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( __copy_to_guest(compat, &grdm.map, 1) )
+                rc = -EFAULT;
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
--- unstable.orig/xen/common/memory.c
+++ unstable/xen/common/memory.c
@@ -748,6 +748,34 @@ static int construct_memop_from_reservat
     return 0;
 }
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1162,6 +1190,32 @@ long do_memory_op(unsigned long cmd, XEN
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( unlikely(start_extent) )
+            return -ENOSYS;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
--- unstable.orig/xen/drivers/passthrough/iommu.c
+++ unstable/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
--- unstable.orig/xen/drivers/passthrough/vtd/dmar.c
+++ unstable/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr;
+    int rc = 0;
+
+    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    {
+        rc = func(PFN_DOWN(rmrr->base_address),
+                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
+                  ctxt);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
--- unstable.orig/xen/drivers/passthrough/vtd/extern.h
+++ unstable/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct do
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
--- unstable.orig/xen/drivers/passthrough/vtd/iommu.c
+++ unstable/xen/drivers/passthrough/vtd/iommu.c
@@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops =
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
--- unstable.orig/xen/include/public/memory.h
+++ unstable/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct xen_vnuma_topology_info {
 typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * With some legacy devices, certain guest-physical addresses cannot safely
+ * be used for other purposes, e.g. to map guest RAM.  This hypercall
+ * enumerates those regions so the toolstack can avoid using them.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+    /* IN/OUT */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
--- unstable.orig/xen/include/xen/iommu.h
+++ unstable/xen/include/xen/iommu.h
@@ -121,6 +121,14 @@ void iommu_dt_domain_destroy(struct doma
 
 struct page_info;
 
+/*
+ * Any non-zero value returned from callbacks of this type will cause the
+ * function the callback was handed to terminate its iteration. Assigning
+ * meaning of these non-zero values is left to the top level caller /
+ * callback pair.
+ */
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -152,12 +160,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
--- unstable.orig/xen/include/xlat.lst
+++ unstable/xen/include/xlat.lst
@@ -61,9 +61,10 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+!	reserved_device_memory_map	memory.h
 ?	vmemrange			memory.h
 !	vnuma_topology_info		memory.h
 ?	physdev_eoi			physdev.h

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-23 12:32     ` Chen, Tiejun
@ 2015-04-23 13:05       ` Tim Deegan
  2015-04-23 13:59       ` Jan Beulich
  1 sibling, 0 replies; 125+ messages in thread
From: Tim Deegan @ 2015-04-23 13:05 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang

At 20:32 +0800 on 23 Apr (1429821151), Chen, Tiejun wrote:
> >> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> >> +                {
> >> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
> >
> > This is a bit cryptic.  How about:
> >   "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may be unstable.",
> > (and pass in the devfn from the caller so we can print the details of
> > the device).
> 
> Got it but we can't get SBDF here directly.
> 
> So just now we can have this line.
> 
>          {
>              if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
>                  dprintk(XENLOG_ERR VTDPREFIX,
>                          "RMRR mapping failed to pfn:%"PRIx64""
>                          " so Dom%d may be unstable.\n",
>                          base_pfn, d->domain_id);
>              else
>                  return err;
>          }
> 
> Certainly, we can extend rmrr_identity_mapping() to own its associated 
> SBDF as an input parameter (and bring some syncs) if you still think 
> this is necessary.

Yes please.  It makes it clear to the admin which device is causing
the problem.

> > "STRICT" might be a better word than "FORCE" (here and everywhere
> > else).  "FORCE" sounds like either Xen will assign the device even if
> > it's unsafe,  which is the opposite of what's meant IIUC.
> 
> This is definitely fine to me but this is derived from our policy based 
> on that previous design,
> 
> Global RDM parameter:
>      rdm = [ 'host, reserve=force/try' ]
> Per-device RDM parameter:
>      pci = [ 'sbdf, rdm_reserve=force/try' ]
> 
> Please refer to patch #1. So I guess we need a further agreement or 
> comments from other guys :)

Well, it was just a suggestion.  If other people are happy with
'force', I am too. :)

Tim.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-23 12:32     ` Chen, Tiejun
  2015-04-23 13:05       ` Tim Deegan
@ 2015-04-23 13:59       ` Jan Beulich
  2015-04-23 14:26         ` Tim Deegan
  2015-05-04  8:15         ` Tian, Kevin
  1 sibling, 2 replies; 125+ messages in thread
From: Jan Beulich @ 2015-04-23 13:59 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Tim Deegan, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
> On 2015/4/16 23:40, Tim Deegan wrote:
>> At 17:21 +0800 on 10 Apr (1428686518), Tiejun Chen wrote:
>>> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>>>           if ( !is_hardware_domain(d) )
>>>           {
>>>               if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
>>> -                return err;
>>> +            {
>>> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
>>> +                {
>>> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
>>
>> This is a bit cryptic.  How about:
>>   "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may be 
> unstable.",
>> (and pass in the devfn from the caller so we can print the details of
>> the device).
> 
> Got it but we can't get SBDF here directly.
> 
> So just now we can have this line.
> 
>          {
>              if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
>                  dprintk(XENLOG_ERR VTDPREFIX,
>                          "RMRR mapping failed to pfn:%"PRIx64""
>                          " so Dom%d may be unstable.\n",
>                          base_pfn, d->domain_id);
>              else
>                  return err;
>          }
> 
> Certainly, we can extend rmrr_identity_mapping() to own its associated 
> SBDF as an input parameter (and bring some syncs) if you still think 
> this is necessary.

I don't think we can, since a single RMRR may be associated with
more than one device.

>>> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>>>   /* XEN_DOMCTL_deassign_device */
>>>   struct xen_domctl_assign_device {
>>>       uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
>>> +    /* IN */
>>> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
>>> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
>>
>> "STRICT" might be a better word than "FORCE" (here and everywhere
>> else).  "FORCE" sounds like either Xen will assign the device even if
>> it's unsafe,  which is the opposite of what's meant IIUC.
> 
> This is definitely fine to me but this is derived from our policy based 
> on that previous design,
> 
> Global RDM parameter:
>      rdm = [ 'host, reserve=force/try' ]
> Per-device RDM parameter:
>      pci = [ 'sbdf, rdm_reserve=force/try' ]
> 
> Please refer to patch #1. So I guess we need a further agreement or 
> comments from other guys :)

I think I'd prefer strict/relaxed over force/try, but I certainly can
live with the latter.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-23 13:59       ` Jan Beulich
@ 2015-04-23 14:26         ` Tim Deegan
  2015-05-04  8:15         ` Tian, Kevin
  1 sibling, 0 replies; 125+ messages in thread
From: Tim Deegan @ 2015-04-23 14:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, yang.z.zhang, Tiejun Chen

At 14:59 +0100 on 23 Apr (1429801184), Jan Beulich wrote:
> >>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
> > On 2015/4/16 23:40, Tim Deegan wrote:
> >> At 17:21 +0800 on 10 Apr (1428686518), Tiejun Chen wrote:
> >>> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
> >>>           if ( !is_hardware_domain(d) )
> >>>           {
> >>>               if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
> >>> -                return err;
> >>> +            {
> >>> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> >>> +                {
> >>> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
> >>
> >> This is a bit cryptic.  How about:
> >>   "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may be 
> > unstable.",
> >> (and pass in the devfn from the caller so we can print the details of
> >> the device).
> > 
> > Got it but we can't get SBDF here directly.
> > 
> > So just now we can have this line.
> > 
> >          {
> >              if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> >                  dprintk(XENLOG_ERR VTDPREFIX,
> >                          "RMRR mapping failed to pfn:%"PRIx64""
> >                          " so Dom%d may be unstable.\n",
> >                          base_pfn, d->domain_id);
> >              else
> >                  return err;
> >          }
> > 
> > Certainly, we can extend rmrr_identity_mapping() to own its associated 
> > SBDF as an input parameter (and bring some syncs) if you still think 
> > this is necessary.
> 
> I don't think we can, since a single RMRR may be associated with
> more than one device.

Ah, and we don't know which device we have in hand when we're doing
this?  Fair enough, just printing the address and domain will do then.

Tim.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-23 12:59         ` Jan Beulich
@ 2015-04-24  1:17           ` Chen, Tiejun
  2015-04-24  7:21             ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-04-24  1:17 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/23 20:59, Jan Beulich wrote:
>>>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
>> But if you already have one just please ignore this and tell me
>
> Here's what I currently have:

Could you resend me this as an attached file? Then I can apply that 
properly without any miss?

Thanks
Tiejun

>
> introduce XENMEM_reserved_device_memory_map
>
> This is a prerequisite for punching holes into HVM and PVH guests' P2M
> to allow passing through devices that are associated with (on VT-d)
> RMRRs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Kevin Tian <kevin.tian@intel.com>
> ---
> v??: Comment iommu_grdm_t typedef. Comment on the purpose of the new
>       hypercall in the public header. (Both requested by Tim.)
>
> --- unstable.orig/xen/common/compat/memory.c
> +++ unstable/xen/common/compat/memory.c
> @@ -17,6 +17,37 @@ CHECK_TYPE(domid);
>   CHECK_mem_access_op;
>   CHECK_vmemrange;
>
> +#ifdef HAS_PASSTHROUGH
> +struct get_reserved_device_memory {
> +    struct compat_reserved_device_memory_map map;
> +    unsigned int used_entries;
> +};
> +
> +static int get_reserved_device_memory(xen_pfn_t start,
> +                                      xen_ulong_t nr, void *ctxt)
> +{
> +    struct get_reserved_device_memory *grdm = ctxt;
> +
> +    if ( grdm->used_entries < grdm->map.nr_entries )
> +    {
> +        struct compat_reserved_device_memory rdm = {
> +            .start_pfn = start, .nr_pages = nr
> +        };
> +
> +        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
> +            return -ERANGE;
> +
> +        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
> +                                     &rdm, 1) )
> +            return -EFAULT;
> +    }
> +
> +    ++grdm->used_entries;
> +
> +    return 0;
> +}
> +#endif
> +
>   int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>   {
>       int split, op = cmd & MEMOP_CMD_MASK;
> @@ -303,6 +334,29 @@ int compat_memory_op(unsigned int cmd, X
>               break;
>           }
>
> +#ifdef HAS_PASSTHROUGH
> +        case XENMEM_reserved_device_memory_map:
> +        {
> +            struct get_reserved_device_memory grdm;
> +
> +            if ( copy_from_guest(&grdm.map, compat, 1) ||
> +                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
> +                return -EFAULT;
> +
> +            grdm.used_entries = 0;
> +            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
> +                                                  &grdm);
> +
> +            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
> +                rc = -ENOBUFS;
> +            grdm.map.nr_entries = grdm.used_entries;
> +            if ( __copy_to_guest(compat, &grdm.map, 1) )
> +                rc = -EFAULT;
> +
> +            return rc;
> +        }
> +#endif
> +
>           default:
>               return compat_arch_memory_op(cmd, compat);
>           }
> --- unstable.orig/xen/common/memory.c
> +++ unstable/xen/common/memory.c
> @@ -748,6 +748,34 @@ static int construct_memop_from_reservat
>       return 0;
>   }
>
> +#ifdef HAS_PASSTHROUGH
> +struct get_reserved_device_memory {
> +    struct xen_reserved_device_memory_map map;
> +    unsigned int used_entries;
> +};
> +
> +static int get_reserved_device_memory(xen_pfn_t start,
> +                                      xen_ulong_t nr, void *ctxt)
> +{
> +    struct get_reserved_device_memory *grdm = ctxt;
> +
> +    if ( grdm->used_entries < grdm->map.nr_entries )
> +    {
> +        struct xen_reserved_device_memory rdm = {
> +            .start_pfn = start, .nr_pages = nr
> +        };
> +
> +        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
> +                                    &rdm, 1) )
> +            return -EFAULT;
> +    }
> +
> +    ++grdm->used_entries;
> +
> +    return 0;
> +}
> +#endif
> +
>   long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>   {
>       struct domain *d;
> @@ -1162,6 +1190,32 @@ long do_memory_op(unsigned long cmd, XEN
>           break;
>       }
>
> +#ifdef HAS_PASSTHROUGH
> +    case XENMEM_reserved_device_memory_map:
> +    {
> +        struct get_reserved_device_memory grdm;
> +
> +        if ( unlikely(start_extent) )
> +            return -ENOSYS;
> +
> +        if ( copy_from_guest(&grdm.map, arg, 1) ||
> +             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
> +            return -EFAULT;
> +
> +        grdm.used_entries = 0;
> +        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
> +                                              &grdm);
> +
> +        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
> +            rc = -ENOBUFS;
> +        grdm.map.nr_entries = grdm.used_entries;
> +        if ( __copy_to_guest(arg, &grdm.map, 1) )
> +            rc = -EFAULT;
> +
> +        break;
> +    }
> +#endif
> +
>       default:
>           rc = arch_memory_op(cmd, arg);
>           break;
> --- unstable.orig/xen/drivers/passthrough/iommu.c
> +++ unstable/xen/drivers/passthrough/iommu.c
> @@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
>       iommu_enabled = iommu_intremap = 0;
>   }
>
> +int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
> +{
> +    const struct iommu_ops *ops = iommu_get_ops();
> +
> +    if ( !iommu_enabled || !ops->get_reserved_device_memory )
> +        return 0;
> +
> +    return ops->get_reserved_device_memory(func, ctxt);
> +}
> +
>   bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
>   {
>       const struct hvm_iommu *hd = domain_hvm_iommu(d);
> --- unstable.orig/xen/drivers/passthrough/vtd/dmar.c
> +++ unstable/xen/drivers/passthrough/vtd/dmar.c
> @@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
>       unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
>       return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
>   }
> +
> +int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
> +{
> +    struct acpi_rmrr_unit *rmrr;
> +    int rc = 0;
> +
> +    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
> +    {
> +        rc = func(PFN_DOWN(rmrr->base_address),
> +                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
> +                  ctxt);
> +        if ( rc )
> +            break;
> +    }
> +
> +    return rc;
> +}
> --- unstable.orig/xen/drivers/passthrough/vtd/extern.h
> +++ unstable/xen/drivers/passthrough/vtd/extern.h
> @@ -75,6 +75,7 @@ int domain_context_mapping_one(struct do
>                                  u8 bus, u8 devfn, const struct pci_dev *);
>   int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
>                                u8 bus, u8 devfn);
> +int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
>
>   unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
>   void io_apic_write_remap_rte(unsigned int apic,
> --- unstable.orig/xen/drivers/passthrough/vtd/iommu.c
> +++ unstable/xen/drivers/passthrough/vtd/iommu.c
> @@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops =
>       .crash_shutdown = vtd_crash_shutdown,
>       .iotlb_flush = intel_iommu_iotlb_flush,
>       .iotlb_flush_all = intel_iommu_iotlb_flush_all,
> +    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
>       .dump_p2m_table = vtd_dump_p2m_table,
>   };
>
> --- unstable.orig/xen/include/public/memory.h
> +++ unstable/xen/include/public/memory.h
> @@ -573,7 +573,29 @@ struct xen_vnuma_topology_info {
>   typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
>
> -/* Next available subop number is 27 */
> +/*
> + * With some legacy devices, certain guest-physical addresses cannot safely
> + * be used for other purposes, e.g. to map guest RAM.  This hypercall
> + * enumerates those regions so the toolstack can avoid using them.
> + */
> +#define XENMEM_reserved_device_memory_map   27
> +struct xen_reserved_device_memory {
> +    xen_pfn_t start_pfn;
> +    xen_ulong_t nr_pages;
> +};
> +typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
> +
> +struct xen_reserved_device_memory_map {
> +    /* IN/OUT */
> +    unsigned int nr_entries;
> +    /* OUT */
> +    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
> +};
> +typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
> +
> +/* Next available subop number is 28 */
>
>   #endif /* __XEN_PUBLIC_MEMORY_H__ */
>
> --- unstable.orig/xen/include/xen/iommu.h
> +++ unstable/xen/include/xen/iommu.h
> @@ -121,6 +121,14 @@ void iommu_dt_domain_destroy(struct doma
>
>   struct page_info;
>
> +/*
> + * Any non-zero value returned from callbacks of this type will cause the
> + * function the callback was handed to terminate its iteration. Assigning
> + * meaning of these non-zero values is left to the top level caller /
> + * callback pair.
> + */
> +typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
> +
>   struct iommu_ops {
>       int (*init)(struct domain *d);
>       void (*hwdom_init)(struct domain *d);
> @@ -152,12 +160,14 @@ struct iommu_ops {
>       void (*crash_shutdown)(void);
>       void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
>       void (*iotlb_flush_all)(struct domain *d);
> +    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
>       void (*dump_p2m_table)(struct domain *d);
>   };
>
>   void iommu_suspend(void);
>   void iommu_resume(void);
>   void iommu_crash_shutdown(void);
> +int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
>
>   void iommu_share_p2m_table(struct domain *d);
>
> --- unstable.orig/xen/include/xlat.lst
> +++ unstable/xen/include/xlat.lst
> @@ -61,9 +61,10 @@
>   !	memory_exchange			memory.h
>   !	memory_map			memory.h
>   !	memory_reservation		memory.h
> -?	mem_access_op		memory.h
> +?	mem_access_op			memory.h
>   !	pod_target			memory.h
>   !	remove_from_physmap		memory.h
> +!	reserved_device_memory_map	memory.h
>   ?	vmemrange			memory.h
>   !	vnuma_topology_info		memory.h
>   ?	physdev_eoi			physdev.h
>
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map
  2015-04-24  1:17           ` Chen, Tiejun
@ 2015-04-24  7:21             ` Jan Beulich
  0 siblings, 0 replies; 125+ messages in thread
From: Jan Beulich @ 2015-04-24  7:21 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: Tim Deegan, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

[-- Attachment #1: Type: text/plain, Size: 398 bytes --]

>>> On 24.04.15 at 03:17, <tiejun.chen@intel.com> wrote:
> On 2015/4/23 20:59, Jan Beulich wrote:
>>>>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
>>> But if you already have one just please ignore this and tell me
>>
>> Here's what I currently have:
> 
> Could you resend me this as an attached file? Then I can apply that 
> properly without any miss?

Here you go.

Jan


[-- Attachment #2: get-reserved-device-memory.patch --]
[-- Type: text/plain, Size: 9723 bytes --]

introduce XENMEM_reserved_device_memory_map

This is a prerequisite for punching holes into HVM and PVH guests' P2M
to allow passing through devices that are associated with (on VT-d)
RMRRs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
v??: Comment iommu_grdm_t typedef. Comment on the purpose of the new
     hypercall in the public header. (Both requested by Tim.)

--- unstable.orig/xen/common/compat/memory.c
+++ unstable/xen/common/compat/memory.c
@@ -17,6 +17,37 @@ CHECK_TYPE(domid);
 CHECK_mem_access_op;
 CHECK_vmemrange;
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct compat_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct compat_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( rdm.start_pfn != start || rdm.nr_pages != nr )
+            return -ERANGE;
+
+        if ( __copy_to_compat_offset(grdm->map.buffer, grdm->used_entries,
+                                     &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
     int split, op = cmd & MEMOP_CMD_MASK;
@@ -303,6 +334,29 @@ int compat_memory_op(unsigned int cmd, X
             break;
         }
 
+#ifdef HAS_PASSTHROUGH
+        case XENMEM_reserved_device_memory_map:
+        {
+            struct get_reserved_device_memory grdm;
+
+            if ( copy_from_guest(&grdm.map, compat, 1) ||
+                 !compat_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+                return -EFAULT;
+
+            grdm.used_entries = 0;
+            rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                                  &grdm);
+
+            if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+                rc = -ENOBUFS;
+            grdm.map.nr_entries = grdm.used_entries;
+            if ( __copy_to_guest(compat, &grdm.map, 1) )
+                rc = -EFAULT;
+
+            return rc;
+        }
+#endif
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
--- unstable.orig/xen/common/memory.c
+++ unstable/xen/common/memory.c
@@ -748,6 +748,34 @@ static int construct_memop_from_reservat
     return 0;
 }
 
+#ifdef HAS_PASSTHROUGH
+struct get_reserved_device_memory {
+    struct xen_reserved_device_memory_map map;
+    unsigned int used_entries;
+};
+
+static int get_reserved_device_memory(xen_pfn_t start,
+                                      xen_ulong_t nr, void *ctxt)
+{
+    struct get_reserved_device_memory *grdm = ctxt;
+
+    if ( grdm->used_entries < grdm->map.nr_entries )
+    {
+        struct xen_reserved_device_memory rdm = {
+            .start_pfn = start, .nr_pages = nr
+        };
+
+        if ( __copy_to_guest_offset(grdm->map.buffer, grdm->used_entries,
+                                    &rdm, 1) )
+            return -EFAULT;
+    }
+
+    ++grdm->used_entries;
+
+    return 0;
+}
+#endif
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -1162,6 +1190,32 @@ long do_memory_op(unsigned long cmd, XEN
         break;
     }
 
+#ifdef HAS_PASSTHROUGH
+    case XENMEM_reserved_device_memory_map:
+    {
+        struct get_reserved_device_memory grdm;
+
+        if ( unlikely(start_extent) )
+            return -ENOSYS;
+
+        if ( copy_from_guest(&grdm.map, arg, 1) ||
+             !guest_handle_okay(grdm.map.buffer, grdm.map.nr_entries) )
+            return -EFAULT;
+
+        grdm.used_entries = 0;
+        rc = iommu_get_reserved_device_memory(get_reserved_device_memory,
+                                              &grdm);
+
+        if ( !rc && grdm.map.nr_entries < grdm.used_entries )
+            rc = -ENOBUFS;
+        grdm.map.nr_entries = grdm.used_entries;
+        if ( __copy_to_guest(arg, &grdm.map, 1) )
+            rc = -EFAULT;
+
+        break;
+    }
+#endif
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
--- unstable.orig/xen/drivers/passthrough/iommu.c
+++ unstable/xen/drivers/passthrough/iommu.c
@@ -344,6 +344,16 @@ void iommu_crash_shutdown(void)
     iommu_enabled = iommu_intremap = 0;
 }
 
+int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    const struct iommu_ops *ops = iommu_get_ops();
+
+    if ( !iommu_enabled || !ops->get_reserved_device_memory )
+        return 0;
+
+    return ops->get_reserved_device_memory(func, ctxt);
+}
+
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
 {
     const struct hvm_iommu *hd = domain_hvm_iommu(d);
--- unstable.orig/xen/drivers/passthrough/vtd/dmar.c
+++ unstable/xen/drivers/passthrough/vtd/dmar.c
@@ -893,3 +893,20 @@ int platform_supports_x2apic(void)
     unsigned int mask = ACPI_DMAR_INTR_REMAP | ACPI_DMAR_X2APIC_OPT_OUT;
     return cpu_has_x2apic && ((dmar_flags & mask) == ACPI_DMAR_INTR_REMAP);
 }
+
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt)
+{
+    struct acpi_rmrr_unit *rmrr;
+    int rc = 0;
+
+    list_for_each_entry(rmrr, &acpi_rmrr_units, list)
+    {
+        rc = func(PFN_DOWN(rmrr->base_address),
+                  PFN_UP(rmrr->end_address) - PFN_DOWN(rmrr->base_address),
+                  ctxt);
+        if ( rc )
+            break;
+    }
+
+    return rc;
+}
--- unstable.orig/xen/drivers/passthrough/vtd/extern.h
+++ unstable/xen/drivers/passthrough/vtd/extern.h
@@ -75,6 +75,7 @@ int domain_context_mapping_one(struct do
                                u8 bus, u8 devfn, const struct pci_dev *);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
+int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
 void io_apic_write_remap_rte(unsigned int apic,
--- unstable.orig/xen/drivers/passthrough/vtd/iommu.c
+++ unstable/xen/drivers/passthrough/vtd/iommu.c
@@ -2491,6 +2491,7 @@ const struct iommu_ops intel_iommu_ops =
     .crash_shutdown = vtd_crash_shutdown,
     .iotlb_flush = intel_iommu_iotlb_flush,
     .iotlb_flush_all = intel_iommu_iotlb_flush_all,
+    .get_reserved_device_memory = intel_iommu_get_reserved_device_memory,
     .dump_p2m_table = vtd_dump_p2m_table,
 };
 
--- unstable.orig/xen/include/public/memory.h
+++ unstable/xen/include/public/memory.h
@@ -573,7 +573,29 @@ struct xen_vnuma_topology_info {
 typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
 
-/* Next available subop number is 27 */
+/*
+ * With some legacy devices, certain guest-physical addresses cannot safely
+ * be used for other purposes, e.g. to map guest RAM.  This hypercall
+ * enumerates those regions so the toolstack can avoid using them.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+    /* IN/OUT */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+/* Next available subop number is 28 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
--- unstable.orig/xen/include/xen/iommu.h
+++ unstable/xen/include/xen/iommu.h
@@ -121,6 +121,14 @@ void iommu_dt_domain_destroy(struct doma
 
 struct page_info;
 
+/*
+ * Any non-zero value returned from callbacks of this type will cause the
+ * function the callback was handed to terminate its iteration. Assigning
+ * meaning of these non-zero values is left to the top level caller /
+ * callback pair.
+ */
+typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, void *ctxt);
+
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
@@ -152,12 +160,14 @@ struct iommu_ops {
     void (*crash_shutdown)(void);
     void (*iotlb_flush)(struct domain *d, unsigned long gfn, unsigned int page_count);
     void (*iotlb_flush_all)(struct domain *d);
+    int (*get_reserved_device_memory)(iommu_grdm_t *, void *);
     void (*dump_p2m_table)(struct domain *d);
 };
 
 void iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
+int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
 
 void iommu_share_p2m_table(struct domain *d);
 
--- unstable.orig/xen/include/xlat.lst
+++ unstable/xen/include/xlat.lst
@@ -61,9 +61,10 @@
 !	memory_exchange			memory.h
 !	memory_map			memory.h
 !	memory_reservation		memory.h
-?	mem_access_op		memory.h
+?	mem_access_op			memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+!	reserved_device_memory_map	memory.h
 ?	vmemrange			memory.h
 !	vnuma_topology_info		memory.h
 ?	physdev_eoi			physdev.h

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-23 13:59       ` Jan Beulich
  2015-04-23 14:26         ` Tim Deegan
@ 2015-05-04  8:15         ` Tian, Kevin
  1 sibling, 0 replies; 125+ messages in thread
From: Tian, Kevin @ 2015-05-04  8:15 UTC (permalink / raw)
  To: Jan Beulich, Chen, Tiejun
  Cc: Tim Deegan, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, Zhang, Yang Z

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, April 23, 2015 10:00 PM
> 
> >>> On 23.04.15 at 14:32, <tiejun.chen@intel.com> wrote:
> > On 2015/4/16 23:40, Tim Deegan wrote:
> >> At 17:21 +0800 on 10 Apr (1428686518), Tiejun Chen wrote:
> >>> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct
> domain *d, bool_t map,
> >>>           if ( !is_hardware_domain(d) )
> >>>           {
> >>>               if ( (err = set_identity_p2m_entry(d, base_pfn,
> p2m_access_rw)) )
> >>> -                return err;
> >>> +            {
> >>> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> >>> +                {
> >>> +                    printk(XENLOG_G_WARNING "Some devices
> may work failed .\n");
> >>
> >> This is a bit cryptic.  How about:
> >>   "RMRR map failed.  Device %04x:%02x:%02x.%u and domain %d may
> be
> > unstable.",
> >> (and pass in the devfn from the caller so we can print the details of
> >> the device).
> >
> > Got it but we can't get SBDF here directly.
> >
> > So just now we can have this line.
> >
> >          {
> >              if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
> >                  dprintk(XENLOG_ERR VTDPREFIX,
> >                          "RMRR mapping failed to pfn:%"PRIx64""
> >                          " so Dom%d may be unstable.\n",
> >                          base_pfn, d->domain_id);
> >              else
> >                  return err;
> >          }
> >
> > Certainly, we can extend rmrr_identity_mapping() to own its associated
> > SBDF as an input parameter (and bring some syncs) if you still think
> > this is necessary.
> 
> I don't think we can, since a single RMRR may be associated with
> more than one device.
> 
> >>> @@ -493,6 +493,10 @@
> DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
> >>>   /* XEN_DOMCTL_deassign_device */
> >>>   struct xen_domctl_assign_device {
> >>>       uint32_t  machine_sbdf;   /* machine PCI ID of assigned device
> */
> >>> +    /* IN */
> >>> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
> >>> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
> >>
> >> "STRICT" might be a better word than "FORCE" (here and everywhere
> >> else).  "FORCE" sounds like either Xen will assign the device even if
> >> it's unsafe,  which is the opposite of what's meant IIUC.
> >
> > This is definitely fine to me but this is derived from our policy based
> > on that previous design,
> >
> > Global RDM parameter:
> >      rdm = [ 'host, reserve=force/try' ]
> > Per-device RDM parameter:
> >      pci = [ 'sbdf, rdm_reserve=force/try' ]
> >
> > Please refer to patch #1. So I guess we need a further agreement or
> > comments from other guys :)
> 
> I think I'd prefer strict/relaxed over force/try, but I certainly can
> live with the latter.
> 

agree the former is clearer. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-14  0:42         ` Chen, Tiejun
@ 2015-05-05  9:32           ` Wei Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Wei Liu @ 2015-05-05  9:32 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Tue, Apr 14, 2015 at 08:42:39AM +0800, Chen, Tiejun wrote:
> On 2015/4/13 19:02, Wei Liu wrote:
> >On Mon, Apr 13, 2015 at 10:09:51AM +0800, Chen, Tiejun wrote:
> >[...]
> >>>Hardcoded value?
> >>
> >>Yes. Actually, we intend to use this to present that lowmem entry,
> >>
> >>tools/firmware/hvmloader/e820.c:
> >>
> >>     /* Low RAM goes here. Reserve space for special pages. */
> >>     ...
> >>     e820[nr].addr = 0x100000;
> >>
> >
> >I don't like the idea of having two hardcoded values in different
> 
> Just one place since based on our logic, hvmloader doesn't have this setting
> anymore and actually it really grab that info from here. Please refer to
> patch #13.
> 
> >locations. Please put this value into a header file and reference it
> >here and in hvmloader.
> 
> Anyway, I'd like to define this here directly since no one consumes this
> again.
> 
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index 5134b33..af747e6 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -787,6 +787,7 @@ out:
>      return rc;
>  }
> 
> +#define GUEST_LOW_MEM_START_DEFAULT 0x100000
>  static int libxl__domain_construct_memmap(libxl__gc *gc,
>                                            libxl_domain_config *d_config,
>                                            uint32_t domid,
> @@ -812,8 +813,8 @@ static int libxl__domain_construct_memmap(libxl__gc *gc,
>      e820 = libxl__malloc(gc, sizeof(struct e820entry) * e820_entries);
> 
>      /* Low memory */
> -    e820[nr].addr = 0x100000;
> -    e820[nr].size = args->lowmem_size - 0x100000;
> +    e820[nr].addr = GUEST_LOW_MEM_START_DEFAULT;
> +    e820[nr].size = args->lowmem_size - GUEST_LOW_MEM_START_DEFAULT;
>      e820[nr].type = E820_RAM;
>      nr++;
> 
> Is this fine to you?
> 

Having a #define without explaining why it is chosen is still fragile.
As Jan pointed out this needs more explanation.

Wei.

> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-20 11:13   ` Jan Beulich
@ 2015-05-06 15:00     ` Chen, Tiejun
  2015-05-06 15:34       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-06 15:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 19:13, Jan Beulich wrote:
>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>> --- a/tools/libxc/xc_domain.c
>> +++ b/tools/libxc/xc_domain.c
>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>       return do_domctl(xch, &domctl);
>>   }
>>
>> +struct xen_reserved_device_memory
>> +*xc_device_get_rdm(xc_interface *xch,
>> +                   uint32_t flag,
>> +                   uint16_t seg,
>> +                   uint8_t bus,
>> +                   uint8_t devfn,
>> +                   unsigned int *nr_entries)
>> +{
>
> So what's the point of having both this new function and
> xc_reserved_device_memory_map()? Is the latter useful for
> anything besides the purpose here?

I just hope xc_reserved_device_memory_map() is a standard interface to 
call that XENMEM_reserved_device_memory_map, but xc_device_get_rdm() can 
handle some errors in current case.

I think you are hinting we just need one, right?

>
>> +    struct xen_reserved_device_memory *xrdm = NULL;
>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>> +                                           nr_entries);
>> +
>> +    if ( rc < 0 )
>> +    {
>> +        if ( errno == ENOBUFS )
>> +        {
>> +            if ( (xrdm = malloc(*nr_entries *
>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>> +            {
>> +                PERROR("Could not allocate memory.");
>
> Now that's exactly the kind of error message that makes no sense:
> As errno will already cause PERROR() to print something along the
> lines of the message you provide here, you're just creating
> redundancy. Indicating the purpose of the allocation, otoh, would
> add helpful context for the one inspecting the resulting log.

What about this?

PERROR("Could not allocate memory buffers to store reserved device 
memory entries.");

>
>> @@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
>>       elf_parse_binary(&elf);
>>       v_start = 0;
>>       v_end = args->mem_size;
>> +    v_lowend = lowmem_end = args->lowmem_size;
>
> Considering that both variables get set to the same value and
> don't appear to be changed subsequently - why two variables?

Yes, I really should drop one.

>
>> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>>
>>       for ( i = 0; i < nr_pages; i++ )
>>           page_array[i] = i;
>> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
>> -        page_array[i] += mmio_size >> PAGE_SHIFT;
>> +    /*
>> +     * This condition 'lowmem_end <= mmio_start' is always true.
>> +     */
>
> For one I think you mean "The", not "This", as there's no such
> condition around here. And then - why? DYM "is supposed to
> always be true"? In which case you may want to check...

I always do this inside libxl__build_hvm() but before setup_guest(),

+    if (args.lowmem_size > mmio_start)
+        args.lowmem_size = mmio_start;

And plus, we also another policy to rdm,

     #1. Above a predefined boundary (default 2G)
         - move lowmem_end below reserved region to solve conflict;

This means there's such a likelihood of args.lowmem_size < mmio_start) 
as well.

So here I'm saying the condition is always true.

>
>> @@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
>>       if ( args.mem_target == 0 )
>>           args.mem_target = args.mem_size;
>>
>> -    if ( args.mmio_size == 0 )
>> -        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>> -
>>       /* An HVM guest must be initialised with at least 2MB memory. */
>>       if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
>>           return -1;
>> @@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
>>       args.mem_size = (uint64_t)memsize << 20;
>>       args.mem_target = (uint64_t)target << 20;
>>       args.image_file_name = image_name;
>> +    if ( args.mmio_size == 0 )
>> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>
> Is this code motion really necessary?

Yes, and actually I'm just move this sort of code.

Please take a look at that original code flow:

libxl__build_hvm()
     |
     + xc_hvm_build()
     |
     + args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
         |
         + setup_guest()
         |
         + populate guest memory


After we introduce libxl__domain_device_check_rdm() to handle our 
policy, libxl__domain_device_check_rdm() needs that final mmio_size to 
calculate our lowmem_size, so I move 'args.mmio_size = 
HVM_BELOW_4G_MMIO_LENGTH' out of xc_hvm_build(), and push forward 
libxl__domain_device_check_rdm():

libxl__build_hvm()
     |
     + args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
     |
     + libxl__domain_device_check_rdm()
     |
     + xc_hvm_build()

This also brings a aftermath to xc_hvm_build_target_mem() which is also 
using xc_hvm_build().

>
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
>>       return dm;
>>   }
>>
>> +/*
>> + * Check whether there exists rdm hole in the specified memory range.
>> + * Returns 1 if exists, else returns 0.
>> + */
>> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
>> +                          uint64_t rdm_start, uint64_t rdm_size)
>> +{
>> +    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
>> +        return 0;
>> +    else
>> +        return 1;
>> +}
>
> The function appears to return a boolean type, so please make its
> return value express this. Also the name doesn't really imply the
> meaning of "true" and "false" being returned - how about
> overlaps_rdm()? And finally, while I'm not the maintainer of this code

Looks good.

> and hence don't have the final say on stylistic issues, I don't see
> why the above couldn't be expressed with a single return statement.

Are you saying something like this? Note this was showed by yourself 
long time ago.

static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
                                       uint64_t mmio_start, uint64_t 
mmio_size)
{
      return start + memsize > mmio_start && start < mmio_start + mmio_size;
}

But I don't think this really can't work out our case.

>
>> +int libxl__domain_device_check_rdm(libxl__gc *gc,
>> +                                   libxl_domain_config *d_config,
>> +                                   uint64_t rdm_mem_guard,
>> +                                   struct xc_hvm_build_args *args)
>> +{
>> +    int i, j, conflict;
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +    struct xen_reserved_device_memory *xrdm = NULL;
>> +    unsigned int nr_all_rdms = 0;
>> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
>> +    uint32_t type = d_config->b_info.rdm.type;
>> +    uint16_t seg;
>> +    uint8_t bus, devfn;
>> +
>> +    /* Might not to expose rdm. */
>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>> +        return 0;
>> +
>> +    /* Collect all rdm info if exist. */
>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>> +                             0, 0, 0, &nr_all_rdms);
>
> What meaning has passing a libxl private value to a libxc function?

We intend to collect all rdm entries info in advance and then we can 
construct d_config->rdms based on our policies as follows. Because we 
need to first allocate d_config->rdms properly to store rdms, but in 
some cases we don't know how many buffers are enough. For example, we 
don't have that global flag but with multiple pci devices. And even a 
shared entry worsen this situation.

So here, we set that flag as LIBXL_RDM_RESERVE_TYPE_HOST but without any 
SBDF to grab all rdms.

>
>> +    if (!nr_all_rdms)
>> +        return 0;
>> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
>> +                                   sizeof(libxl_device_rdm));
>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>> +
>> +    /* Query all RDM entries in this platform */
>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>> +        d_config->num_rdms = nr_all_rdms;
>> +        for (i = 0; i < d_config->num_rdms; i++) {
>> +            d_config->rdms[i].start =
>> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
>> +            d_config->rdms[i].size =
>> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>> +        }
>> +    } else {
>> +        d_config->num_rdms = 0;
>> +    }
>> +    free(xrdm);
>> +
>> +    /* Query RDM entries per-device */
>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>> +        unsigned int nr_entries = 0;
>> +        bool new = true;
>> +        seg = d_config->pcidevs[i].domain;
>> +        bus = d_config->pcidevs[i].bus;
>> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
>> +        nr_entries = 0;
>> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
>> +                                 seg, bus, devfn, &nr_entries);
>> +        /* No RDM to associated with this device. */
>> +        if (!nr_entries)
>> +            continue;
>> +
>> +        /* Need to check whether this entry is already saved in the array.
>> +         * This could come from two cases:
>> +         *
>> +         *   - user may configure to get all RMRRs in this platform, which
>> +         * is already queried before this point
>> +         *   - or two assigned devices may share one RMRR entry
>> +         *
>> +         * different policies may be configured on the same RMRR due to above
>> +         * two cases. We choose a simple policy to always favor stricter policy
>> +         */
>> +        for (j = 0; j < d_config->num_rdms; j++) {
>> +            if (d_config->rdms[j].start ==
>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
>> +             {
>> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
>> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
>> +                new = false;
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (new) {
>> +            if (d_config->num_rdms > nr_all_rdms - 1) {
>
> ">= nr_all_rdms" expresses the same thing.

Okay.

>
>> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
>> +                free(xrdm);
>> +                return -1;
>> +            }
>> +
>> +            /*
>> +             * This is a new entry.
>> +             */
>> +            d_config->rdms[d_config->num_rdms].start =
>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
>> +            d_config->rdms[d_config->num_rdms].size =
>> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
>> +            d_config->rdms[d_config->num_rdms].flag =
>> d_config->pcidevs[i].rdm_reserve;
>> +            d_config->num_rdms++;
>> +        }
>> +        free(xrdm);
>> +    }
>> +
>> +    /* Fix highmem. */
>> +    if (args->mem_size > args->lowmem_size)
>> +        highmem_end += (args->mem_size - args->lowmem_size);
>
> This looks misplaced. In particular it doesn't depend on anything done
> so far in this function, so it being here is at least confusing. Is there

Sure.

> any reason not to set up the variable correctly right at the beginning
> of the function, or do _all_ of its initialization here?

So I think I should move that right at the beginning of the function.

>
>> +    /* Next step is to check and avoid potential conflict between RDM entries
>> +     * and guest RAM. To avoid intrusive impact to existing memory layout
>> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
>> +     * below conflicts are not handled which are rare and handling them would
>> +     * lead to a more scattered layout:
>> +     *  - RMRR in highmem area (>4G)
>> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
>
> So how is that going to do on my box having RMRRs below 1Mb?

This is just right in this case of (<2G) as well. So we still check what 
policy is passed, 'strict' versus 'relaxed'.

>
>> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
>> +     * end below reserved region to solve conflict.
>> +     *
>> +     * If a conflict is detected on a given RMRR entry, an error will be
>> +     * returned.
>> +     * If 'force' policy is specified. Or conflict is treated as a warning if
>
> The first sentence here appears to belong at the end of the one ending
> on the previous line.
>
>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>> +     * this entry to hvmloader.
>
> What is "this" in "... also mark this as ..."? Certainly neither the conflict
> nor the warning.

Sorry, this is my fault.

      * If a conflict is detected on a given RMRR entry, an error will be
      * returned if 'strict' policy is specified. Or conflict is treated 
as a
      * warning if 'relaxed' policy is specified, and we also mark this as
      * INVALID not to expose this entry to hvmloader.

>
>> +     *
>> +     * Firstly we should check the case of rdm < 4G because we may need to
>> +     * expand highmem_end.
>> +     */
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        rdm_start = d_config->rdms[i].start;
>> +        rdm_size = d_config->rdms[i].size;
>> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
>> +
>> +        if (!conflict)
>> +            continue;
>> +
>> +        /*
>> +         * Just check if RDM > our memory boundary
>> +         */
>> +        if (d_config->rdms[i].start > rdm_mem_guard) {
>> +            /*
>> +             * We will move downwards lowmem_end so we have to expand
>> +             * highmem_end.
>> +             */
>> +            highmem_end += (args->lowmem_size - rdm_start);
>> +            /* Now move downwards lowmem_end. */
>> +            args->lowmem_size = rdm_start;
>
> Considering that the action here doesn't depend on the specific
> ->rdms[] slot being looked at, I don't see why the loop needs to

I'm not sure if I understand what you mean.

All rdm entries are organized disorderly in d_config->rdms, so we should 
traverse all entries to make sure args->lowmem_size is below all rdms' 
start address.

> continue. Also - what if rdm_mem_guard > args->lowmem_size?

We have this check before we go to check rdm_mem_guard,

conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
if (!conflict) continue;

This means args->lowmem_size is already above rdm_start, so here 
actually this indicates we work with this condition,

(args->lowmem_size > rdm_start) && (rdm_start > rdm_mem_guard)

>
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Finally we can take same policy to check lowmem(< 2G) and
>> +     * highmem adjusted above.
>> +     */
>
> I don't follow - neither the code above nor the one below makes any
> distinction between memory below 2Gb and memory below 4Gb (as
> far as check_rdm_hole() calls are concerned).

As you guys discussed during that design phrase, we just need a 
memory_guard specific to rdm regardless of args->lowmem_size is below 2G 
or [2G, 4G], right?

In case of [2G, 4G] we have such this policy, args->lowmem_size = 
rdm_start. As you see this is implemented above. But in case of 
args->lowmem_size < 2G (memory_guard), we are trying to handle any 
conflict directly according to our policy. And currently we'd like to 
follow this same policy to handle any conflict from highmem as well.

>
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        rdm_start = d_config->rdms[i].start;
>> +        rdm_size = d_config->rdms[i].size;
>> +        /* Does this entry conflict with lowmem? */
>> +        conflict = check_rdm_hole(0, args->lowmem_size,
>> +                                  rdm_start, rdm_size);
>> +        /* Does this entry conflict with highmem? */
>> +        conflict |= check_rdm_hole((1ULL<<32), highmem_end,
>
> The second parameter of the function is a size, not an address.

You're right. And this should be 'highmem_end - (1ULL<<32)'.

>
>> @@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>        * Do all this in one step here...
>>        */
>>       args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
>> +    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
>> +                            (1ULL << 32) : args.mem_size;
>
> Use min().

Sure.

>
>> @@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>           if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
>>               args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
>>       }
>> +
>> +    if (args.mmio_size == 0)
>> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>> +    mmio_start = (1ull << 32) - args.mmio_size;
>> +
>> +    if (args.lowmem_size > mmio_start)
>> +        args.lowmem_size = mmio_start;
>> +
>> +    /*
>> +     * We'd like to set a memory boundary to determine if we need to check
>> +     * any overlap with reserved device memory.
>> +     *
>> +     * TODO: we will add a config parameter for this boundary value next.
>
> According to the titles of subsequent patches this doesn't seem to
> happen within this series.

Sorry, actually I'm trying to say we do this after current revision is 
accepted. So just rephrase this comment,

* TODO: Its batter to provide a config parameter for this boundary value.

>
> But overall the intentions of this patch seem right.
>

Really appreciate your comments.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-06 15:00     ` Chen, Tiejun
@ 2015-05-06 15:34       ` Jan Beulich
  2015-05-07  2:22         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-06 15:34 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 06.05.15 at 17:00, <tiejun.chen@intel.com> wrote:
> On 2015/4/20 19:13, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/libxc/xc_domain.c
>>> +++ b/tools/libxc/xc_domain.c
>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>       return do_domctl(xch, &domctl);
>>>   }
>>>
>>> +struct xen_reserved_device_memory
>>> +*xc_device_get_rdm(xc_interface *xch,
>>> +                   uint32_t flag,
>>> +                   uint16_t seg,
>>> +                   uint8_t bus,
>>> +                   uint8_t devfn,
>>> +                   unsigned int *nr_entries)
>>> +{
>>
>> So what's the point of having both this new function and
>> xc_reserved_device_memory_map()? Is the latter useful for
>> anything besides the purpose here?
> 
> I just hope xc_reserved_device_memory_map() is a standard interface to 
> call that XENMEM_reserved_device_memory_map, but xc_device_get_rdm() can 
> handle some errors in current case.
> 
> I think you are hinting we just need one, right?

Correct. But remember - I'm not a maintainer of this code, so
maintainers may be of different opinion.

>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>> +                                           nr_entries);
>>> +
>>> +    if ( rc < 0 )
>>> +    {
>>> +        if ( errno == ENOBUFS )
>>> +        {
>>> +            if ( (xrdm = malloc(*nr_entries *
>>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>>> +            {
>>> +                PERROR("Could not allocate memory.");
>>
>> Now that's exactly the kind of error message that makes no sense:
>> As errno will already cause PERROR() to print something along the
>> lines of the message you provide here, you're just creating
>> redundancy. Indicating the purpose of the allocation, otoh, would
>> add helpful context for the one inspecting the resulting log.
> 
> What about this?
> 
> PERROR("Could not allocate memory buffers to store reserved device 
> memory entries.");

You kind of go from one extreme to the other - the message
doesn't need to be overly long, but it should be distinct from
all other messages (so that when seen one can identify what
went wrong).

>>> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>>>
>>>       for ( i = 0; i < nr_pages; i++ )
>>>           page_array[i] = i;
>>> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
>>> -        page_array[i] += mmio_size >> PAGE_SHIFT;
>>> +    /*
>>> +     * This condition 'lowmem_end <= mmio_start' is always true.
>>> +     */
>>
>> For one I think you mean "The", not "This", as there's no such
>> condition around here. And then - why? DYM "is supposed to
>> always be true"? In which case you may want to check...
> 
> I always do this inside libxl__build_hvm() but before setup_guest(),
> 
> +    if (args.lowmem_size > mmio_start)
> +        args.lowmem_size = mmio_start;
> 
> And plus, we also another policy to rdm,
> 
>      #1. Above a predefined boundary (default 2G)
>          - move lowmem_end below reserved region to solve conflict;
> 
> This means there's such a likelihood of args.lowmem_size < mmio_start) 
> as well.
> 
> So here I'm saying the condition is always true.

Okay, but again - if this is relevant to the following code, an
assertion or alike may still be warranted.

>> and hence don't have the final say on stylistic issues, I don't see
>> why the above couldn't be expressed with a single return statement.
> 
> Are you saying something like this? Note this was showed by yourself 
> long time ago.

I know, and hence I was puzzled to still see you use the more
convoluted form.

> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>                                        uint64_t mmio_start, uint64_t mmio_size)
> {
>       return start + memsize > mmio_start && start < mmio_start + mmio_size;
> }
> 
> But I don't think this really can't work out our case.

It's equivalent to the original you had, so I don't see what you
mean with "this really can't work out our case".

>>> +int libxl__domain_device_check_rdm(libxl__gc *gc,
>>> +                                   libxl_domain_config *d_config,
>>> +                                   uint64_t rdm_mem_guard,
>>> +                                   struct xc_hvm_build_args *args)
>>> +{
>>> +    int i, j, conflict;
>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>> +    unsigned int nr_all_rdms = 0;
>>> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
>>> +    uint32_t type = d_config->b_info.rdm.type;
>>> +    uint16_t seg;
>>> +    uint8_t bus, devfn;
>>> +
>>> +    /* Might not to expose rdm. */
>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>>> +        return 0;
>>> +
>>> +    /* Collect all rdm info if exist. */
>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>> +                             0, 0, 0, &nr_all_rdms);
>>
>> What meaning has passing a libxl private value to a libxc function?
> 
> We intend to collect all rdm entries info in advance and then we can 
> construct d_config->rdms based on our policies as follows. Because we 
> need to first allocate d_config->rdms properly to store rdms, but in 
> some cases we don't know how many buffers are enough. For example, we 
> don't have that global flag but with multiple pci devices. And even a 
> shared entry worsen this situation.
> 
> So here, we set that flag as LIBXL_RDM_RESERVE_TYPE_HOST but without any 
> SBDF to grab all rdms.

I'm afraid you didn't get my point: Values passed to libxc should be
known to libxc. Values privately defined by libxl for its own purposes
aren't known to libxc, and hence shouldn't be passed to libxc
functions.

>>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>>> +     * this entry to hvmloader.
>>
>> What is "this" in "... also mark this as ..."? Certainly neither the conflict
>> nor the warning.
> 
> Sorry, this is my fault.
> 
>       * If a conflict is detected on a given RMRR entry, an error will be
>       * returned if 'strict' policy is specified. Or conflict is treated as a
>       * warning if 'relaxed' policy is specified, and we also mark this as
>       * INVALID not to expose this entry to hvmloader.

The same "this" still doesn't have anything reasonable it references. I
think you mean "the entry" (in which case the subsequent "this entry"
could become just "it" afaict). But (not being a native speaker) the
grammar of the second half of the sentence looks odd (and hence
potentially confusing) to me anyway (i.e. even with the previous
issue fixed).

>>> +     *
>>> +     * Firstly we should check the case of rdm < 4G because we may need to
>>> +     * expand highmem_end.
>>> +     */
>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>> +        rdm_start = d_config->rdms[i].start;
>>> +        rdm_size = d_config->rdms[i].size;
>>> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
>>> +
>>> +        if (!conflict)
>>> +            continue;
>>> +
>>> +        /*
>>> +         * Just check if RDM > our memory boundary
>>> +         */
>>> +        if (d_config->rdms[i].start > rdm_mem_guard) {
>>> +            /*
>>> +             * We will move downwards lowmem_end so we have to expand
>>> +             * highmem_end.
>>> +             */
>>> +            highmem_end += (args->lowmem_size - rdm_start);
>>> +            /* Now move downwards lowmem_end. */
>>> +            args->lowmem_size = rdm_start;
>>
>> Considering that the action here doesn't depend on the specific
>> ->rdms[] slot being looked at, I don't see why the loop needs to
> 
> I'm not sure if I understand what you mean.
> 
> All rdm entries are organized disorderly in d_config->rdms, so we should 
> traverse all entries to make sure args->lowmem_size is below all rdms' 
> start address.

I think I see what confused me: in the if() condition you reference
d_config->rdms[i].start, yet the body of the if() has no reference
to d_config->rdms[i] at all. If the if() used rdm_start it would
become obvious that this is being latched at the beginning of the
body (which is what I overlooked, assuming the variable's value
to have got set prior to the loop), and hence the body is not loop
invariant.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-06 15:34       ` Jan Beulich
@ 2015-05-07  2:22         ` Chen, Tiejun
  2015-05-07  6:04           ` Jan Beulich
  2015-05-08  1:24           ` Chen, Tiejun
  0 siblings, 2 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-07  2:22 UTC (permalink / raw)
  To: Jan Beulich, ian.campbell, stefano.stabellini, wei.liu2, Ian.Jackson
  Cc: kevin.tian, andrew.cooper3, tim, xen-devel, yang.z.zhang

On 2015/5/6 23:34, Jan Beulich wrote:
>>>> On 06.05.15 at 17:00, <tiejun.chen@intel.com> wrote:
>> On 2015/4/20 19:13, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/libxc/xc_domain.c
>>>> +++ b/tools/libxc/xc_domain.c
>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>>        return do_domctl(xch, &domctl);
>>>>    }
>>>>
>>>> +struct xen_reserved_device_memory
>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>> +                   uint32_t flag,
>>>> +                   uint16_t seg,
>>>> +                   uint8_t bus,
>>>> +                   uint8_t devfn,
>>>> +                   unsigned int *nr_entries)
>>>> +{
>>>
>>> So what's the point of having both this new function and
>>> xc_reserved_device_memory_map()? Is the latter useful for
>>> anything besides the purpose here?
>>
>> I just hope xc_reserved_device_memory_map() is a standard interface to
>> call that XENMEM_reserved_device_memory_map, but xc_device_get_rdm() can
>> handle some errors in current case.
>>
>> I think you are hinting we just need one, right?
>
> Correct. But remember - I'm not a maintainer of this code, so

But this may be a little complex with one...

> maintainers may be of different opinion.

Anyway, let me ask our tools maintainers.

Campbell, Jackson, Wei and Stefano,

What about your concern to this?

>
>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>>> +                                           nr_entries);
>>>> +
>>>> +    if ( rc < 0 )
>>>> +    {
>>>> +        if ( errno == ENOBUFS )
>>>> +        {
>>>> +            if ( (xrdm = malloc(*nr_entries *
>>>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>>>> +            {
>>>> +                PERROR("Could not allocate memory.");
>>>
>>> Now that's exactly the kind of error message that makes no sense:
>>> As errno will already cause PERROR() to print something along the
>>> lines of the message you provide here, you're just creating
>>> redundancy. Indicating the purpose of the allocation, otoh, would
>>> add helpful context for the one inspecting the resulting log.
>>
>> What about this?
>>
>> PERROR("Could not allocate memory buffers to store reserved device
>> memory entries.");
>
> You kind of go from one extreme to the other - the message
> doesn't need to be overly long, but it should be distinct from
> all other messages (so that when seen one can identify what
> went wrong).

I originally refer to some existing examples like this,

int
xc_core_arch_memory_map_get(xc_interface *xch, struct 
xc_core_arch_context *unused,
                             xc_dominfo_t *info, shared_info_any_t 
*live_shinfo,
                             xc_core_memory_map_t **mapp,
                             unsigned int *nr_entries)
{
     ...
     map = malloc(sizeof(*map));
     if ( map == NULL )
     {
         PERROR("Could not allocate memory");
         return -1;
     }

Maybe this is wrong to my case. Could I change this?

PERROR("Could not allocate memory for XENMEM_reserved_device_memory_map 
hypercall");

Or just give me your line.

>
>>>> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>>>>
>>>>        for ( i = 0; i < nr_pages; i++ )
>>>>            page_array[i] = i;
>>>> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
>>>> -        page_array[i] += mmio_size >> PAGE_SHIFT;
>>>> +    /*
>>>> +     * This condition 'lowmem_end <= mmio_start' is always true.
>>>> +     */
>>>
>>> For one I think you mean "The", not "This", as there's no such
>>> condition around here. And then - why? DYM "is supposed to
>>> always be true"? In which case you may want to check...
>>
>> I always do this inside libxl__build_hvm() but before setup_guest(),
>>
>> +    if (args.lowmem_size > mmio_start)
>> +        args.lowmem_size = mmio_start;
>>
>> And plus, we also another policy to rdm,
>>
>>       #1. Above a predefined boundary (default 2G)
>>           - move lowmem_end below reserved region to solve conflict;
>>
>> This means there's such a likelihood of args.lowmem_size < mmio_start)
>> as well.
>>
>> So here I'm saying the condition is always true.
>
> Okay, but again - if this is relevant to the following code, an
> assertion or alike may still be warranted.

Yes I should add 'assert()' here.

>
>>> and hence don't have the final say on stylistic issues, I don't see
>>> why the above couldn't be expressed with a single return statement.
>>
>> Are you saying something like this? Note this was showed by yourself
>> long time ago.
>
> I know, and hence I was puzzled to still see you use the more
> convoluted form.
>
>> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>>                                         uint64_t mmio_start, uint64_t mmio_size)
>> {
>>        return start + memsize > mmio_start && start < mmio_start + mmio_size;
>> }
>>
>> But I don't think this really can't work out our case.
>
> It's equivalent to the original you had, so I don't see what you
> mean with "this really can't work out our case".
>

Let me make this point clear.

The original implementation,

+static int check_rdm_hole(uint64_t start, uint64_t memsize,
+                          uint64_t rdm_start, uint64_t rdm_size)
+{
+    if (start + memsize <= rdm_start || start >= rdm_start + rdm_size)
+        return 0;
+    else
+        return 1;
+}

means it returns 'false' in two cases:

#1. end = start + memsize; end <= rdm_start;

This region [start, end] is below of rdm entry.

#2. rdm_end = rdm_start + rdm_size; stat >= rdm_end;

This region [start, end] is above of rdm entry.

So others conditions should indicate that rdm entry is overlapping with 
this region. Actually this has three cases:

#1. This region just conflicts with the first half of rdm entry;
#2. This region just conflicts with the second half of rdm entry;
#3. This whole region falls inside of rdm entry;

Then it should return 'true', right?

But with this single line,

return start + memsize > rdm_start && start < rdm_start + rdm_size;

=>

return end > rdm_start && start < rdm_end;

This just guarantee it return 'true' *only* if #3 above occurs.

>>>> +int libxl__domain_device_check_rdm(libxl__gc *gc,
>>>> +                                   libxl_domain_config *d_config,
>>>> +                                   uint64_t rdm_mem_guard,
>>>> +                                   struct xc_hvm_build_args *args)
>>>> +{
>>>> +    int i, j, conflict;
>>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>> +    unsigned int nr_all_rdms = 0;
>>>> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
>>>> +    uint32_t type = d_config->b_info.rdm.type;
>>>> +    uint16_t seg;
>>>> +    uint8_t bus, devfn;
>>>> +
>>>> +    /* Might not to expose rdm. */
>>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>>>> +        return 0;
>>>> +
>>>> +    /* Collect all rdm info if exist. */
>>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>>> +                             0, 0, 0, &nr_all_rdms);
>>>
>>> What meaning has passing a libxl private value to a libxc function?
>>
>> We intend to collect all rdm entries info in advance and then we can
>> construct d_config->rdms based on our policies as follows. Because we
>> need to first allocate d_config->rdms properly to store rdms, but in
>> some cases we don't know how many buffers are enough. For example, we
>> don't have that global flag but with multiple pci devices. And even a
>> shared entry worsen this situation.
>>
>> So here, we set that flag as LIBXL_RDM_RESERVE_TYPE_HOST but without any
>> SBDF to grab all rdms.
>
> I'm afraid you didn't get my point: Values passed to libxc should be

Sorry for this misunderstanding.

> known to libxc. Values privately defined by libxl for its own purposes
> aren't known to libxc, and hence shouldn't be passed to libxc
> functions.

I think we should set this with 'PCI_DEV_RDM_ALL' since,

struct xen_reserved_device_memory_map {
     /* IN */
     /* Currently just one bit to indicate checkng all Reserved Device 
Memory. */
#define PCI_DEV_RDM_ALL   0x1

>
>>>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>>>> +     * this entry to hvmloader.
>>>
>>> What is "this" in "... also mark this as ..."? Certainly neither the conflict
>>> nor the warning.
>>
>> Sorry, this is my fault.
>>
>>        * If a conflict is detected on a given RMRR entry, an error will be
>>        * returned if 'strict' policy is specified. Or conflict is treated as a
>>        * warning if 'relaxed' policy is specified, and we also mark this as
>>        * INVALID not to expose this entry to hvmloader.
>
> The same "this" still doesn't have anything reasonable it references. I
> think you mean "the entry" (in which case the subsequent "this entry"
> could become just "it" afaict). But (not being a native speaker) the
> grammar of the second half of the sentence looks odd (and hence
> potentially confusing) to me anyway (i.e. even with the previous

Sure, we need to make this better and clear.

> issue fixed).

      * If a conflict is detected on a given RMRR entry, an error will be
      * returned if 'strict' policy is specified. Instead, if 'relaxed' 
policy
      * specified, this conflict is treated just as a warning, but we 
mark this
      * RMRR entry as INVALID to indicate that this entry shouldn't be 
exposed
      * to hvmloader.

I hope this can help us understand what we do.

>
>>>> +     *
>>>> +     * Firstly we should check the case of rdm < 4G because we may need to
>>>> +     * expand highmem_end.
>>>> +     */
>>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>>> +        rdm_start = d_config->rdms[i].start;
>>>> +        rdm_size = d_config->rdms[i].size;
>>>> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
>>>> +
>>>> +        if (!conflict)
>>>> +            continue;
>>>> +
>>>> +        /*
>>>> +         * Just check if RDM > our memory boundary
>>>> +         */
>>>> +        if (d_config->rdms[i].start > rdm_mem_guard) {
>>>> +            /*
>>>> +             * We will move downwards lowmem_end so we have to expand
>>>> +             * highmem_end.
>>>> +             */
>>>> +            highmem_end += (args->lowmem_size - rdm_start);
>>>> +            /* Now move downwards lowmem_end. */
>>>> +            args->lowmem_size = rdm_start;
>>>
>>> Considering that the action here doesn't depend on the specific
>>> ->rdms[] slot being looked at, I don't see why the loop needs to
>>
>> I'm not sure if I understand what you mean.
>>
>> All rdm entries are organized disorderly in d_config->rdms, so we should
>> traverse all entries to make sure args->lowmem_size is below all rdms'
>> start address.
>
> I think I see what confused me: in the if() condition you reference
> d_config->rdms[i].start, yet the body of the if() has no reference
> to d_config->rdms[i] at all. If the if() used rdm_start it would
> become obvious that this is being latched at the beginning of the

Indeed, I really should use rdm_start here.

> body (which is what I overlooked, assuming the variable's value
> to have got set prior to the loop), and hence the body is not loop
> invariant.
>

So just replace d_config->rdms[i].start with rdm_start like this,

         /*
          * Just check if RDM > our memory boundary
          */
         if (rdm_start > rdm_mem_guard) {
             /*
              * We will move downwards lowmem_end so we have to expand
              * highmem_end.
              */
             highmem_end += (args->lowmem_size - rdm_start);
             /* Now move downwards lowmem_end. */
             args->lowmem_size = rdm_start;
         }
     }

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-07  2:22         ` Chen, Tiejun
@ 2015-05-07  6:04           ` Jan Beulich
  2015-05-08  1:14             ` Chen, Tiejun
  2015-05-08  1:24           ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-07  6:04 UTC (permalink / raw)
  To: tiejun.chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> "Chen, Tiejun" <tiejun.chen@intel.com> 05/07/15 4:22 AM >>>
>On 2015/5/6 23:34, Jan Beulich wrote:
>>>>> On 06.05.15 at 17:00, <tiejun.chen@intel.com> wrote:
>>> On 2015/4/20 19:13, Jan Beulich wrote:
>>>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>>>> +                PERROR("Could not allocate memory.");
>>>>
>>>> Now that's exactly the kind of error message that makes no sense:
>>>> As errno will already cause PERROR() to print something along the
>>>> lines of the message you provide here, you're just creating
>>>> redundancy. Indicating the purpose of the allocation, otoh, would
>>>> add helpful context for the one inspecting the resulting log.
>>>
>>> What about this?
>>>
>>> PERROR("Could not allocate memory buffers to store reserved device
>>> memory entries.");
>>
>> You kind of go from one extreme to the other - the message
>> doesn't need to be overly long, but it should be distinct from
>> all other messages (so that when seen one can identify what
>> went wrong).
>
>I originally refer to some existing examples like this,
>
>int
>xc_core_arch_memory_map_get(xc_interface *xch, struct 
>xc_core_arch_context *unused,
>xc_dominfo_t *info, shared_info_any_t 
>*live_shinfo,
>xc_core_memory_map_t **mapp,
>unsigned int *nr_entries)
>{
>...
>map = malloc(sizeof(*map));
>if ( map == NULL )
>{
>PERROR("Could not allocate memory");
>return -1;
>}
>
>Maybe this is wrong to my case. Could I change this?

Yeah, I realize there are bad examples. But we try to avoid spreading those.

>PERROR("Could not allocate memory for XENMEM_reserved_device_memory_map 
>hypercall");
>
>Or just give me your line.

How about

PERROR("Could not allocate RDM buffer");

It's brief but specific enough to uniquely identify it.

>>>> and hence don't have the final say on stylistic issues, I don't see
>>>> why the above couldn't be expressed with a single return statement.
>>>
>>> Are you saying something like this? Note this was showed by yourself
>>> long time ago.
>>
>> I know, and hence I was puzzled to still see you use the more
>> convoluted form.
>>
>>> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>>>                                         uint64_t mmio_start, uint64_t mmio_size)
>>> {
>>>        return start + memsize > mmio_start && start < mmio_start + mmio_size;
>>> }
>>>
>>> But I don't think this really can't work out our case.
>>
>> It's equivalent to the original you had, so I don't see what you
>> mean with "this really can't work out our case".
>
>Let me make this point clear.
>
>The original implementation,
>
>+static int check_rdm_hole(uint64_t start, uint64_t memsize,
>+                          uint64_t rdm_start, uint64_t rdm_size)
>+{
>+    if (start + memsize <= rdm_start || start >= rdm_start + rdm_size)
>+        return 0;
>+    else
>+        return 1;
>+}
>
>means it returns 'false' in two cases:
>
>#1. end = start + memsize; end <= rdm_start;
>
>This region [start, end] is below of rdm entry.
>
>#2. rdm_end = rdm_start + rdm_size; stat >= rdm_end;
>
>This region [start, end] is above of rdm entry.
>
>So others conditions should indicate that rdm entry is overlapping with 
>this region. Actually this has three cases:
>
>#1. This region just conflicts with the first half of rdm entry;
>#2. This region just conflicts with the second half of rdm entry;
>#3. This whole region falls inside of rdm entry;
>
>Then it should return 'true', right?
>
>But with this single line,
>
>return start + memsize > rdm_start && start < rdm_start + rdm_size;
>
>=>
>
>return end > rdm_start && start < rdm_end;
>
>This just guarantee it return 'true' *only* if #3 above occurs.

I don't even need to look at all the explanations you give. It is a simple matter
of expression re-writing to see that

   if (a <= b || c >= d)
       return 0;
   else
       return 1;

is equivalent to

    return !(a <= b || c >= d);

and a simple matter of formal logic to see that this is equivalent to

    return a > b && c < d;

Or what am I missing here?

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-07  6:04           ` Jan Beulich
@ 2015-05-08  1:14             ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-08  1:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/7 14:04, Jan Beulich wrote:
>>>> "Chen, Tiejun" <tiejun.chen@intel.com> 05/07/15 4:22 AM >>>
>> On 2015/5/6 23:34, Jan Beulich wrote:
>>>>>> On 06.05.15 at 17:00, <tiejun.chen@intel.com> wrote:
>>>> On 2015/4/20 19:13, Jan Beulich wrote:
>>>>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>>>>> +                PERROR("Could not allocate memory.");
>>>>>
>>>>> Now that's exactly the kind of error message that makes no sense:
>>>>> As errno will already cause PERROR() to print something along the
>>>>> lines of the message you provide here, you're just creating
>>>>> redundancy. Indicating the purpose of the allocation, otoh, would
>>>>> add helpful context for the one inspecting the resulting log.
>>>>
>>>> What about this?
>>>>
>>>> PERROR("Could not allocate memory buffers to store reserved device
>>>> memory entries.");
>>>
>>> You kind of go from one extreme to the other - the message
>>> doesn't need to be overly long, but it should be distinct from
>>> all other messages (so that when seen one can identify what
>>> went wrong).
>>
>> I originally refer to some existing examples like this,
>>
>> int
>> xc_core_arch_memory_map_get(xc_interface *xch, struct
>> xc_core_arch_context *unused,
>> xc_dominfo_t *info, shared_info_any_t
>> *live_shinfo,
>> xc_core_memory_map_t **mapp,
>> unsigned int *nr_entries)
>> {
>> ...
>> map = malloc(sizeof(*map));
>> if ( map == NULL )
>> {
>> PERROR("Could not allocate memory");
>> return -1;
>> }
>>
>> Maybe this is wrong to my case. Could I change this?
>
> Yeah, I realize there are bad examples. But we try to avoid spreading those.

Sure.

>
>> PERROR("Could not allocate memory for XENMEM_reserved_device_memory_map
>> hypercall");
>>
>> Or just give me your line.
>
> How about
>
> PERROR("Could not allocate RDM buffer");
>
> It's brief but specific enough to uniquely identify it.

Looks good.

>
>>>>> and hence don't have the final say on stylistic issues, I don't see
>>>>> why the above couldn't be expressed with a single return statement.
>>>>
>>>> Are you saying something like this? Note this was showed by yourself
>>>> long time ago.
>>>
>>> I know, and hence I was puzzled to still see you use the more
>>> convoluted form.
>>>
>>>> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>>>>                                          uint64_t mmio_start, uint64_t mmio_size)
>>>> {
>>>>         return start + memsize > mmio_start && start < mmio_start + mmio_size;
>>>> }
>>>>
>>>> But I don't think this really can't work out our case.
>>>
>>> It's equivalent to the original you had, so I don't see what you
>>> mean with "this really can't work out our case".
>>
>> Let me make this point clear.
>>
>> The original implementation,
>>
>> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
>> +                          uint64_t rdm_start, uint64_t rdm_size)
>> +{
>> +    if (start + memsize <= rdm_start || start >= rdm_start + rdm_size)
>> +        return 0;
>> +    else
>> +        return 1;
>> +}
>>
>> means it returns 'false' in two cases:
>>
>> #1. end = start + memsize; end <= rdm_start;
>>
>> This region [start, end] is below of rdm entry.
>>
>> #2. rdm_end = rdm_start + rdm_size; stat >= rdm_end;
>>
>> This region [start, end] is above of rdm entry.
>>
>> So others conditions should indicate that rdm entry is overlapping with
>> this region. Actually this has three cases:
>>
>> #1. This region just conflicts with the first half of rdm entry;
>> #2. This region just conflicts with the second half of rdm entry;
>> #3. This whole region falls inside of rdm entry;
>>
>> Then it should return 'true', right?
>>
>> But with this single line,
>>
>> return start + memsize > rdm_start && start < rdm_start + rdm_size;
>>
>> =>
>>
>> return end > rdm_start && start < rdm_end;
>>
>> This just guarantee it return 'true' *only* if #3 above occurs.
>
> I don't even need to look at all the explanations you give. It is a simple matter
> of expression re-writing to see that
>
>     if (a <= b || c >= d)
>         return 0;
>     else
>         return 1;
>
> is equivalent to
>
>      return !(a <= b || c >= d);
>
> and a simple matter of formal logic to see that this is equivalent to
>
>      return a > b && c < d;

Right now I think you're right.

And I can't recall exactly what's my problem while using this simple 
line, maybe others was misleading me to treat this change as a cause so 
sorry to this confusion.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-07  2:22         ` Chen, Tiejun
  2015-05-07  6:04           ` Jan Beulich
@ 2015-05-08  1:24           ` Chen, Tiejun
  2015-05-08 15:13             ` Wei Liu
  1 sibling, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-08  1:24 UTC (permalink / raw)
  To: Jan Beulich, ian.campbell, stefano.stabellini, wei.liu2, Ian.Jackson
  Cc: yang.z.zhang, andrew.cooper3, kevin.tian, tim, xen-devel

Campbell, Jackson, Wei and Stefano,

Any consideration?

I can follow up Jan's idea but I need you guys make sure I'm going to do 
this properly.

Thanks
Tiejun

On 2015/5/7 10:22, Chen, Tiejun wrote:
> On 2015/5/6 23:34, Jan Beulich wrote:
>>>>> On 06.05.15 at 17:00, <tiejun.chen@intel.com> wrote:
>>> On 2015/4/20 19:13, Jan Beulich wrote:
>>>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>>>> --- a/tools/libxc/xc_domain.c
>>>>> +++ b/tools/libxc/xc_domain.c
>>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>>>        return do_domctl(xch, &domctl);
>>>>>    }
>>>>>
>>>>> +struct xen_reserved_device_memory
>>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>>> +                   uint32_t flag,
>>>>> +                   uint16_t seg,
>>>>> +                   uint8_t bus,
>>>>> +                   uint8_t devfn,
>>>>> +                   unsigned int *nr_entries)
>>>>> +{
>>>>
>>>> So what's the point of having both this new function and
>>>> xc_reserved_device_memory_map()? Is the latter useful for
>>>> anything besides the purpose here?
>>>
>>> I just hope xc_reserved_device_memory_map() is a standard interface to
>>> call that XENMEM_reserved_device_memory_map, but xc_device_get_rdm() can
>>> handle some errors in current case.
>>>
>>> I think you are hinting we just need one, right?
>>
>> Correct. But remember - I'm not a maintainer of this code, so
>
> But this may be a little complex with one...
>
>> maintainers may be of different opinion.
>
> Anyway, let me ask our tools maintainers.
>
> Campbell, Jackson, Wei and Stefano,
>
> What about your concern to this?
>
>>
>>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus,
>>>>> devfn, xrdm,
>>>>> +                                           nr_entries);
>>>>> +
>>>>> +    if ( rc < 0 )
>>>>> +    {
>>>>> +        if ( errno == ENOBUFS )
>>>>> +        {
>>>>> +            if ( (xrdm = malloc(*nr_entries *
>>>>> +
>>>>> sizeof(xen_reserved_device_memory_t))) == NULL )
>>>>> +            {
>>>>> +                PERROR("Could not allocate memory.");
>>>>
>>>> Now that's exactly the kind of error message that makes no sense:
>>>> As errno will already cause PERROR() to print something along the
>>>> lines of the message you provide here, you're just creating
>>>> redundancy. Indicating the purpose of the allocation, otoh, would
>>>> add helpful context for the one inspecting the resulting log.
>>>
>>> What about this?
>>>
>>> PERROR("Could not allocate memory buffers to store reserved device
>>> memory entries.");
>>
>> You kind of go from one extreme to the other - the message
>> doesn't need to be overly long, but it should be distinct from
>> all other messages (so that when seen one can identify what
>> went wrong).
>
> I originally refer to some existing examples like this,
>
> int
> xc_core_arch_memory_map_get(xc_interface *xch, struct
> xc_core_arch_context *unused,
>                              xc_dominfo_t *info, shared_info_any_t
> *live_shinfo,
>                              xc_core_memory_map_t **mapp,
>                              unsigned int *nr_entries)
> {
>      ...
>      map = malloc(sizeof(*map));
>      if ( map == NULL )
>      {
>          PERROR("Could not allocate memory");
>          return -1;
>      }
>
> Maybe this is wrong to my case. Could I change this?
>
> PERROR("Could not allocate memory for XENMEM_reserved_device_memory_map
> hypercall");
>
> Or just give me your line.
>
>>
>>>>> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>>>>>
>>>>>        for ( i = 0; i < nr_pages; i++ )
>>>>>            page_array[i] = i;
>>>>> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
>>>>> -        page_array[i] += mmio_size >> PAGE_SHIFT;
>>>>> +    /*
>>>>> +     * This condition 'lowmem_end <= mmio_start' is always true.
>>>>> +     */
>>>>
>>>> For one I think you mean "The", not "This", as there's no such
>>>> condition around here. And then - why? DYM "is supposed to
>>>> always be true"? In which case you may want to check...
>>>
>>> I always do this inside libxl__build_hvm() but before setup_guest(),
>>>
>>> +    if (args.lowmem_size > mmio_start)
>>> +        args.lowmem_size = mmio_start;
>>>
>>> And plus, we also another policy to rdm,
>>>
>>>       #1. Above a predefined boundary (default 2G)
>>>           - move lowmem_end below reserved region to solve conflict;
>>>
>>> This means there's such a likelihood of args.lowmem_size < mmio_start)
>>> as well.
>>>
>>> So here I'm saying the condition is always true.
>>
>> Okay, but again - if this is relevant to the following code, an
>> assertion or alike may still be warranted.
>
> Yes I should add 'assert()' here.
>
>>
>>>> and hence don't have the final say on stylistic issues, I don't see
>>>> why the above couldn't be expressed with a single return statement.
>>>
>>> Are you saying something like this? Note this was showed by yourself
>>> long time ago.
>>
>> I know, and hence I was puzzled to still see you use the more
>> convoluted form.
>>
>>> static bool check_mmio_hole_conflict(uint64_t start, uint64_t memsize,
>>>                                         uint64_t mmio_start, uint64_t
>>> mmio_size)
>>> {
>>>        return start + memsize > mmio_start && start < mmio_start +
>>> mmio_size;
>>> }
>>>
>>> But I don't think this really can't work out our case.
>>
>> It's equivalent to the original you had, so I don't see what you
>> mean with "this really can't work out our case".
>>
>
> Let me make this point clear.
>
> The original implementation,
>
> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
> +                          uint64_t rdm_start, uint64_t rdm_size)
> +{
> +    if (start + memsize <= rdm_start || start >= rdm_start + rdm_size)
> +        return 0;
> +    else
> +        return 1;
> +}
>
> means it returns 'false' in two cases:
>
> #1. end = start + memsize; end <= rdm_start;
>
> This region [start, end] is below of rdm entry.
>
> #2. rdm_end = rdm_start + rdm_size; stat >= rdm_end;
>
> This region [start, end] is above of rdm entry.
>
> So others conditions should indicate that rdm entry is overlapping with
> this region. Actually this has three cases:
>
> #1. This region just conflicts with the first half of rdm entry;
> #2. This region just conflicts with the second half of rdm entry;
> #3. This whole region falls inside of rdm entry;
>
> Then it should return 'true', right?
>
> But with this single line,
>
> return start + memsize > rdm_start && start < rdm_start + rdm_size;
>
> =>
>
> return end > rdm_start && start < rdm_end;
>
> This just guarantee it return 'true' *only* if #3 above occurs.
>
>>>>> +int libxl__domain_device_check_rdm(libxl__gc *gc,
>>>>> +                                   libxl_domain_config *d_config,
>>>>> +                                   uint64_t rdm_mem_guard,
>>>>> +                                   struct xc_hvm_build_args *args)
>>>>> +{
>>>>> +    int i, j, conflict;
>>>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>>> +    unsigned int nr_all_rdms = 0;
>>>>> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
>>>>> +    uint32_t type = d_config->b_info.rdm.type;
>>>>> +    uint16_t seg;
>>>>> +    uint8_t bus, devfn;
>>>>> +
>>>>> +    /* Might not to expose rdm. */
>>>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) &&
>>>>> !d_config->num_pcidevs)
>>>>> +        return 0;
>>>>> +
>>>>> +    /* Collect all rdm info if exist. */
>>>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>>>> +                             0, 0, 0, &nr_all_rdms);
>>>>
>>>> What meaning has passing a libxl private value to a libxc function?
>>>
>>> We intend to collect all rdm entries info in advance and then we can
>>> construct d_config->rdms based on our policies as follows. Because we
>>> need to first allocate d_config->rdms properly to store rdms, but in
>>> some cases we don't know how many buffers are enough. For example, we
>>> don't have that global flag but with multiple pci devices. And even a
>>> shared entry worsen this situation.
>>>
>>> So here, we set that flag as LIBXL_RDM_RESERVE_TYPE_HOST but without any
>>> SBDF to grab all rdms.
>>
>> I'm afraid you didn't get my point: Values passed to libxc should be
>
> Sorry for this misunderstanding.
>
>> known to libxc. Values privately defined by libxl for its own purposes
>> aren't known to libxc, and hence shouldn't be passed to libxc
>> functions.
>
> I think we should set this with 'PCI_DEV_RDM_ALL' since,
>
> struct xen_reserved_device_memory_map {
>      /* IN */
>      /* Currently just one bit to indicate checkng all Reserved Device
> Memory. */
> #define PCI_DEV_RDM_ALL   0x1
>
>>
>>>>> +     * 'try' policy is specified, and we also mark this as INVALID
>>>>> not to expose
>>>>> +     * this entry to hvmloader.
>>>>
>>>> What is "this" in "... also mark this as ..."? Certainly neither the
>>>> conflict
>>>> nor the warning.
>>>
>>> Sorry, this is my fault.
>>>
>>>        * If a conflict is detected on a given RMRR entry, an error
>>> will be
>>>        * returned if 'strict' policy is specified. Or conflict is
>>> treated as a
>>>        * warning if 'relaxed' policy is specified, and we also mark
>>> this as
>>>        * INVALID not to expose this entry to hvmloader.
>>
>> The same "this" still doesn't have anything reasonable it references. I
>> think you mean "the entry" (in which case the subsequent "this entry"
>> could become just "it" afaict). But (not being a native speaker) the
>> grammar of the second half of the sentence looks odd (and hence
>> potentially confusing) to me anyway (i.e. even with the previous
>
> Sure, we need to make this better and clear.
>
>> issue fixed).
>
>       * If a conflict is detected on a given RMRR entry, an error will be
>       * returned if 'strict' policy is specified. Instead, if 'relaxed'
> policy
>       * specified, this conflict is treated just as a warning, but we
> mark this
>       * RMRR entry as INVALID to indicate that this entry shouldn't be
> exposed
>       * to hvmloader.
>
> I hope this can help us understand what we do.
>
>>
>>>>> +     *
>>>>> +     * Firstly we should check the case of rdm < 4G because we may
>>>>> need to
>>>>> +     * expand highmem_end.
>>>>> +     */
>>>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>>>> +        rdm_start = d_config->rdms[i].start;
>>>>> +        rdm_size = d_config->rdms[i].size;
>>>>> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start,
>>>>> rdm_size);
>>>>> +
>>>>> +        if (!conflict)
>>>>> +            continue;
>>>>> +
>>>>> +        /*
>>>>> +         * Just check if RDM > our memory boundary
>>>>> +         */
>>>>> +        if (d_config->rdms[i].start > rdm_mem_guard) {
>>>>> +            /*
>>>>> +             * We will move downwards lowmem_end so we have to expand
>>>>> +             * highmem_end.
>>>>> +             */
>>>>> +            highmem_end += (args->lowmem_size - rdm_start);
>>>>> +            /* Now move downwards lowmem_end. */
>>>>> +            args->lowmem_size = rdm_start;
>>>>
>>>> Considering that the action here doesn't depend on the specific
>>>> ->rdms[] slot being looked at, I don't see why the loop needs to
>>>
>>> I'm not sure if I understand what you mean.
>>>
>>> All rdm entries are organized disorderly in d_config->rdms, so we should
>>> traverse all entries to make sure args->lowmem_size is below all rdms'
>>> start address.
>>
>> I think I see what confused me: in the if() condition you reference
>> d_config->rdms[i].start, yet the body of the if() has no reference
>> to d_config->rdms[i] at all. If the if() used rdm_start it would
>> become obvious that this is being latched at the beginning of the
>
> Indeed, I really should use rdm_start here.
>
>> body (which is what I overlooked, assuming the variable's value
>> to have got set prior to the loop), and hence the body is not loop
>> invariant.
>>
>
> So just replace d_config->rdms[i].start with rdm_start like this,
>
>          /*
>           * Just check if RDM > our memory boundary
>           */
>          if (rdm_start > rdm_mem_guard) {
>              /*
>               * We will move downwards lowmem_end so we have to expand
>               * highmem_end.
>               */
>              highmem_end += (args->lowmem_size - rdm_start);
>              /* Now move downwards lowmem_end. */
>              args->lowmem_size = rdm_start;
>          }
>      }
>
> Thanks
> Tiejun
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-04-10  9:21 ` [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy Tiejun Chen
@ 2015-05-08 13:04   ` Wei Liu
  2015-05-11  5:35     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-08 13:04 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

Sorry for the late review.

On Fri, Apr 10, 2015 at 05:21:52PM +0800, Tiejun Chen wrote:
> This patch introduces user configurable parameters to specify RDM
> resource and according policies,
> 
> Global RDM parameter:
>     rdm = [ 'host, reserve=force/try' ]
> Per-device RDM parameter:
>     pci = [ 'sbdf, rdm_reserve=force/try' ]
> 
> Global RDM parameter allows user to specify reserved regions explicitly,
> e.g. using 'host' to include all reserved regions reported on this platform
> which is good to handle hotplug scenario. In the future this parameter
> may be further extended to allow specifying random regions, e.g. even
> those belonging to another platform as a preparation for live migration
> with passthrough devices.
> 
> 'force/try' policy decides how to handle conflict when reserving RDM
> regions in pfn space. If conflict exists, 'force' means an immediate error
> so VM will be killed, while 'try' allows moving forward with a warning
> message thrown out.
> 
> Default per-device RDM policy is 'force', while default global RDM policy
> is 'try'. When both policies are specified on a given region, 'force' is
> always preferred.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> ---
>  docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
>  docs/misc/vtd.txt           | 34 ++++++++++++++++++++
>  tools/libxl/libxl_create.c  |  5 +++
>  tools/libxl/libxl_types.idl | 18 +++++++++++
>  tools/libxl/libxlu_pci.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++
>  tools/libxl/libxlutil.h     |  4 +++
>  tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
>  7 files changed, 203 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> index 408653f..9ed3055 100644
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -583,6 +583,36 @@ assigned slave device.
>  
>  =back
>  
> +=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
> +

Shouldn't this be "TYPE,RDM_RESERVE_STRIGN" according to your commit
message? If the only available config is just one string, you probably
don't need a list for this?

> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
> +which is necessary to enable robust device passthrough usage. One example of
> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
> +structure on x86 platform.
> +Each B<RDM_CHECK_STRING> has the form C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
> +

RDM_CHECK_STRING?

> +=over 4
> +
> +=item B<"TYPE">
> +
> +Currently we just have one type. 'host' means all reserved device memory on
> +this platform should be reserved in this VM's pfn space.
> +

What are other possible types? If there is only one type then we can
simply ignore the type?

> +=item B<KEY=VALUE>
> +
> +Possible B<KEY>s are:
> +
> +=over 4
> +
> +=item B<reserve="STRING">
> +
> +Conflict may be detected when reserving reserved device memory in gfn space.
> +'force' means an unsolved conflict leads to immediate VM destroy, while

Do you mean "immediate VM crash"?

> +'try' allows VM moving forward with a warning message thrown out. 'try'
> +is default.

Can you please your double quotes for "force", "try" etc.

> +
> +Note this may be overrided by another sub item, rdm_reserve, in pci device.
> +
>  =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
>  
>  Specifies the host PCI devices to passthrough to this guest. Each B<PCI_SPEC_STRING>
> @@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
>  D0-D3hot power management states for the PCI device. False (0) by
>  default.
>  
> +=item B<rdm_check="STRING">
> +
> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
> +which is necessary to enable robust device passthrough usage. One example of
> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
> +structure on x86 platform.
> +
> +Conflict may be detected when reserving reserved device memory in gfn space.
> +'force' means an unsolved conflict leads to immediate VM destroy, while
> +'try' allows VM moving forward with a warning message thrown out. 'force'
> +is default.
> +
> +Note this would override another global item, rdm = [''].
> +

Note this would override global B<rdm> option.

>  =back
>  
>  =back
> diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
> index 9af0e99..d7434d6 100644
> --- a/docs/misc/vtd.txt
> +++ b/docs/misc/vtd.txt
> @@ -111,6 +111,40 @@ in the config file:
>  To override for a specific device:
>  	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
>  
> +RDM, 'reserved device memory', for PCI Device Passthrough
> +---------------------------------------------------------
> +
> +There are some devices the BIOS controls, for e.g. USB devices to perform
> +PS2 emulation. The regions of memory used for these devices are marked
> +reserved in the e820 map. When we turn on DMA translation, DMA to those
> +regions will fail. Hence BIOS uses RMRR to specify these regions along with
> +devices that need to access these regions. OS is expected to setup
> +identity mappings for these regions for these devices to access these regions.
> +
> +While creating a VM we should reserve them in advance, and avoid any conflicts.
> +So we introduce user configurable parameters to specify RDM resource and
> +according policies,
> +
> +To enable this globally, add "rdm" in the config file:
> +
> +    rdm = [ 'host, reserve=force/try' ]
> +

The "force/try" should be called "policy". And then you explain what
policies we have.

> +Or just for a specific device:
> +
> +	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
> +
> +Global RDM parameter allows user to specify reserved regions explicitly.
> +Using 'host' to include all reserved regions reported on this platform
> +which is good to handle hotplug scenario. In the future this parameter
> +may be further extended to allow specifying random regions, e.g. even
> +those belonging to another platform as a preparation for live migration
> +with passthrough devices.
> +
> +'force/try' policy decides how to handle conflict when reserving RDM
> +regions in pfn space. If conflict exists, 'force' means an immediate error
> +so VM will be killed, while 'try' allows moving forward with a warning

Be killed by whom? I think it's hvmloader crashes voluntarily, right? 

> +message thrown out.
> +
>  
>  Caveat on Conventional PCI Device Passthrough
>  ---------------------------------------------
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 98687bd..9ed40d4 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
>      }
>  
>      for (i = 0; i < d_config->num_pcidevs; i++) {
> +        /*
> +         * If the rdm global policy is 'force' we should override each device.
> +         */
> +        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
> +            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>          ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
>          if (ret < 0) {
>              LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 47af340..5786455 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>      (2, "PV"),
>      ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>  
> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> +    (0, "none"),
> +    (1, "host"),
> +    ])
> +
> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> +    (-1, "invalid"),
> +    (0, "force"),
> +    (1, "try"),
> +    ])

If you don't set init_val, the default value would be "force" (0), is this
want you want?

> +
>  libxl_channel_connection = Enumeration("channel_connection", [
>      (0, "UNKNOWN"),
>      (1, "PTY"),
> @@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
>      ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
>      ])
>  
> +libxl_rdm_reserve = Struct("rdm_reserve", [
> +    ("type",    libxl_rdm_reserve_type),
> +    ("reserve",   libxl_rdm_reserve_flag),
> +    ])
> +
>  libxl_domain_build_info = Struct("domain_build_info",[
>      ("max_vcpus",       integer),
>      ("avail_vcpus",     libxl_bitmap),
> @@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>      ("kernel",           string),
>      ("cmdline",          string),
>      ("ramdisk",          string),
> +    ("rdm",     libxl_rdm_reserve),
>      ("u", KeyedUnion(None, libxl_domain_type, "type",
>                  [("hvm", Struct(None, [("firmware",         string),
>                                         ("bios",             libxl_bios_type),
> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>      ("power_mgmt", bool),
>      ("permissive", bool),
>      ("seize", bool),
> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>      ])
>  
>  libxl_device_vtpm = Struct("device_vtpm", [
> diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
> index 26fb143..45be0d9 100644
> --- a/tools/libxl/libxlu_pci.c
> +++ b/tools/libxl/libxlu_pci.c
> @@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain,
>  #define STATE_OPTIONS_K 6
>  #define STATE_OPTIONS_V 7
>  #define STATE_TERMINAL  8
> +#define STATE_TYPE      9
> +#define STATE_CHECK_FLAG      10
>  int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str)
>  {
>      unsigned state = STATE_DOMAIN;
> @@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str
>                      pcidev->permissive = atoi(tok);
>                  }else if ( !strcmp(optkey, "seize") ) {
>                      pcidev->seize = atoi(tok);
> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
> +                    if ( !strcmp(tok, "force") ) {
> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> +                    } else if ( !strcmp(tok, "try") ) {
> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> +                    } else {
> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
> +                                          " %s, so goes 'force' by default.",

If this is not an error, you don't need XLU__PCI_ERR.

But I would say we should  just treat this as an error and
abort/exit/report (whatever the parser should do in this case). 

> +                                     tok);
> +                    }
>                  }else{
>                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
>                  }
> @@ -167,6 +180,71 @@ parse_error:
>      return ERROR_INVAL;
>  }
>  
> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
> +{
> +    unsigned state = STATE_TYPE;
> +    char *buf2, *tok, *ptr, *end;
> +
> +    if ( NULL == (buf2 = ptr = strdup(str)) )
> +        return ERROR_NOMEM;
> +
> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> +        switch(state) {
> +        case STATE_TYPE:
> +            if ( *ptr == '\0' || *ptr == ',' ) {
> +                state = STATE_CHECK_FLAG;
> +                *ptr = '\0';
> +                if ( !strcmp(tok, "host") ) {
> +                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
> +                } else {
> +                    XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok);
> +                    goto parse_error;
> +                }
> +                tok = ptr + 1;
> +            }
> +            break;
> +        case STATE_CHECK_FLAG:
> +            if ( *ptr == '=' ) {
> +                state = STATE_OPTIONS_V;
> +                *ptr = '\0';
> +                if ( strcmp(tok, "reserve") ) {
> +                    XLU__PCI_ERR(cfg, "Unknown RDM property value: %s", tok);
> +                    goto parse_error;
> +                }
> +                tok = ptr + 1;
> +            }
> +            break;
> +        case STATE_OPTIONS_V:
> +            if ( *ptr == ',' || *ptr == '\0' ) {
> +                state = STATE_TERMINAL;
> +                *ptr = '\0';
> +                if ( !strcmp(tok, "force") ) {
> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> +                } else if ( !strcmp(tok, "try") ) {
> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> +                } else {
> +                    XLU__PCI_ERR(cfg, "Unknown RDM property flag value: %s",
> +                                 tok);
> +                    goto parse_error;
> +                }
> +                tok = ptr + 1;
> +            }
> +        default:
> +            break;
> +        }
> +    }
> +
> +    free(buf2);
> +
> +    if ( tok != ptr || state != STATE_TERMINAL )
> +        goto parse_error;
> +
> +    return 0;
> +
> +parse_error:
> +    return ERROR_INVAL;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
> index 0333e55..80497f8 100644
> --- a/tools/libxl/libxlutil.h
> +++ b/tools/libxl/libxlutil.h
> @@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs, const char *const *specs,
>   */
>  int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str);
>  
> +/*
> + * RDM parsing
> + */
> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str);
>  
>  /*
>   * Vif rate parsing.
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 04faf98..9a58464 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -988,7 +988,7 @@ static void parse_config_data(const char *config_source,
>      const char *buf;
>      long l;
>      XLU_Config *config;
> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
> +    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms, *rdms;
>      XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
>      int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
>      int pci_power_mgmt = 0;
> @@ -1727,6 +1727,23 @@ skip_vfb:
>          xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0);
>      }
>  
> +    /*
> +     * By default our global policy is to query all rdm entries, and
> +     * force reserve them.
> +     */
> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;

This should probably to into the _setdefault function of
libxl_domain_build_info.

> +    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
> +        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
> +            libxl_rdm_reserve rdm;
> +            if (!xlu_rdm_parse(config, &rdm, buf))
> +            {
> +                b_info->rdm.type = rdm.type;
> +                b_info->rdm.reserve = rdm.reserve;
> +            }

You only have one rdm in b_info, so there is no need to use a list for
it in config file.

One side note is that this patch should probably be placed later in this
series, after other patches to add the necessary underlying
functionalities.

Wei.

> +        }
> +    }
> +
>      if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>          d_config->num_pcidevs = 0;
>          d_config->pcidevs = NULL;
> @@ -1741,6 +1758,8 @@ skip_vfb:
>              pcidev->power_mgmt = pci_power_mgmt;
>              pcidev->permissive = pci_permissive;
>              pcidev->seize = pci_seize;
> +            /* We'd like to force reserve rdm specific to a device by default.*/
> +            pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>              if (!xlu_pci_parse_bdf(config, pcidev, buf))
>                  d_config->num_pcidevs++;
>          }
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map
  2015-04-10  9:21 ` [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map Tiejun Chen
@ 2015-05-08 13:07   ` Wei Liu
  2015-05-11  5:36     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-08 13:07 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Fri, Apr 10, 2015 at 05:21:54PM +0800, Tiejun Chen wrote:
> We will introduce the hypercall xc_reserved_device_memory_map
> approach to libxc. This helps us get rdm entry info according to
> different parameters. If flag == PCI_DEV_RDM_ALL, all entries
> should be exposed. Or we just expose that rdm entry specific to
> a SBDF.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

This patch contains a wrapper to the new hypercall. If the HV guys are
happy with this I'm fine with it too.

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
  2015-04-15 13:10   ` Ian Jackson
  2015-04-20 11:13   ` Jan Beulich
@ 2015-05-08 14:43   ` Wei Liu
  2015-05-11  8:09     ` Chen, Tiejun
  2 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-08 14:43 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

Sorry for the late review. This series fell through the crack.

On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
> While building a VM, HVM domain builder provides struct hvm_info_table{}
> to help hvmloader. Currently it includes two fields to construct guest
> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
> check them to fix any conflict with RAM.
> 
> RMRR can reside in address space beyond 4G theoretically, but we never
> see this in real world. So in order to avoid breaking highmem layout

How does this break highmem layout?

> we don't solve highmem conflict. Note this means highmem rmrr could still
> be supported if no conflict.
> 

Aren't you actively trying to avoid conflict in libxl?

> But in the case of lowmem, RMRR probably scatter the whole RAM space.
> Especially multiple RMRR entries would worsen this to lead a complicated
> memory layout. And then its hard to extend hvm_info_table{} to work
> hvmloader out. So here we're trying to figure out a simple solution to
> avoid breaking existing layout. So when a conflict occurs,
> 
>     #1. Above a predefined boundary (default 2G)
>         - move lowmem_end below reserved region to solve conflict;
> 

I hope this "predefined boundary" is user tunable. I will check later in
this patch if it is the case.

>     #2 Below a predefined boundary (default 2G)
>         - Check force/try policy.
>         "force" policy leads to fail libxl. Note when both policies
>         are specified on a given region, 'force' is always preferred.
>         "try" policy issue a warning message and also mask this entry INVALID
>         to indicate we shouldn't expose this entry to hvmloader.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> ---
>  tools/libxc/include/xenctrl.h  |   8 ++
>  tools/libxc/include/xenguest.h |   3 +-
>  tools/libxc/xc_domain.c        |  40 +++++++++
>  tools/libxc/xc_hvm_build_x86.c |  28 +++---
>  tools/libxl/libxl_create.c     |   2 +-
>  tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
>  tools/libxl/libxl_dom.c        |  27 +++++-
>  tools/libxl/libxl_internal.h   |  11 ++-
>  tools/libxl/libxl_types.idl    |   7 ++
>  9 files changed, 303 insertions(+), 18 deletions(-)
> 
> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
> index 59bbe06..299b95f 100644
> --- a/tools/libxc/include/xenctrl.h
> +++ b/tools/libxc/include/xenctrl.h
> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
>                       uint32_t *num_sdevs,
>                       uint32_t *sdev_array);
>  
> +struct xen_reserved_device_memory
> +*xc_device_get_rdm(xc_interface *xch,
> +                   uint32_t flag,
> +                   uint16_t seg,
> +                   uint8_t bus,
> +                   uint8_t devfn,
> +                   unsigned int *nr_entries);
> +
>  int xc_test_assign_device(xc_interface *xch,
>                            uint32_t domid,
>                            uint32_t machine_sbdf);
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index 40bbac8..0f1c23b 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -218,7 +218,8 @@ struct xc_hvm_firmware_module {
>  };
>  
>  struct xc_hvm_build_args {
> -    uint64_t mem_size;           /* Memory size in bytes. */
> +    uint64_t lowmem_size;        /* All low memory size in bytes. */
> +    uint64_t mem_size;           /* All memory size in bytes. */

Do you find the comment confusing? I think there is no need to change
the comment of mem_size. 

>      uint64_t mem_target;         /* Memory target in bytes. */
>      uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
>      const char *image_file_name; /* File name of the image to load. */
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> index 4f8383e..85b18ea 100644
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>      return do_domctl(xch, &domctl);
>  }
>  
> +struct xen_reserved_device_memory
> +*xc_device_get_rdm(xc_interface *xch,
> +                   uint32_t flag,
> +                   uint16_t seg,
> +                   uint8_t bus,
> +                   uint8_t devfn,
> +                   unsigned int *nr_entries)
> +{
> +    struct xen_reserved_device_memory *xrdm = NULL;
> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
> +                                           nr_entries);
> +
> +    if ( rc < 0 )
> +    {
> +        if ( errno == ENOBUFS )
> +        {
> +            if ( (xrdm = malloc(*nr_entries *
> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
> +            {
> +                PERROR("Could not allocate memory.");
> +                goto out;
> +            }

Don't you leak origin xrdm in this case?

And, this style is not very good. Shouldn't the caller allocate enough
memory before hand?

> +            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
> +                                               nr_entries);
> +            if ( rc )
> +            {
> +                PERROR("Could not get reserved device memory maps.");
> +                free(xrdm);
> +                xrdm = NULL;
> +            }
> +        }
> +        else {
> +            PERROR("Could not get reserved device memory maps.");
> +        }
> +    }
> +
> + out:
> +    return xrdm;
> +}
> +
>  int xc_get_device_group(
>      xc_interface *xch,
>      uint32_t domid,
> diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
> index c81a25b..3f87bb3 100644
> --- a/tools/libxc/xc_hvm_build_x86.c
> +++ b/tools/libxc/xc_hvm_build_x86.c
> @@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
>  }
>  
>  static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
> -                           uint64_t mmio_start, uint64_t mmio_size)
> +                           uint64_t lowmem_end)
>  {
>      struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>          (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
> -    uint64_t lowmem_end = mem_size, highmem_end = 0;
> +    uint64_t highmem_end = 0;
>      uint8_t sum;
>      int i;
>  
> -    if ( lowmem_end > mmio_start )
> -    {
> -        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
> -        lowmem_end = mmio_start;
> -    }
> +    if ( mem_size > lowmem_end )
> +        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
>  
>      memset(hvm_info_page, 0, PAGE_SIZE);
>  
> @@ -252,7 +249,7 @@ static int setup_guest(xc_interface *xch,
>      void *hvm_info_page;
>      uint32_t *ident_pt;
>      struct elf_binary elf;
> -    uint64_t v_start, v_end;
> +    uint64_t v_start, v_end, v_lowend, lowmem_end;
>      uint64_t m_start = 0, m_end = 0;
>      int rc;
>      xen_capabilities_info_t caps;
> @@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
>      elf_parse_binary(&elf);
>      v_start = 0;
>      v_end = args->mem_size;
> +    v_lowend = lowmem_end = args->lowmem_size;
>  
>      if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
>      {
> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>  
>      for ( i = 0; i < nr_pages; i++ )
>          page_array[i] = i;
> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
> -        page_array[i] += mmio_size >> PAGE_SHIFT;
> +    /*
> +     * This condition 'lowmem_end <= mmio_start' is always true.
> +     */
> +    for ( i = lowmem_end >> PAGE_SHIFT; i < nr_pages; i++ )
> +        page_array[i] += ((1ull << 32) - lowmem_end) >> PAGE_SHIFT;
>  
>      /*
>       * Try to claim pages for early warning of insufficient memory available.
> @@ -469,7 +470,7 @@ static int setup_guest(xc_interface *xch,
>                xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
>                HVM_INFO_PFN)) == NULL )
>          goto error_out;
> -    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
> +    build_hvm_info(hvm_info_page, v_end, v_lowend);
>      munmap(hvm_info_page, PAGE_SIZE);
>  
>      /* Allocate and clear special pages. */
> @@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
>      if ( args.mem_target == 0 )
>          args.mem_target = args.mem_size;
>  
> -    if ( args.mmio_size == 0 )
> -        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
> -
>      /* An HVM guest must be initialised with at least 2MB memory. */
>      if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
>          return -1;
> @@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
>      args.mem_size = (uint64_t)memsize << 20;
>      args.mem_target = (uint64_t)target << 20;
>      args.image_file_name = image_name;
> +    if ( args.mmio_size == 0 )
> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>  
>      return xc_hvm_build(xch, domid, &args);
>  }
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 9ed40d4..eab86fd 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -430,7 +430,7 @@ int libxl__domain_build(libxl__gc *gc,
>  
>      switch (info->type) {
>      case LIBXL_DOMAIN_TYPE_HVM:
> -        ret = libxl__build_hvm(gc, domid, info, state);
> +        ret = libxl__build_hvm(gc, domid, d_config, state);
>          if (ret)
>              goto out;
>  
> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
> index a8b08f2..9ad04ae 100644
> --- a/tools/libxl/libxl_dm.c
> +++ b/tools/libxl/libxl_dm.c
> @@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
>      return dm;
>  }
>  
> +/*
> + * Check whether there exists rdm hole in the specified memory range.
> + * Returns 1 if exists, else returns 0.
> + */
> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
> +                          uint64_t rdm_start, uint64_t rdm_size)
> +{
> +    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
> +        return 0;
> +    else
> +        return 1;
> +}
> +
> +/*
> + * Check reported RDM regions and handle potential gfn conflicts according
> + * to user preferred policy.
> + */
> +int libxl__domain_device_check_rdm(libxl__gc *gc,
> +                                   libxl_domain_config *d_config,
> +                                   uint64_t rdm_mem_guard,

rdm_mem_boundary

> +                                   struct xc_hvm_build_args *args)

This function does more than "checking", so a better name is needed.

May be you should split this function to one "build" function and one
"check" function? What do you think?

> +{
> +    int i, j, conflict;
> +    libxl_ctx *ctx = libxl__gc_owner(gc);

You can just use CTX macro.

> +    struct xen_reserved_device_memory *xrdm = NULL;
> +    unsigned int nr_all_rdms = 0;
> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
> +    uint32_t type = d_config->b_info.rdm.type;
> +    uint16_t seg;
> +    uint8_t bus, devfn;
> +
> +    /* Might not to expose rdm. */

Delete "to".

> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
> +        return 0;
> +
> +    /* Collect all rdm info if exist. */
> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
> +                             0, 0, 0, &nr_all_rdms);
> +    if (!nr_all_rdms)
> +        return 0;
> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
> +                                   sizeof(libxl_device_rdm));

Note that if you use "gc" here the allocated memory will be, well,
garbage collected at some point. If you don't want them to be gc'ed you
should use NOGC.

> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
> +
> +    /* Query all RDM entries in this platform */
> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
> +        d_config->num_rdms = nr_all_rdms;
> +        for (i = 0; i < d_config->num_rdms; i++) {
> +            d_config->rdms[i].start =
> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
> +            d_config->rdms[i].size =
> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
> +        }
> +    } else {
> +        d_config->num_rdms = 0;
> +    }

And you should move d_config->rdms = libxl__calloc inside that `if'.
That is, don't allocate memory if you don't need it.

> +    free(xrdm);
> +
> +    /* Query RDM entries per-device */
> +    for (i = 0; i < d_config->num_pcidevs; i++) {
> +        unsigned int nr_entries = 0;
> +        bool new = true;
> +        seg = d_config->pcidevs[i].domain;
> +        bus = d_config->pcidevs[i].bus;
> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
> +        nr_entries = 0;

You've already initialised this variable.

> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
> +                                 seg, bus, devfn, &nr_entries);
> +        /* No RDM to associated with this device. */
> +        if (!nr_entries)
> +            continue;
> +
> +        /* Need to check whether this entry is already saved in the array.
> +         * This could come from two cases:
> +         *
> +         *   - user may configure to get all RMRRs in this platform, which
> +         * is already queried before this point

Formatting.

> +         *   - or two assigned devices may share one RMRR entry
> +         *
> +         * different policies may be configured on the same RMRR due to above
> +         * two cases. We choose a simple policy to always favor stricter policy
> +         */
> +        for (j = 0; j < d_config->num_rdms; j++) {
> +            if (d_config->rdms[j].start ==
> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
> +             {
> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
> +                new = false;
> +                break;
> +            }
> +        }
> +
> +        if (new) {
> +            if (d_config->num_rdms > nr_all_rdms - 1) {
> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");

LOG(ERROR, ...)

> +                free(xrdm);
> +                return -1;

Please use goto out idiom.

> +            }
> +
> +            /*
> +             * This is a new entry.
> +             */

/* This is a new entry. */

> +            d_config->rdms[d_config->num_rdms].start =
> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
> +            d_config->rdms[d_config->num_rdms].size =
> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
> +            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
> +            d_config->num_rdms++;

Does this work? I don't see you reallocate memory.

> +        }
> +        free(xrdm);

Bug: you free xrdm several times.

Please fix  obvious bugs and do simple tests before posting. That would
save you another round of back and forth.

> +    }
> +
> +    /* Fix highmem. */
> +    if (args->mem_size > args->lowmem_size)
> +        highmem_end += (args->mem_size - args->lowmem_size);

It would be clearer if you just do
    highmem_end = (1ULL << 32) + (args->mem_size - args->lowmem_size);

And please have a blank line in between this assignment and following
comment.

> +    /* Next step is to check and avoid potential conflict between RDM entries
> +     * and guest RAM. To avoid intrusive impact to existing memory layout
> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
> +     * below conflicts are not handled which are rare and handling them would
> +     * lead to a more scattered layout:
> +     *  - RMRR in highmem area (>4G)
> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
> +     * end below reserved region to solve conflict.
> +     *
> +     * If a conflict is detected on a given RMRR entry, an error will be
> +     * returned.
> +     * If 'force' policy is specified. Or conflict is treated as a warning if
> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
> +     * this entry to hvmloader.
> +     *
> +     * Firstly we should check the case of rdm < 4G because we may need to
> +     * expand highmem_end.

Is this strategy agreed in previous discussion? How future-proof is this
strategy?

> +     */
> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        rdm_start = d_config->rdms[i].start;
> +        rdm_size = d_config->rdms[i].size;
> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
> +
> +        if (!conflict)
> +            continue;
> +
> +        /*
> +         * Just check if RDM > our memory boundary
> +         */

A one line comment is good enough.

> +        if (d_config->rdms[i].start > rdm_mem_guard) {
> +            /*
> +             * We will move downwards lowmem_end so we have to expand
> +             * highmem_end.
> +             */
> +            highmem_end += (args->lowmem_size - rdm_start);
> +            /* Now move downwards lowmem_end. */
> +            args->lowmem_size = rdm_start;
> +        }
> +    }
> +
> +    /*
> +     * Finally we can take same policy to check lowmem(< 2G) and
> +     * highmem adjusted above.
> +     */
> +    for (i = 0; i < d_config->num_rdms; i++) {
> +        rdm_start = d_config->rdms[i].start;
> +        rdm_size = d_config->rdms[i].size;
> +        /* Does this entry conflict with lowmem? */
> +        conflict = check_rdm_hole(0, args->lowmem_size,
> +                                  rdm_start, rdm_size);
> +        /* Does this entry conflict with highmem? */
> +        conflict |= check_rdm_hole((1ULL<<32), highmem_end,
> +                                   rdm_start, rdm_size);
> +
> +        if (!conflict)
> +            continue;
> +
> +        if(d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_FORCE) {
> +            LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "RDM conflict at 0x%lx.\n",
> +                                                d_config->rdms[i].start);

LOG(ERROR, ...)

> +            return -1;

Use goto out style.

> +        } else {
> +            LIBXL__LOG(CTX, LIBXL__LOG_WARNING,
> +                                        "Ignoring RDM conflict at 0x%lx.\n",
> +                                        d_config->rdms[i].start);
> +
> +            /*
> +             * Then mask this INVALID to indicate we shouldn't expose this
> +             * to hvmloader.
> +             */
> +            d_config->rdms[i].flag = LIBXL_RDM_RESERVE_FLAG_INVALID;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  const libxl_vnc_info *libxl__dm_vnc(const libxl_domain_config *guest_config)
>  {
>      const libxl_vnc_info *vnc = NULL;
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index a16d4a1..ee67282 100644
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -788,12 +788,14 @@ out:
>  }
>  
>  int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
> -              libxl_domain_build_info *info,
> +              libxl_domain_config *d_config,
>                libxl__domain_build_state *state)
>  {
>      libxl_ctx *ctx = libxl__gc_owner(gc);
>      struct xc_hvm_build_args args = {};
>      int ret, rc = ERROR_FAIL;
> +    libxl_domain_build_info *const info = &d_config->b_info;
> +    uint64_t rdm_mem_boundary, mmio_start;
>  
>      memset(&args, 0, sizeof(struct xc_hvm_build_args));
>      /* The params from the configuration file are in Mb, which are then
> @@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>       * Do all this in one step here...
>       */
>      args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
> +    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
> +                            (1ULL << 32) : args.mem_size;
>      args.mem_target = (uint64_t)(info->target_memkb - info->video_memkb) << 10;
>      args.claim_enabled = libxl_defbool_val(info->claim_mode);
>      if (info->u.hvm.mmio_hole_memkb) {
> @@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>          if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
>              args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
>      }
> +
> +    if (args.mmio_size == 0)
> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
> +    mmio_start = (1ull << 32) - args.mmio_size;
> +
> +    if (args.lowmem_size > mmio_start)
> +        args.lowmem_size = mmio_start;
> +
> +    /*
> +     * We'd like to set a memory boundary to determine if we need to check
> +     * any overlap with reserved device memory.
> +     *
> +     * TODO: we will add a config parameter for this boundary value next.
> +     */

Yes, please do so and properly document it.

> +    rdm_mem_boundary = 0x80000000;
> +    ret = libxl__domain_device_check_rdm(gc, d_config, rdm_mem_boundary, &args);
> +    if (ret) {
> +        LOG(ERROR, "checking reserved device memory failed");
> +        goto out;
> +    }
> +
>      if (libxl__domain_firmware(gc, info, &args)) {
>          LOG(ERROR, "initializing domain firmware failed");
>          goto out;
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 934465a..64d05b3 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -985,7 +985,7 @@ _hidden int libxl__build_post(libxl__gc *gc, uint32_t domid,
>  _hidden int libxl__build_pv(libxl__gc *gc, uint32_t domid,
>               libxl_domain_build_info *info, libxl__domain_build_state *state);
>  _hidden int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
> -              libxl_domain_build_info *info,
> +              libxl_domain_config *d_config,
>                libxl__domain_build_state *state);
>  
>  _hidden int libxl__qemu_traditional_cmd(libxl__gc *gc, uint32_t domid,
> @@ -1480,6 +1480,15 @@ _hidden int libxl__need_xenpv_qemu(libxl__gc *gc,
>          int nr_channels, libxl_device_channel *channels);
>  
>  /*
> + * This function will fix reserved device memory conflict
> + * according to user's configuration.
> + */
> +_hidden int libxl__domain_device_check_rdm(libxl__gc *gc,
> +                                   libxl_domain_config *d_config,
> +                                   uint64_t rdm_mem_guard,
> +                                   struct xc_hvm_build_args *args);
> +
> +/*
>   * This function will cause the whole libxl process to hang
>   * if the device model does not respond.  It is deprecated.
>   *
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 5786455..218931b 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -541,6 +541,12 @@ libxl_device_pci = Struct("device_pci", [
>      ("rdm_reserve",   libxl_rdm_reserve_flag),
>      ])
>  
> +libxl_device_rdm = Struct("device_rdm", [
> +    ("start", uint64),
> +    ("size", uint64),
> +    ("flag", bool),
> +    ])
> +
>  libxl_device_vtpm = Struct("device_vtpm", [
>      ("backend_domid",    libxl_domid),
>      ("backend_domname",  string),
> @@ -567,6 +573,7 @@ libxl_domain_config = Struct("domain_config", [
>      ("disks", Array(libxl_device_disk, "num_disks")),
>      ("nics", Array(libxl_device_nic, "num_nics")),
>      ("pcidevs", Array(libxl_device_pci, "num_pcidevs")),
> +    ("rdms", Array(libxl_device_rdm, "num_rdms")),
>      ("vfbs", Array(libxl_device_vfb, "num_vfbs")),
>      ("vkbs", Array(libxl_device_vkb, "num_vkbs")),
>      ("vtpms", Array(libxl_device_vtpm, "num_vtpms")),
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-08  1:24           ` Chen, Tiejun
@ 2015-05-08 15:13             ` Wei Liu
  2015-05-11  6:06               ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-08 15:13 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, wei.liu2, ian.campbell, andrew.cooper3, Ian.Jackson,
	tim, stefano.stabellini, Jan Beulich, yang.z.zhang, xen-devel

On Fri, May 08, 2015 at 09:24:56AM +0800, Chen, Tiejun wrote:
> Campbell, Jackson, Wei and Stefano,
> 
> Any consideration?
> 
> I can follow up Jan's idea but I need you guys make sure I'm going to do
> this properly.
> 

Look at my earlier reply.

1. This function seems to have bug.
2. Caller should have allocated enough memory so that this function
   don't have to.

Looks like we don't need it, at least not in its current form.

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
  2015-04-16 15:40   ` Tim Deegan
  2015-04-20 13:36   ` Jan Beulich
@ 2015-05-08 16:07   ` Julien Grall
  2015-05-11  8:42     ` Chen, Tiejun
  2 siblings, 1 reply; 125+ messages in thread
From: Julien Grall @ 2015-05-08 16:07 UTC (permalink / raw)
  To: Tiejun Chen, JBeulich, tim, konrad.wilk, andrew.cooper3,
	kevin.tian, yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

Hi,

On 10/04/15 10:21, Tiejun Chen wrote:
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index ca0e51e..e5ba7cb 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>  /* XEN_DOMCTL_deassign_device */
>  struct xen_domctl_assign_device {
>      uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
> +    /* IN */
> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
> +    uint32_t  sbdf_flag;   /* flag of assigned device */
>  };
>  typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
> diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
> index 8565b82..0d10b3d 100644
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -129,7 +129,7 @@ struct iommu_ops {
>      int (*add_device)(u8 devfn, device_t *dev);
>      int (*enable_device)(device_t *dev);
>      int (*remove_device)(u8 devfn, device_t *dev);
> -    int (*assign_device)(struct domain *, u8 devfn, device_t *dev);
> +    int (*assign_device)(struct domain *, u8 devfn, device_t *dev, u32 flag);

You need to update the ARM code with this new prototype:

xen/drivers/passthrough/device_tree.c
xen/drivers/passthrough/arm/smmu.c

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-08 13:04   ` Wei Liu
@ 2015-05-11  5:35     ` Chen, Tiejun
  2015-05-11 14:54       ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  5:35 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/8 21:04, Wei Liu wrote:
> Sorry for the late review.
>

Really thanks for taking your time :)

> On Fri, Apr 10, 2015 at 05:21:52PM +0800, Tiejun Chen wrote:
>> This patch introduces user configurable parameters to specify RDM
>> resource and according policies,
>>
>> Global RDM parameter:
>>      rdm = [ 'host, reserve=force/try' ]
>> Per-device RDM parameter:
>>      pci = [ 'sbdf, rdm_reserve=force/try' ]
>>
>> Global RDM parameter allows user to specify reserved regions explicitly,
>> e.g. using 'host' to include all reserved regions reported on this platform
>> which is good to handle hotplug scenario. In the future this parameter
>> may be further extended to allow specifying random regions, e.g. even
>> those belonging to another platform as a preparation for live migration
>> with passthrough devices.
>>
>> 'force/try' policy decides how to handle conflict when reserving RDM
>> regions in pfn space. If conflict exists, 'force' means an immediate error
>> so VM will be killed, while 'try' allows moving forward with a warning
>> message thrown out.
>>
>> Default per-device RDM policy is 'force', while default global RDM policy
>> is 'try'. When both policies are specified on a given region, 'force' is
>> always preferred.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>> ---
>>   docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
>>   docs/misc/vtd.txt           | 34 ++++++++++++++++++++
>>   tools/libxl/libxl_create.c  |  5 +++
>>   tools/libxl/libxl_types.idl | 18 +++++++++++
>>   tools/libxl/libxlu_pci.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++
>>   tools/libxl/libxlutil.h     |  4 +++
>>   tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
>>   7 files changed, 203 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
>> index 408653f..9ed3055 100644
>> --- a/docs/man/xl.cfg.pod.5
>> +++ b/docs/man/xl.cfg.pod.5
>> @@ -583,6 +583,36 @@ assigned slave device.
>>
>>   =back
>>
>> +=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
>> +
>
> Shouldn't this be "TYPE,RDM_RESERVE_STRIGN" according to your commit
> message? If the only available config is just one string, you probably
> don't need a list for this?

Yes, based on that design we don't need a list. So

=item B<rdm=[ "RDM_RESERVE_STRING" ]>

>
>> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
>> +which is necessary to enable robust device passthrough usage. One example of
>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>> +structure on x86 platform.
>> +Each B<RDM_CHECK_STRING> has the form C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
>> +
>
> RDM_CHECK_STRING?

And here should be corrected like this,

B<RDM_RESERVE_STRING> has the form ...

>
>> +=over 4
>> +
>> +=item B<"TYPE">
>> +
>> +Currently we just have one type. 'host' means all reserved device memory on
>> +this platform should be reserved in this VM's pfn space.
>> +
>
> What are other possible types? If there is only one type then we can

Currently we just have one type and looks that design doesn't make this 
clear.

> simply ignore the type?

I just think we may introduce something else specific to live migration 
in the future... But I'm really not sure right now.

>
>> +=item B<KEY=VALUE>
>> +
>> +Possible B<KEY>s are:
>> +
>> +=over 4
>> +
>> +=item B<reserve="STRING">
>> +
>> +Conflict may be detected when reserving reserved device memory in gfn space.
>> +'force' means an unsolved conflict leads to immediate VM destroy, while
>
> Do you mean "immediate VM crash"?

Yes. So I guess I should replace this.

>
>> +'try' allows VM moving forward with a warning message thrown out. 'try'
>> +is default.
>
> Can you please your double quotes for "force", "try" etc.

Sure. Just note we'd like to use "strict"/"relaxed" to replace 
"force"/"try" from next revision according to Jan's suggestion.

>
>> +
>> +Note this may be overrided by another sub item, rdm_reserve, in pci device.
>> +
>>   =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
>>
>>   Specifies the host PCI devices to passthrough to this guest. Each B<PCI_SPEC_STRING>
>> @@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
>>   D0-D3hot power management states for the PCI device. False (0) by
>>   default.
>>
>> +=item B<rdm_check="STRING">
>> +
>> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
>> +which is necessary to enable robust device passthrough usage. One example of
>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>> +structure on x86 platform.
>> +
>> +Conflict may be detected when reserving reserved device memory in gfn space.
>> +'force' means an unsolved conflict leads to immediate VM destroy, while
>> +'try' allows VM moving forward with a warning message thrown out. 'force'
>> +is default.
>> +
>> +Note this would override another global item, rdm = [''].
>> +
>
> Note this would override global B<rdm> option.

Fixed.

>
>>   =back
>>
>>   =back
>> diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
>> index 9af0e99..d7434d6 100644
>> --- a/docs/misc/vtd.txt
>> +++ b/docs/misc/vtd.txt
>> @@ -111,6 +111,40 @@ in the config file:
>>   To override for a specific device:
>>   	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
>>
>> +RDM, 'reserved device memory', for PCI Device Passthrough
>> +---------------------------------------------------------
>> +
>> +There are some devices the BIOS controls, for e.g. USB devices to perform
>> +PS2 emulation. The regions of memory used for these devices are marked
>> +reserved in the e820 map. When we turn on DMA translation, DMA to those
>> +regions will fail. Hence BIOS uses RMRR to specify these regions along with
>> +devices that need to access these regions. OS is expected to setup
>> +identity mappings for these regions for these devices to access these regions.
>> +
>> +While creating a VM we should reserve them in advance, and avoid any conflicts.
>> +So we introduce user configurable parameters to specify RDM resource and
>> +according policies,
>> +
>> +To enable this globally, add "rdm" in the config file:
>> +
>> +    rdm = [ 'host, reserve=force/try' ]
>> +
>
> The "force/try" should be called "policy". And then you explain what
> policies we have.

Do you mean we should rename this?

rdm = [ 'host, policy=force/try' ]

This is really a policy but 'reserve' may can reflect our action 
explicitly, right?

>
>> +Or just for a specific device:
>> +
>> +	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]

And you also can see this.

But anyway, if you're really stick to rename this, I'm going to be fine 
as well but we should ping every one to check this point since this name 
is from our previous discussion.

>> +
>> +Global RDM parameter allows user to specify reserved regions explicitly.
>> +Using 'host' to include all reserved regions reported on this platform
>> +which is good to handle hotplug scenario. In the future this parameter
>> +may be further extended to allow specifying random regions, e.g. even
>> +those belonging to another platform as a preparation for live migration
>> +with passthrough devices.
>> +
>> +'force/try' policy decides how to handle conflict when reserving RDM
>> +regions in pfn space. If conflict exists, 'force' means an immediate error
>> +so VM will be killed, while 'try' allows moving forward with a warning
>
> Be killed by whom? I think it's hvmloader crashes voluntarily, right?

s/VM will be kille/hvmloader crashes voluntarily

>
>> +message thrown out.
>> +
>>
>>   Caveat on Conventional PCI Device Passthrough
>>   ---------------------------------------------
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 98687bd..9ed40d4 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
>>       }
>>
>>       for (i = 0; i < d_config->num_pcidevs; i++) {
>> +        /*
>> +         * If the rdm global policy is 'force' we should override each device.
>> +         */
>> +        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
>> +            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>           ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
>>           if (ret < 0) {
>>               LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 47af340..5786455 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>>       (2, "PV"),
>>       ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>>
>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>> +    (0, "none"),
>> +    (1, "host"),
>> +    ])
>> +
>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>> +    (-1, "invalid"),
>> +    (0, "force"),
>> +    (1, "try"),
>> +    ])
>
> If you don't set init_val, the default value would be "force" (0), is this

Yes.

> want you want?

We have a little bit of complexity here,

"Default per-device RDM policy is 'force', while default global RDM 
policy is 'try'. When both policies are specified on a given region, 
'force' is always preferred."

>
>> +
>>   libxl_channel_connection = Enumeration("channel_connection", [
>>       (0, "UNKNOWN"),
>>       (1, "PTY"),
>> @@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
>>       ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
>>       ])
>>
>> +libxl_rdm_reserve = Struct("rdm_reserve", [
>> +    ("type",    libxl_rdm_reserve_type),
>> +    ("reserve",   libxl_rdm_reserve_flag),
>> +    ])
>> +
>>   libxl_domain_build_info = Struct("domain_build_info",[
>>       ("max_vcpus",       integer),
>>       ("avail_vcpus",     libxl_bitmap),
>> @@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>>       ("kernel",           string),
>>       ("cmdline",          string),
>>       ("ramdisk",          string),
>> +    ("rdm",     libxl_rdm_reserve),
>>       ("u", KeyedUnion(None, libxl_domain_type, "type",
>>                   [("hvm", Struct(None, [("firmware",         string),
>>                                          ("bios",             libxl_bios_type),
>> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>>       ("power_mgmt", bool),
>>       ("permissive", bool),
>>       ("seize", bool),
>> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>>       ])
>>
>>   libxl_device_vtpm = Struct("device_vtpm", [
>> diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
>> index 26fb143..45be0d9 100644
>> --- a/tools/libxl/libxlu_pci.c
>> +++ b/tools/libxl/libxlu_pci.c
>> @@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain,
>>   #define STATE_OPTIONS_K 6
>>   #define STATE_OPTIONS_V 7
>>   #define STATE_TERMINAL  8
>> +#define STATE_TYPE      9
>> +#define STATE_CHECK_FLAG      10
>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str)
>>   {
>>       unsigned state = STATE_DOMAIN;
>> @@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str
>>                       pcidev->permissive = atoi(tok);
>>                   }else if ( !strcmp(optkey, "seize") ) {
>>                       pcidev->seize = atoi(tok);
>> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
>> +                    if ( !strcmp(tok, "force") ) {
>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>> +                    } else if ( !strcmp(tok, "try") ) {
>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>> +                    } else {
>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
>> +                                          " %s, so goes 'force' by default.",
>
> If this is not an error, you don't need XLU__PCI_ERR.
>
> But I would say we should  just treat this as an error and
> abort/exit/report (whatever the parser should do in this case).

In our case we just want to post a message to set a appropriate flag to 
recover this behavior like we write here,

                         XLU__PCI_ERR(cfg, "Unknown PCI RDM property 
flag value:"
                                           " %s, so goes 'strict' by 
default.",
                                      tok);

This may just be a warning? But I don't we have this sort of definition, 
XLU__PCI_WARN, ...

So what LOG format can be adopted here?

>
>> +                                     tok);
>> +                    }
>>                   }else{
>>                       XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
>>                   }
>> @@ -167,6 +180,71 @@ parse_error:
>>       return ERROR_INVAL;
>>   }
>>
>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
>> +{
>> +    unsigned state = STATE_TYPE;
>> +    char *buf2, *tok, *ptr, *end;
>> +
>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>> +        return ERROR_NOMEM;
>> +
>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>> +        switch(state) {
>> +        case STATE_TYPE:
>> +            if ( *ptr == '\0' || *ptr == ',' ) {
>> +                state = STATE_CHECK_FLAG;
>> +                *ptr = '\0';
>> +                if ( !strcmp(tok, "host") ) {
>> +                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
>> +                } else {
>> +                    XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok);
>> +                    goto parse_error;
>> +                }
>> +                tok = ptr + 1;
>> +            }
>> +            break;
>> +        case STATE_CHECK_FLAG:
>> +            if ( *ptr == '=' ) {
>> +                state = STATE_OPTIONS_V;
>> +                *ptr = '\0';
>> +                if ( strcmp(tok, "reserve") ) {
>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property value: %s", tok);
>> +                    goto parse_error;
>> +                }
>> +                tok = ptr + 1;
>> +            }
>> +            break;
>> +        case STATE_OPTIONS_V:
>> +            if ( *ptr == ',' || *ptr == '\0' ) {
>> +                state = STATE_TERMINAL;
>> +                *ptr = '\0';
>> +                if ( !strcmp(tok, "force") ) {
>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>> +                } else if ( !strcmp(tok, "try") ) {
>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>> +                } else {
>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property flag value: %s",
>> +                                 tok);
>> +                    goto parse_error;
>> +                }
>> +                tok = ptr + 1;
>> +            }
>> +        default:
>> +            break;
>> +        }
>> +    }
>> +
>> +    free(buf2);
>> +
>> +    if ( tok != ptr || state != STATE_TERMINAL )
>> +        goto parse_error;
>> +
>> +    return 0;
>> +
>> +parse_error:
>> +    return ERROR_INVAL;
>> +}
>> +
>>   /*
>>    * Local variables:
>>    * mode: C
>> diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
>> index 0333e55..80497f8 100644
>> --- a/tools/libxl/libxlutil.h
>> +++ b/tools/libxl/libxlutil.h
>> @@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs, const char *const *specs,
>>    */
>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str);
>>
>> +/*
>> + * RDM parsing
>> + */
>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str);
>>
>>   /*
>>    * Vif rate parsing.
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index 04faf98..9a58464 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -988,7 +988,7 @@ static void parse_config_data(const char *config_source,
>>       const char *buf;
>>       long l;
>>       XLU_Config *config;
>> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
>> +    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms, *rdms;
>>       XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
>>       int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
>>       int pci_power_mgmt = 0;
>> @@ -1727,6 +1727,23 @@ skip_vfb:
>>           xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0);
>>       }
>>
>> +    /*
>> +     * By default our global policy is to query all rdm entries, and
>> +     * force reserve them.
>> +     */
>> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
>> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>
> This should probably to into the _setdefault function of
> libxl_domain_build_info.

Sorry, I just see this

libxl_domain_build_info_init()
     |
     + libxl_rdm_reserve_init(&p->rdm);
	|
	+ memset(p, '\0', sizeof(*p));

But this should be generated automatically, right? So how to implement 
your idea? Could you give me a show?

>
>> +    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
>> +        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
>> +            libxl_rdm_reserve rdm;
>> +            if (!xlu_rdm_parse(config, &rdm, buf))
>> +            {
>> +                b_info->rdm.type = rdm.type;
>> +                b_info->rdm.reserve = rdm.reserve;
>> +            }
>
> You only have one rdm in b_info, so there is no need to use a list for
> it in config file.
>

Is this fine?

     if (!xlu_cfg_get_string(config, "rdm", &buf, 0)) { 

         libxl_rdm_reserve rdm; 

         if (!xlu_rdm_parse(config, &rdm, buf)) {
             b_info->rdm.type = rdm.type; 

             b_info->rdm.reserve = rdm.reserve; 

         }
     }

> One side note is that this patch should probably be placed later in this
> series, after other patches to add the necessary underlying
> functionalities.

Okay.

Thanks
Tiejun

>
> Wei.
>
>> +        }
>> +    }
>> +
>>       if (!xlu_cfg_get_list (config, "pci", &pcis, 0, 0)) {
>>           d_config->num_pcidevs = 0;
>>           d_config->pcidevs = NULL;
>> @@ -1741,6 +1758,8 @@ skip_vfb:
>>               pcidev->power_mgmt = pci_power_mgmt;
>>               pcidev->permissive = pci_permissive;
>>               pcidev->seize = pci_seize;
>> +            /* We'd like to force reserve rdm specific to a device by default.*/
>> +            pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>               if (!xlu_pci_parse_bdf(config, pcidev, buf))
>>                   d_config->num_pcidevs++;
>>           }
>> --
>> 1.9.1
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map
  2015-05-08 13:07   ` Wei Liu
@ 2015-05-11  5:36     ` Chen, Tiejun
  2015-05-11  9:50       ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  5:36 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/8 21:07, Wei Liu wrote:
> On Fri, Apr 10, 2015 at 05:21:54PM +0800, Tiejun Chen wrote:
>> We will introduce the hypercall xc_reserved_device_memory_map
>> approach to libxc. This helps us get rdm entry info according to
>> different parameters. If flag == PCI_DEV_RDM_ALL, all entries
>> should be exposed. Or we just expose that rdm entry specific to
>> a SBDF.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>
> This patch contains a wrapper to the new hypercall. If the HV guys are
> happy with this I'm fine with it too.
>

Who should be pinged in this case?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-08 15:13             ` Wei Liu
@ 2015-05-11  6:06               ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  6:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, Ian.Jackson, tim,
	stefano.stabellini, Jan Beulich, yang.z.zhang, xen-devel

On 2015/5/8 23:13, Wei Liu wrote:
> On Fri, May 08, 2015 at 09:24:56AM +0800, Chen, Tiejun wrote:
>> Campbell, Jackson, Wei and Stefano,
>>
>> Any consideration?
>>
>> I can follow up Jan's idea but I need you guys make sure I'm going to do
>> this properly.
>>
>
> Look at my earlier reply.

Thanks for your reply. And lets discuss this directly in your reply.

Tiejun

>
> 1. This function seems to have bug.
> 2. Caller should have allocated enough memory so that this function
>     don't have to.
>
> Looks like we don't need it, at least not in its current form.
>
> Wei.
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-08 14:43   ` Wei Liu
@ 2015-05-11  8:09     ` Chen, Tiejun
  2015-05-11 11:32       ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  8:09 UTC (permalink / raw)
  To: Wei Liu, JBeulich
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, yang.z.zhang, Ian.Jackson

On 2015/5/8 22:43, Wei Liu wrote:
> Sorry for the late review. This series fell through the crack.
>

Thanks for your review.

> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
>> While building a VM, HVM domain builder provides struct hvm_info_table{}
>> to help hvmloader. Currently it includes two fields to construct guest
>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
>> check them to fix any conflict with RAM.
>>
>> RMRR can reside in address space beyond 4G theoretically, but we never
>> see this in real world. So in order to avoid breaking highmem layout
>
> How does this break highmem layout?

In most cases highmen is always continuous like [4G, ...] but RMRR is 
theoretically beyond 4G but very rarely. Especially we don't see this 
happened in real world. So we don't want to such a case breaking the 
highmem.

>
>> we don't solve highmem conflict. Note this means highmem rmrr could still
>> be supported if no conflict.
>>
>
> Aren't you actively trying to avoid conflict in libxl?

RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt 
some good policies to address RMRR. In the case of highmemt, that simple 
policy should be enough?

>
>> But in the case of lowmem, RMRR probably scatter the whole RAM space.
>> Especially multiple RMRR entries would worsen this to lead a complicated
>> memory layout. And then its hard to extend hvm_info_table{} to work
>> hvmloader out. So here we're trying to figure out a simple solution to
>> avoid breaking existing layout. So when a conflict occurs,
>>
>>      #1. Above a predefined boundary (default 2G)
>>          - move lowmem_end below reserved region to solve conflict;
>>
>
> I hope this "predefined boundary" is user tunable. I will check later in
> this patch if it is the case.

Yes. As we clarified in that comments,

* TODO: Its batter to provide a config parameter for this boundary value.

This mean I would provide a patch address this since currently I just 
think this is not a big deal?

>
>>      #2 Below a predefined boundary (default 2G)
>>          - Check force/try policy.
>>          "force" policy leads to fail libxl. Note when both policies
>>          are specified on a given region, 'force' is always preferred.
>>          "try" policy issue a warning message and also mask this entry INVALID
>>          to indicate we shouldn't expose this entry to hvmloader.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>> ---
>>   tools/libxc/include/xenctrl.h  |   8 ++
>>   tools/libxc/include/xenguest.h |   3 +-
>>   tools/libxc/xc_domain.c        |  40 +++++++++
>>   tools/libxc/xc_hvm_build_x86.c |  28 +++---
>>   tools/libxl/libxl_create.c     |   2 +-
>>   tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
>>   tools/libxl/libxl_dom.c        |  27 +++++-
>>   tools/libxl/libxl_internal.h   |  11 ++-
>>   tools/libxl/libxl_types.idl    |   7 ++
>>   9 files changed, 303 insertions(+), 18 deletions(-)
>>
>> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
>> index 59bbe06..299b95f 100644
>> --- a/tools/libxc/include/xenctrl.h
>> +++ b/tools/libxc/include/xenctrl.h
>> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
>>                        uint32_t *num_sdevs,
>>                        uint32_t *sdev_array);
>>
>> +struct xen_reserved_device_memory
>> +*xc_device_get_rdm(xc_interface *xch,
>> +                   uint32_t flag,
>> +                   uint16_t seg,
>> +                   uint8_t bus,
>> +                   uint8_t devfn,
>> +                   unsigned int *nr_entries);
>> +
>>   int xc_test_assign_device(xc_interface *xch,
>>                             uint32_t domid,
>>                             uint32_t machine_sbdf);
>> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
>> index 40bbac8..0f1c23b 100644
>> --- a/tools/libxc/include/xenguest.h
>> +++ b/tools/libxc/include/xenguest.h
>> @@ -218,7 +218,8 @@ struct xc_hvm_firmware_module {
>>   };
>>
>>   struct xc_hvm_build_args {
>> -    uint64_t mem_size;           /* Memory size in bytes. */
>> +    uint64_t lowmem_size;        /* All low memory size in bytes. */
>> +    uint64_t mem_size;           /* All memory size in bytes. */
>
> Do you find the comment confusing? I think there is no need to change
> the comment of mem_size.

Okay.

>
>>       uint64_t mem_target;         /* Memory target in bytes. */
>>       uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
>>       const char *image_file_name; /* File name of the image to load. */
>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>> index 4f8383e..85b18ea 100644
>> --- a/tools/libxc/xc_domain.c
>> +++ b/tools/libxc/xc_domain.c
>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>       return do_domctl(xch, &domctl);
>>   }
>>
>> +struct xen_reserved_device_memory
>> +*xc_device_get_rdm(xc_interface *xch,
>> +                   uint32_t flag,
>> +                   uint16_t seg,
>> +                   uint8_t bus,
>> +                   uint8_t devfn,
>> +                   unsigned int *nr_entries)
>> +{
>> +    struct xen_reserved_device_memory *xrdm = NULL;
>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>> +                                           nr_entries);
>> +
>> +    if ( rc < 0 )
>> +    {
>> +        if ( errno == ENOBUFS )
>> +        {
>> +            if ( (xrdm = malloc(*nr_entries *
>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>> +            {
>> +                PERROR("Could not allocate memory.");
>> +                goto out;
>> +            }
>
> Don't you leak origin xrdm in this case?

The caller to xc_device_get_rdm always frees this.

>
> And, this style is not very good. Shouldn't the caller allocate enough
> memory before hand?

Are you saying the caller to xc_device_get_rdm()? If so, any caller 
don't know this, too.

Actually this is just a wrapper of that fundamental hypercall, 
xc_reserved_device_memory_map() in patch #2, and based on that, we 
always have to first call this to inquire how much memory we really 
need. And this is why we have this wrapper since we don't want to 
duplicate more codes.

One error handler of this wrapper is just handling ENOBUFS since the 
caller never know how much memory we should allocate. So oftentimes we 
always set 'entries = 0' to inquire firstly.

Here Jan suggested we may need to figure out a good way to consolidate 
xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm().

But in some ways, that wrapper likes a static function so we just need 
to move this into that file associated to its caller, right?

>
>> +            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>> +                                               nr_entries);
>> +            if ( rc )
>> +            {
>> +                PERROR("Could not get reserved device memory maps.");
>> +                free(xrdm);
>> +                xrdm = NULL;
>> +            }
>> +        }
>> +        else {
>> +            PERROR("Could not get reserved device memory maps.");
>> +        }
>> +    }
>> +
>> + out:
>> +    return xrdm;
>> +}
>> +
>>   int xc_get_device_group(
>>       xc_interface *xch,
>>       uint32_t domid,
>> diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
>> index c81a25b..3f87bb3 100644
>> --- a/tools/libxc/xc_hvm_build_x86.c
>> +++ b/tools/libxc/xc_hvm_build_x86.c
>> @@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
>>   }
>>
>>   static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
>> -                           uint64_t mmio_start, uint64_t mmio_size)
>> +                           uint64_t lowmem_end)
>>   {
>>       struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>>           (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
>> -    uint64_t lowmem_end = mem_size, highmem_end = 0;
>> +    uint64_t highmem_end = 0;
>>       uint8_t sum;
>>       int i;
>>
>> -    if ( lowmem_end > mmio_start )
>> -    {
>> -        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
>> -        lowmem_end = mmio_start;
>> -    }
>> +    if ( mem_size > lowmem_end )
>> +        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
>>
>>       memset(hvm_info_page, 0, PAGE_SIZE);
>>
>> @@ -252,7 +249,7 @@ static int setup_guest(xc_interface *xch,
>>       void *hvm_info_page;
>>       uint32_t *ident_pt;
>>       struct elf_binary elf;
>> -    uint64_t v_start, v_end;
>> +    uint64_t v_start, v_end, v_lowend, lowmem_end;
>>       uint64_t m_start = 0, m_end = 0;
>>       int rc;
>>       xen_capabilities_info_t caps;
>> @@ -275,6 +272,7 @@ static int setup_guest(xc_interface *xch,
>>       elf_parse_binary(&elf);
>>       v_start = 0;
>>       v_end = args->mem_size;
>> +    v_lowend = lowmem_end = args->lowmem_size;
>>
>>       if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
>>       {
>> @@ -302,8 +300,11 @@ static int setup_guest(xc_interface *xch,
>>
>>       for ( i = 0; i < nr_pages; i++ )
>>           page_array[i] = i;
>> -    for ( i = mmio_start >> PAGE_SHIFT; i < nr_pages; i++ )
>> -        page_array[i] += mmio_size >> PAGE_SHIFT;
>> +    /*
>> +     * This condition 'lowmem_end <= mmio_start' is always true.
>> +     */
>> +    for ( i = lowmem_end >> PAGE_SHIFT; i < nr_pages; i++ )
>> +        page_array[i] += ((1ull << 32) - lowmem_end) >> PAGE_SHIFT;
>>
>>       /*
>>        * Try to claim pages for early warning of insufficient memory available.
>> @@ -469,7 +470,7 @@ static int setup_guest(xc_interface *xch,
>>                 xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
>>                 HVM_INFO_PFN)) == NULL )
>>           goto error_out;
>> -    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
>> +    build_hvm_info(hvm_info_page, v_end, v_lowend);
>>       munmap(hvm_info_page, PAGE_SIZE);
>>
>>       /* Allocate and clear special pages. */
>> @@ -588,9 +589,6 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
>>       if ( args.mem_target == 0 )
>>           args.mem_target = args.mem_size;
>>
>> -    if ( args.mmio_size == 0 )
>> -        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>> -
>>       /* An HVM guest must be initialised with at least 2MB memory. */
>>       if ( args.mem_size < (2ull << 20) || args.mem_target < (2ull << 20) )
>>           return -1;
>> @@ -634,6 +632,8 @@ int xc_hvm_build_target_mem(xc_interface *xch,
>>       args.mem_size = (uint64_t)memsize << 20;
>>       args.mem_target = (uint64_t)target << 20;
>>       args.image_file_name = image_name;
>> +    if ( args.mmio_size == 0 )
>> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>>
>>       return xc_hvm_build(xch, domid, &args);
>>   }
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 9ed40d4..eab86fd 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -430,7 +430,7 @@ int libxl__domain_build(libxl__gc *gc,
>>
>>       switch (info->type) {
>>       case LIBXL_DOMAIN_TYPE_HVM:
>> -        ret = libxl__build_hvm(gc, domid, info, state);
>> +        ret = libxl__build_hvm(gc, domid, d_config, state);
>>           if (ret)
>>               goto out;
>>
>> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
>> index a8b08f2..9ad04ae 100644
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -90,6 +90,201 @@ const char *libxl__domain_device_model(libxl__gc *gc,
>>       return dm;
>>   }
>>
>> +/*
>> + * Check whether there exists rdm hole in the specified memory range.
>> + * Returns 1 if exists, else returns 0.
>> + */
>> +static int check_rdm_hole(uint64_t start, uint64_t memsize,
>> +                          uint64_t rdm_start, uint64_t rdm_size)
>> +{
>> +    if ( start + memsize <= rdm_start || start >= rdm_start + rdm_size )
>> +        return 0;
>> +    else
>> +        return 1;
>> +}
>> +
>> +/*
>> + * Check reported RDM regions and handle potential gfn conflicts according
>> + * to user preferred policy.
>> + */
>> +int libxl__domain_device_check_rdm(libxl__gc *gc,
>> +                                   libxl_domain_config *d_config,
>> +                                   uint64_t rdm_mem_guard,
>
> rdm_mem_boundary

Okay.

>
>> +                                   struct xc_hvm_build_args *args)
>
> This function does more than "checking", so a better name is needed.
>
> May be you should split this function to one "build" function and one
> "check" function? What do you think?

We'd better keep this big one since this can make our policy 
understandable, but I agree we need to rename this like,

libxl__domain_device_construct_rdm()

construct = check + build :)

>
>> +{
>> +    int i, j, conflict;
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>
> You can just use CTX macro.

Good to know.

>
>> +    struct xen_reserved_device_memory *xrdm = NULL;
>> +    unsigned int nr_all_rdms = 0;
>> +    uint64_t rdm_start, rdm_size, highmem_end = (1ULL << 32);
>> +    uint32_t type = d_config->b_info.rdm.type;
>> +    uint16_t seg;
>> +    uint8_t bus, devfn;
>> +
>> +    /* Might not to expose rdm. */
>
> Delete "to".

Fix.

>
>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>> +        return 0;
>> +
>> +    /* Collect all rdm info if exist. */
>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>> +                             0, 0, 0, &nr_all_rdms);
>> +    if (!nr_all_rdms)
>> +        return 0;
>> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
>> +                                   sizeof(libxl_device_rdm));
>
> Note that if you use "gc" here the allocated memory will be, well,
> garbage collected at some point. If you don't want them to be gc'ed you
> should use NOGC.

Sorry, what does that mean by 'garbage collected'?

>
>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>> +
>> +    /* Query all RDM entries in this platform */
>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>> +        d_config->num_rdms = nr_all_rdms;
>> +        for (i = 0; i < d_config->num_rdms; i++) {
>> +            d_config->rdms[i].start =
>> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
>> +            d_config->rdms[i].size =
>> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>> +        }
>> +    } else {
>> +        d_config->num_rdms = 0;
>> +    }
>
> And you should move d_config->rdms = libxl__calloc inside that `if'.
> That is, don't allocate memory if you don't need it.

We can't since in all cases we need to preallocate this, and then we 
will handle this according to our policy.

>
>> +    free(xrdm);
>> +
>> +    /* Query RDM entries per-device */
>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>> +        unsigned int nr_entries = 0;

Maybe I should restate this,
	unsigned int nr_entries;

>> +        bool new = true;
>> +        seg = d_config->pcidevs[i].domain;
>> +        bus = d_config->pcidevs[i].bus;
>> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
>> +        nr_entries = 0;
>
> You've already initialised this variable.

We need to set this as zero to start.

>
>> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
>> +                                 seg, bus, devfn, &nr_entries);
>> +        /* No RDM to associated with this device. */
>> +        if (!nr_entries)
>> +            continue;
>> +
>> +        /* Need to check whether this entry is already saved in the array.
>> +         * This could come from two cases:
>> +         *
>> +         *   - user may configure to get all RMRRs in this platform, which
>> +         * is already queried before this point
>
> Formatting.

Are you saying this?

+        /* Need to check whether this entry is already saved in the array.

=>
         /* 

          * Need to check whether this entry is already saved in the 
array.
          * This could come from two cases:

>
>> +         *   - or two assigned devices may share one RMRR entry
>> +         *
>> +         * different policies may be configured on the same RMRR due to above
>> +         * two cases. We choose a simple policy to always favor stricter policy
>> +         */
>> +        for (j = 0; j < d_config->num_rdms; j++) {
>> +            if (d_config->rdms[j].start ==
>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
>> +             {
>> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
>> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
>> +                new = false;
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (new) {
>> +            if (d_config->num_rdms > nr_all_rdms - 1) {
>> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
>
> LOG(ERROR, ...)

Fixed.

>
>> +                free(xrdm);
>> +                return -1;
>
> Please use goto out idiom.

We just have two 'return -1' differently so I'm not sure its worth doing 
this.

>
>> +            }
>> +
>> +            /*
>> +             * This is a new entry.
>> +             */
>
> /* This is a new entry. */

Fixed.

>
>> +            d_config->rdms[d_config->num_rdms].start =
>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
>> +            d_config->rdms[d_config->num_rdms].size =
>> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
>> +            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
>> +            d_config->num_rdms++;
>
> Does this work? I don't see you reallocate memory.

Like I replied above we always preallocate this at the beginning.

>
>> +        }
>> +        free(xrdm);
>
> Bug: you free xrdm several times.

No any conflict.

What I did is that I would free once I finish to calling every 
xc_device_get_rdm().

>
> Please fix  obvious bugs and do simple tests before posting. That would
> save you another round of back and forth.
>
>> +    }
>> +
>> +    /* Fix highmem. */
>> +    if (args->mem_size > args->lowmem_size)
>> +        highmem_end += (args->mem_size - args->lowmem_size);
>
> It would be clearer if you just do
>      highmem_end = (1ULL << 32) + (args->mem_size - args->lowmem_size);

Okay.

>
> And please have a blank line in between this assignment and following
> comment.

As Jan commented previously I already push this at the beginning of this 
function.

>
>> +    /* Next step is to check and avoid potential conflict between RDM entries
>> +     * and guest RAM. To avoid intrusive impact to existing memory layout
>> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
>> +     * below conflicts are not handled which are rare and handling them would
>> +     * lead to a more scattered layout:
>> +     *  - RMRR in highmem area (>4G)
>> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
>> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
>> +     * end below reserved region to solve conflict.
>> +     *
>> +     * If a conflict is detected on a given RMRR entry, an error will be
>> +     * returned.
>> +     * If 'force' policy is specified. Or conflict is treated as a warning if
>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>> +     * this entry to hvmloader.
>> +     *
>> +     * Firstly we should check the case of rdm < 4G because we may need to
>> +     * expand highmem_end.
>
> Is this strategy agreed in previous discussion? How future-proof is this

Yes, this is based on that design.

> strategy?

I don't see any ovious futhure proof is prompted in this RFC.

>
>> +     */
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        rdm_start = d_config->rdms[i].start;
>> +        rdm_size = d_config->rdms[i].size;
>> +        conflict = check_rdm_hole(0, args->lowmem_size, rdm_start, rdm_size);
>> +
>> +        if (!conflict)
>> +            continue;
>> +
>> +        /*
>> +         * Just check if RDM > our memory boundary
>> +         */
>
> A one line comment is good enough.

Fixed.

>
>> +        if (d_config->rdms[i].start > rdm_mem_guard) {
>> +            /*
>> +             * We will move downwards lowmem_end so we have to expand
>> +             * highmem_end.
>> +             */
>> +            highmem_end += (args->lowmem_size - rdm_start);
>> +            /* Now move downwards lowmem_end. */
>> +            args->lowmem_size = rdm_start;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Finally we can take same policy to check lowmem(< 2G) and
>> +     * highmem adjusted above.
>> +     */
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        rdm_start = d_config->rdms[i].start;
>> +        rdm_size = d_config->rdms[i].size;
>> +        /* Does this entry conflict with lowmem? */
>> +        conflict = check_rdm_hole(0, args->lowmem_size,
>> +                                  rdm_start, rdm_size);
>> +        /* Does this entry conflict with highmem? */
>> +        conflict |= check_rdm_hole((1ULL<<32), highmem_end,
>> +                                   rdm_start, rdm_size);
>> +
>> +        if (!conflict)
>> +            continue;
>> +
>> +        if(d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_FORCE) {
>> +            LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "RDM conflict at 0x%lx.\n",
>> +                                                d_config->rdms[i].start);
>
> LOG(ERROR, ...)

Fixed.

>
>> +            return -1;
>
> Use goto out style.

See above.

>
>> +        } else {
>> +            LIBXL__LOG(CTX, LIBXL__LOG_WARNING,
>> +                                        "Ignoring RDM conflict at 0x%lx.\n",
>> +                                        d_config->rdms[i].start);
>> +
>> +            /*
>> +             * Then mask this INVALID to indicate we shouldn't expose this
>> +             * to hvmloader.
>> +             */
>> +            d_config->rdms[i].flag = LIBXL_RDM_RESERVE_FLAG_INVALID;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   const libxl_vnc_info *libxl__dm_vnc(const libxl_domain_config *guest_config)
>>   {
>>       const libxl_vnc_info *vnc = NULL;
>> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
>> index a16d4a1..ee67282 100644
>> --- a/tools/libxl/libxl_dom.c
>> +++ b/tools/libxl/libxl_dom.c
>> @@ -788,12 +788,14 @@ out:
>>   }
>>
>>   int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>> -              libxl_domain_build_info *info,
>> +              libxl_domain_config *d_config,
>>                 libxl__domain_build_state *state)
>>   {
>>       libxl_ctx *ctx = libxl__gc_owner(gc);
>>       struct xc_hvm_build_args args = {};
>>       int ret, rc = ERROR_FAIL;
>> +    libxl_domain_build_info *const info = &d_config->b_info;
>> +    uint64_t rdm_mem_boundary, mmio_start;
>>
>>       memset(&args, 0, sizeof(struct xc_hvm_build_args));
>>       /* The params from the configuration file are in Mb, which are then
>> @@ -802,6 +804,8 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>        * Do all this in one step here...
>>        */
>>       args.mem_size = (uint64_t)(info->max_memkb - info->video_memkb) << 10;
>> +    args.lowmem_size = (args.mem_size > (1ULL << 32)) ?
>> +                            (1ULL << 32) : args.mem_size;
>>       args.mem_target = (uint64_t)(info->target_memkb - info->video_memkb) << 10;
>>       args.claim_enabled = libxl_defbool_val(info->claim_mode);
>>       if (info->u.hvm.mmio_hole_memkb) {
>> @@ -811,6 +815,27 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>           if (max_ram_below_4g < HVM_BELOW_4G_MMIO_START)
>>               args.mmio_size = info->u.hvm.mmio_hole_memkb << 10;
>>       }
>> +
>> +    if (args.mmio_size == 0)
>> +        args.mmio_size = HVM_BELOW_4G_MMIO_LENGTH;
>> +    mmio_start = (1ull << 32) - args.mmio_size;
>> +
>> +    if (args.lowmem_size > mmio_start)
>> +        args.lowmem_size = mmio_start;
>> +
>> +    /*
>> +     * We'd like to set a memory boundary to determine if we need to check
>> +     * any overlap with reserved device memory.
>> +     *
>> +     * TODO: we will add a config parameter for this boundary value next.
>> +     */
>
> Yes, please do so and properly document it.

Yes, I will try.

Thanks
Tiejun

>
>> +    rdm_mem_boundary = 0x80000000;
>> +    ret = libxl__domain_device_check_rdm(gc, d_config, rdm_mem_boundary, &args);
>> +    if (ret) {
>> +        LOG(ERROR, "checking reserved device memory failed");
>> +        goto out;
>> +    }
>> +
>>       if (libxl__domain_firmware(gc, info, &args)) {
>>           LOG(ERROR, "initializing domain firmware failed");
>>           goto out;
>> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
>> index 934465a..64d05b3 100644
>> --- a/tools/libxl/libxl_internal.h
>> +++ b/tools/libxl/libxl_internal.h
>> @@ -985,7 +985,7 @@ _hidden int libxl__build_post(libxl__gc *gc, uint32_t domid,
>>   _hidden int libxl__build_pv(libxl__gc *gc, uint32_t domid,
>>                libxl_domain_build_info *info, libxl__domain_build_state *state);
>>   _hidden int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>> -              libxl_domain_build_info *info,
>> +              libxl_domain_config *d_config,
>>                 libxl__domain_build_state *state);
>>
>>   _hidden int libxl__qemu_traditional_cmd(libxl__gc *gc, uint32_t domid,
>> @@ -1480,6 +1480,15 @@ _hidden int libxl__need_xenpv_qemu(libxl__gc *gc,
>>           int nr_channels, libxl_device_channel *channels);
>>
>>   /*
>> + * This function will fix reserved device memory conflict
>> + * according to user's configuration.
>> + */
>> +_hidden int libxl__domain_device_check_rdm(libxl__gc *gc,
>> +                                   libxl_domain_config *d_config,
>> +                                   uint64_t rdm_mem_guard,
>> +                                   struct xc_hvm_build_args *args);
>> +
>> +/*
>>    * This function will cause the whole libxl process to hang
>>    * if the device model does not respond.  It is deprecated.
>>    *
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 5786455..218931b 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -541,6 +541,12 @@ libxl_device_pci = Struct("device_pci", [
>>       ("rdm_reserve",   libxl_rdm_reserve_flag),
>>       ])
>>
>> +libxl_device_rdm = Struct("device_rdm", [
>> +    ("start", uint64),
>> +    ("size", uint64),
>> +    ("flag", bool),
>> +    ])
>> +
>>   libxl_device_vtpm = Struct("device_vtpm", [
>>       ("backend_domid",    libxl_domid),
>>       ("backend_domname",  string),
>> @@ -567,6 +573,7 @@ libxl_domain_config = Struct("domain_config", [
>>       ("disks", Array(libxl_device_disk, "num_disks")),
>>       ("nics", Array(libxl_device_nic, "num_nics")),
>>       ("pcidevs", Array(libxl_device_pci, "num_pcidevs")),
>> +    ("rdms", Array(libxl_device_rdm, "num_rdms")),
>>       ("vfbs", Array(libxl_device_vfb, "num_vfbs")),
>>       ("vkbs", Array(libxl_device_vkb, "num_vkbs")),
>>       ("vtpms", Array(libxl_device_vtpm, "num_vtpms")),
>> --
>> 1.9.1
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-04-20 13:36   ` Jan Beulich
@ 2015-05-11  8:37     ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  8:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

Sorry for this delay response.

On 2015/4/20 21:36, Jan Beulich wrote:
>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>> --- a/xen/drivers/passthrough/vtd/iommu.c
>> +++ b/xen/drivers/passthrough/vtd/iommu.c
>> @@ -1793,8 +1793,14 @@ static void iommu_set_pgd(struct domain *d)
>>       hd->arch.pgd_maddr = pagetable_get_paddr(pagetable_from_mfn(pgd_mfn));
>>   }
>>
>> +/*
>> + * In some cases, e.g. add a device to hwdomain, and remove a device from
>> + * user domain, 'try' is fine enough since this is always safe to hwdomain.
>> + */
>> +#define XEN_DOMCTL_PCIDEV_RDM_DEFAULT XEN_DOMCTL_PCIDEV_RDM_TRY
>
> Then why invent this one instead of just using ..._TRY at the respective
> call sites.

I just want to own one place to guarantee this behavior if we want to 
change something...

But now I can use XEN_DOMCTL_PCIDEV_RDM_TRY directly.

>
>> @@ -1851,7 +1857,14 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
>>           if ( !is_hardware_domain(d) )
>>           {
>>               if ( (err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw)) )
>> -                return err;
>> +            {
>> +                if ( flag == XEN_DOMCTL_PCIDEV_RDM_TRY )
>> +                {
>> +                    printk(XENLOG_G_WARNING "Some devices may work failed .\n");
>> +                }
>
> Iirc someone else already pointed out that this message needs to
> change to something understandable. Perhaps it should also log
> the PFN causing the error. And the braces here should be dropped
> (if you inverted the condition you wouldn't even need an "else"; or
> wait - this shouldn't be just "else", as you shouldn't imply anything
> other than ..._TRY means ..._FORCE, albeit the respective check
> perhaps belongs in the generic IOMMU code, not here).

I guess we already achieved this agreement in other thread with 
discussion involved with you and Tim.

>
>> --- a/xen/include/public/domctl.h
>> +++ b/xen/include/public/domctl.h
>> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>>   /* XEN_DOMCTL_deassign_device */
>>   struct xen_domctl_assign_device {
>>       uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
>> +    /* IN */
>> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
>> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
>> +    uint32_t  sbdf_flag;   /* flag of assigned device */
>
> Why do you call this "sbdf_flag" - it holds nothing SBDF-like.
>

I can remove 'sbdf_' to make sure this is not confused.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-08 16:07   ` Julien Grall
@ 2015-05-11  8:42     ` Chen, Tiejun
  2015-05-11  9:51       ` Julien Grall
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  8:42 UTC (permalink / raw)
  To: Julien Grall, JBeulich, tim, konrad.wilk, andrew.cooper3,
	kevin.tian, yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

On 2015/5/9 0:07, Julien Grall wrote:
> Hi,
>
> On 10/04/15 10:21, Tiejun Chen wrote:
>> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
>> index ca0e51e..e5ba7cb 100644
>> --- a/xen/include/public/domctl.h
>> +++ b/xen/include/public/domctl.h
>> @@ -493,6 +493,10 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_sendtrigger_t);
>>   /* XEN_DOMCTL_deassign_device */
>>   struct xen_domctl_assign_device {
>>       uint32_t  machine_sbdf;   /* machine PCI ID of assigned device */
>> +    /* IN */
>> +#define XEN_DOMCTL_PCIDEV_RDM_TRY       0
>> +#define XEN_DOMCTL_PCIDEV_RDM_FORCE     1
>> +    uint32_t  sbdf_flag;   /* flag of assigned device */
>>   };
>>   typedef struct xen_domctl_assign_device xen_domctl_assign_device_t;
>>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_assign_device_t);
>> diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
>> index 8565b82..0d10b3d 100644
>> --- a/xen/include/xen/iommu.h
>> +++ b/xen/include/xen/iommu.h
>> @@ -129,7 +129,7 @@ struct iommu_ops {
>>       int (*add_device)(u8 devfn, device_t *dev);
>>       int (*enable_device)(device_t *dev);
>>       int (*remove_device)(u8 devfn, device_t *dev);
>> -    int (*assign_device)(struct domain *, u8 devfn, device_t *dev);
>> +    int (*assign_device)(struct domain *, u8 devfn, device_t *dev, u32 flag);
>
> You need to update the ARM code with this new prototype:

Thanks for your review.

>
> xen/drivers/passthrough/device_tree.c
> xen/drivers/passthrough/arm/smmu.c
>

Is this fine to you?

diff --git a/xen/drivers/passthrough/arm/smmu.c 
b/xen/drivers/passthrough/arm/smmu.c
index 8a9b58b..a3e6383 100644
--- a/xen/drivers/passthrough/arm/smmu.c
+++ b/xen/drivers/passthrough/arm/smmu.c
@@ -2599,7 +2599,7 @@ static void arm_smmu_destroy_iommu_domain(struct 
iommu_domain *domain)
  }

  static int arm_smmu_assign_dev(struct domain *d, u8 devfn,
-                              struct device *dev)
+                              struct device *dev, u32 flag)
  {
         struct iommu_domain *domain;
         struct arm_smmu_xen_domain *xen_domain;
diff --git a/xen/drivers/passthrough/device_tree.c 
b/xen/drivers/passthrough/device_tree.c
index 377d41d..97e7fc5 100644
--- a/xen/drivers/passthrough/device_tree.c
+++ b/xen/drivers/passthrough/device_tree.c
@@ -41,7 +41,8 @@ int iommu_assign_dt_device(struct domain *d, struct 
dt_device_node *dev)
      if ( !list_empty(&dev->domain_list) )
          goto fail;

-    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev));
+    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev),
+                                         XEN_DOMCTL_PCIDEV_RDM_TRY);

      if ( rc )
          goto fail;

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 08/13] tools: extend xc_assign_device() to support rdm reservation policy
  2015-04-20 13:39   ` Jan Beulich
@ 2015-05-11  9:45     ` Chen, Tiejun
  2015-05-11 10:53       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-11  9:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 21:39, Jan Beulich wrote:
>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>> --- a/tools/libxc/xc_domain.c
>> +++ b/tools/libxc/xc_domain.c
>> @@ -1654,13 +1654,15 @@ int xc_domain_setdebugging(xc_interface *xch,
>>   int xc_assign_device(
>>       xc_interface *xch,
>>       uint32_t domid,
>> -    uint32_t machine_sbdf)
>> +    uint32_t machine_sbdf,
>> +    uint32_t flag)
>>   {
>>       DECLARE_DOMCTL;
>>
>>       domctl.cmd = XEN_DOMCTL_assign_device;
>>       domctl.domain = domid;
>>       domctl.u.assign_device.machine_sbdf = machine_sbdf;
>> +    domctl.u.assign_device.sbdf_flag = flag;
>
> The previous patch needs to initialize this field, in order to not pass
> random input to the hypervisor. Using the ..._TRY value here
> intermediately (until this patch gets applied) would seem the right
> approach.
>

If I'm correct, looks I should introduce a little of change in previous 
patch,

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 250d1e4..0bcfd87 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1513,7 +1513,7 @@ int iommu_do_pci_domctl(
  {
      u16 seg;
      u8 bus, devfn;
-    u32 flag;
+    u32 flag = XEN_DOMCTL_PCIDEV_RDM_TRY;
      int ret = 0;

      switch ( domctl->cmd )
@@ -1582,7 +1582,6 @@ int iommu_do_pci_domctl(
          seg = domctl->u.assign_device.machine_sbdf >> 16;
          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
-        flag = domctl->u.assign_device.flag;

          ret = device_assigned(seg, bus, devfn) ?:
                assign_device(d, seg, bus, devfn, flag);

Then in this patch,

@@ -1513,7 +1513,7 @@ int iommu_do_pci_domctl(
  {
      u16 seg;
      u8 bus, devfn;
-    u32 flag = XEN_DOMCTL_PCIDEV_RDM_TRY;
+    u32 flag;
      int ret = 0;

      switch ( domctl->cmd )
@@ -1582,6 +1582,7 @@ int iommu_do_pci_domctl(
          seg = domctl->u.assign_device.machine_sbdf >> 16;
          bus = (domctl->u.assign_device.machine_sbdf >> 8) & 0xff;
          devfn = domctl->u.assign_device.machine_sbdf & 0xff;
+        flag = domctl->u.assign_device.flag;

          ret = device_assigned(seg, bus, devfn) ?:
                assign_device(d, seg, bus, devfn, flag);

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map
  2015-05-11  5:36     ` Chen, Tiejun
@ 2015-05-11  9:50       ` Wei Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Wei Liu @ 2015-05-11  9:50 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Mon, May 11, 2015 at 01:36:49PM +0800, Chen, Tiejun wrote:
> On 2015/5/8 21:07, Wei Liu wrote:
> >On Fri, Apr 10, 2015 at 05:21:54PM +0800, Tiejun Chen wrote:
> >>We will introduce the hypercall xc_reserved_device_memory_map
> >>approach to libxc. This helps us get rdm entry info according to
> >>different parameters. If flag == PCI_DEV_RDM_ALL, all entries
> >>should be exposed. Or we just expose that rdm entry specific to
> >>a SBDF.
> >>
> >>Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> >
> >This patch contains a wrapper to the new hypercall. If the HV guys are
> >happy with this I'm fine with it too.
> >
> 
> Who should be pinged in this case?
> 

No need to ping. This patch depends on HV side interface being accepted
first. I will ack this patch when HV interface is acked.

Wei.

> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-11  8:42     ` Chen, Tiejun
@ 2015-05-11  9:51       ` Julien Grall
  2015-05-11 10:57         ` Jan Beulich
  2015-05-14  5:47         ` Chen, Tiejun
  0 siblings, 2 replies; 125+ messages in thread
From: Julien Grall @ 2015-05-11  9:51 UTC (permalink / raw)
  To: Chen, Tiejun, Julien Grall, JBeulich, tim, konrad.wilk,
	andrew.cooper3, kevin.tian, yang.z.zhang, ian.campbell, wei.liu2,
	Ian.Jackson, stefano.stabellini
  Cc: xen-devel

Hi,

On 11/05/15 09:42, Chen, Tiejun wrote:
> diff --git a/xen/drivers/passthrough/arm/smmu.c
> b/xen/drivers/passthrough/arm/smmu.c
> index 8a9b58b..a3e6383 100644
> --- a/xen/drivers/passthrough/arm/smmu.c
> +++ b/xen/drivers/passthrough/arm/smmu.c
> @@ -2599,7 +2599,7 @@ static void arm_smmu_destroy_iommu_domain(struct
> iommu_domain *domain)
>  }
> 
>  static int arm_smmu_assign_dev(struct domain *d, u8 devfn,
> -                              struct device *dev)
> +                              struct device *dev, u32 flag)
>  {
>         struct iommu_domain *domain;
>         struct arm_smmu_xen_domain *xen_domain;
> diff --git a/xen/drivers/passthrough/device_tree.c
> b/xen/drivers/passthrough/device_tree.c
> index 377d41d..97e7fc5 100644
> --- a/xen/drivers/passthrough/device_tree.c
> +++ b/xen/drivers/passthrough/device_tree.c
> @@ -41,7 +41,8 @@ int iommu_assign_dt_device(struct domain *d, struct
> dt_device_node *dev)
>      if ( !list_empty(&dev->domain_list) )
>          goto fail;
> 
> -    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev));
> +    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev),
> +                                         XEN_DOMCTL_PCIDEV_RDM_TRY);

On ARM we can passthrough 2 different types of device: PCI device and
platform device described in the device tree (it's a tree representation
of the hardware).

This assign_device callback deals with the latter. So from the name the
value doesn't look right.

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 08/13] tools: extend xc_assign_device() to support rdm reservation policy
  2015-05-11  9:45     ` Chen, Tiejun
@ 2015-05-11 10:53       ` Jan Beulich
  2015-05-14  7:04         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-11 10:53 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 11.05.15 at 11:45, <tiejun.chen@intel.com> wrote:
> On 2015/4/20 21:39, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/libxc/xc_domain.c
>>> +++ b/tools/libxc/xc_domain.c
>>> @@ -1654,13 +1654,15 @@ int xc_domain_setdebugging(xc_interface *xch,
>>>   int xc_assign_device(
>>>       xc_interface *xch,
>>>       uint32_t domid,
>>> -    uint32_t machine_sbdf)
>>> +    uint32_t machine_sbdf,
>>> +    uint32_t flag)
>>>   {
>>>       DECLARE_DOMCTL;
>>>
>>>       domctl.cmd = XEN_DOMCTL_assign_device;
>>>       domctl.domain = domid;
>>>       domctl.u.assign_device.machine_sbdf = machine_sbdf;
>>> +    domctl.u.assign_device.sbdf_flag = flag;
>>
>> The previous patch needs to initialize this field, in order to not pass
>> random input to the hypervisor. Using the ..._TRY value here
>> intermediately (until this patch gets applied) would seem the right
>> approach.
>>
> 
> If I'm correct, looks I should introduce a little of change in previous 
> patch,
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 250d1e4..0bcfd87 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1513,7 +1513,7 @@ int iommu_do_pci_domctl(
>   {
>       u16 seg;
>       u8 bus, devfn;
> -    u32 flag;
> +    u32 flag = XEN_DOMCTL_PCIDEV_RDM_TRY;

Provided that constant is available already at that point (I didn't
check); if it isn't, you'd probably want to go with plain zero.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-11  9:51       ` Julien Grall
@ 2015-05-11 10:57         ` Jan Beulich
  2015-05-14  5:48           ` Chen, Tiejun
  2015-05-14  5:47         ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-11 10:57 UTC (permalink / raw)
  To: Julien Grall, Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 11.05.15 at 11:51, <julien.grall@citrix.com> wrote:
> Hi,
> 
> On 11/05/15 09:42, Chen, Tiejun wrote:
>> diff --git a/xen/drivers/passthrough/arm/smmu.c
>> b/xen/drivers/passthrough/arm/smmu.c
>> index 8a9b58b..a3e6383 100644
>> --- a/xen/drivers/passthrough/arm/smmu.c
>> +++ b/xen/drivers/passthrough/arm/smmu.c
>> @@ -2599,7 +2599,7 @@ static void arm_smmu_destroy_iommu_domain(struct
>> iommu_domain *domain)
>>  }
>> 
>>  static int arm_smmu_assign_dev(struct domain *d, u8 devfn,
>> -                              struct device *dev)
>> +                              struct device *dev, u32 flag)
>>  {
>>         struct iommu_domain *domain;
>>         struct arm_smmu_xen_domain *xen_domain;
>> diff --git a/xen/drivers/passthrough/device_tree.c
>> b/xen/drivers/passthrough/device_tree.c
>> index 377d41d..97e7fc5 100644
>> --- a/xen/drivers/passthrough/device_tree.c
>> +++ b/xen/drivers/passthrough/device_tree.c
>> @@ -41,7 +41,8 @@ int iommu_assign_dt_device(struct domain *d, struct
>> dt_device_node *dev)
>>      if ( !list_empty(&dev->domain_list) )
>>          goto fail;
>> 
>> -    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev));
>> +    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev),
>> +                                         XEN_DOMCTL_PCIDEV_RDM_TRY);
> 
> On ARM we can passthrough 2 different types of device: PCI device and
> platform device described in the device tree (it's a tree representation
> of the hardware).
> 
> This assign_device callback deals with the latter. So from the name the
> value doesn't look right.

Yeah, the constant name probably shouldn't refer to PCI, but simply
to pass-through.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-11  8:09     ` Chen, Tiejun
@ 2015-05-11 11:32       ` Wei Liu
  2015-05-14  8:27         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-11 11:32 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
> On 2015/5/8 22:43, Wei Liu wrote:
> >Sorry for the late review. This series fell through the crack.
> >
> 
> Thanks for your review.
> 
> >On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
> >>While building a VM, HVM domain builder provides struct hvm_info_table{}
> >>to help hvmloader. Currently it includes two fields to construct guest
> >>e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
> >>check them to fix any conflict with RAM.
> >>
> >>RMRR can reside in address space beyond 4G theoretically, but we never
> >>see this in real world. So in order to avoid breaking highmem layout
> >
> >How does this break highmem layout?
> 
> In most cases highmen is always continuous like [4G, ...] but RMRR is
> theoretically beyond 4G but very rarely. Especially we don't see this
> happened in real world. So we don't want to such a case breaking the
> highmem.
> 

The problem is  we take this approach just because this rarely happens
*now* is not future proof.  It needs to be clearly documented somewhere
in the manual (or any other Intel docs) and be referenced in the code.
Otherwise we might end up in a situation that no-one knows how it is
supposed to work and no-one can fix it if it breaks in the future, that
is, every single device on earth requires RMRR > 4G overnight (I'm
exaggerating).

Or you can just make it works with highmem. How much more work do you
envisage?

(If my above comment makes no sense do let me know. I only have very
shallow understanding of RMRR)

> >
> >>we don't solve highmem conflict. Note this means highmem rmrr could still
> >>be supported if no conflict.
> >>
> >
> >Aren't you actively trying to avoid conflict in libxl?
> 
> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
> good policies to address RMRR. In the case of highmemt, that simple policy
> should be enough?
> 

Whatever policy you and HV maintainers agree on. Just clearly document
it.

> >
> >>But in the case of lowmem, RMRR probably scatter the whole RAM space.
> >>Especially multiple RMRR entries would worsen this to lead a complicated
> >>memory layout. And then its hard to extend hvm_info_table{} to work
> >>hvmloader out. So here we're trying to figure out a simple solution to
> >>avoid breaking existing layout. So when a conflict occurs,
> >>
> >>     #1. Above a predefined boundary (default 2G)
> >>         - move lowmem_end below reserved region to solve conflict;
> >>
> >
> >I hope this "predefined boundary" is user tunable. I will check later in
> >this patch if it is the case.
> 
> Yes. As we clarified in that comments,
> 
> * TODO: Its batter to provide a config parameter for this boundary value.
> 
> This mean I would provide a patch address this since currently I just think
> this is not a big deal?
> 

Yes please provide a config option to override that. It's reasonable
that user wants to change that.

> >
> >>     #2 Below a predefined boundary (default 2G)
> >>         - Check force/try policy.
> >>         "force" policy leads to fail libxl. Note when both policies
> >>         are specified on a given region, 'force' is always preferred.
> >>         "try" policy issue a warning message and also mask this entry INVALID
> >>         to indicate we shouldn't expose this entry to hvmloader.
> >>
> >>Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> >>---
> >>  tools/libxc/include/xenctrl.h  |   8 ++
> >>  tools/libxc/include/xenguest.h |   3 +-
> >>  tools/libxc/xc_domain.c        |  40 +++++++++
> >>  tools/libxc/xc_hvm_build_x86.c |  28 +++---
> >>  tools/libxl/libxl_create.c     |   2 +-
> >>  tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
> >>  tools/libxl/libxl_dom.c        |  27 +++++-
> >>  tools/libxl/libxl_internal.h   |  11 ++-
> >>  tools/libxl/libxl_types.idl    |   7 ++
> >>  9 files changed, 303 insertions(+), 18 deletions(-)
> >>
> >>diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
> >>index 59bbe06..299b95f 100644
> >>--- a/tools/libxc/include/xenctrl.h
> >>+++ b/tools/libxc/include/xenctrl.h
> >>@@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
> >>                       uint32_t *num_sdevs,
> >>                       uint32_t *sdev_array);
> >>
> >>+struct xen_reserved_device_memory
> >>+*xc_device_get_rdm(xc_interface *xch,
> >>+                   uint32_t flag,
> >>+                   uint16_t seg,
> >>+                   uint8_t bus,
> >>+                   uint8_t devfn,
> >>+                   unsigned int *nr_entries);
> >>+
> >>  int xc_test_assign_device(xc_interface *xch,
> >>                            uint32_t domid,
> >>                            uint32_t machine_sbdf);

[...]

> >
> >>      uint64_t mem_target;         /* Memory target in bytes. */
> >>      uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
> >>      const char *image_file_name; /* File name of the image to load. */
> >>diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> >>index 4f8383e..85b18ea 100644
> >>--- a/tools/libxc/xc_domain.c
> >>+++ b/tools/libxc/xc_domain.c
> >>@@ -1665,6 +1665,46 @@ int xc_assign_device(
> >>      return do_domctl(xch, &domctl);
> >>  }
> >>
> >>+struct xen_reserved_device_memory
> >>+*xc_device_get_rdm(xc_interface *xch,
> >>+                   uint32_t flag,
> >>+                   uint16_t seg,
> >>+                   uint8_t bus,
> >>+                   uint8_t devfn,
> >>+                   unsigned int *nr_entries)
> >>+{
> >>+    struct xen_reserved_device_memory *xrdm = NULL;
> >>+    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
> >>+                                           nr_entries);
> >>+
> >>+    if ( rc < 0 )
> >>+    {
> >>+        if ( errno == ENOBUFS )
> >>+        {
> >>+            if ( (xrdm = malloc(*nr_entries *
> >>+                                sizeof(xen_reserved_device_memory_t))) == NULL )
> >>+            {
> >>+                PERROR("Could not allocate memory.");
> >>+                goto out;
> >>+            }
> >
> >Don't you leak origin xrdm in this case?
> 
> The caller to xc_device_get_rdm always frees this.
> 

I think I misunderstood how this function works. I thought xrdm was
passed in by caller, which is clearly not the case. Sorry!


In that case, the `if ( rc < 0 )' is not needed because the call should
always return rc < 0. An assert is good enough.

> >
> >And, this style is not very good. Shouldn't the caller allocate enough
> >memory before hand?
> 
> Are you saying the caller to xc_device_get_rdm()? If so, any caller don't
> know this, too.
> 

I see.

> Actually this is just a wrapper of that fundamental hypercall,
> xc_reserved_device_memory_map() in patch #2, and based on that, we always
> have to first call this to inquire how much memory we really need. And this
> is why we have this wrapper since we don't want to duplicate more codes.
> 
> One error handler of this wrapper is just handling ENOBUFS since the caller
> never know how much memory we should allocate. So oftentimes we always set
> 'entries = 0' to inquire firstly.
> 
> Here Jan suggested we may need to figure out a good way to consolidate
> xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm().
> 
> But in some ways, that wrapper likes a static function so we just need to
> move this into that file associated to its caller, right?
> 

Yes, if there is only one user at the moment, make a static function.

> >
> >>+            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
> >>+                                               nr_entries);
> >>+            if ( rc )
> >>+            {
> >>+                PERROR("Could not get reserved device memory maps.");
> >>+                free(xrdm);
> >>+                xrdm = NULL;
> >>+            }
> >>+        }
> >>+        else {
> >>+            PERROR("Could not get reserved device memory maps.");
> >>+        }
> >>+    }
> >>+
> >>+ out:
> >>+    return xrdm;
> >>+}
> >>+
> >>  int xc_get_device_group(
> >>      xc_interface *xch,
> >>      uint32_t domid,
> >>diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
> >>index c81a25b..3f87bb3 100644
> >>--- a/tools/libxc/xc_hvm_build_x86.c
> >>+++ b/tools/libxc/xc_hvm_build_x86.c
> >>@@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
> >>  }
> >>
> >>  static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
> >>-                           uint64_t mmio_start, uint64_t mmio_size)
> >>+                           uint64_t lowmem_end)
> >>  {
> >>      struct hvm_info_table *hvm_info = (struct hvm_info_table *)
> >>          (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
> >>-    uint64_t lowmem_end = mem_size, highmem_end = 0;
> >>+    uint64_t highmem_end = 0;
> >>      uint8_t sum;
> >>      int i;
> >>
> >>-    if ( lowmem_end > mmio_start )
> >>-    {
> >>-        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
> >>-        lowmem_end = mmio_start;
> >>-    }
> >>+    if ( mem_size > lowmem_end )
> >>+        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
> >>
> >>      memset(hvm_info_page, 0, PAGE_SIZE);

[...]

> >
> >>+                                   struct xc_hvm_build_args *args)
> >
> >This function does more than "checking", so a better name is needed.
> >
> >May be you should split this function to one "build" function and one
> >"check" function? What do you think?
> 
> We'd better keep this big one since this can make our policy understandable,
> but I agree we need to rename this like,
> 
> libxl__domain_device_construct_rdm()
> 
> construct = check + build :)

I'm fine with this.

> 
> >
> >>+{
> >>+    int i, j, conflict;
> >>+    libxl_ctx *ctx = libxl__gc_owner(gc);
> >
> >You can just use CTX macro.

[...]

> >
> >>+    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
> >>+        return 0;
> >>+
> >>+    /* Collect all rdm info if exist. */
> >>+    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
> >>+                             0, 0, 0, &nr_all_rdms);
> >>+    if (!nr_all_rdms)
> >>+        return 0;
> >>+    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
> >>+                                   sizeof(libxl_device_rdm));
> >
> >Note that if you use "gc" here the allocated memory will be, well,
> >garbage collected at some point. If you don't want them to be gc'ed you
> >should use NOGC.
> 
> Sorry, what does that mean by 'garbage collected'?
> 

That means the memory allocated with gc will be freed at some point by
GC_FREE, because those memory regions are meant to be temporary and used
internally.

When entering a libxl public API function (those start wtih
libxl_), that function calls GC_INIT to initialise a garbage collector.
When that function exits, it calls GC_FREE to free all memory allocated
with that gc.

Since d_config is very likely to be used by libxl user (xl, libvirt
etc), you probably don't want to fill it with gc allocated memory.

> >
> >>+    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
> >>+
> >>+    /* Query all RDM entries in this platform */
> >>+    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
> >>+        d_config->num_rdms = nr_all_rdms;
> >>+        for (i = 0; i < d_config->num_rdms; i++) {
> >>+            d_config->rdms[i].start =
> >>+                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
> >>+            d_config->rdms[i].size =
> >>+                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
> >>+            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
> >>+        }
> >>+    } else {
> >>+        d_config->num_rdms = 0;
> >>+    }
> >
> >And you should move d_config->rdms = libxl__calloc inside that `if'.
> >That is, don't allocate memory if you don't need it.
> 
> We can't since in all cases we need to preallocate this, and then we will
> handle this according to our policy.
> 

How would it ever be used again if you set d_config->num_rdms to 0? How
do you know the exact size of your array again?

> >
> >>+    free(xrdm);
> >>+
> >>+    /* Query RDM entries per-device */
> >>+    for (i = 0; i < d_config->num_pcidevs; i++) {
> >>+        unsigned int nr_entries = 0;
> 
> Maybe I should restate this,
> 	unsigned int nr_entries;
> 
> >>+        bool new = true;
> >>+        seg = d_config->pcidevs[i].domain;
> >>+        bus = d_config->pcidevs[i].bus;
> >>+        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
> >>+        nr_entries = 0;
> >
> >You've already initialised this variable.
> 
> We need to set this as zero to start.
> 

Either of the tow works for me. Just don't want redundant
initialisation.

> >
> >>+        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
> >>+                                 seg, bus, devfn, &nr_entries);
> >>+        /* No RDM to associated with this device. */
> >>+        if (!nr_entries)
> >>+            continue;
> >>+
> >>+        /* Need to check whether this entry is already saved in the array.
> >>+         * This could come from two cases:
> >>+         *
> >>+         *   - user may configure to get all RMRRs in this platform, which
> >>+         * is already queried before this point
> >
> >Formatting.
> 
> Are you saying this?

I mean you need to move "is already..." to the right go align with
previous line.

> 
> +        /* Need to check whether this entry is already saved in the array.
> 
> =>

The CODING_STYLE in libxl doesn't seem to enforce this, so you can just
follow other examples.

>         /*
> 
>          * Need to check whether this entry is already saved in the array.
>          * This could come from two cases:
> 
> >
> >>+         *   - or two assigned devices may share one RMRR entry
> >>+         *
> >>+         * different policies may be configured on the same RMRR due to above
> >>+         * two cases. We choose a simple policy to always favor stricter policy
> >>+         */
> >>+        for (j = 0; j < d_config->num_rdms; j++) {
> >>+            if (d_config->rdms[j].start ==
> >>+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
> >>+             {
> >>+                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
> >>+                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
> >>+                new = false;
> >>+                break;
> >>+            }
> >>+        }
> >>+
> >>+        if (new) {
> >>+            if (d_config->num_rdms > nr_all_rdms - 1) {
> >>+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
> >
> >LOG(ERROR, ...)
> 
> Fixed.
> 
> >
> >>+                free(xrdm);
> >>+                return -1;
> >
> >Please use goto out idiom.
> 
> We just have two 'return -1' differently so I'm not sure its worth doing
> this.
> 

Yes, please comply with libxl idiom.

> >
> >>+            }
> >>+
> >>+            /*
> >>+             * This is a new entry.
> >>+             */
> >
> >/* This is a new entry. */
> 
> Fixed.
> 
> >
> >>+            d_config->rdms[d_config->num_rdms].start =
> >>+                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
> >>+            d_config->rdms[d_config->num_rdms].size =
> >>+                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
> >>+            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
> >>+            d_config->num_rdms++;
> >
> >Does this work? I don't see you reallocate memory.
> 
> Like I replied above we always preallocate this at the beginning.
> 

Ah, OK.

But please don't do this. It's hard to see you don't overrun the
buffer. Please allocate memory only when you need it.

> >
> >>+        }
> >>+        free(xrdm);
> >
> >Bug: you free xrdm several times.
> 
> No any conflict.
> 
> What I did is that I would free once I finish to calling every
> xc_device_get_rdm().
> 

OK. I misread. Sorry.


[...]

> 
> >
> >>+    /* Next step is to check and avoid potential conflict between RDM entries
> >>+     * and guest RAM. To avoid intrusive impact to existing memory layout
> >>+     * {lowmem, mmio, highmem} which is passed around various function blocks,
> >>+     * below conflicts are not handled which are rare and handling them would
> >>+     * lead to a more scattered layout:
> >>+     *  - RMRR in highmem area (>4G)
> >>+     *  - RMRR lower than a defined memory boundary (e.g. 2G)
> >>+     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
> >>+     * end below reserved region to solve conflict.
> >>+     *
> >>+     * If a conflict is detected on a given RMRR entry, an error will be
> >>+     * returned.
> >>+     * If 'force' policy is specified. Or conflict is treated as a warning if
> >>+     * 'try' policy is specified, and we also mark this as INVALID not to expose
> >>+     * this entry to hvmloader.
> >>+     *
> >>+     * Firstly we should check the case of rdm < 4G because we may need to
> >>+     * expand highmem_end.
> >
> >Is this strategy agreed in previous discussion? How future-proof is this
> 
> Yes, this is based on that design.
> 

OK.

[...]

> >>
> >>  int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
> >>-              libxl_domain_build_info *info,
> >>+              libxl_domain_config *d_config,
> >>                libxl__domain_build_state *state)
> >>  {
> >>      libxl_ctx *ctx = libxl__gc_owner(gc);
> >>      struct xc_hvm_build_args args = {};
> >>      int ret, rc = ERROR_FAIL;
> >>+    libxl_domain_build_info *const info = &d_config->b_info;
> >>+    uint64_t rdm_mem_boundary, mmio_start;

I didn't mention this in the first pass. You seem to have inserted some
tabs? We use space to indent.


Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-11  5:35     ` Chen, Tiejun
@ 2015-05-11 14:54       ` Wei Liu
  2015-05-15  1:52         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-11 14:54 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Mon, May 11, 2015 at 01:35:06PM +0800, Chen, Tiejun wrote:
> On 2015/5/8 21:04, Wei Liu wrote:
> >Sorry for the late review.
> >
> 
> Really thanks for taking your time :)
> 
> >On Fri, Apr 10, 2015 at 05:21:52PM +0800, Tiejun Chen wrote:
> >>This patch introduces user configurable parameters to specify RDM
> >>resource and according policies,
> >>
> >>Global RDM parameter:
> >>     rdm = [ 'host, reserve=force/try' ]
> >>Per-device RDM parameter:
> >>     pci = [ 'sbdf, rdm_reserve=force/try' ]
> >>
> >>Global RDM parameter allows user to specify reserved regions explicitly,
> >>e.g. using 'host' to include all reserved regions reported on this platform
> >>which is good to handle hotplug scenario. In the future this parameter
> >>may be further extended to allow specifying random regions, e.g. even
> >>those belonging to another platform as a preparation for live migration
> >>with passthrough devices.
> >>
> >>'force/try' policy decides how to handle conflict when reserving RDM
> >>regions in pfn space. If conflict exists, 'force' means an immediate error
> >>so VM will be killed, while 'try' allows moving forward with a warning
> >>message thrown out.
> >>
> >>Default per-device RDM policy is 'force', while default global RDM policy
> >>is 'try'. When both policies are specified on a given region, 'force' is
> >>always preferred.
> >>
> >>Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> >>---
> >>  docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
> >>  docs/misc/vtd.txt           | 34 ++++++++++++++++++++
> >>  tools/libxl/libxl_create.c  |  5 +++
> >>  tools/libxl/libxl_types.idl | 18 +++++++++++
> >>  tools/libxl/libxlu_pci.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++
> >>  tools/libxl/libxlutil.h     |  4 +++
> >>  tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
> >>  7 files changed, 203 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> >>index 408653f..9ed3055 100644
> >>--- a/docs/man/xl.cfg.pod.5
> >>+++ b/docs/man/xl.cfg.pod.5
> >>@@ -583,6 +583,36 @@ assigned slave device.
> >>
> >>  =back
> >>
> >>+=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
> >>+
> >
> >Shouldn't this be "TYPE,RDM_RESERVE_STRIGN" according to your commit
> >message? If the only available config is just one string, you probably
> >don't need a list for this?
> 
> Yes, based on that design we don't need a list. So
> 
> =item B<rdm=[ "RDM_RESERVE_STRING" ]>
> 

Note that this is still a list (enclosed by "[]"). Maybe you mean

   rdm = "RDM_RESERVE_STRING"

?

> >
> >>+(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
> >>+which is necessary to enable robust device passthrough usage. One example of
> >>+RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
> >>+structure on x86 platform.
> >>+Each B<RDM_CHECK_STRING> has the form C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
> >>+
> >
> >RDM_CHECK_STRING?
> 
> And here should be corrected like this,
> 
> B<RDM_RESERVE_STRING> has the form ...
> 
> >
> >>+=over 4
> >>+
> >>+=item B<"TYPE">
> >>+
> >>+Currently we just have one type. 'host' means all reserved device memory on
> >>+this platform should be reserved in this VM's pfn space.
> >>+
> >
> >What are other possible types? If there is only one type then we can
> 
> Currently we just have one type and looks that design doesn't make this
> clear.
> 
> >simply ignore the type?
> 
> I just think we may introduce something else specific to live migration in
> the future... But I'm really not sure right now.
> 

Fair enough. I was just wondering if there would be any other types. If
so we do need provisioning.

In any case, the "type" argument you proposed is a positional argument
(you require it to be the first element of the spec string").
I think you can just make it a key-value pair to make parsing easier.

> >
> >>+=item B<KEY=VALUE>
> >>+
> >>+Possible B<KEY>s are:
> >>+
> >>+=over 4
> >>+
> >>+=item B<reserve="STRING">
> >>+
> >>+Conflict may be detected when reserving reserved device memory in gfn space.
> >>+'force' means an unsolved conflict leads to immediate VM destroy, while
> >
> >Do you mean "immediate VM crash"?
> 
> Yes. So I guess I should replace this.
> 
> >
> >>+'try' allows VM moving forward with a warning message thrown out. 'try'
> >>+is default.
> >
> >Can you please your double quotes for "force", "try" etc.
> 
> Sure. Just note we'd like to use "strict"/"relaxed" to replace "force"/"try"
> from next revision according to Jan's suggestion.
> 

No problem.

> >
> >>+
> >>+Note this may be overrided by another sub item, rdm_reserve, in pci device.
> >>+
> >>  =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
> >>
> >>  Specifies the host PCI devices to passthrough to this guest. Each B<PCI_SPEC_STRING>
> >>@@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
> >>  D0-D3hot power management states for the PCI device. False (0) by
> >>  default.
> >>
> >>+=item B<rdm_check="STRING">
> >>+
> >>+(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
> >>+which is necessary to enable robust device passthrough usage. One example of
> >>+RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
> >>+structure on x86 platform.
> >>+
> >>+Conflict may be detected when reserving reserved device memory in gfn space.
> >>+'force' means an unsolved conflict leads to immediate VM destroy, while
> >>+'try' allows VM moving forward with a warning message thrown out. 'force'
> >>+is default.
> >>+
> >>+Note this would override another global item, rdm = [''].
> >>+
> >
> >Note this would override global B<rdm> option.
> 
> Fixed.
> 
> >
> >>  =back
> >>
> >>  =back
> >>diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
> >>index 9af0e99..d7434d6 100644
> >>--- a/docs/misc/vtd.txt
> >>+++ b/docs/misc/vtd.txt
> >>@@ -111,6 +111,40 @@ in the config file:
> >>  To override for a specific device:
> >>  	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
> >>
> >>+RDM, 'reserved device memory', for PCI Device Passthrough
> >>+---------------------------------------------------------
> >>+
> >>+There are some devices the BIOS controls, for e.g. USB devices to perform
> >>+PS2 emulation. The regions of memory used for these devices are marked
> >>+reserved in the e820 map. When we turn on DMA translation, DMA to those
> >>+regions will fail. Hence BIOS uses RMRR to specify these regions along with
> >>+devices that need to access these regions. OS is expected to setup
> >>+identity mappings for these regions for these devices to access these regions.
> >>+
> >>+While creating a VM we should reserve them in advance, and avoid any conflicts.
> >>+So we introduce user configurable parameters to specify RDM resource and
> >>+according policies,
> >>+
> >>+To enable this globally, add "rdm" in the config file:
> >>+
> >>+    rdm = [ 'host, reserve=force/try' ]
> >>+
> >
> >The "force/try" should be called "policy". And then you explain what
> >policies we have.
> 
> Do you mean we should rename this?
> 
> rdm = [ 'host, policy=force/try' ]
> 

No, I didn't ask you to rename that.

The above line is an example which should reflect the correct syntax.
"force/try" is not the *actual syntax*, i.e. you won't write that in
your config file.

I meant to changes it to "reserve=POLICY". Then you explain the possible
values of POLICY.

> This is really a policy but 'reserve' may can reflect our action explicitly,
> right?
> 
> >
> >>+Or just for a specific device:
> >>+
> >>+	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
> 
> And you also can see this.
> 
> But anyway, if you're really stick to rename this, I'm going to be fine as
> well but we should ping every one to check this point since this name is
> from our previous discussion.
> 
> >>+
> >>+Global RDM parameter allows user to specify reserved regions explicitly.
> >>+Using 'host' to include all reserved regions reported on this platform
> >>+which is good to handle hotplug scenario. In the future this parameter
> >>+may be further extended to allow specifying random regions, e.g. even
> >>+those belonging to another platform as a preparation for live migration
> >>+with passthrough devices.
> >>+
> >>+'force/try' policy decides how to handle conflict when reserving RDM
> >>+regions in pfn space. If conflict exists, 'force' means an immediate error
> >>+so VM will be killed, while 'try' allows moving forward with a warning
> >
> >Be killed by whom? I think it's hvmloader crashes voluntarily, right?
> 
> s/VM will be kille/hvmloader crashes voluntarily
> 
> >
> >>+message thrown out.
> >>+
> >>
> >>  Caveat on Conventional PCI Device Passthrough
> >>  ---------------------------------------------
> >>diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> >>index 98687bd..9ed40d4 100644
> >>--- a/tools/libxl/libxl_create.c
> >>+++ b/tools/libxl/libxl_create.c
> >>@@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
> >>      }
> >>
> >>      for (i = 0; i < d_config->num_pcidevs; i++) {
> >>+        /*
> >>+         * If the rdm global policy is 'force' we should override each device.
> >>+         */
> >>+        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
> >>+            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>          ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
> >>          if (ret < 0) {
> >>              LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
> >>diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> >>index 47af340..5786455 100644
> >>--- a/tools/libxl/libxl_types.idl
> >>+++ b/tools/libxl/libxl_types.idl
> >>@@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
> >>      (2, "PV"),
> >>      ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
> >>
> >>+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> >>+    (0, "none"),
> >>+    (1, "host"),
> >>+    ])
> >>+
> >>+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> >>+    (-1, "invalid"),
> >>+    (0, "force"),
> >>+    (1, "try"),
> >>+    ])
> >
> >If you don't set init_val, the default value would be "force" (0), is this
> 
> Yes.
> 
> >want you want?
> 
> We have a little bit of complexity here,
> 
> "Default per-device RDM policy is 'force', while default global RDM policy
> is 'try'. When both policies are specified on a given region, 'force' is
> always preferred."
> 

This is going to be done in actual code anyway.

This type is used both in global and per-device setting, so I envisage
this to have an invalid value to start with. Appropriate default values
should be done in libxl_TYPE_setdefault functions. And the logic to
detect conflict and preferences done in your construct function.

What do you think?

> >
> >>+
> >>  libxl_channel_connection = Enumeration("channel_connection", [
> >>      (0, "UNKNOWN"),
> >>      (1, "PTY"),
> >>@@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
> >>      ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
> >>      ])
> >>
> >>+libxl_rdm_reserve = Struct("rdm_reserve", [
> >>+    ("type",    libxl_rdm_reserve_type),
> >>+    ("reserve",   libxl_rdm_reserve_flag),
> >>+    ])
> >>+
> >>  libxl_domain_build_info = Struct("domain_build_info",[
> >>      ("max_vcpus",       integer),
> >>      ("avail_vcpus",     libxl_bitmap),
> >>@@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
> >>      ("kernel",           string),
> >>      ("cmdline",          string),
> >>      ("ramdisk",          string),
> >>+    ("rdm",     libxl_rdm_reserve),
> >>      ("u", KeyedUnion(None, libxl_domain_type, "type",
> >>                  [("hvm", Struct(None, [("firmware",         string),
> >>                                         ("bios",             libxl_bios_type),
> >>@@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
> >>      ("power_mgmt", bool),
> >>      ("permissive", bool),
> >>      ("seize", bool),
> >>+    ("rdm_reserve",   libxl_rdm_reserve_flag),
> >>      ])
> >>
> >>  libxl_device_vtpm = Struct("device_vtpm", [
> >>diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
> >>index 26fb143..45be0d9 100644
> >>--- a/tools/libxl/libxlu_pci.c
> >>+++ b/tools/libxl/libxlu_pci.c
> >>@@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain,
> >>  #define STATE_OPTIONS_K 6
> >>  #define STATE_OPTIONS_V 7
> >>  #define STATE_TERMINAL  8
> >>+#define STATE_TYPE      9
> >>+#define STATE_CHECK_FLAG      10
> >>  int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str)
> >>  {
> >>      unsigned state = STATE_DOMAIN;
> >>@@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str
> >>                      pcidev->permissive = atoi(tok);
> >>                  }else if ( !strcmp(optkey, "seize") ) {
> >>                      pcidev->seize = atoi(tok);
> >>+                }else if ( !strcmp(optkey, "rdm_reserve") ) {
> >>+                    if ( !strcmp(tok, "force") ) {
> >>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>+                    } else if ( !strcmp(tok, "try") ) {
> >>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >>+                    } else {
> >>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>+                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
> >>+                                          " %s, so goes 'force' by default.",
> >
> >If this is not an error, you don't need XLU__PCI_ERR.
> >
> >But I would say we should  just treat this as an error and
> >abort/exit/report (whatever the parser should do in this case).
> 
> In our case we just want to post a message to set a appropriate flag to
> recover this behavior like we write here,
> 
>                         XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
> value:"
>                                           " %s, so goes 'strict' by
> default.",
>                                      tok);

I suggest we just abort in this case and not second guess what the admin
wants.

> 
> This may just be a warning? But I don't we have this sort of definition,
> XLU__PCI_WARN, ...
> 
> So what LOG format can be adopted here?

Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.

> 
> >
> >>+                                     tok);
> >>+                    }
> >>                  }else{
> >>                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
> >>                  }
> >>@@ -167,6 +180,71 @@ parse_error:
> >>      return ERROR_INVAL;
> >>  }
> >>
> >>+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
> >>+{
> >>+    unsigned state = STATE_TYPE;
> >>+    char *buf2, *tok, *ptr, *end;
> >>+
> >>+    if ( NULL == (buf2 = ptr = strdup(str)) )
> >>+        return ERROR_NOMEM;
> >>+
> >>+    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> >>+        switch(state) {

Coding style. I haven't checked what actual style this file uses, but
there is inconsistency in this function by itself.

> >>+        case STATE_TYPE:
> >>+            if ( *ptr == '\0' || *ptr == ',' ) {
> >>+                state = STATE_CHECK_FLAG;
> >>+                *ptr = '\0';
> >>+                if ( !strcmp(tok, "host") ) {
> >>+                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
> >>+                } else {
> >>+                    XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok);
> >>+                    goto parse_error;
> >>+                }
> >>+                tok = ptr + 1;
> >>+            }
> >>+            break;
> >>+        case STATE_CHECK_FLAG:
> >>+            if ( *ptr == '=' ) {
> >>+                state = STATE_OPTIONS_V;
> >>+                *ptr = '\0';
> >>+                if ( strcmp(tok, "reserve") ) {
> >>+                    XLU__PCI_ERR(cfg, "Unknown RDM property value: %s", tok);
> >>+                    goto parse_error;
> >>+                }
> >>+                tok = ptr + 1;
> >>+            }
> >>+            break;
> >>+        case STATE_OPTIONS_V:
> >>+            if ( *ptr == ',' || *ptr == '\0' ) {
> >>+                state = STATE_TERMINAL;
> >>+                *ptr = '\0';
> >>+                if ( !strcmp(tok, "force") ) {
> >>+                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>+                } else if ( !strcmp(tok, "try") ) {
> >>+                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >>+                } else {
> >>+                    XLU__PCI_ERR(cfg, "Unknown RDM property flag value: %s",
> >>+                                 tok);
> >>+                    goto parse_error;
> >>+                }
> >>+                tok = ptr + 1;
> >>+            }
> >>+        default:
> >>+            break;
> >>+        }
> >>+    }
> >>+
> >>+    free(buf2);
> >>+
> >>+    if ( tok != ptr || state != STATE_TERMINAL )
> >>+        goto parse_error;
> >>+
> >>+    return 0;
> >>+
> >>+parse_error:
> >>+    return ERROR_INVAL;
> >>+}
> >>+
> >>  /*
> >>   * Local variables:
> >>   * mode: C
> >>diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
> >>index 0333e55..80497f8 100644
> >>--- a/tools/libxl/libxlutil.h
> >>+++ b/tools/libxl/libxlutil.h
> >>@@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs, const char *const *specs,
> >>   */
> >>  int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str);
> >>
> >>+/*
> >>+ * RDM parsing
> >>+ */
> >>+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str);
> >>
> >>  /*
> >>   * Vif rate parsing.
> >>diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> >>index 04faf98..9a58464 100644
> >>--- a/tools/libxl/xl_cmdimpl.c
> >>+++ b/tools/libxl/xl_cmdimpl.c
> >>@@ -988,7 +988,7 @@ static void parse_config_data(const char *config_source,
> >>      const char *buf;
> >>      long l;
> >>      XLU_Config *config;
> >>-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
> >>+    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms, *rdms;
> >>      XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
> >>      int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
> >>      int pci_power_mgmt = 0;
> >>@@ -1727,6 +1727,23 @@ skip_vfb:
> >>          xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0);
> >>      }
> >>
> >>+    /*
> >>+     * By default our global policy is to query all rdm entries, and
> >>+     * force reserve them.
> >>+     */
> >>+    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
> >>+    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >
> >This should probably to into the _setdefault function of
> >libxl_domain_build_info.
> 
> Sorry, I just see this
> 
> libxl_domain_build_info_init()
>     |
>     + libxl_rdm_reserve_init(&p->rdm);
> 	|
> 	+ memset(p, '\0', sizeof(*p));
> 
> But this should be generated automatically, right? So how to implement your
> idea? Could you give me a show?
> 

Check libxl_domain_build_info_setdefault.

To use libxl types. You normally do:

  libxl_TYPE_init
  libxl_TYPE_setdefault

  DO STUFF

  libxl_TYPE_dispose

_init and _dispose are auto-generated. _setdefault is not.

> >
> >>+    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
> >>+        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
> >>+            libxl_rdm_reserve rdm;
> >>+            if (!xlu_rdm_parse(config, &rdm, buf))
> >>+            {
> >>+                b_info->rdm.type = rdm.type;
> >>+                b_info->rdm.reserve = rdm.reserve;
> >>+            }
> >
> >You only have one rdm in b_info, so there is no need to use a list for
> >it in config file.
> >
> 
> Is this fine?
> 
>     if (!xlu_cfg_get_string(config, "rdm", &buf, 0)) {
> 
>         libxl_rdm_reserve rdm;
> 
>         if (!xlu_rdm_parse(config, &rdm, buf)) {
>             b_info->rdm.type = rdm.type;
> 
>             b_info->rdm.reserve = rdm.reserve;
> 
>         }
>     }
> 

I think it is fine. But you'd better wait a little bit for other people
to voice their opinions.

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-11  9:51       ` Julien Grall
  2015-05-11 10:57         ` Jan Beulich
@ 2015-05-14  5:47         ` Chen, Tiejun
  2015-05-14 10:19           ` Julien Grall
  1 sibling, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-14  5:47 UTC (permalink / raw)
  To: Julien Grall, JBeulich, tim, konrad.wilk, andrew.cooper3,
	kevin.tian, yang.z.zhang, ian.campbell, wei.liu2, Ian.Jackson,
	stefano.stabellini
  Cc: xen-devel

On 2015/5/11 17:51, Julien Grall wrote:
> Hi,
>
> On 11/05/15 09:42, Chen, Tiejun wrote:
>> diff --git a/xen/drivers/passthrough/arm/smmu.c
>> b/xen/drivers/passthrough/arm/smmu.c
>> index 8a9b58b..a3e6383 100644
>> --- a/xen/drivers/passthrough/arm/smmu.c
>> +++ b/xen/drivers/passthrough/arm/smmu.c
>> @@ -2599,7 +2599,7 @@ static void arm_smmu_destroy_iommu_domain(struct
>> iommu_domain *domain)
>>   }
>>
>>   static int arm_smmu_assign_dev(struct domain *d, u8 devfn,
>> -                              struct device *dev)
>> +                              struct device *dev, u32 flag)
>>   {
>>          struct iommu_domain *domain;
>>          struct arm_smmu_xen_domain *xen_domain;
>> diff --git a/xen/drivers/passthrough/device_tree.c
>> b/xen/drivers/passthrough/device_tree.c
>> index 377d41d..97e7fc5 100644
>> --- a/xen/drivers/passthrough/device_tree.c
>> +++ b/xen/drivers/passthrough/device_tree.c
>> @@ -41,7 +41,8 @@ int iommu_assign_dt_device(struct domain *d, struct
>> dt_device_node *dev)
>>       if ( !list_empty(&dev->domain_list) )
>>           goto fail;
>>
>> -    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev));
>> +    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev),
>> +                                         XEN_DOMCTL_PCIDEV_RDM_TRY);
>
> On ARM we can passthrough 2 different types of device: PCI device and
> platform device described in the device tree (it's a tree representation
> of the hardware).

Sure, dts is always popular in most embedded systems, and actually I 
really like that  ;-)

>
> This assign_device callback deals with the latter. So from the name the
> value doesn't look right.
>

What about XEN_DOMCTL_DEV_RDM_XXX?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-11 10:57         ` Jan Beulich
@ 2015-05-14  5:48           ` Chen, Tiejun
  2015-05-14 20:13             ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-14  5:48 UTC (permalink / raw)
  To: Jan Beulich, Julien Grall
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/11 18:57, Jan Beulich wrote:
>>>> On 11.05.15 at 11:51, <julien.grall@citrix.com> wrote:
>> Hi,
>>
>> On 11/05/15 09:42, Chen, Tiejun wrote:
>>> diff --git a/xen/drivers/passthrough/arm/smmu.c
>>> b/xen/drivers/passthrough/arm/smmu.c
>>> index 8a9b58b..a3e6383 100644
>>> --- a/xen/drivers/passthrough/arm/smmu.c
>>> +++ b/xen/drivers/passthrough/arm/smmu.c
>>> @@ -2599,7 +2599,7 @@ static void arm_smmu_destroy_iommu_domain(struct
>>> iommu_domain *domain)
>>>   }
>>>
>>>   static int arm_smmu_assign_dev(struct domain *d, u8 devfn,
>>> -                              struct device *dev)
>>> +                              struct device *dev, u32 flag)
>>>   {
>>>          struct iommu_domain *domain;
>>>          struct arm_smmu_xen_domain *xen_domain;
>>> diff --git a/xen/drivers/passthrough/device_tree.c
>>> b/xen/drivers/passthrough/device_tree.c
>>> index 377d41d..97e7fc5 100644
>>> --- a/xen/drivers/passthrough/device_tree.c
>>> +++ b/xen/drivers/passthrough/device_tree.c
>>> @@ -41,7 +41,8 @@ int iommu_assign_dt_device(struct domain *d, struct
>>> dt_device_node *dev)
>>>       if ( !list_empty(&dev->domain_list) )
>>>           goto fail;
>>>
>>> -    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev));
>>> +    rc = hd->platform_ops->assign_device(d, 0, dt_to_dev(dev),
>>> +                                         XEN_DOMCTL_PCIDEV_RDM_TRY);
>>
>> On ARM we can passthrough 2 different types of device: PCI device and
>> platform device described in the device tree (it's a tree representation
>> of the hardware).
>>
>> This assign_device callback deals with the latter. So from the name the
>> value doesn't look right.
>
> Yeah, the constant name probably shouldn't refer to PCI, but simply
> to pass-through.
>

What about XEN_DOMCTL_DEV_RDM_XXX? I mean this may be specific to 
device, right?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 08/13] tools: extend xc_assign_device() to support rdm reservation policy
  2015-05-11 10:53       ` Jan Beulich
@ 2015-05-14  7:04         ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-14  7:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/11 18:53, Jan Beulich wrote:
>>>> On 11.05.15 at 11:45, <tiejun.chen@intel.com> wrote:
>> On 2015/4/20 21:39, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:21, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/libxc/xc_domain.c
>>>> +++ b/tools/libxc/xc_domain.c
>>>> @@ -1654,13 +1654,15 @@ int xc_domain_setdebugging(xc_interface *xch,
>>>>    int xc_assign_device(
>>>>        xc_interface *xch,
>>>>        uint32_t domid,
>>>> -    uint32_t machine_sbdf)
>>>> +    uint32_t machine_sbdf,
>>>> +    uint32_t flag)
>>>>    {
>>>>        DECLARE_DOMCTL;
>>>>
>>>>        domctl.cmd = XEN_DOMCTL_assign_device;
>>>>        domctl.domain = domid;
>>>>        domctl.u.assign_device.machine_sbdf = machine_sbdf;
>>>> +    domctl.u.assign_device.sbdf_flag = flag;
>>>
>>> The previous patch needs to initialize this field, in order to not pass
>>> random input to the hypervisor. Using the ..._TRY value here
>>> intermediately (until this patch gets applied) would seem the right
>>> approach.
>>>
>>
>> If I'm correct, looks I should introduce a little of change in previous
>> patch,
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 250d1e4..0bcfd87 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1513,7 +1513,7 @@ int iommu_do_pci_domctl(
>>    {
>>        u16 seg;
>>        u8 bus, devfn;
>> -    u32 flag;
>> +    u32 flag = XEN_DOMCTL_PCIDEV_RDM_TRY;
>
> Provided that constant is available already at that point (I didn't

Yes, both that definition and this usage are in the previous patch.

Thanks
Tiejun

> check); if it isn't, you'd probably want to go with plain zero.
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-11 11:32       ` Wei Liu
@ 2015-05-14  8:27         ` Chen, Tiejun
  2015-05-18  1:06           ` Chen, Tiejun
  2015-05-18 20:00           ` Wei Liu
  0 siblings, 2 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-14  8:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/11 19:32, Wei Liu wrote:
> On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
>> On 2015/5/8 22:43, Wei Liu wrote:
>>> Sorry for the late review. This series fell through the crack.
>>>
>>
>> Thanks for your review.
>>
>>> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
>>>> While building a VM, HVM domain builder provides struct hvm_info_table{}
>>>> to help hvmloader. Currently it includes two fields to construct guest
>>>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
>>>> check them to fix any conflict with RAM.
>>>>
>>>> RMRR can reside in address space beyond 4G theoretically, but we never
>>>> see this in real world. So in order to avoid breaking highmem layout
>>>
>>> How does this break highmem layout?
>>
>> In most cases highmen is always continuous like [4G, ...] but RMRR is
>> theoretically beyond 4G but very rarely. Especially we don't see this
>> happened in real world. So we don't want to such a case breaking the
>> highmem.
>>
>
> The problem is  we take this approach just because this rarely happens
> *now* is not future proof.  It needs to be clearly documented somewhere
> in the manual (or any other Intel docs) and be referenced in the code.
> Otherwise we might end up in a situation that no-one knows how it is
> supposed to work and no-one can fix it if it breaks in the future, that
> is, every single device on earth requires RMRR > 4G overnight (I'm
> exaggerating).
>
> Or you can just make it works with highmem. How much more work do you
> envisage?
>
> (If my above comment makes no sense do let me know. I only have very
> shallow understanding of RMRR)

Maybe I'm misleading you :)

I don't mean RMRR > 4G is not allowed to work in our implementation. 
What I'm saying is that our *policy* is just simple for this kind of 
rare highmem case...

>
>>>
>>>> we don't solve highmem conflict. Note this means highmem rmrr could still
>>>> be supported if no conflict.
>>>>

Like these two sentences above.

>>>
>>> Aren't you actively trying to avoid conflict in libxl?
>>
>> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
>> good policies to address RMRR. In the case of highmemt, that simple policy
>> should be enough?
>>
>
> Whatever policy you and HV maintainers agree on. Just clearly document
> it.

Do you mean I should brief this patch description into one separate 
document?

>
>>>
>>>> But in the case of lowmem, RMRR probably scatter the whole RAM space.
>>>> Especially multiple RMRR entries would worsen this to lead a complicated
>>>> memory layout. And then its hard to extend hvm_info_table{} to work
>>>> hvmloader out. So here we're trying to figure out a simple solution to
>>>> avoid breaking existing layout. So when a conflict occurs,
>>>>
>>>>      #1. Above a predefined boundary (default 2G)
>>>>          - move lowmem_end below reserved region to solve conflict;
>>>>
>>>
>>> I hope this "predefined boundary" is user tunable. I will check later in
>>> this patch if it is the case.
>>
>> Yes. As we clarified in that comments,
>>
>> * TODO: Its batter to provide a config parameter for this boundary value.
>>
>> This mean I would provide a patch address this since currently I just think
>> this is not a big deal?
>>
>
> Yes please provide a config option to override that. It's reasonable
> that user wants to change that.

Okay.

>
>>>
>>>>      #2 Below a predefined boundary (default 2G)
>>>>          - Check force/try policy.
>>>>          "force" policy leads to fail libxl. Note when both policies
>>>>          are specified on a given region, 'force' is always preferred.
>>>>          "try" policy issue a warning message and also mask this entry INVALID
>>>>          to indicate we shouldn't expose this entry to hvmloader.
>>>>
>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>> ---
>>>>   tools/libxc/include/xenctrl.h  |   8 ++
>>>>   tools/libxc/include/xenguest.h |   3 +-
>>>>   tools/libxc/xc_domain.c        |  40 +++++++++
>>>>   tools/libxc/xc_hvm_build_x86.c |  28 +++---
>>>>   tools/libxl/libxl_create.c     |   2 +-
>>>>   tools/libxl/libxl_dm.c         | 195 +++++++++++++++++++++++++++++++++++++++++
>>>>   tools/libxl/libxl_dom.c        |  27 +++++-
>>>>   tools/libxl/libxl_internal.h   |  11 ++-
>>>>   tools/libxl/libxl_types.idl    |   7 ++
>>>>   9 files changed, 303 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
>>>> index 59bbe06..299b95f 100644
>>>> --- a/tools/libxc/include/xenctrl.h
>>>> +++ b/tools/libxc/include/xenctrl.h
>>>> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
>>>>                        uint32_t *num_sdevs,
>>>>                        uint32_t *sdev_array);
>>>>
>>>> +struct xen_reserved_device_memory
>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>> +                   uint32_t flag,
>>>> +                   uint16_t seg,
>>>> +                   uint8_t bus,
>>>> +                   uint8_t devfn,
>>>> +                   unsigned int *nr_entries);
>>>> +
>>>>   int xc_test_assign_device(xc_interface *xch,
>>>>                             uint32_t domid,
>>>>                             uint32_t machine_sbdf);
>
> [...]
>
>>>
>>>>       uint64_t mem_target;         /* Memory target in bytes. */
>>>>       uint64_t mmio_size;          /* Size of the MMIO hole in bytes. */
>>>>       const char *image_file_name; /* File name of the image to load. */
>>>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>>>> index 4f8383e..85b18ea 100644
>>>> --- a/tools/libxc/xc_domain.c
>>>> +++ b/tools/libxc/xc_domain.c
>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>>       return do_domctl(xch, &domctl);
>>>>   }
>>>>
>>>> +struct xen_reserved_device_memory
>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>> +                   uint32_t flag,
>>>> +                   uint16_t seg,
>>>> +                   uint8_t bus,
>>>> +                   uint8_t devfn,
>>>> +                   unsigned int *nr_entries)
>>>> +{
>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>>> +                                           nr_entries);
>>>> +
>>>> +    if ( rc < 0 )
>>>> +    {
>>>> +        if ( errno == ENOBUFS )
>>>> +        {
>>>> +            if ( (xrdm = malloc(*nr_entries *
>>>> +                                sizeof(xen_reserved_device_memory_t))) == NULL )
>>>> +            {
>>>> +                PERROR("Could not allocate memory.");
>>>> +                goto out;
>>>> +            }
>>>
>>> Don't you leak origin xrdm in this case?
>>
>> The caller to xc_device_get_rdm always frees this.
>>
>
> I think I misunderstood how this function works. I thought xrdm was
> passed in by caller, which is clearly not the case. Sorry!
>
>
> In that case, the `if ( rc < 0 )' is not needed because the call should
> always return rc < 0. An assert is good enough.

assert(rc < 0)? But we can't presume the user always pass '0' firstly, 
and additionally, we may have no any RMRR indeed.

So I guess what you want is,

assert( rc <=0 );
if ( !rc )
     goto xrdm;

if ( errno == ENOBUFS )
...

Right?

>
>>>
>>> And, this style is not very good. Shouldn't the caller allocate enough
>>> memory before hand?
>>
>> Are you saying the caller to xc_device_get_rdm()? If so, any caller don't
>> know this, too.
>>
>
> I see.
>
>> Actually this is just a wrapper of that fundamental hypercall,
>> xc_reserved_device_memory_map() in patch #2, and based on that, we always
>> have to first call this to inquire how much memory we really need. And this
>> is why we have this wrapper since we don't want to duplicate more codes.
>>
>> One error handler of this wrapper is just handling ENOBUFS since the caller
>> never know how much memory we should allocate. So oftentimes we always set
>> 'entries = 0' to inquire firstly.
>>
>> Here Jan suggested we may need to figure out a good way to consolidate
>> xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm().
>>
>> But in some ways, that wrapper likes a static function so we just need to
>> move this into that file associated to its caller, right?
>>
>
> Yes, if there is only one user at the moment, make a static function.

Thanks.

>
>>>
>>>> +            rc = xc_reserved_device_memory_map(xch, flag, seg, bus, devfn, xrdm,
>>>> +                                               nr_entries);
>>>> +            if ( rc )
>>>> +            {
>>>> +                PERROR("Could not get reserved device memory maps.");
>>>> +                free(xrdm);
>>>> +                xrdm = NULL;
>>>> +            }
>>>> +        }
>>>> +        else {
>>>> +            PERROR("Could not get reserved device memory maps.");
>>>> +        }
>>>> +    }
>>>> +
>>>> + out:
>>>> +    return xrdm;
>>>> +}
>>>> +
>>>>   int xc_get_device_group(
>>>>       xc_interface *xch,
>>>>       uint32_t domid,
>>>> diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
>>>> index c81a25b..3f87bb3 100644
>>>> --- a/tools/libxc/xc_hvm_build_x86.c
>>>> +++ b/tools/libxc/xc_hvm_build_x86.c
>>>> @@ -89,19 +89,16 @@ static int modules_init(struct xc_hvm_build_args *args,
>>>>   }
>>>>
>>>>   static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
>>>> -                           uint64_t mmio_start, uint64_t mmio_size)
>>>> +                           uint64_t lowmem_end)
>>>>   {
>>>>       struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>>>>           (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
>>>> -    uint64_t lowmem_end = mem_size, highmem_end = 0;
>>>> +    uint64_t highmem_end = 0;
>>>>       uint8_t sum;
>>>>       int i;
>>>>
>>>> -    if ( lowmem_end > mmio_start )
>>>> -    {
>>>> -        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
>>>> -        lowmem_end = mmio_start;
>>>> -    }
>>>> +    if ( mem_size > lowmem_end )
>>>> +        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
>>>>
>>>>       memset(hvm_info_page, 0, PAGE_SIZE);
>
> [...]
>
>>>
>>>> +                                   struct xc_hvm_build_args *args)
>>>
>>> This function does more than "checking", so a better name is needed.
>>>
>>> May be you should split this function to one "build" function and one
>>> "check" function? What do you think?
>>
>> We'd better keep this big one since this can make our policy understandable,
>> but I agree we need to rename this like,
>>
>> libxl__domain_device_construct_rdm()
>>
>> construct = check + build :)
>
> I'm fine with this.

Thanks.

>
>>
>>>
>>>> +{
>>>> +    int i, j, conflict;
>>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>>
>>> You can just use CTX macro.
>
> [...]
>
>>>
>>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) && !d_config->num_pcidevs)
>>>> +        return 0;
>>>> +
>>>> +    /* Collect all rdm info if exist. */
>>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>>> +                             0, 0, 0, &nr_all_rdms);
>>>> +    if (!nr_all_rdms)
>>>> +        return 0;
>>>> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
>>>> +                                   sizeof(libxl_device_rdm));
>>>
>>> Note that if you use "gc" here the allocated memory will be, well,
>>> garbage collected at some point. If you don't want them to be gc'ed you
>>> should use NOGC.
>>
>> Sorry, what does that mean by 'garbage collected'?
>>
>
> That means the memory allocated with gc will be freed at some point by
> GC_FREE, because those memory regions are meant to be temporary and used
> internally.
>
> When entering a libxl public API function (those start wtih
> libxl_), that function calls GC_INIT to initialise a garbage collector.
> When that function exits, it calls GC_FREE to free all memory allocated
> with that gc.

Thanks for sharing this good info to me!

>
> Since d_config is very likely to be used by libxl user (xl, libvirt
> etc), you probably don't want to fill it with gc allocated memory.
>

Yes.

>>>
>>>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>>>> +
>>>> +    /* Query all RDM entries in this platform */
>>>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>>>> +        d_config->num_rdms = nr_all_rdms;
>>>> +        for (i = 0; i < d_config->num_rdms; i++) {
>>>> +            d_config->rdms[i].start =
>>>> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[i].size =
>>>> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>>>> +        }
>>>> +    } else {
>>>> +        d_config->num_rdms = 0;
>>>> +    }
>>>
>>> And you should move d_config->rdms = libxl__calloc inside that `if'.
>>> That is, don't allocate memory if you don't need it.
>>
>> We can't since in all cases we need to preallocate this, and then we will
>> handle this according to our policy.
>>
>
> How would it ever be used again if you set d_config->num_rdms to 0? How
> do you know the exact size of your array again?

If we don't consider multiple devices shares one rdm entry, our workflow 
can be showed as follows:

#1. We always preallocate all rdms[] but with memset().
#2. Then we have two cases for that global rule,

#2.1 If we really need all according to our global rule, we would set 
all rdms[] with all real rdm info and set d_config->num_rdms.
#2.2 If we have no such a global rule to obtain all, we just clear 
d_config->num_rdms.

#3. Then for per device rule

#3.1 From #2.1, we just need to check if we should change one given rdm 
entry's policy if this given entry is just associated to this device.
#3.2 From 2.2, obviously we just need to fill rdms one by one. Of 
course, its very possible that we don't fill all rdms since all 
passthroued devices might not have no rdm at all or they just occupy 
some. But anyway, finally we sync d_config->num_rdms.

>
>>>
>>>> +    free(xrdm);
>>>> +
>>>> +    /* Query RDM entries per-device */
>>>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>>>> +        unsigned int nr_entries = 0;
>>
>> Maybe I should restate this,
>> 	unsigned int nr_entries;
>>
>>>> +        bool new = true;
>>>> +        seg = d_config->pcidevs[i].domain;
>>>> +        bus = d_config->pcidevs[i].bus;
>>>> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev, d_config->pcidevs[i].func);
>>>> +        nr_entries = 0;
>>>
>>> You've already initialised this variable.
>>
>> We need to set this as zero to start.
>>
>
> Either of the tow works for me. Just don't want redundant
> initialisation.

Right.

>
>>>
>>>> +        xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_NONE,
>>>> +                                 seg, bus, devfn, &nr_entries);
>>>> +        /* No RDM to associated with this device. */
>>>> +        if (!nr_entries)
>>>> +            continue;
>>>> +
>>>> +        /* Need to check whether this entry is already saved in the array.
>>>> +         * This could come from two cases:
>>>> +         *
>>>> +         *   - user may configure to get all RMRRs in this platform, which
>>>> +         * is already queried before this point
>>>
>>> Formatting.
>>
>> Are you saying this?
>
> I mean you need to move "is already..." to the right go align with
> previous line.

Fixed.

>
>>
>> +        /* Need to check whether this entry is already saved in the array.
>>
>> =>
>
> The CODING_STYLE in libxl doesn't seem to enforce this, so you can just
> follow other examples.
>
>>          /*
>>
>>           * Need to check whether this entry is already saved in the array.
>>           * This could come from two cases:
>>
>>>
>>>> +         *   - or two assigned devices may share one RMRR entry
>>>> +         *
>>>> +         * different policies may be configured on the same RMRR due to above
>>>> +         * two cases. We choose a simple policy to always favor stricter policy
>>>> +         */
>>>> +        for (j = 0; j < d_config->num_rdms; j++) {
>>>> +            if (d_config->rdms[j].start ==
>>>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT)
>>>> +             {
>>>> +                if (d_config->rdms[j].flag != LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>> +                    d_config->rdms[j].flag = d_config->pcidevs[i].rdm_reserve;
>>>> +                new = false;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        if (new) {
>>>> +            if (d_config->num_rdms > nr_all_rdms - 1) {
>>>> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm array boundary!\n");
>>>
>>> LOG(ERROR, ...)
>>
>> Fixed.
>>
>>>
>>>> +                free(xrdm);
>>>> +                return -1;
>>>
>>> Please use goto out idiom.
>>
>> We just have two 'return -1' differently so I'm not sure its worth doing
>> this.
>>
>
> Yes, please comply with libxl idiom.
>
>>>
>>>> +            }
>>>> +
>>>> +            /*
>>>> +             * This is a new entry.
>>>> +             */
>>>
>>> /* This is a new entry. */
>>
>> Fixed.
>>
>>>
>>>> +            d_config->rdms[d_config->num_rdms].start =
>>>> +                                (uint64_t)xrdm[0].start_pfn << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[d_config->num_rdms].size =
>>>> +                                (uint64_t)xrdm[0].nr_pages << XC_PAGE_SHIFT;
>>>> +            d_config->rdms[d_config->num_rdms].flag = d_config->pcidevs[i].rdm_reserve;
>>>> +            d_config->num_rdms++;
>>>
>>> Does this work? I don't see you reallocate memory.
>>
>> Like I replied above we always preallocate this at the beginning.
>>
>
> Ah, OK.
>
> But please don't do this. It's hard to see you don't overrun the
> buffer. Please allocate memory only when you need it.

Sorry I don't understand this. As I mention above, we don't know how 
many rdm entries we really need to allocate. So this is why we'd like to 
preallocate all rdms at the beginning. Then this can cover all cases, 
global policy, (global policy & per device) and only per device. Even if 
multiple devices shares one rdm we also need to avoid duplicating a new...


>
>>>
>>>> +        }
>>>> +        free(xrdm);
>>>
>>> Bug: you free xrdm several times.
>>
>> No any conflict.
>>
>> What I did is that I would free once I finish to calling every
>> xc_device_get_rdm().
>>
>
> OK. I misread. Sorry.
>
>
> [...]
>
>>
>>>
>>>> +    /* Next step is to check and avoid potential conflict between RDM entries
>>>> +     * and guest RAM. To avoid intrusive impact to existing memory layout
>>>> +     * {lowmem, mmio, highmem} which is passed around various function blocks,
>>>> +     * below conflicts are not handled which are rare and handling them would
>>>> +     * lead to a more scattered layout:
>>>> +     *  - RMRR in highmem area (>4G)
>>>> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
>>>> +     * Otherwise for conflicts between boundary and 4G, we'll simply move lowmem
>>>> +     * end below reserved region to solve conflict.
>>>> +     *
>>>> +     * If a conflict is detected on a given RMRR entry, an error will be
>>>> +     * returned.
>>>> +     * If 'force' policy is specified. Or conflict is treated as a warning if
>>>> +     * 'try' policy is specified, and we also mark this as INVALID not to expose
>>>> +     * this entry to hvmloader.
>>>> +     *
>>>> +     * Firstly we should check the case of rdm < 4G because we may need to
>>>> +     * expand highmem_end.
>>>
>>> Is this strategy agreed in previous discussion? How future-proof is this
>>
>> Yes, this is based on that design.
>>
>
> OK.
>
> [...]
>
>>>>
>>>>   int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>>> -              libxl_domain_build_info *info,
>>>> +              libxl_domain_config *d_config,
>>>>                 libxl__domain_build_state *state)
>>>>   {
>>>>       libxl_ctx *ctx = libxl__gc_owner(gc);
>>>>       struct xc_hvm_build_args args = {};
>>>>       int ret, rc = ERROR_FAIL;
>>>> +    libxl_domain_build_info *const info = &d_config->b_info;
>>>> +    uint64_t rdm_mem_boundary, mmio_start;
>
> I didn't mention this in the first pass. You seem to have inserted some
> tabs? We use space to indent.
>
>

Okay.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-14  5:47         ` Chen, Tiejun
@ 2015-05-14 10:19           ` Julien Grall
  0 siblings, 0 replies; 125+ messages in thread
From: Julien Grall @ 2015-05-14 10:19 UTC (permalink / raw)
  To: Chen, Tiejun, Julien Grall, JBeulich, tim, konrad.wilk,
	andrew.cooper3, kevin.tian, yang.z.zhang, ian.campbell, wei.liu2,
	Ian.Jackson, stefano.stabellini
  Cc: xen-devel

Hi Chen,

On 14/05/15 06:47, Chen, Tiejun wrote:
>> This assign_device callback deals with the latter. So from the name the
>> value doesn't look right.
>>
> 
> What about XEN_DOMCTL_DEV_RDM_XXX?

That would be better naming.

I would also add a XEN_DOMCTL_DEV_NO_RDM that would be use for non-PCI
assignment

Regards,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy
  2015-05-14  5:48           ` Chen, Tiejun
@ 2015-05-14 20:13             ` Jan Beulich
  0 siblings, 0 replies; 125+ messages in thread
From: Jan Beulich @ 2015-05-14 20:13 UTC (permalink / raw)
  To: julien.grall, tiejun.chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> "Chen, Tiejun" <tiejun.chen@intel.com> 05/14/15 7:48 AM >>>
>On 2015/5/11 18:57, Jan Beulich wrote:
>> Yeah, the constant name probably shouldn't refer to PCI, but simply
>> to pass-through.
>
>What about XEN_DOMCTL_DEV_RDM_XXX? I mean this may be specific to 
>device, right?

Fine with me.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-11 14:54       ` Wei Liu
@ 2015-05-15  1:52         ` Chen, Tiejun
  2015-05-18  1:06           ` Chen, Tiejun
  2015-05-18 19:17           ` Wei Liu
  0 siblings, 2 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  1:52 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/11 22:54, Wei Liu wrote:
> On Mon, May 11, 2015 at 01:35:06PM +0800, Chen, Tiejun wrote:
>> On 2015/5/8 21:04, Wei Liu wrote:
>>> Sorry for the late review.
>>>
>>
>> Really thanks for taking your time :)
>>
>>> On Fri, Apr 10, 2015 at 05:21:52PM +0800, Tiejun Chen wrote:
>>>> This patch introduces user configurable parameters to specify RDM
>>>> resource and according policies,
>>>>
>>>> Global RDM parameter:
>>>>      rdm = [ 'host, reserve=force/try' ]
>>>> Per-device RDM parameter:
>>>>      pci = [ 'sbdf, rdm_reserve=force/try' ]
>>>>
>>>> Global RDM parameter allows user to specify reserved regions explicitly,
>>>> e.g. using 'host' to include all reserved regions reported on this platform
>>>> which is good to handle hotplug scenario. In the future this parameter
>>>> may be further extended to allow specifying random regions, e.g. even
>>>> those belonging to another platform as a preparation for live migration
>>>> with passthrough devices.
>>>>
>>>> 'force/try' policy decides how to handle conflict when reserving RDM
>>>> regions in pfn space. If conflict exists, 'force' means an immediate error
>>>> so VM will be killed, while 'try' allows moving forward with a warning
>>>> message thrown out.
>>>>
>>>> Default per-device RDM policy is 'force', while default global RDM policy
>>>> is 'try'. When both policies are specified on a given region, 'force' is
>>>> always preferred.
>>>>
>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>> ---
>>>>   docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
>>>>   docs/misc/vtd.txt           | 34 ++++++++++++++++++++
>>>>   tools/libxl/libxl_create.c  |  5 +++
>>>>   tools/libxl/libxl_types.idl | 18 +++++++++++
>>>>   tools/libxl/libxlu_pci.c    | 78 +++++++++++++++++++++++++++++++++++++++++++++
>>>>   tools/libxl/libxlutil.h     |  4 +++
>>>>   tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
>>>>   7 files changed, 203 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
>>>> index 408653f..9ed3055 100644
>>>> --- a/docs/man/xl.cfg.pod.5
>>>> +++ b/docs/man/xl.cfg.pod.5
>>>> @@ -583,6 +583,36 @@ assigned slave device.
>>>>
>>>>   =back
>>>>
>>>> +=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
>>>> +
>>>
>>> Shouldn't this be "TYPE,RDM_RESERVE_STRIGN" according to your commit
>>> message? If the only available config is just one string, you probably
>>> don't need a list for this?
>>
>> Yes, based on that design we don't need a list. So
>>
>> =item B<rdm=[ "RDM_RESERVE_STRING" ]>
>>
>
> Note that this is still a list (enclosed by "[]"). Maybe you mean
>
>     rdm = "RDM_RESERVE_STRING"
>
> ?

Yes, I'll do this.

>
>>>
>>>> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
>>>> +which is necessary to enable robust device passthrough usage. One example of
>>>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>>>> +structure on x86 platform.
>>>> +Each B<RDM_CHECK_STRING> has the form C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
>>>> +
>>>
>>> RDM_CHECK_STRING?
>>
>> And here should be corrected like this,
>>
>> B<RDM_RESERVE_STRING> has the form ...
>>
>>>
>>>> +=over 4
>>>> +
>>>> +=item B<"TYPE">
>>>> +
>>>> +Currently we just have one type. 'host' means all reserved device memory on
>>>> +this platform should be reserved in this VM's pfn space.
>>>> +
>>>
>>> What are other possible types? If there is only one type then we can
>>
>> Currently we just have one type and looks that design doesn't make this
>> clear.
>>
>>> simply ignore the type?
>>
>> I just think we may introduce something else specific to live migration in
>> the future... But I'm really not sure right now.
>>
>
> Fair enough. I was just wondering if there would be any other types. If
> so we do need provisioning.
>
> In any case, the "type" argument you proposed is a positional argument
> (you require it to be the first element of the spec string").
> I think you can just make it a key-value pair to make parsing easier.

Do you mean this statement?

=item B<rdm= "RDM_RESERVE_STRING" >

...

B<RDM_RESERVE_STRING> has the form C<[KEY=VALUE,KEY=VALUE,...> where:

=over 4

=item B<KEY=VALUE>

Possible B<KEY>s are:

=over 4

=item B<type="STRING">

Currently we just have one type. "host" means all reserved device memory on
this platform should be reserved in this VM's pfn space.

=over 4

=item B<reserve="STRING">
...


>
>>>
>>>> +=item B<KEY=VALUE>
>>>> +
>>>> +Possible B<KEY>s are:
>>>> +
>>>> +=over 4
>>>> +
>>>> +=item B<reserve="STRING">
>>>> +
>>>> +Conflict may be detected when reserving reserved device memory in gfn space.
>>>> +'force' means an unsolved conflict leads to immediate VM destroy, while
>>>
>>> Do you mean "immediate VM crash"?
>>
>> Yes. So I guess I should replace this.
>>
>>>
>>>> +'try' allows VM moving forward with a warning message thrown out. 'try'
>>>> +is default.
>>>
>>> Can you please your double quotes for "force", "try" etc.
>>
>> Sure. Just note we'd like to use "strict"/"relaxed" to replace "force"/"try"
>> from next revision according to Jan's suggestion.
>>
>
> No problem.
>
>>>
>>>> +
>>>> +Note this may be overrided by another sub item, rdm_reserve, in pci device.
>>>> +
>>>>   =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
>>>>
>>>>   Specifies the host PCI devices to passthrough to this guest. Each B<PCI_SPEC_STRING>
>>>> @@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
>>>>   D0-D3hot power management states for the PCI device. False (0) by
>>>>   default.
>>>>
>>>> +=item B<rdm_check="STRING">
>>>> +
>>>> +(HVM/x86 only) Specifies the information about Reserved Device Memory (RDM),
>>>> +which is necessary to enable robust device passthrough usage. One example of
>>>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>>>> +structure on x86 platform.
>>>> +
>>>> +Conflict may be detected when reserving reserved device memory in gfn space.
>>>> +'force' means an unsolved conflict leads to immediate VM destroy, while
>>>> +'try' allows VM moving forward with a warning message thrown out. 'force'
>>>> +is default.
>>>> +
>>>> +Note this would override another global item, rdm = [''].
>>>> +
>>>
>>> Note this would override global B<rdm> option.
>>
>> Fixed.
>>
>>>
>>>>   =back
>>>>
>>>>   =back
>>>> diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
>>>> index 9af0e99..d7434d6 100644
>>>> --- a/docs/misc/vtd.txt
>>>> +++ b/docs/misc/vtd.txt
>>>> @@ -111,6 +111,40 @@ in the config file:
>>>>   To override for a specific device:
>>>>   	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
>>>>
>>>> +RDM, 'reserved device memory', for PCI Device Passthrough
>>>> +---------------------------------------------------------
>>>> +
>>>> +There are some devices the BIOS controls, for e.g. USB devices to perform
>>>> +PS2 emulation. The regions of memory used for these devices are marked
>>>> +reserved in the e820 map. When we turn on DMA translation, DMA to those
>>>> +regions will fail. Hence BIOS uses RMRR to specify these regions along with
>>>> +devices that need to access these regions. OS is expected to setup
>>>> +identity mappings for these regions for these devices to access these regions.
>>>> +
>>>> +While creating a VM we should reserve them in advance, and avoid any conflicts.
>>>> +So we introduce user configurable parameters to specify RDM resource and
>>>> +according policies,
>>>> +
>>>> +To enable this globally, add "rdm" in the config file:
>>>> +
>>>> +    rdm = [ 'host, reserve=force/try' ]
>>>> +
>>>
>>> The "force/try" should be called "policy". And then you explain what
>>> policies we have.
>>
>> Do you mean we should rename this?
>>
>> rdm = [ 'host, policy=force/try' ]
>>
>
> No, I didn't ask you to rename that.
>
> The above line is an example which should reflect the correct syntax.
> "force/try" is not the *actual syntax*, i.e. you won't write that in
> your config file.
>
> I meant to changes it to "reserve=POLICY". Then you explain the possible
> values of POLICY.
>

Understood so what about this,

To enable this globally, add "rdm" in the config file:

     rdm = [ 'host, reserve=<POLICY>' ]

Or just for a specific device:

     pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]

Global RDM parameter allows user to specify reserved regions explicitly.
Using "host" to include all reserved regions reported on this platform
which is good to handle hotplug scenario. In the future this parameter
may be further extended to allow specifying random regions, e.g. even
those belonging to another platform as a preparation for live migration
with passthrough devices.

Currently "POLICY" includes two options, "strict" and "relaxed". It 
decides how to handle conflict when reserving RDM regions in pfn space. 
If conflict ...

>> This is really a policy but 'reserve' may can reflect our action explicitly,
>> right?
>>
>>>
>>>> +Or just for a specific device:
>>>> +
>>>> +	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
>>
>> And you also can see this.
>>
>> But anyway, if you're really stick to rename this, I'm going to be fine as
>> well but we should ping every one to check this point since this name is
>> from our previous discussion.
>>
>>>> +
>>>> +Global RDM parameter allows user to specify reserved regions explicitly.
>>>> +Using 'host' to include all reserved regions reported on this platform
>>>> +which is good to handle hotplug scenario. In the future this parameter
>>>> +may be further extended to allow specifying random regions, e.g. even
>>>> +those belonging to another platform as a preparation for live migration
>>>> +with passthrough devices.
>>>> +
>>>> +'force/try' policy decides how to handle conflict when reserving RDM
>>>> +regions in pfn space. If conflict exists, 'force' means an immediate error
>>>> +so VM will be killed, while 'try' allows moving forward with a warning
>>>
>>> Be killed by whom? I think it's hvmloader crashes voluntarily, right?
>>
>> s/VM will be kille/hvmloader crashes voluntarily
>>
>>>
>>>> +message thrown out.
>>>> +
>>>>
>>>>   Caveat on Conventional PCI Device Passthrough
>>>>   ---------------------------------------------
>>>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>>>> index 98687bd..9ed40d4 100644
>>>> --- a/tools/libxl/libxl_create.c
>>>> +++ b/tools/libxl/libxl_create.c
>>>> @@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
>>>>       }
>>>>
>>>>       for (i = 0; i < d_config->num_pcidevs; i++) {
>>>> +        /*
>>>> +         * If the rdm global policy is 'force' we should override each device.
>>>> +         */
>>>> +        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>> +            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>           ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
>>>>           if (ret < 0) {
>>>>               LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
>>>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>>>> index 47af340..5786455 100644
>>>> --- a/tools/libxl/libxl_types.idl
>>>> +++ b/tools/libxl/libxl_types.idl
>>>> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>>>>       (2, "PV"),
>>>>       ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>>>>
>>>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>>>> +    (0, "none"),
>>>> +    (1, "host"),
>>>> +    ])
>>>> +
>>>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>>>> +    (-1, "invalid"),
>>>> +    (0, "force"),
>>>> +    (1, "try"),
>>>> +    ])
>>>
>>> If you don't set init_val, the default value would be "force" (0), is this
>>
>> Yes.
>>
>>> want you want?
>>
>> We have a little bit of complexity here,
>>
>> "Default per-device RDM policy is 'force', while default global RDM policy
>> is 'try'. When both policies are specified on a given region, 'force' is
>> always preferred."
>>
>
> This is going to be done in actual code anyway.
>
> This type is used both in global and per-device setting, so I envisage

Yes.

> this to have an invalid value to start with. Appropriate default values

Sounds I should set this,

+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
+    (-1, "invalid"),
+    (0, "strict"),
+    (1, "relaxed"),
+    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
+


> should be done in libxl_TYPE_setdefault functions. And the logic to
> detect conflict and preferences done in your construct function.
>
> What do you think?
>
>>>
>>>> +
>>>>   libxl_channel_connection = Enumeration("channel_connection", [
>>>>       (0, "UNKNOWN"),
>>>>       (1, "PTY"),
>>>> @@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
>>>>       ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
>>>>       ])
>>>>
>>>> +libxl_rdm_reserve = Struct("rdm_reserve", [
>>>> +    ("type",    libxl_rdm_reserve_type),
>>>> +    ("reserve",   libxl_rdm_reserve_flag),
>>>> +    ])
>>>> +
>>>>   libxl_domain_build_info = Struct("domain_build_info",[
>>>>       ("max_vcpus",       integer),
>>>>       ("avail_vcpus",     libxl_bitmap),
>>>> @@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>>>>       ("kernel",           string),
>>>>       ("cmdline",          string),
>>>>       ("ramdisk",          string),
>>>> +    ("rdm",     libxl_rdm_reserve),
>>>>       ("u", KeyedUnion(None, libxl_domain_type, "type",
>>>>                   [("hvm", Struct(None, [("firmware",         string),
>>>>                                          ("bios",             libxl_bios_type),
>>>> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>>>>       ("power_mgmt", bool),
>>>>       ("permissive", bool),
>>>>       ("seize", bool),
>>>> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>>>>       ])
>>>>
>>>>   libxl_device_vtpm = Struct("device_vtpm", [
>>>> diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
>>>> index 26fb143..45be0d9 100644
>>>> --- a/tools/libxl/libxlu_pci.c
>>>> +++ b/tools/libxl/libxlu_pci.c
>>>> @@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci *pcidev, unsigned int domain,
>>>>   #define STATE_OPTIONS_K 6
>>>>   #define STATE_OPTIONS_V 7
>>>>   #define STATE_TERMINAL  8
>>>> +#define STATE_TYPE      9
>>>> +#define STATE_CHECK_FLAG      10
>>>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str)
>>>>   {
>>>>       unsigned state = STATE_DOMAIN;
>>>> @@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str
>>>>                       pcidev->permissive = atoi(tok);
>>>>                   }else if ( !strcmp(optkey, "seize") ) {
>>>>                       pcidev->seize = atoi(tok);
>>>> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
>>>> +                    if ( !strcmp(tok, "force") ) {
>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>> +                    } else if ( !strcmp(tok, "try") ) {
>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>> +                    } else {
>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
>>>> +                                          " %s, so goes 'force' by default.",
>>>
>>> If this is not an error, you don't need XLU__PCI_ERR.
>>>
>>> But I would say we should  just treat this as an error and
>>> abort/exit/report (whatever the parser should do in this case).
>>
>> In our case we just want to post a message to set a appropriate flag to
>> recover this behavior like we write here,
>>
>>                          XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
>> value:"
>>                                            " %s, so goes 'strict' by
>> default.",
>>                                       tok);
>
> I suggest we just abort in this case and not second guess what the admin
> wants.

Okay,
                     } else {
                         XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM 
property"
                                           " flag: 'strict' or 'relaxed'.",
                                      tok);
                         abort();


>
>>
>> This may just be a warning? But I don't we have this sort of definition,
>> XLU__PCI_WARN, ...
>>
>> So what LOG format can be adopted here?
>
> Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.

If it goes to abort(), I think XLU__PCI_ERR() should be good.

>
>>
>>>
>>>> +                                     tok);
>>>> +                    }
>>>>                   }else{
>>>>                       XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
>>>>                   }
>>>> @@ -167,6 +180,71 @@ parse_error:
>>>>       return ERROR_INVAL;
>>>>   }
>>>>
>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
>>>> +{
>>>> +    unsigned state = STATE_TYPE;
>>>> +    char *buf2, *tok, *ptr, *end;
>>>> +
>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>> +        return ERROR_NOMEM;
>>>> +
>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>> +        switch(state) {
>
> Coding style. I haven't checked what actual style this file uses, but
> there is inconsistency in this function by itself.

I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and 
they are in the same file...

Anyway, I should change this line,

for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {

>
>>>> +        case STATE_TYPE:
>>>> +            if ( *ptr == '\0' || *ptr == ',' ) {
>>>> +                state = STATE_CHECK_FLAG;
>>>> +                *ptr = '\0';
>>>> +                if ( !strcmp(tok, "host") ) {
>>>> +                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
>>>> +                } else {
>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM state option: %s", tok);
>>>> +                    goto parse_error;
>>>> +                }
>>>> +                tok = ptr + 1;
>>>> +            }
>>>> +            break;
>>>> +        case STATE_CHECK_FLAG:
>>>> +            if ( *ptr == '=' ) {
>>>> +                state = STATE_OPTIONS_V;
>>>> +                *ptr = '\0';
>>>> +                if ( strcmp(tok, "reserve") ) {
>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property value: %s", tok);
>>>> +                    goto parse_error;
>>>> +                }
>>>> +                tok = ptr + 1;
>>>> +            }
>>>> +            break;
>>>> +        case STATE_OPTIONS_V:
>>>> +            if ( *ptr == ',' || *ptr == '\0' ) {
>>>> +                state = STATE_TERMINAL;
>>>> +                *ptr = '\0';
>>>> +                if ( !strcmp(tok, "force") ) {
>>>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>> +                } else if ( !strcmp(tok, "try") ) {
>>>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>> +                } else {
>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property flag value: %s",
>>>> +                                 tok);
>>>> +                    goto parse_error;
>>>> +                }
>>>> +                tok = ptr + 1;
>>>> +            }
>>>> +        default:
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    free(buf2);
>>>> +
>>>> +    if ( tok != ptr || state != STATE_TERMINAL )
>>>> +        goto parse_error;
>>>> +
>>>> +    return 0;
>>>> +
>>>> +parse_error:
>>>> +    return ERROR_INVAL;
>>>> +}
>>>> +
>>>>   /*
>>>>    * Local variables:
>>>>    * mode: C
>>>> diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
>>>> index 0333e55..80497f8 100644
>>>> --- a/tools/libxl/libxlutil.h
>>>> +++ b/tools/libxl/libxlutil.h
>>>> @@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs, const char *const *specs,
>>>>    */
>>>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev, const char *str);
>>>>
>>>> +/*
>>>> + * RDM parsing
>>>> + */
>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str);
>>>>
>>>>   /*
>>>>    * Vif rate parsing.
>>>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>>>> index 04faf98..9a58464 100644
>>>> --- a/tools/libxl/xl_cmdimpl.c
>>>> +++ b/tools/libxl/xl_cmdimpl.c
>>>> @@ -988,7 +988,7 @@ static void parse_config_data(const char *config_source,
>>>>       const char *buf;
>>>>       long l;
>>>>       XLU_Config *config;
>>>> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
>>>> +    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms, *rdms;
>>>>       XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
>>>>       int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
>>>>       int pci_power_mgmt = 0;
>>>> @@ -1727,6 +1727,23 @@ skip_vfb:
>>>>           xlu_cfg_get_defbool(config, "e820_host", &b_info->u.pv.e820_host, 0);
>>>>       }
>>>>
>>>> +    /*
>>>> +     * By default our global policy is to query all rdm entries, and
>>>> +     * force reserve them.
>>>> +     */
>>>> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
>>>> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>
>>> This should probably to into the _setdefault function of
>>> libxl_domain_build_info.
>>
>> Sorry, I just see this
>>
>> libxl_domain_build_info_init()
>>      |
>>      + libxl_rdm_reserve_init(&p->rdm);
>> 	|
>> 	+ memset(p, '\0', sizeof(*p));
>>
>> But this should be generated automatically, right? So how to implement your
>> idea? Could you give me a show?
>>
>
> Check libxl_domain_build_info_setdefault.
>
> To use libxl types. You normally do:
>
>    libxl_TYPE_init
>    libxl_TYPE_setdefault
>
>    DO STUFF
>
>    libxl_TYPE_dispose
>
> _init and _dispose are auto-generated. _setdefault is not.

So in our case, maybe we can do this,

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index f0da7dc..461606c 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -100,6 +100,17 @@ static int sched_params_valid(libxl__gc *gc,
      return 1;
  }

+void libxl__device_rdm_setdefault(libxl__gc *gc,
+                                  libxl_domain_build_info *b_info)
+{
+    /*
+     * By default our global policy is to query all rdm entries, and
+     * force reserve them.
+     */
+    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
+    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_STRICT;
+}
+
  int libxl__domain_build_info_setdefault(libxl__gc *gc,
                                          libxl_domain_build_info *b_info)
  {
@@ -410,6 +421,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
                     libxl_domain_type_to_string(b_info->type));
          return ERROR_INVAL;
      }
+
+    libxl__device_rdm_setdefault(gc, b_info);
      return 0;
  }


>
>>>
>>>> +    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
>>>> +        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
>>>> +            libxl_rdm_reserve rdm;
>>>> +            if (!xlu_rdm_parse(config, &rdm, buf))
>>>> +            {
>>>> +                b_info->rdm.type = rdm.type;
>>>> +                b_info->rdm.reserve = rdm.reserve;
>>>> +            }
>>>
>>> You only have one rdm in b_info, so there is no need to use a list for
>>> it in config file.
>>>
>>
>> Is this fine?
>>
>>      if (!xlu_cfg_get_string(config, "rdm", &buf, 0)) {
>>
>>          libxl_rdm_reserve rdm;
>>
>>          if (!xlu_rdm_parse(config, &rdm, buf)) {
>>              b_info->rdm.type = rdm.type;
>>
>>              b_info->rdm.reserve = rdm.reserve;
>>
>>          }
>>      }
>>
>
> I think it is fine. But you'd better wait a little bit for other people
> to voice their opinions.
>

Sure.

Thanks
Tiejun

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-04-20 13:46   ` Jan Beulich
@ 2015-05-15  2:33     ` Chen, Tiejun
  2015-05-15  6:12       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  2:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang



On 2015/4/20 21:46, Jan Beulich wrote:
>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>>
>>       switch ( cmd & MEMOP_CMD_MASK )
>>       {
>> -    case XENMEM_memory_map:
>
> Title and description talk about XENMEM_set_memory_map only. As
> I think the implementation is right, the former will need updating. Do
> you actually need a HVM domain to be able to XENMEM_set_memory_map

Yes. Actually we need to enable two hypercalls here,

#1. XENMEM_set_memory_map --> Set
#2. XENMEM_memory_map --> Get

> on itself? If not, it should probably replace XENMEM_memory_map here.
>

Just rephrase,

xen: enable XENMEM set/get memory_map in hvm

This patch enables XENMEM_set_memory_map in hvm and then we can use
it to setup the e820 mappings, and finally hvmloader can get
these mappings with XENMEM_memory_map.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-04-20 13:51   ` Jan Beulich
@ 2015-05-15  2:57     ` Chen, Tiejun
  2015-05-15  6:16       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  2:57 UTC (permalink / raw)
  To: Jan Beulich, wei.liu2
  Cc: tim, kevin.tian, ian.campbell, andrew.cooper3, Ian.Jackson,
	xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 21:51, Jan Beulich wrote:
>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>> --- a/tools/libxl/libxl_dom.c
>> +++ b/tools/libxl/libxl_dom.c
>> @@ -787,6 +787,70 @@ out:
>>       return rc;
>>   }
>>
>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>> +                                          libxl_domain_config *d_config,
>> +                                          uint32_t domid,
>> +                                          struct xc_hvm_build_args *args,
>> +                                          int num_pcidevs,
>> +                                          libxl_device_pci *pcidevs)
>> +{
>> +    unsigned int nr = 0, i;
>> +    /* We always own at least one lowmem entry. */
>> +    unsigned int e820_entries = 1;
>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>> +    struct e820entry *e820 = NULL;
>> +
>> +    /* Add all rdm entries. */
>> +    e820_entries += d_config->num_rdms;
>> +
>> +    /* If we should have a highmem range. */
>> +    if (highmem_size)
>> +    {
>> +        highmem_end = (1ull<<32) + highmem_size;
>> +        e820_entries++;
>> +    }
>> +
>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>> +    if (!e820) {
>> +        return -1;
>> +    }
>> +
>> +    /* Low memory */
>> +    e820[nr].addr = 0x100000;
>> +    e820[nr].size = args->lowmem_size - 0x100000;
>> +    e820[nr].type = E820_RAM;
>
> If you really mean it to be this lax (not covering the low 1Mb), then
> you need to explain why in a comment (and the consuming side
> should also have a similar explanation then).
>

Okay, here may need this,

/* 

  * Low RAM starts at least from 1M to make sure all standard regions 

  * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios, 

  * have enough space.
  */
#define GUEST_LOW_MEM_START_DEFAULT 0x100000

On the consuming side, I should clarify that we always preserve 1M.

>> +    nr++;
>> +
>> +    /* RDM mapping */
>> +    for (i = 0; i < d_config->num_rdms; i++) {
>> +        /*
>> +         * We should drop this kind of rdm entry.
>> +         */
>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>> +            continue;
>> +
>> +        e820[nr].addr = d_config->rdms[i].start;
>> +        e820[nr].size = d_config->rdms[i].size;
>> +        e820[nr].type = E820_RESERVED;
>> +        nr++;
>> +    }
>
> Is this guaranteed not to produce overlapping entries?
>

Right, I would add this at the beginning,

     if (e820_entries >= E820MAX) {
         LOG(ERROR, "Ooops! Too many entries in the memory map!\n");
         return -1;
     }

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[]
  2015-04-20 13:57   ` Jan Beulich
@ 2015-05-15  3:10     ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  3:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 21:57, Jan Beulich wrote:
>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/util.c
>> +++ b/tools/firmware/hvmloader/util.c
>> @@ -27,6 +27,16 @@
>>   #include <xen/memory.h>
>>   #include <xen/sched.h>
>>
>> +int check_hole_conflict(uint64_t start, uint64_t size,
>> +                        uint64_t reserved_start, uint64_t reserved_size)
>> +{
>> +    if ( start + size <= reserved_start ||
>> +            start >= reserved_start + reserved_size )
>> +        return 0;
>> +    else
>> +        return 1;
>> +}
>
> See the comments on the similar tool stack function. Also please get
> indentation right.
>

Okay.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-04-20 14:21   ` Jan Beulich
@ 2015-05-15  3:18     ` Chen, Tiejun
  2015-05-15  6:19       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  3:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 22:21, Jan Beulich wrote:
>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/pci.c
>> +++ b/tools/firmware/hvmloader/pci.c
>> @@ -59,8 +59,8 @@ void pci_setup(void)
>>           uint32_t bar_reg;
>>           uint64_t bar_sz;
>>       } *bars = (struct bars *)scratch_start;
>> -    unsigned int i, nr_bars = 0;
>> -    uint64_t mmio_hole_size = 0;
>> +    unsigned int i, j, nr_bars = 0;
>> +    uint64_t mmio_hole_size = 0, reserved_end;
>>
>>       const char *s;
>>       /*
>> @@ -393,8 +393,23 @@ void pci_setup(void)
>>           }
>>
>>           base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>> + reallocate_mmio:
>>           bar_data |= (uint32_t)base;
>>           bar_data_upper = (uint32_t)(base >> 32);
>> +        for ( j = 0; j < memory_map.nr_map ; j++ )
>> +        {
>> +            if ( memory_map.map[j].type != E820_RAM )
>> +            {
>> +                reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
>> +                if ( check_hole_conflict(base, bar_sz,
>> +                                         memory_map.map[j].addr,
>> +                                         memory_map.map[j].size) )
>> +                {
>> +                    base = (reserved_end  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>> +                    goto reallocate_mmio;
>> +                }
>> +            }
>> +        }
>>           base += bar_sz;
>>
>>           if ( (base < resource->base) || (base > resource->max) )
>

Actually some original codes are missing here,

         if ( (base < resource->base) || (base > resource->max) )
         {
             printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
                    "resource!\n", devfn>>3, devfn&7, bar_reg,
                    PRIllx_arg(bar_sz));
             continue;
         }

I think this can guarantee the MMIO regions just fit in the available RAM.

Or am I wrong?

Thanks
Tiejun

> But you do nothing to make sure the MMIO regions all fit in the
> available window (see the code ahead of this relocating RAM if
> necessary).
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-04-20 14:29   ` Jan Beulich
@ 2015-05-15  6:11     ` Chen, Tiejun
  2015-05-15  6:25       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  6:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/4/20 22:29, Jan Beulich wrote:
>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -73,7 +73,8 @@ int build_e820_table(struct e820entry *e820,
>>                        unsigned int lowmem_reserved_base,
>>                        unsigned int bios_image_base)
>>   {
>> -    unsigned int nr = 0;
>> +    unsigned int nr = 0, i, j;
>> +    struct e820entry tmp;
>
> The declaration of "tmp" belongs in the most narrow scope you need
> it in.

Right.

>
>> @@ -119,10 +120,6 @@ int build_e820_table(struct e820entry *e820,
>>
>>       /* Low RAM goes here. Reserve space for special pages. */
>>       BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
>> -    e820[nr].addr = 0x100000;
>> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>> -    e820[nr].type = E820_RAM;
>> -    nr++;
>
> I think the above comment needs adjustment with all this code
> removed. I also wonder how meaningful the BUG_ON() is with
> ->low_mem_pgend no longer used for E820 table construction.
> Perhaps this needs another BUG_ON() validating that the field
> matches some value from memory_map.map[]?

But I think hvm_info->low_mem_pgend is still correct, right? And 
additionally, there's no any obvious flag to indicate which 
memory_map.map[x] is that last low memory map. Even we may separate the 
low memory to construct memory_map.map[]...

>
>> @@ -159,16 +156,37 @@ int build_e820_table(struct e820entry *e820,
>>           nr++;
>>       }
>>
>> -
>> -    if ( hvm_info->high_mem_pgend )
>> +    /* Construct the remaining according memory_map[]. */
>> +    for ( i = 0; i < memory_map.nr_map; i++ )
>>       {
>> -        e820[nr].addr = ((uint64_t)1 << 32);
>> -        e820[nr].size =
>> -            ((uint64_t)hvm_info->high_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>> -        e820[nr].type = E820_RAM;
>> +        e820[nr].addr = memory_map.map[i].addr;
>> +        e820[nr].size = memory_map.map[i].size;
>> +        e820[nr].type = memory_map.map[i].type;
>
> Afaict you could use structure assignment here to make this
> more readable.

Sorry, are you saying this?

memcpy(&e820[nr], &memory_map.map[i], sizeof(struct e820entry));

>
>>           nr++;
>>       }
>>
>> +    /* May need to reorder all e820 entries. */
>> +    for ( j = 0; j < nr-1; j++ )
>> +    {
>> +        for ( i = j+1; i < nr; i++ )
>> +        {
>> +            if ( e820[j].addr > e820[i].addr )
>> +            {
>> +                tmp.addr = e820[j].addr;
>> +                tmp.size = e820[j].size;
>> +                tmp.type = e820[j].type;
>> +
>> +                e820[j].addr = e820[i].addr;
>> +                e820[j].size = e820[i].size;
>> +                e820[j].type = e820[i].type;
>> +
>> +                e820[i].addr = tmp.addr;
>> +                e820[i].size = tmp.size;
>> +                e820[i].type = tmp.type;
>
> Please again use structure assignments to make this more readable.
>

And here,

     for ( j = 0; j < nr-1; j++ )
     {
         for ( i = j+1; i < nr; i++ )
         {
             if ( e820[j].addr > e820[i].addr )
             {
                 struct e820entry tmp;

                 memcpy(&tmp, &e820[j], sizeof(struct e820entry));

                 memcpy(&e820[j], &e820[i], sizeof(struct e820entry));

                 memcpy(&e820[i], &tmp, sizeof(struct e820entry));
             }
         }
     }

If I'm wrong please correct me.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-05-15  2:33     ` Chen, Tiejun
@ 2015-05-15  6:12       ` Jan Beulich
  2015-05-15  6:24         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:12 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 04:33, <tiejun.chen@intel.com> wrote:

> 
> On 2015/4/20 21:46, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>> --- a/xen/arch/x86/hvm/hvm.c
>>> +++ b/xen/arch/x86/hvm/hvm.c
>>> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd, 
> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>
>>>       switch ( cmd & MEMOP_CMD_MASK )
>>>       {
>>> -    case XENMEM_memory_map:
>>
>> Title and description talk about XENMEM_set_memory_map only. As
>> I think the implementation is right, the former will need updating. Do
>> you actually need a HVM domain to be able to XENMEM_set_memory_map
> 
> Yes. Actually we need to enable two hypercalls here,
> 
> #1. XENMEM_set_memory_map --> Set
> #2. XENMEM_memory_map --> Get

You say "yes" without saying why, and ...

>> on itself? If not, it should probably replace XENMEM_memory_map here.
>>
> 
> Just rephrase,
> 
> xen: enable XENMEM set/get memory_map in hvm
> 
> This patch enables XENMEM_set_memory_map in hvm and then we can use
> it to setup the e820 mappings, and finally hvmloader can get
> these mappings with XENMEM_memory_map.

... according to this wording of yours it's not needed.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-05-15  2:57     ` Chen, Tiejun
@ 2015-05-15  6:16       ` Jan Beulich
  2015-05-15  7:09         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:16 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 04:57, <tiejun.chen@intel.com> wrote:
> On 2015/4/20 21:51, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/libxl/libxl_dom.c
>>> +++ b/tools/libxl/libxl_dom.c
>>> @@ -787,6 +787,70 @@ out:
>>>       return rc;
>>>   }
>>>
>>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>>> +                                          libxl_domain_config *d_config,
>>> +                                          uint32_t domid,
>>> +                                          struct xc_hvm_build_args *args,
>>> +                                          int num_pcidevs,
>>> +                                          libxl_device_pci *pcidevs)
>>> +{
>>> +    unsigned int nr = 0, i;
>>> +    /* We always own at least one lowmem entry. */
>>> +    unsigned int e820_entries = 1;
>>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>>> +    struct e820entry *e820 = NULL;
>>> +
>>> +    /* Add all rdm entries. */
>>> +    e820_entries += d_config->num_rdms;
>>> +
>>> +    /* If we should have a highmem range. */
>>> +    if (highmem_size)
>>> +    {
>>> +        highmem_end = (1ull<<32) + highmem_size;
>>> +        e820_entries++;
>>> +    }
>>> +
>>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>>> +    if (!e820) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    /* Low memory */
>>> +    e820[nr].addr = 0x100000;
>>> +    e820[nr].size = args->lowmem_size - 0x100000;
>>> +    e820[nr].type = E820_RAM;
>>
>> If you really mean it to be this lax (not covering the low 1Mb), then
>> you need to explain why in a comment (and the consuming side
>> should also have a similar explanation then).
>>
> 
> Okay, here may need this,
> 
> /* 
> 
>   * Low RAM starts at least from 1M to make sure all standard regions 
> 
>   * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios, 
> 
>   * have enough space.
>   */
> #define GUEST_LOW_MEM_START_DEFAULT 0x100000

But this only states a generic fact, but doesn't explain why you can
lump together all the different things below 1Mb into a single E820
entry.

>>> +    nr++;
>>> +
>>> +    /* RDM mapping */
>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>> +        /*
>>> +         * We should drop this kind of rdm entry.
>>> +         */
>>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>>> +            continue;
>>> +
>>> +        e820[nr].addr = d_config->rdms[i].start;
>>> +        e820[nr].size = d_config->rdms[i].size;
>>> +        e820[nr].type = E820_RESERVED;
>>> +        nr++;
>>> +    }
>>
>> Is this guaranteed not to produce overlapping entries?
>>
> 
> Right, I would add this at the beginning,
> 
>      if (e820_entries >= E820MAX) {
>          LOG(ERROR, "Ooops! Too many entries in the memory map!\n");
>          return -1;
>      }

That would be a protection against too many entries, but not against
overlapping ones.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  3:18     ` Chen, Tiejun
@ 2015-05-15  6:19       ` Jan Beulich
  2015-05-15  7:34         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:19 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 05:18, <tiejun.chen@intel.com> wrote:
> On 2015/4/20 22:21, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>> --- a/tools/firmware/hvmloader/pci.c
>>> +++ b/tools/firmware/hvmloader/pci.c
>>> @@ -59,8 +59,8 @@ void pci_setup(void)
>>>           uint32_t bar_reg;
>>>           uint64_t bar_sz;
>>>       } *bars = (struct bars *)scratch_start;
>>> -    unsigned int i, nr_bars = 0;
>>> -    uint64_t mmio_hole_size = 0;
>>> +    unsigned int i, j, nr_bars = 0;
>>> +    uint64_t mmio_hole_size = 0, reserved_end;
>>>
>>>       const char *s;
>>>       /*
>>> @@ -393,8 +393,23 @@ void pci_setup(void)
>>>           }
>>>
>>>           base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>>> + reallocate_mmio:
>>>           bar_data |= (uint32_t)base;
>>>           bar_data_upper = (uint32_t)(base >> 32);
>>> +        for ( j = 0; j < memory_map.nr_map ; j++ )
>>> +        {
>>> +            if ( memory_map.map[j].type != E820_RAM )
>>> +            {
>>> +                reserved_end = memory_map.map[j].addr + 
> memory_map.map[j].size;
>>> +                if ( check_hole_conflict(base, bar_sz,
>>> +                                         memory_map.map[j].addr,
>>> +                                         memory_map.map[j].size) )
>>> +                {
>>> +                    base = (reserved_end  + bar_sz - 1) & ~(uint64_t)(bar_sz - 
> 1);
>>> +                    goto reallocate_mmio;
>>> +                }
>>> +            }
>>> +        }
>>>           base += bar_sz;
>>>
>>>           if ( (base < resource->base) || (base > resource->max) )
>>
> 
> Actually some original codes are missing here,
> 
>          if ( (base < resource->base) || (base > resource->max) )
>          {
>              printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>                     "resource!\n", devfn>>3, devfn&7, bar_reg,
>                     PRIllx_arg(bar_sz));
>              continue;
>          }
> 
> I think this can guarantee the MMIO regions just fit in the available RAM.
> 
> Or am I wrong?

The code you cite guarantees almost nothing, it simply skips assigning
resources. Your changes potentially growing the space needed to fit
all MMIO BARs therefore also needs to adjust the up front calculation,
such that if necessary more RAM can be relocated to make the hole
large enough.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-05-15  6:12       ` Jan Beulich
@ 2015-05-15  6:24         ` Chen, Tiejun
  2015-05-15  6:35           ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  6:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:12, Jan Beulich wrote:
>>>> On 15.05.15 at 04:33, <tiejun.chen@intel.com> wrote:
>
>>
>> On 2015/4/20 21:46, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd,
>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>
>>>>        switch ( cmd & MEMOP_CMD_MASK )
>>>>        {
>>>> -    case XENMEM_memory_map:
>>>
>>> Title and description talk about XENMEM_set_memory_map only. As
>>> I think the implementation is right, the former will need updating. Do
>>> you actually need a HVM domain to be able to XENMEM_set_memory_map
>>
>> Yes. Actually we need to enable two hypercalls here,
>>
>> #1. XENMEM_set_memory_map --> Set
>> #2. XENMEM_memory_map --> Get
>
> You say "yes" without saying why, and ...

Instead of constructing e820 in the case of hvmloader, now we'd like to 
set up a basic e820 while building hvm, so we need to enable 
XENMEM_set_memory_map/XENMEM_memory_map to own this approach in hvm case.

>
>>> on itself? If not, it should probably replace XENMEM_memory_map here.
>>>
>>
>> Just rephrase,
>>
>> xen: enable XENMEM set/get memory_map in hvm
>>
>> This patch enables XENMEM_set_memory_map in hvm and then we can use
>> it to setup the e820 mappings, and finally hvmloader can get
>> these mappings with XENMEM_memory_map.
>
> ... according to this wording of yours it's not needed.
>

Sorry, anything confound you or me?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  6:11     ` Chen, Tiejun
@ 2015-05-15  6:25       ` Jan Beulich
  2015-05-15  6:39         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:25 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
> On 2015/4/20 22:29, Jan Beulich wrote:
>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>> @@ -119,10 +120,6 @@ int build_e820_table(struct e820entry *e820,
>>>
>>>       /* Low RAM goes here. Reserve space for special pages. */
>>>       BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
>>> -    e820[nr].addr = 0x100000;
>>> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>>> -    e820[nr].type = E820_RAM;
>>> -    nr++;
>>
>> I think the above comment needs adjustment with all this code
>> removed. I also wonder how meaningful the BUG_ON() is with
>> ->low_mem_pgend no longer used for E820 table construction.
>> Perhaps this needs another BUG_ON() validating that the field
>> matches some value from memory_map.map[]?
> 
> But I think hvm_info->low_mem_pgend is still correct, right?

I think so, but as said it's becoming less used and hence less
relevant here.

> And 
> additionally, there's no any obvious flag to indicate which 
> memory_map.map[x] is that last low memory map.

I didn't imply it would be immediately obvious _how_ to do this.
I'm merely wanting to avoid leaving meaningless BUG_ON()s in
the code, while meaningful ones are amiss.

> Even we may separate the 
> low memory to construct memory_map.map[]...

???

>>> @@ -159,16 +156,37 @@ int build_e820_table(struct e820entry *e820,
>>>           nr++;
>>>       }
>>>
>>> -
>>> -    if ( hvm_info->high_mem_pgend )
>>> +    /* Construct the remaining according memory_map[]. */
>>> +    for ( i = 0; i < memory_map.nr_map; i++ )
>>>       {
>>> -        e820[nr].addr = ((uint64_t)1 << 32);
>>> -        e820[nr].size =
>>> -            ((uint64_t)hvm_info->high_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>>> -        e820[nr].type = E820_RAM;
>>> +        e820[nr].addr = memory_map.map[i].addr;
>>> +        e820[nr].size = memory_map.map[i].size;
>>> +        e820[nr].type = memory_map.map[i].type;
>>
>> Afaict you could use structure assignment here to make this
>> more readable.
> 
> Sorry, are you saying this?
> 
> memcpy(&e820[nr], &memory_map.map[i], sizeof(struct e820entry));

No, structure assignment (which, other than memcpy(), is type safe):

    e820[nr] = memory_map.map[i];

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-05-15  6:24         ` Chen, Tiejun
@ 2015-05-15  6:35           ` Jan Beulich
  2015-05-15  6:59             ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:35 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 08:24, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 14:12, Jan Beulich wrote:
>>>>> On 15.05.15 at 04:33, <tiejun.chen@intel.com> wrote:
>>> On 2015/4/20 21:46, Jan Beulich wrote:
>>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd,
>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>>
>>>>>        switch ( cmd & MEMOP_CMD_MASK )
>>>>>        {
>>>>> -    case XENMEM_memory_map:
>>>>
>>>> Title and description talk about XENMEM_set_memory_map only. As
>>>> I think the implementation is right, the former will need updating. Do
>>>> you actually need a HVM domain to be able to XENMEM_set_memory_map
>>>
>>> Yes. Actually we need to enable two hypercalls here,
>>>
>>> #1. XENMEM_set_memory_map --> Set
>>> #2. XENMEM_memory_map --> Get
>>
>> You say "yes" without saying why, and ...
> 
> Instead of constructing e820 in the case of hvmloader, now we'd like to 
> set up a basic e820 while building hvm, so we need to enable 
> XENMEM_set_memory_map/XENMEM_memory_map to own this approach in hvm case.

You continue to ignore ...

>>>> on itself? If not, it should probably replace XENMEM_memory_map here.

... the "on itself" here. Of course the tool stack needs to be able to
invoke this. But does the guest itself need to?

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  6:25       ` Jan Beulich
@ 2015-05-15  6:39         ` Chen, Tiejun
  2015-05-15  6:56           ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  6:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:25, Jan Beulich wrote:
>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>> On 2015/4/20 22:29, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>> @@ -119,10 +120,6 @@ int build_e820_table(struct e820entry *e820,
>>>>
>>>>        /* Low RAM goes here. Reserve space for special pages. */
>>>>        BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
>>>> -    e820[nr].addr = 0x100000;
>>>> -    e820[nr].size = (hvm_info->low_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>>>> -    e820[nr].type = E820_RAM;
>>>> -    nr++;
>>>
>>> I think the above comment needs adjustment with all this code
>>> removed. I also wonder how meaningful the BUG_ON() is with
>>> ->low_mem_pgend no longer used for E820 table construction.
>>> Perhaps this needs another BUG_ON() validating that the field
>>> matches some value from memory_map.map[]?
>>
>> But I think hvm_info->low_mem_pgend is still correct, right?
>
> I think so, but as said it's becoming less used and hence less
> relevant here.

Understood.

>
>> And
>> additionally, there's no any obvious flag to indicate which
>> memory_map.map[x] is that last low memory map.
>
> I didn't imply it would be immediately obvious _how_ to do this.
> I'm merely wanting to avoid leaving meaningless BUG_ON()s in
> the code, while meaningful ones are amiss.

Maybe we should lookup all .map[] to get the lowest memory map and then 
BUG_ON?

>
>> Even we may separate the
>> low memory to construct memory_map.map[]...
>
> ???

Sorry I just mean that the low memory is not represented with only one 
memory_map.map[] in some cases. Is it impossible? Even in the future? Or 
actually we always consider the lowest memory map?

>
>>>> @@ -159,16 +156,37 @@ int build_e820_table(struct e820entry *e820,
>>>>            nr++;
>>>>        }
>>>>
>>>> -
>>>> -    if ( hvm_info->high_mem_pgend )
>>>> +    /* Construct the remaining according memory_map[]. */
>>>> +    for ( i = 0; i < memory_map.nr_map; i++ )
>>>>        {
>>>> -        e820[nr].addr = ((uint64_t)1 << 32);
>>>> -        e820[nr].size =
>>>> -            ((uint64_t)hvm_info->high_mem_pgend << PAGE_SHIFT) - e820[nr].addr;
>>>> -        e820[nr].type = E820_RAM;
>>>> +        e820[nr].addr = memory_map.map[i].addr;
>>>> +        e820[nr].size = memory_map.map[i].size;
>>>> +        e820[nr].type = memory_map.map[i].type;
>>>
>>> Afaict you could use structure assignment here to make this
>>> more readable.
>>
>> Sorry, are you saying this?
>>
>> memcpy(&e820[nr], &memory_map.map[i], sizeof(struct e820entry));
>
> No, structure assignment (which, other than memcpy(), is type safe):
>
>      e820[nr] = memory_map.map[i];
>

Understood.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  6:39         ` Chen, Tiejun
@ 2015-05-15  6:56           ` Jan Beulich
  2015-05-15  7:11             ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  6:56 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>> Even we may separate the
>>> low memory to construct memory_map.map[]...
>>
>> ???
> 
> Sorry I just mean that the low memory is not represented with only one 
> memory_map.map[] in some cases.

That's correct.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm
  2015-05-15  6:35           ` Jan Beulich
@ 2015-05-15  6:59             ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  6:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:35, Jan Beulich wrote:
>>>> On 15.05.15 at 08:24, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 14:12, Jan Beulich wrote:
>>>>>> On 15.05.15 at 04:33, <tiejun.chen@intel.com> wrote:
>>>> On 2015/4/20 21:46, Jan Beulich wrote:
>>>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/xen/arch/x86/hvm/hvm.c
>>>>>> +++ b/xen/arch/x86/hvm/hvm.c
>>>>>> @@ -4729,7 +4729,6 @@ static long hvm_memory_op(int cmd,
>>>> XEN_GUEST_HANDLE_PARAM(void) arg)
>>>>>>
>>>>>>         switch ( cmd & MEMOP_CMD_MASK )
>>>>>>         {
>>>>>> -    case XENMEM_memory_map:
>>>>>
>>>>> Title and description talk about XENMEM_set_memory_map only. As
>>>>> I think the implementation is right, the former will need updating. Do
>>>>> you actually need a HVM domain to be able to XENMEM_set_memory_map
>>>>
>>>> Yes. Actually we need to enable two hypercalls here,
>>>>
>>>> #1. XENMEM_set_memory_map --> Set
>>>> #2. XENMEM_memory_map --> Get
>>>
>>> You say "yes" without saying why, and ...
>>
>> Instead of constructing e820 in the case of hvmloader, now we'd like to
>> set up a basic e820 while building hvm, so we need to enable
>> XENMEM_set_memory_map/XENMEM_memory_map to own this approach in hvm case.
>
> You continue to ignore ...
>
>>>>> on itself? If not, it should probably replace XENMEM_memory_map here.
>
> ... the "on itself" here. Of course the tool stack needs to be able to
> invoke this. But does the guest itself need to?

No, so you're advising I just update the patch head,

xen: enable XENMEM_memory_map in hvm

This patch enables XENMEM_memory_map in hvm. So we can use it to
setup the e820 mappings.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-05-15  6:16       ` Jan Beulich
@ 2015-05-15  7:09         ` Chen, Tiejun
  2015-05-15  7:32           ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  7:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:16, Jan Beulich wrote:
>>>> On 15.05.15 at 04:57, <tiejun.chen@intel.com> wrote:
>> On 2015/4/20 21:51, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/libxl/libxl_dom.c
>>>> +++ b/tools/libxl/libxl_dom.c
>>>> @@ -787,6 +787,70 @@ out:
>>>>        return rc;
>>>>    }
>>>>
>>>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>>>> +                                          libxl_domain_config *d_config,
>>>> +                                          uint32_t domid,
>>>> +                                          struct xc_hvm_build_args *args,
>>>> +                                          int num_pcidevs,
>>>> +                                          libxl_device_pci *pcidevs)
>>>> +{
>>>> +    unsigned int nr = 0, i;
>>>> +    /* We always own at least one lowmem entry. */
>>>> +    unsigned int e820_entries = 1;
>>>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>>>> +    struct e820entry *e820 = NULL;
>>>> +
>>>> +    /* Add all rdm entries. */
>>>> +    e820_entries += d_config->num_rdms;
>>>> +
>>>> +    /* If we should have a highmem range. */
>>>> +    if (highmem_size)
>>>> +    {
>>>> +        highmem_end = (1ull<<32) + highmem_size;
>>>> +        e820_entries++;
>>>> +    }
>>>> +
>>>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>>>> +    if (!e820) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    /* Low memory */
>>>> +    e820[nr].addr = 0x100000;
>>>> +    e820[nr].size = args->lowmem_size - 0x100000;
>>>> +    e820[nr].type = E820_RAM;
>>>
>>> If you really mean it to be this lax (not covering the low 1Mb), then
>>> you need to explain why in a comment (and the consuming side
>>> should also have a similar explanation then).
>>>
>>
>> Okay, here may need this,
>>
>> /*
>>
>>    * Low RAM starts at least from 1M to make sure all standard regions
>>
>>    * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios,
>>
>>    * have enough space.
>>    */
>> #define GUEST_LOW_MEM_START_DEFAULT 0x100000
>
> But this only states a generic fact, but doesn't explain why you can
> lump together all the different things below 1Mb into a single E820
> entry.

I'm not sure if I'm misleading you. All different things below 1M is not 
in a single entry. Here we just construct these mappings:

#1. [1M, lowmem_end]
#2. [RDM]
#3. [4G, highmem_end]

Those stuffs below 1M are still constructed with multiple e820 entries 
by hvmloader. At this point I don't change anything.

>
>>>> +    nr++;
>>>> +
>>>> +    /* RDM mapping */
>>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>>> +        /*
>>>> +         * We should drop this kind of rdm entry.
>>>> +         */
>>>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>>>> +            continue;
>>>> +
>>>> +        e820[nr].addr = d_config->rdms[i].start;
>>>> +        e820[nr].size = d_config->rdms[i].size;
>>>> +        e820[nr].type = E820_RESERVED;
>>>> +        nr++;
>>>> +    }
>>>
>>> Is this guaranteed not to produce overlapping entries?
>>>
>>
>> Right, I would add this at the beginning,
>>
>>       if (e820_entries >= E820MAX) {
>>           LOG(ERROR, "Ooops! Too many entries in the memory map!\n");
>>           return -1;
>>       }
>
> That would be a protection against too many entries, but not against
> overlapping ones.
>

Are you saying these kinds of mapping?

#1. [1M, lowmem_end]
#2. [RDM]
#3. [4G, highmem_end]

Before we call this function we already finish handling RDM with our 
policy. This means there's no any overlapping here.

Thanks
TIejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  6:56           ` Jan Beulich
@ 2015-05-15  7:11             ` Chen, Tiejun
  2015-05-15  7:34               ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  7:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:56, Jan Beulich wrote:
>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>> Even we may separate the
>>>> low memory to construct memory_map.map[]...
>>>
>>> ???
>>
>> Sorry I just mean that the low memory is not represented with only one
>> memory_map.map[] in some cases.
>
> That's correct.
>

So just lets keep that original BUG_ON()?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-05-15  7:09         ` Chen, Tiejun
@ 2015-05-15  7:32           ` Jan Beulich
  2015-05-15  7:51             ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  7:32 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 09:09, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 14:16, Jan Beulich wrote:
>>>>> On 15.05.15 at 04:57, <tiejun.chen@intel.com> wrote:
>>> On 2015/4/20 21:51, Jan Beulich wrote:
>>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>>> --- a/tools/libxl/libxl_dom.c
>>>>> +++ b/tools/libxl/libxl_dom.c
>>>>> @@ -787,6 +787,70 @@ out:
>>>>>        return rc;
>>>>>    }
>>>>>
>>>>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>>>>> +                                          libxl_domain_config *d_config,
>>>>> +                                          uint32_t domid,
>>>>> +                                          struct xc_hvm_build_args *args,
>>>>> +                                          int num_pcidevs,
>>>>> +                                          libxl_device_pci *pcidevs)
>>>>> +{
>>>>> +    unsigned int nr = 0, i;
>>>>> +    /* We always own at least one lowmem entry. */
>>>>> +    unsigned int e820_entries = 1;
>>>>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>>>>> +    struct e820entry *e820 = NULL;
>>>>> +
>>>>> +    /* Add all rdm entries. */
>>>>> +    e820_entries += d_config->num_rdms;
>>>>> +
>>>>> +    /* If we should have a highmem range. */
>>>>> +    if (highmem_size)
>>>>> +    {
>>>>> +        highmem_end = (1ull<<32) + highmem_size;
>>>>> +        e820_entries++;
>>>>> +    }
>>>>> +
>>>>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>>>>> +    if (!e820) {
>>>>> +        return -1;
>>>>> +    }
>>>>> +
>>>>> +    /* Low memory */
>>>>> +    e820[nr].addr = 0x100000;
>>>>> +    e820[nr].size = args->lowmem_size - 0x100000;
>>>>> +    e820[nr].type = E820_RAM;
>>>>
>>>> If you really mean it to be this lax (not covering the low 1Mb), then
>>>> you need to explain why in a comment (and the consuming side
>>>> should also have a similar explanation then).
>>>>
>>>
>>> Okay, here may need this,
>>>
>>> /*
>>>
>>>    * Low RAM starts at least from 1M to make sure all standard regions
>>>
>>>    * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios,
>>>
>>>    * have enough space.
>>>    */
>>> #define GUEST_LOW_MEM_START_DEFAULT 0x100000
>>
>> But this only states a generic fact, but doesn't explain why you can
>> lump together all the different things below 1Mb into a single E820
>> entry.
> 
> I'm not sure if I'm misleading you. All different things below 1M is not 
> in a single entry. Here we just construct these mappings:
> 
> #1. [1M, lowmem_end]
> #2. [RDM]
> #3. [4G, highmem_end]
> 
> Those stuffs below 1M are still constructed with multiple e820 entries 
> by hvmloader. At this point I don't change anything.

Then _that_ is what the comment needs to say.

>>>>> +    nr++;
>>>>> +
>>>>> +    /* RDM mapping */
>>>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>>>> +        /*
>>>>> +         * We should drop this kind of rdm entry.
>>>>> +         */
>>>>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>>>>> +            continue;
>>>>> +
>>>>> +        e820[nr].addr = d_config->rdms[i].start;
>>>>> +        e820[nr].size = d_config->rdms[i].size;
>>>>> +        e820[nr].type = E820_RESERVED;
>>>>> +        nr++;
>>>>> +    }
>>>>
>>>> Is this guaranteed not to produce overlapping entries?
>>>>
>>>
>>> Right, I would add this at the beginning,
>>>
>>>       if (e820_entries >= E820MAX) {
>>>           LOG(ERROR, "Ooops! Too many entries in the memory map!\n");
>>>           return -1;
>>>       }
>>
>> That would be a protection against too many entries, but not against
>> overlapping ones.
>>
> 
> Are you saying these kinds of mapping?
> 
> #1. [1M, lowmem_end]
> #2. [RDM]
> #3. [4G, highmem_end]
> 
> Before we call this function we already finish handling RDM with our 
> policy. This means there's no any overlapping here.

That would be fine then. Note what I had asked: "Is this guaranteed
not to produce overlapping entries?" I.e. if it is guaranteed (which
afaict isn't obvious from the code itself), then please again say why
in a brief comment.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  6:19       ` Jan Beulich
@ 2015-05-15  7:34         ` Chen, Tiejun
  2015-05-15  7:44           ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  7:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 14:19, Jan Beulich wrote:
>>>> On 15.05.15 at 05:18, <tiejun.chen@intel.com> wrote:
>> On 2015/4/20 22:21, Jan Beulich wrote:
>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>> --- a/tools/firmware/hvmloader/pci.c
>>>> +++ b/tools/firmware/hvmloader/pci.c
>>>> @@ -59,8 +59,8 @@ void pci_setup(void)
>>>>            uint32_t bar_reg;
>>>>            uint64_t bar_sz;
>>>>        } *bars = (struct bars *)scratch_start;
>>>> -    unsigned int i, nr_bars = 0;
>>>> -    uint64_t mmio_hole_size = 0;
>>>> +    unsigned int i, j, nr_bars = 0;
>>>> +    uint64_t mmio_hole_size = 0, reserved_end;
>>>>
>>>>        const char *s;
>>>>        /*
>>>> @@ -393,8 +393,23 @@ void pci_setup(void)
>>>>            }
>>>>
>>>>            base = (resource->base  + bar_sz - 1) & ~(uint64_t)(bar_sz - 1);
>>>> + reallocate_mmio:
>>>>            bar_data |= (uint32_t)base;
>>>>            bar_data_upper = (uint32_t)(base >> 32);
>>>> +        for ( j = 0; j < memory_map.nr_map ; j++ )
>>>> +        {
>>>> +            if ( memory_map.map[j].type != E820_RAM )
>>>> +            {
>>>> +                reserved_end = memory_map.map[j].addr +
>> memory_map.map[j].size;
>>>> +                if ( check_hole_conflict(base, bar_sz,
>>>> +                                         memory_map.map[j].addr,
>>>> +                                         memory_map.map[j].size) )
>>>> +                {
>>>> +                    base = (reserved_end  + bar_sz - 1) & ~(uint64_t)(bar_sz -
>> 1);
>>>> +                    goto reallocate_mmio;
>>>> +                }
>>>> +            }
>>>> +        }
>>>>            base += bar_sz;
>>>>
>>>>            if ( (base < resource->base) || (base > resource->max) )
>>>
>>
>> Actually some original codes are missing here,
>>
>>           if ( (base < resource->base) || (base > resource->max) )
>>           {
>>               printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>                      "resource!\n", devfn>>3, devfn&7, bar_reg,
>>                      PRIllx_arg(bar_sz));
>>               continue;
>>           }
>>
>> I think this can guarantee the MMIO regions just fit in the available RAM.
>>
>> Or am I wrong?
>
> The code you cite guarantees almost nothing, it simply skips assigning
> resources. Your changes potentially growing the space needed to fit
> all MMIO BARs therefore also needs to adjust the up front calculation,
> such that if necessary more RAM can be relocated to make the hole
> large enough.
>

You're right.

Just think about we're always trying to check pci_mem_start to populate 
more RAM to obtain enough PCI mempry,

     /* Relocate RAM that overlaps PCI space (in 64k-page chunks). */
     while ( (pci_mem_start >> PAGE_SHIFT) < hvm_info->low_mem_pgend )
     {
         struct xen_add_to_physmap xatp;
         unsigned int nr_pages = min_t(
             unsigned int,
             hvm_info->low_mem_pgend - (pci_mem_start >> PAGE_SHIFT),
             (1u << 16) - 1);
         if ( hvm_info->high_mem_pgend == 0 )
             hvm_info->high_mem_pgend = 1ull << (32 - PAGE_SHIFT);
         hvm_info->low_mem_pgend -= nr_pages;
         printf("Relocating 0x%x pages from "PRIllx" to "PRIllx\
                " for lowmem MMIO hole\n",
                nr_pages,
                PRIllx_arg(((uint64_t)hvm_info->low_mem_pgend)<<PAGE_SHIFT),
 
PRIllx_arg(((uint64_t)hvm_info->high_mem_pgend)<<PAGE_SHIFT));
         xatp.domid = DOMID_SELF;
         xatp.space = XENMAPSPACE_gmfn_range;
         xatp.idx   = hvm_info->low_mem_pgend;
         xatp.gpfn  = hvm_info->high_mem_pgend;
         xatp.size  = nr_pages;
         if ( hypercall_memory_op(XENMEM_add_to_physmap, &xatp) != 0 )
             BUG();
         hvm_info->high_mem_pgend += nr_pages;
     }


So I think we may need to adjust pci_mem_start like this,

@@ -301,6 +301,19 @@ void pci_setup(void)
              pci_mem_start <<= 1;
      }

+    /* Relocate PCI memory that overlaps reserved space, like RDM. */
+    for ( j = 0; j < memory_map.nr_map ; j++ )
+    {
+        if ( memory_map.map[j].type != E820_RAM )
+        {
+            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
+            if ( check_overlap(pci_mem_start, pci_mem_end,
+                               memory_map.map[j].addr,
+                               memory_map.map[j].size) )
+                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
+        }
+    }
+
      if ( mmio_total > (pci_mem_end - pci_mem_start) )
      {
          printf("Low MMIO hole not large enough for all devices,"

Right?

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  7:11             ` Chen, Tiejun
@ 2015-05-15  7:34               ` Jan Beulich
  2015-05-15  8:00                 ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  7:34 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>> Even we may separate the
>>>>> low memory to construct memory_map.map[]...
>>>>
>>>> ???
>>>
>>> Sorry I just mean that the low memory is not represented with only one
>>> memory_map.map[] in some cases.
>>
>> That's correct.
>>
> 
> So just lets keep that original BUG_ON()?

In your previous reply you seemed to agree that the BUG_ON() is
becoming meaningless. Why do you now suggest to keep it then?

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  7:34         ` Chen, Tiejun
@ 2015-05-15  7:44           ` Jan Beulich
  2015-05-15  8:16             ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  7:44 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 09:34, <tiejun.chen@intel.com> wrote:
> So I think we may need to adjust pci_mem_start like this,
> 
> @@ -301,6 +301,19 @@ void pci_setup(void)
>               pci_mem_start <<= 1;
>       }
> 
> +    /* Relocate PCI memory that overlaps reserved space, like RDM. */
> +    for ( j = 0; j < memory_map.nr_map ; j++ )
> +    {
> +        if ( memory_map.map[j].type != E820_RAM )
> +        {
> +            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
> +            if ( check_overlap(pci_mem_start, pci_mem_end,
> +                               memory_map.map[j].addr,
> +                               memory_map.map[j].size) )
> +                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
> +        }
> +    }
> +
>       if ( mmio_total > (pci_mem_end - pci_mem_start) )
>       {
>           printf("Low MMIO hole not large enough for all devices,"
> 
> Right?

I think that gets you in the right direction, but isn't enough, as it
doesn't account for (unavoidable) gaps (BARs are always a power
of 2 in size and accordingly aligned).

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map
  2015-05-15  7:32           ` Jan Beulich
@ 2015-05-15  7:51             ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  7:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 15:32, Jan Beulich wrote:
>>>> On 15.05.15 at 09:09, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 14:16, Jan Beulich wrote:
>>>>>> On 15.05.15 at 04:57, <tiejun.chen@intel.com> wrote:
>>>> On 2015/4/20 21:51, Jan Beulich wrote:
>>>>>>>> On 10.04.15 at 11:22, <tiejun.chen@intel.com> wrote:
>>>>>> --- a/tools/libxl/libxl_dom.c
>>>>>> +++ b/tools/libxl/libxl_dom.c
>>>>>> @@ -787,6 +787,70 @@ out:
>>>>>>         return rc;
>>>>>>     }
>>>>>>
>>>>>> +static int libxl__domain_construct_memmap(libxl_ctx *ctx,
>>>>>> +                                          libxl_domain_config *d_config,
>>>>>> +                                          uint32_t domid,
>>>>>> +                                          struct xc_hvm_build_args *args,
>>>>>> +                                          int num_pcidevs,
>>>>>> +                                          libxl_device_pci *pcidevs)
>>>>>> +{
>>>>>> +    unsigned int nr = 0, i;
>>>>>> +    /* We always own at least one lowmem entry. */
>>>>>> +    unsigned int e820_entries = 1;
>>>>>> +    uint64_t highmem_end = 0, highmem_size = args->mem_size - args->lowmem_size;
>>>>>> +    struct e820entry *e820 = NULL;
>>>>>> +
>>>>>> +    /* Add all rdm entries. */
>>>>>> +    e820_entries += d_config->num_rdms;
>>>>>> +
>>>>>> +    /* If we should have a highmem range. */
>>>>>> +    if (highmem_size)
>>>>>> +    {
>>>>>> +        highmem_end = (1ull<<32) + highmem_size;
>>>>>> +        e820_entries++;
>>>>>> +    }
>>>>>> +
>>>>>> +    e820 = malloc(sizeof(struct e820entry) * e820_entries);
>>>>>> +    if (!e820) {
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Low memory */
>>>>>> +    e820[nr].addr = 0x100000;
>>>>>> +    e820[nr].size = args->lowmem_size - 0x100000;
>>>>>> +    e820[nr].type = E820_RAM;
>>>>>
>>>>> If you really mean it to be this lax (not covering the low 1Mb), then
>>>>> you need to explain why in a comment (and the consuming side
>>>>> should also have a similar explanation then).
>>>>>
>>>>
>>>> Okay, here may need this,
>>>>
>>>> /*
>>>>
>>>>     * Low RAM starts at least from 1M to make sure all standard regions
>>>>
>>>>     * of the PC memory map, like BIOS, VGA memory-mapped I/O and vgabios,
>>>>
>>>>     * have enough space.
>>>>     */
>>>> #define GUEST_LOW_MEM_START_DEFAULT 0x100000
>>>
>>> But this only states a generic fact, but doesn't explain why you can
>>> lump together all the different things below 1Mb into a single E820
>>> entry.
>>
>> I'm not sure if I'm misleading you. All different things below 1M is not
>> in a single entry. Here we just construct these mappings:
>>
>> #1. [1M, lowmem_end]
>> #2. [RDM]
>> #3. [4G, highmem_end]
>>
>> Those stuffs below 1M are still constructed with multiple e820 entries
>> by hvmloader. At this point I don't change anything.
>
> Then _that_ is what the comment needs to say.
>
>>>>>> +    nr++;
>>>>>> +
>>>>>> +    /* RDM mapping */
>>>>>> +    for (i = 0; i < d_config->num_rdms; i++) {
>>>>>> +        /*
>>>>>> +         * We should drop this kind of rdm entry.
>>>>>> +         */
>>>>>> +        if (d_config->rdms[i].flag == LIBXL_RDM_RESERVE_FLAG_INVALID)
>>>>>> +            continue;
>>>>>> +
>>>>>> +        e820[nr].addr = d_config->rdms[i].start;
>>>>>> +        e820[nr].size = d_config->rdms[i].size;
>>>>>> +        e820[nr].type = E820_RESERVED;
>>>>>> +        nr++;
>>>>>> +    }
>>>>>
>>>>> Is this guaranteed not to produce overlapping entries?
>>>>>
>>>>
>>>> Right, I would add this at the beginning,
>>>>
>>>>        if (e820_entries >= E820MAX) {
>>>>            LOG(ERROR, "Ooops! Too many entries in the memory map!\n");
>>>>            return -1;
>>>>        }
>>>
>>> That would be a protection against too many entries, but not against
>>> overlapping ones.
>>>
>>
>> Are you saying these kinds of mapping?
>>
>> #1. [1M, lowmem_end]
>> #2. [RDM]
>> #3. [4G, highmem_end]
>>
>> Before we call this function we already finish handling RDM with our
>> policy. This means there's no any overlapping here.
>
> That would be fine then. Note what I had asked: "Is this guaranteed
> not to produce overlapping entries?" I.e. if it is guaranteed (which
> afaict isn't obvious from the code itself), then please again say why
> in a brief comment.
>

Sorry for this unclear reply previously.

I would summary current reply as a comment.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  7:34               ` Jan Beulich
@ 2015-05-15  8:00                 ` Chen, Tiejun
  2015-05-15  8:12                   ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  8:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 15:34, Jan Beulich wrote:
>>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>>> Even we may separate the
>>>>>> low memory to construct memory_map.map[]...
>>>>>
>>>>> ???
>>>>
>>>> Sorry I just mean that the low memory is not represented with only one
>>>> memory_map.map[] in some cases.
>>>
>>> That's correct.
>>>
>>
>> So just lets keep that original BUG_ON()?
>
> In your previous reply you seemed to agree that the BUG_ON() is
> becoming meaningless. Why do you now suggest to keep it then?
>

Sorry just let me clear this.

We still need to check this,

(hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)

Right? I agree the original is really less relevant as you said. But 
what is that meaningful BUG_ON() as you expect?

I have no better ideas and just show one draft thought in previous email.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  8:00                 ` Chen, Tiejun
@ 2015-05-15  8:12                   ` Jan Beulich
  2015-05-15  8:47                     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  8:12 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 10:00, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 15:34, Jan Beulich wrote:
>>>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
>>> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>>>> Even we may separate the
>>>>>>> low memory to construct memory_map.map[]...
>>>>>>
>>>>>> ???
>>>>>
>>>>> Sorry I just mean that the low memory is not represented with only one
>>>>> memory_map.map[] in some cases.
>>>>
>>>> That's correct.
>>>>
>>>
>>> So just lets keep that original BUG_ON()?
>>
>> In your previous reply you seemed to agree that the BUG_ON() is
>> becoming meaningless. Why do you now suggest to keep it then?
>>
> 
> Sorry just let me clear this.
> 
> We still need to check this,
> 
> (hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)
> 
> Right? I agree the original is really less relevant as you said.

And I didn't ask you to drop it. All I asked it to amend it with another
BUG_ON() checking what the one above won't cover anymore.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  7:44           ` Jan Beulich
@ 2015-05-15  8:16             ` Chen, Tiejun
  2015-05-15  8:31               ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  8:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 15:44, Jan Beulich wrote:
>>>> On 15.05.15 at 09:34, <tiejun.chen@intel.com> wrote:
>> So I think we may need to adjust pci_mem_start like this,
>>
>> @@ -301,6 +301,19 @@ void pci_setup(void)
>>                pci_mem_start <<= 1;
>>        }
>>
>> +    /* Relocate PCI memory that overlaps reserved space, like RDM. */
>> +    for ( j = 0; j < memory_map.nr_map ; j++ )
>> +    {
>> +        if ( memory_map.map[j].type != E820_RAM )
>> +        {
>> +            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
>> +            if ( check_overlap(pci_mem_start, pci_mem_end,
>> +                               memory_map.map[j].addr,
>> +                               memory_map.map[j].size) )
>> +                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
>> +        }
>> +    }
>> +
>>        if ( mmio_total > (pci_mem_end - pci_mem_start) )
>>        {
>>            printf("Low MMIO hole not large enough for all devices,"
>>
>> Right?
>
> I think that gets you in the right direction, but isn't enough, as it
> doesn't account for (unavoidable) gaps (BARs are always a power
> of 2 in size and accordingly aligned).

Right.

But as you see, we always take this action, >> PAGE_SHIFT, so this means 
its always a sort of power of 2.

Additionally, lets go back here,

         if ( (base < resource->base) || (base > resource->max) )
         {
             printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
                    "resource!\n", devfn>>3, devfn&7, bar_reg,
                    PRIllx_arg(bar_sz));
             continue;
         }

I mean even without rdm, the original codes don't consider handling this 
lack of space from a alignment in advance, right?


Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  8:16             ` Chen, Tiejun
@ 2015-05-15  8:31               ` Jan Beulich
  2015-05-15  9:21                 ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  8:31 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 10:16, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 15:44, Jan Beulich wrote:
>>>>> On 15.05.15 at 09:34, <tiejun.chen@intel.com> wrote:
>>> So I think we may need to adjust pci_mem_start like this,
>>>
>>> @@ -301,6 +301,19 @@ void pci_setup(void)
>>>                pci_mem_start <<= 1;
>>>        }
>>>
>>> +    /* Relocate PCI memory that overlaps reserved space, like RDM. */
>>> +    for ( j = 0; j < memory_map.nr_map ; j++ )
>>> +    {
>>> +        if ( memory_map.map[j].type != E820_RAM )
>>> +        {
>>> +            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
>>> +            if ( check_overlap(pci_mem_start, pci_mem_end,
>>> +                               memory_map.map[j].addr,
>>> +                               memory_map.map[j].size) )
>>> +                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
>>> +        }
>>> +    }
>>> +
>>>        if ( mmio_total > (pci_mem_end - pci_mem_start) )
>>>        {
>>>            printf("Low MMIO hole not large enough for all devices,"
>>>
>>> Right?
>>
>> I think that gets you in the right direction, but isn't enough, as it
>> doesn't account for (unavoidable) gaps (BARs are always a power
>> of 2 in size and accordingly aligned).
> 
> Right.
> 
> But as you see, we always take this action, >> PAGE_SHIFT, so this means 
> its always a sort of power of 2.

No, certainly not.

> Additionally, lets go back here,
> 
>          if ( (base < resource->base) || (base > resource->max) )
>          {
>              printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>                     "resource!\n", devfn>>3, devfn&7, bar_reg,
>                     PRIllx_arg(bar_sz));
>              continue;
>          }
> 
> I mean even without rdm, the original codes don't consider handling this 
> lack of space from a alignment in advance, right?

Correct. But your change increases the chances of this getting
used. Also I think you may want to carefully look at under what
conditions this path gets taken without and with your patches.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  8:12                   ` Jan Beulich
@ 2015-05-15  8:47                     ` Chen, Tiejun
  2015-05-15  8:54                       ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  8:47 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 16:12, Jan Beulich wrote:
>>>> On 15.05.15 at 10:00, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 15:34, Jan Beulich wrote:
>>>>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
>>>> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>>>>> Even we may separate the
>>>>>>>> low memory to construct memory_map.map[]...
>>>>>>>
>>>>>>> ???
>>>>>>
>>>>>> Sorry I just mean that the low memory is not represented with only one
>>>>>> memory_map.map[] in some cases.
>>>>>
>>>>> That's correct.
>>>>>
>>>>
>>>> So just lets keep that original BUG_ON()?
>>>
>>> In your previous reply you seemed to agree that the BUG_ON() is
>>> becoming meaningless. Why do you now suggest to keep it then?
>>>
>>
>> Sorry just let me clear this.
>>
>> We still need to check this,
>>
>> (hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)
>>
>> Right? I agree the original is really less relevant as you said.
>
> And I didn't ask you to drop it. All I asked it to amend it with another
> BUG_ON() checking what the one above won't cover anymore.
>

Another point hits me in this case while we're discussing MMIO in 
another email.

We may populate RAM to get enough MMIO in pci_setup() so this means 
hvm_info->low_mem_pgend would be changed. Furthermore, low_mem_pgend 
isn't going to keep recording our original lowmem while building domain. 
So I think this original BUG_ON() is still good to cover this case. But 
obviously, we need to adjust its associated memory_map.map[x] right now.

So what about this?

@@ -74,6 +74,7 @@ int build_e820_table(struct e820entry *e820,
                       unsigned int bios_image_base)
  {
      unsigned int nr = 0, i, j;
+    uint64_t low_mem_pgend = hvm_info->low_mem_pgend << PAGE_SHIFT;

      if ( !lowmem_reserved_base )
              lowmem_reserved_base = 0xA0000;
@@ -117,9 +118,6 @@ int build_e820_table(struct e820entry *e820,
      e820[nr].type = E820_RESERVED;
      nr++;

-    /* Low RAM goes here. Reserve space for special pages. */
-    BUG_ON((hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20));
-
      /*
       * Explicitly reserve space for special pages.
       * This space starts at RESERVED_MEMBASE an extends to cover various
@@ -162,6 +160,17 @@ int build_e820_table(struct e820entry *e820,
          nr++;
      }

+    /* Low RAM goes here. Reserve space for special pages. */
+    BUG_ON(low_mem_pgend < (2u << 20));
+    /* We may need to adjust lowmem end. */
+    for ( i = 0; i < memory_map.nr_map; i++ )
+    {
+        uint64_t end = e820[i].addr + e820[i].size;
+        if ( e820[i].type == E820_RAM &&
+             low_mem_pgend > e820[i].addr && low_mem_pgend < end )
+            e820[i].size = low_mem_pgend - e820[i].addr;
+    }
+
      /* May need to reorder all e820 entries. */
      for ( j = 0; j < nr-1; j++ )
      {


Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  8:47                     ` Chen, Tiejun
@ 2015-05-15  8:54                       ` Jan Beulich
  2015-05-15  9:18                         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  8:54 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 10:47, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 16:12, Jan Beulich wrote:
>>>>> On 15.05.15 at 10:00, <tiejun.chen@intel.com> wrote:
>>> On 2015/5/15 15:34, Jan Beulich wrote:
>>>>>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
>>>>> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>>>>>> Even we may separate the
>>>>>>>>> low memory to construct memory_map.map[]...
>>>>>>>>
>>>>>>>> ???
>>>>>>>
>>>>>>> Sorry I just mean that the low memory is not represented with only one
>>>>>>> memory_map.map[] in some cases.
>>>>>>
>>>>>> That's correct.
>>>>>>
>>>>>
>>>>> So just lets keep that original BUG_ON()?
>>>>
>>>> In your previous reply you seemed to agree that the BUG_ON() is
>>>> becoming meaningless. Why do you now suggest to keep it then?
>>>>
>>>
>>> Sorry just let me clear this.
>>>
>>> We still need to check this,
>>>
>>> (hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)
>>>
>>> Right? I agree the original is really less relevant as you said.
>>
>> And I didn't ask you to drop it. All I asked it to amend it with another
>> BUG_ON() checking what the one above won't cover anymore.
>>
> 
> Another point hits me in this case while we're discussing MMIO in 
> another email.
> 
> We may populate RAM to get enough MMIO in pci_setup() so this means 
> hvm_info->low_mem_pgend would be changed. Furthermore, low_mem_pgend 
> isn't going to keep recording our original lowmem while building domain. 
> So I think this original BUG_ON() is still good to cover this case. But 
> obviously, we need to adjust its associated memory_map.map[x] right now.
> 
> So what about this?

I don't think I have enough context anymore of all the other changes
that you have pending to be able to reasonably judge on such code
fragments. This will need looking at in the context of the next patch
series revision.

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table
  2015-05-15  8:54                       ` Jan Beulich
@ 2015-05-15  9:18                         ` Chen, Tiejun
  0 siblings, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  9:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 16:54, Jan Beulich wrote:
>>>> On 15.05.15 at 10:47, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 16:12, Jan Beulich wrote:
>>>>>> On 15.05.15 at 10:00, <tiejun.chen@intel.com> wrote:
>>>> On 2015/5/15 15:34, Jan Beulich wrote:
>>>>>>>> On 15.05.15 at 09:11, <tiejun.chen@intel.com> wrote:
>>>>>> On 2015/5/15 14:56, Jan Beulich wrote:
>>>>>>>>>> On 15.05.15 at 08:39, <tiejun.chen@intel.com> wrote:
>>>>>>>> On 2015/5/15 14:25, Jan Beulich wrote:
>>>>>>>>>>>> On 15.05.15 at 08:11, <tiejun.chen@intel.com> wrote:
>>>>>>>>>> Even we may separate the
>>>>>>>>>> low memory to construct memory_map.map[]...
>>>>>>>>>
>>>>>>>>> ???
>>>>>>>>
>>>>>>>> Sorry I just mean that the low memory is not represented with only one
>>>>>>>> memory_map.map[] in some cases.
>>>>>>>
>>>>>>> That's correct.
>>>>>>>
>>>>>>
>>>>>> So just lets keep that original BUG_ON()?
>>>>>
>>>>> In your previous reply you seemed to agree that the BUG_ON() is
>>>>> becoming meaningless. Why do you now suggest to keep it then?
>>>>>
>>>>
>>>> Sorry just let me clear this.
>>>>
>>>> We still need to check this,
>>>>
>>>> (hvm_info->low_mem_pgend << PAGE_SHIFT) < (2u << 20)
>>>>
>>>> Right? I agree the original is really less relevant as you said.
>>>
>>> And I didn't ask you to drop it. All I asked it to amend it with another
>>> BUG_ON() checking what the one above won't cover anymore.
>>>
>>
>> Another point hits me in this case while we're discussing MMIO in
>> another email.
>>
>> We may populate RAM to get enough MMIO in pci_setup() so this means
>> hvm_info->low_mem_pgend would be changed. Furthermore, low_mem_pgend
>> isn't going to keep recording our original lowmem while building domain.
>> So I think this original BUG_ON() is still good to cover this case. But
>> obviously, we need to adjust its associated memory_map.map[x] right now.
>>
>> So what about this?
>
> I don't think I have enough context anymore of all the other changes
> that you have pending to be able to reasonably judge on such code
> fragments. This will need looking at in the context of the next patch
> series revision.
>

Okay. Currently I'm still addressing some comments from tools 
maintainer. Once that is fine enough to step next, I'll send out next 
revision.

Thanks for your review.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  8:31               ` Jan Beulich
@ 2015-05-15  9:21                 ` Chen, Tiejun
  2015-05-15  9:32                   ` Jan Beulich
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-15  9:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

On 2015/5/15 16:31, Jan Beulich wrote:
>>>> On 15.05.15 at 10:16, <tiejun.chen@intel.com> wrote:
>> On 2015/5/15 15:44, Jan Beulich wrote:
>>>>>> On 15.05.15 at 09:34, <tiejun.chen@intel.com> wrote:
>>>> So I think we may need to adjust pci_mem_start like this,
>>>>
>>>> @@ -301,6 +301,19 @@ void pci_setup(void)
>>>>                 pci_mem_start <<= 1;
>>>>         }
>>>>
>>>> +    /* Relocate PCI memory that overlaps reserved space, like RDM. */
>>>> +    for ( j = 0; j < memory_map.nr_map ; j++ )
>>>> +    {
>>>> +        if ( memory_map.map[j].type != E820_RAM )
>>>> +        {
>>>> +            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
>>>> +            if ( check_overlap(pci_mem_start, pci_mem_end,
>>>> +                               memory_map.map[j].addr,
>>>> +                               memory_map.map[j].size) )
>>>> +                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
>>>> +        }
>>>> +    }
>>>> +
>>>>         if ( mmio_total > (pci_mem_end - pci_mem_start) )
>>>>         {
>>>>             printf("Low MMIO hole not large enough for all devices,"
>>>>
>>>> Right?
>>>
>>> I think that gets you in the right direction, but isn't enough, as it
>>> doesn't account for (unavoidable) gaps (BARs are always a power
>>> of 2 in size and accordingly aligned).
>>
>> Right.
>>
>> But as you see, we always take this action, >> PAGE_SHIFT, so this means
>> its always a sort of power of 2.
>
> No, certainly not.

rdm start or size? Or anything else I'm missing?

>
>> Additionally, lets go back here,
>>
>>           if ( (base < resource->base) || (base > resource->max) )
>>           {
>>               printf("pci dev %02x:%x bar %02x size "PRIllx": no space for "
>>                      "resource!\n", devfn>>3, devfn&7, bar_reg,
>>                      PRIllx_arg(bar_sz));
>>               continue;
>>           }
>>
>> I mean even without rdm, the original codes don't consider handling this
>> lack of space from a alignment in advance, right?
>
> Correct. But your change increases the chances of this getting
> used. Also I think you may want to carefully look at under what
> conditions this path gets taken without and with your patches.
>

Sure.

I think I can record that max bar_sz to improve this like,

@@ -60,7 +60,7 @@ void pci_setup(void)
          uint64_t bar_sz;
      } *bars = (struct bars *)scratch_start;
      unsigned int i, j, nr_bars = 0;
-    uint64_t mmio_hole_size = 0, reserved_end;
+    uint64_t mmio_hole_size = 0, reserved_end, max_bar_sz = 0;

      const char *s;
      /*
@@ -226,6 +226,8 @@ void pci_setup(void)
              bars[i].devfn   = devfn;
              bars[i].bar_reg = bar_reg;
              bars[i].bar_sz  = bar_sz;
+            if ( bar_sz > max_bar_sz )
+                max_bar_sz = bar_sz;

              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
@@ -311,6 +313,8 @@ void pci_setup(void)
                                 memory_map.map[j].addr,
                                 memory_map.map[j].size) )
                  pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
+                pci_mem_start = (pci_mem_start + max_bar_sz - 1) &
+                                    ~(uint64_t)(max_bar_sz - 1);
          }
      }


Note you also can take close look at this change in next revision if 
this is not that bad with your glance :)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges
  2015-05-15  9:21                 ` Chen, Tiejun
@ 2015-05-15  9:32                   ` Jan Beulich
  0 siblings, 0 replies; 125+ messages in thread
From: Jan Beulich @ 2015-05-15  9:32 UTC (permalink / raw)
  To: Tiejun Chen
  Cc: tim, kevin.tian, wei.liu2, ian.campbell, andrew.cooper3,
	Ian.Jackson, xen-devel, stefano.stabellini, yang.z.zhang

>>> On 15.05.15 at 11:21, <tiejun.chen@intel.com> wrote:
> On 2015/5/15 16:31, Jan Beulich wrote:
>>>>> On 15.05.15 at 10:16, <tiejun.chen@intel.com> wrote:
>>> On 2015/5/15 15:44, Jan Beulich wrote:
>>>>>>> On 15.05.15 at 09:34, <tiejun.chen@intel.com> wrote:
>>>>> So I think we may need to adjust pci_mem_start like this,
>>>>>
>>>>> @@ -301,6 +301,19 @@ void pci_setup(void)
>>>>>                 pci_mem_start <<= 1;
>>>>>         }
>>>>>
>>>>> +    /* Relocate PCI memory that overlaps reserved space, like RDM. */
>>>>> +    for ( j = 0; j < memory_map.nr_map ; j++ )
>>>>> +    {
>>>>> +        if ( memory_map.map[j].type != E820_RAM )
>>>>> +        {
>>>>> +            reserved_end = memory_map.map[j].addr + memory_map.map[j].size;
>>>>> +            if ( check_overlap(pci_mem_start, pci_mem_end,
>>>>> +                               memory_map.map[j].addr,
>>>>> +                               memory_map.map[j].size) )
>>>>> +                pci_mem_start -= memory_map.map[j].size >> PAGE_SHIFT;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>>         if ( mmio_total > (pci_mem_end - pci_mem_start) )
>>>>>         {
>>>>>             printf("Low MMIO hole not large enough for all devices,"
>>>>>
>>>>> Right?
>>>>
>>>> I think that gets you in the right direction, but isn't enough, as it
>>>> doesn't account for (unavoidable) gaps (BARs are always a power
>>>> of 2 in size and accordingly aligned).
>>>
>>> Right.
>>>
>>> But as you see, we always take this action, >> PAGE_SHIFT, so this means
>>> its always a sort of power of 2.
>>
>> No, certainly not.
> 
> rdm start or size? Or anything else I'm missing?

Both. How can an arbitrary value shifted right be PAGE_SHIFT be
guaranteed to be a power of two?

Jan

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-14  8:27         ` Chen, Tiejun
@ 2015-05-18  1:06           ` Chen, Tiejun
  2015-05-18 20:00           ` Wei Liu
  1 sibling, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-18  1:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

Wei,

Any comments?

Thanks
Tiejun

On 2015/5/14 16:27, Chen, Tiejun wrote:
> On 2015/5/11 19:32, Wei Liu wrote:
>> On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
>>> On 2015/5/8 22:43, Wei Liu wrote:
>>>> Sorry for the late review. This series fell through the crack.
>>>>
>>>
>>> Thanks for your review.
>>>
>>>> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
>>>>> While building a VM, HVM domain builder provides struct
>>>>> hvm_info_table{}
>>>>> to help hvmloader. Currently it includes two fields to construct guest
>>>>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we
>>>>> should
>>>>> check them to fix any conflict with RAM.
>>>>>
>>>>> RMRR can reside in address space beyond 4G theoretically, but we never
>>>>> see this in real world. So in order to avoid breaking highmem layout
>>>>
>>>> How does this break highmem layout?
>>>
>>> In most cases highmen is always continuous like [4G, ...] but RMRR is
>>> theoretically beyond 4G but very rarely. Especially we don't see this
>>> happened in real world. So we don't want to such a case breaking the
>>> highmem.
>>>
>>
>> The problem is  we take this approach just because this rarely happens
>> *now* is not future proof.  It needs to be clearly documented somewhere
>> in the manual (or any other Intel docs) and be referenced in the code.
>> Otherwise we might end up in a situation that no-one knows how it is
>> supposed to work and no-one can fix it if it breaks in the future, that
>> is, every single device on earth requires RMRR > 4G overnight (I'm
>> exaggerating).
>>
>> Or you can just make it works with highmem. How much more work do you
>> envisage?
>>
>> (If my above comment makes no sense do let me know. I only have very
>> shallow understanding of RMRR)
>
> Maybe I'm misleading you :)
>
> I don't mean RMRR > 4G is not allowed to work in our implementation.
> What I'm saying is that our *policy* is just simple for this kind of
> rare highmem case...
>
>>
>>>>
>>>>> we don't solve highmem conflict. Note this means highmem rmrr could
>>>>> still
>>>>> be supported if no conflict.
>>>>>
>
> Like these two sentences above.
>
>>>>
>>>> Aren't you actively trying to avoid conflict in libxl?
>>>
>>> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to
>>> adopt some
>>> good policies to address RMRR. In the case of highmemt, that simple
>>> policy
>>> should be enough?
>>>
>>
>> Whatever policy you and HV maintainers agree on. Just clearly document
>> it.
>
> Do you mean I should brief this patch description into one separate
> document?
>
>>
>>>>
>>>>> But in the case of lowmem, RMRR probably scatter the whole RAM space.
>>>>> Especially multiple RMRR entries would worsen this to lead a
>>>>> complicated
>>>>> memory layout. And then its hard to extend hvm_info_table{} to work
>>>>> hvmloader out. So here we're trying to figure out a simple solution to
>>>>> avoid breaking existing layout. So when a conflict occurs,
>>>>>
>>>>>      #1. Above a predefined boundary (default 2G)
>>>>>          - move lowmem_end below reserved region to solve conflict;
>>>>>
>>>>
>>>> I hope this "predefined boundary" is user tunable. I will check
>>>> later in
>>>> this patch if it is the case.
>>>
>>> Yes. As we clarified in that comments,
>>>
>>> * TODO: Its batter to provide a config parameter for this boundary
>>> value.
>>>
>>> This mean I would provide a patch address this since currently I just
>>> think
>>> this is not a big deal?
>>>
>>
>> Yes please provide a config option to override that. It's reasonable
>> that user wants to change that.
>
> Okay.
>
>>
>>>>
>>>>>      #2 Below a predefined boundary (default 2G)
>>>>>          - Check force/try policy.
>>>>>          "force" policy leads to fail libxl. Note when both policies
>>>>>          are specified on a given region, 'force' is always preferred.
>>>>>          "try" policy issue a warning message and also mask this
>>>>> entry INVALID
>>>>>          to indicate we shouldn't expose this entry to hvmloader.
>>>>>
>>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>>> ---
>>>>>   tools/libxc/include/xenctrl.h  |   8 ++
>>>>>   tools/libxc/include/xenguest.h |   3 +-
>>>>>   tools/libxc/xc_domain.c        |  40 +++++++++
>>>>>   tools/libxc/xc_hvm_build_x86.c |  28 +++---
>>>>>   tools/libxl/libxl_create.c     |   2 +-
>>>>>   tools/libxl/libxl_dm.c         | 195
>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>   tools/libxl/libxl_dom.c        |  27 +++++-
>>>>>   tools/libxl/libxl_internal.h   |  11 ++-
>>>>>   tools/libxl/libxl_types.idl    |   7 ++
>>>>>   9 files changed, 303 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/tools/libxc/include/xenctrl.h
>>>>> b/tools/libxc/include/xenctrl.h
>>>>> index 59bbe06..299b95f 100644
>>>>> --- a/tools/libxc/include/xenctrl.h
>>>>> +++ b/tools/libxc/include/xenctrl.h
>>>>> @@ -2053,6 +2053,14 @@ int xc_get_device_group(xc_interface *xch,
>>>>>                        uint32_t *num_sdevs,
>>>>>                        uint32_t *sdev_array);
>>>>>
>>>>> +struct xen_reserved_device_memory
>>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>>> +                   uint32_t flag,
>>>>> +                   uint16_t seg,
>>>>> +                   uint8_t bus,
>>>>> +                   uint8_t devfn,
>>>>> +                   unsigned int *nr_entries);
>>>>> +
>>>>>   int xc_test_assign_device(xc_interface *xch,
>>>>>                             uint32_t domid,
>>>>>                             uint32_t machine_sbdf);
>>
>> [...]
>>
>>>>
>>>>>       uint64_t mem_target;         /* Memory target in bytes. */
>>>>>       uint64_t mmio_size;          /* Size of the MMIO hole in
>>>>> bytes. */
>>>>>       const char *image_file_name; /* File name of the image to
>>>>> load. */
>>>>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>>>>> index 4f8383e..85b18ea 100644
>>>>> --- a/tools/libxc/xc_domain.c
>>>>> +++ b/tools/libxc/xc_domain.c
>>>>> @@ -1665,6 +1665,46 @@ int xc_assign_device(
>>>>>       return do_domctl(xch, &domctl);
>>>>>   }
>>>>>
>>>>> +struct xen_reserved_device_memory
>>>>> +*xc_device_get_rdm(xc_interface *xch,
>>>>> +                   uint32_t flag,
>>>>> +                   uint16_t seg,
>>>>> +                   uint8_t bus,
>>>>> +                   uint8_t devfn,
>>>>> +                   unsigned int *nr_entries)
>>>>> +{
>>>>> +    struct xen_reserved_device_memory *xrdm = NULL;
>>>>> +    int rc = xc_reserved_device_memory_map(xch, flag, seg, bus,
>>>>> devfn, xrdm,
>>>>> +                                           nr_entries);
>>>>> +
>>>>> +    if ( rc < 0 )
>>>>> +    {
>>>>> +        if ( errno == ENOBUFS )
>>>>> +        {
>>>>> +            if ( (xrdm = malloc(*nr_entries *
>>>>> +
>>>>> sizeof(xen_reserved_device_memory_t))) == NULL )
>>>>> +            {
>>>>> +                PERROR("Could not allocate memory.");
>>>>> +                goto out;
>>>>> +            }
>>>>
>>>> Don't you leak origin xrdm in this case?
>>>
>>> The caller to xc_device_get_rdm always frees this.
>>>
>>
>> I think I misunderstood how this function works. I thought xrdm was
>> passed in by caller, which is clearly not the case. Sorry!
>>
>>
>> In that case, the `if ( rc < 0 )' is not needed because the call should
>> always return rc < 0. An assert is good enough.
>
> assert(rc < 0)? But we can't presume the user always pass '0' firstly,
> and additionally, we may have no any RMRR indeed.
>
> So I guess what you want is,
>
> assert( rc <=0 );
> if ( !rc )
>      goto xrdm;
>
> if ( errno == ENOBUFS )
> ...
>
> Right?
>
>>
>>>>
>>>> And, this style is not very good. Shouldn't the caller allocate enough
>>>> memory before hand?
>>>
>>> Are you saying the caller to xc_device_get_rdm()? If so, any caller
>>> don't
>>> know this, too.
>>>
>>
>> I see.
>>
>>> Actually this is just a wrapper of that fundamental hypercall,
>>> xc_reserved_device_memory_map() in patch #2, and based on that, we
>>> always
>>> have to first call this to inquire how much memory we really need.
>>> And this
>>> is why we have this wrapper since we don't want to duplicate more codes.
>>>
>>> One error handler of this wrapper is just handling ENOBUFS since the
>>> caller
>>> never know how much memory we should allocate. So oftentimes we
>>> always set
>>> 'entries = 0' to inquire firstly.
>>>
>>> Here Jan suggested we may need to figure out a good way to consolidate
>>> xc_reserved_device_memory_map() and its wrapper, xc_device_get_rdm().
>>>
>>> But in some ways, that wrapper likes a static function so we just
>>> need to
>>> move this into that file associated to its caller, right?
>>>
>>
>> Yes, if there is only one user at the moment, make a static function.
>
> Thanks.
>
>>
>>>>
>>>>> +            rc = xc_reserved_device_memory_map(xch, flag, seg,
>>>>> bus, devfn, xrdm,
>>>>> +                                               nr_entries);
>>>>> +            if ( rc )
>>>>> +            {
>>>>> +                PERROR("Could not get reserved device memory maps.");
>>>>> +                free(xrdm);
>>>>> +                xrdm = NULL;
>>>>> +            }
>>>>> +        }
>>>>> +        else {
>>>>> +            PERROR("Could not get reserved device memory maps.");
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> + out:
>>>>> +    return xrdm;
>>>>> +}
>>>>> +
>>>>>   int xc_get_device_group(
>>>>>       xc_interface *xch,
>>>>>       uint32_t domid,
>>>>> diff --git a/tools/libxc/xc_hvm_build_x86.c
>>>>> b/tools/libxc/xc_hvm_build_x86.c
>>>>> index c81a25b..3f87bb3 100644
>>>>> --- a/tools/libxc/xc_hvm_build_x86.c
>>>>> +++ b/tools/libxc/xc_hvm_build_x86.c
>>>>> @@ -89,19 +89,16 @@ static int modules_init(struct
>>>>> xc_hvm_build_args *args,
>>>>>   }
>>>>>
>>>>>   static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
>>>>> -                           uint64_t mmio_start, uint64_t mmio_size)
>>>>> +                           uint64_t lowmem_end)
>>>>>   {
>>>>>       struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>>>>>           (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
>>>>> -    uint64_t lowmem_end = mem_size, highmem_end = 0;
>>>>> +    uint64_t highmem_end = 0;
>>>>>       uint8_t sum;
>>>>>       int i;
>>>>>
>>>>> -    if ( lowmem_end > mmio_start )
>>>>> -    {
>>>>> -        highmem_end = (1ull<<32) + (lowmem_end - mmio_start);
>>>>> -        lowmem_end = mmio_start;
>>>>> -    }
>>>>> +    if ( mem_size > lowmem_end )
>>>>> +        highmem_end = (1ull<<32) + (mem_size - lowmem_end);
>>>>>
>>>>>       memset(hvm_info_page, 0, PAGE_SIZE);
>>
>> [...]
>>
>>>>
>>>>> +                                   struct xc_hvm_build_args *args)
>>>>
>>>> This function does more than "checking", so a better name is needed.
>>>>
>>>> May be you should split this function to one "build" function and one
>>>> "check" function? What do you think?
>>>
>>> We'd better keep this big one since this can make our policy
>>> understandable,
>>> but I agree we need to rename this like,
>>>
>>> libxl__domain_device_construct_rdm()
>>>
>>> construct = check + build :)
>>
>> I'm fine with this.
>
> Thanks.
>
>>
>>>
>>>>
>>>>> +{
>>>>> +    int i, j, conflict;
>>>>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>>>>
>>>> You can just use CTX macro.
>>
>> [...]
>>
>>>>
>>>>> +    if ((type == LIBXL_RDM_RESERVE_TYPE_NONE) &&
>>>>> !d_config->num_pcidevs)
>>>>> +        return 0;
>>>>> +
>>>>> +    /* Collect all rdm info if exist. */
>>>>> +    xrdm = xc_device_get_rdm(ctx->xch, LIBXL_RDM_RESERVE_TYPE_HOST,
>>>>> +                             0, 0, 0, &nr_all_rdms);
>>>>> +    if (!nr_all_rdms)
>>>>> +        return 0;
>>>>> +    d_config->rdms = libxl__calloc(gc, nr_all_rdms,
>>>>> +                                   sizeof(libxl_device_rdm));
>>>>
>>>> Note that if you use "gc" here the allocated memory will be, well,
>>>> garbage collected at some point. If you don't want them to be gc'ed you
>>>> should use NOGC.
>>>
>>> Sorry, what does that mean by 'garbage collected'?
>>>
>>
>> That means the memory allocated with gc will be freed at some point by
>> GC_FREE, because those memory regions are meant to be temporary and used
>> internally.
>>
>> When entering a libxl public API function (those start wtih
>> libxl_), that function calls GC_INIT to initialise a garbage collector.
>> When that function exits, it calls GC_FREE to free all memory allocated
>> with that gc.
>
> Thanks for sharing this good info to me!
>
>>
>> Since d_config is very likely to be used by libxl user (xl, libvirt
>> etc), you probably don't want to fill it with gc allocated memory.
>>
>
> Yes.
>
>>>>
>>>>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>>>>> +
>>>>> +    /* Query all RDM entries in this platform */
>>>>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>>>>> +        d_config->num_rdms = nr_all_rdms;
>>>>> +        for (i = 0; i < d_config->num_rdms; i++) {
>>>>> +            d_config->rdms[i].start =
>>>>> +                                (uint64_t)xrdm[i].start_pfn <<
>>>>> XC_PAGE_SHIFT;
>>>>> +            d_config->rdms[i].size =
>>>>> +                                (uint64_t)xrdm[i].nr_pages <<
>>>>> XC_PAGE_SHIFT;
>>>>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>>>>> +        }
>>>>> +    } else {
>>>>> +        d_config->num_rdms = 0;
>>>>> +    }
>>>>
>>>> And you should move d_config->rdms = libxl__calloc inside that `if'.
>>>> That is, don't allocate memory if you don't need it.
>>>
>>> We can't since in all cases we need to preallocate this, and then we
>>> will
>>> handle this according to our policy.
>>>
>>
>> How would it ever be used again if you set d_config->num_rdms to 0? How
>> do you know the exact size of your array again?
>
> If we don't consider multiple devices shares one rdm entry, our workflow
> can be showed as follows:
>
> #1. We always preallocate all rdms[] but with memset().
> #2. Then we have two cases for that global rule,
>
> #2.1 If we really need all according to our global rule, we would set
> all rdms[] with all real rdm info and set d_config->num_rdms.
> #2.2 If we have no such a global rule to obtain all, we just clear
> d_config->num_rdms.
>
> #3. Then for per device rule
>
> #3.1 From #2.1, we just need to check if we should change one given rdm
> entry's policy if this given entry is just associated to this device.
> #3.2 From 2.2, obviously we just need to fill rdms one by one. Of
> course, its very possible that we don't fill all rdms since all
> passthroued devices might not have no rdm at all or they just occupy
> some. But anyway, finally we sync d_config->num_rdms.
>
>>
>>>>
>>>>> +    free(xrdm);
>>>>> +
>>>>> +    /* Query RDM entries per-device */
>>>>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>>>>> +        unsigned int nr_entries = 0;
>>>
>>> Maybe I should restate this,
>>>     unsigned int nr_entries;
>>>
>>>>> +        bool new = true;
>>>>> +        seg = d_config->pcidevs[i].domain;
>>>>> +        bus = d_config->pcidevs[i].bus;
>>>>> +        devfn = PCI_DEVFN(d_config->pcidevs[i].dev,
>>>>> d_config->pcidevs[i].func);
>>>>> +        nr_entries = 0;
>>>>
>>>> You've already initialised this variable.
>>>
>>> We need to set this as zero to start.
>>>
>>
>> Either of the tow works for me. Just don't want redundant
>> initialisation.
>
> Right.
>
>>
>>>>
>>>>> +        xrdm = xc_device_get_rdm(ctx->xch,
>>>>> LIBXL_RDM_RESERVE_TYPE_NONE,
>>>>> +                                 seg, bus, devfn, &nr_entries);
>>>>> +        /* No RDM to associated with this device. */
>>>>> +        if (!nr_entries)
>>>>> +            continue;
>>>>> +
>>>>> +        /* Need to check whether this entry is already saved in
>>>>> the array.
>>>>> +         * This could come from two cases:
>>>>> +         *
>>>>> +         *   - user may configure to get all RMRRs in this
>>>>> platform, which
>>>>> +         * is already queried before this point
>>>>
>>>> Formatting.
>>>
>>> Are you saying this?
>>
>> I mean you need to move "is already..." to the right go align with
>> previous line.
>
> Fixed.
>
>>
>>>
>>> +        /* Need to check whether this entry is already saved in the
>>> array.
>>>
>>> =>
>>
>> The CODING_STYLE in libxl doesn't seem to enforce this, so you can just
>> follow other examples.
>>
>>>          /*
>>>
>>>           * Need to check whether this entry is already saved in the
>>> array.
>>>           * This could come from two cases:
>>>
>>>>
>>>>> +         *   - or two assigned devices may share one RMRR entry
>>>>> +         *
>>>>> +         * different policies may be configured on the same RMRR
>>>>> due to above
>>>>> +         * two cases. We choose a simple policy to always favor
>>>>> stricter policy
>>>>> +         */
>>>>> +        for (j = 0; j < d_config->num_rdms; j++) {
>>>>> +            if (d_config->rdms[j].start ==
>>>>> +                                (uint64_t)xrdm[0].start_pfn <<
>>>>> XC_PAGE_SHIFT)
>>>>> +             {
>>>>> +                if (d_config->rdms[j].flag !=
>>>>> LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>>> +                    d_config->rdms[j].flag =
>>>>> d_config->pcidevs[i].rdm_reserve;
>>>>> +                new = false;
>>>>> +                break;
>>>>> +            }
>>>>> +        }
>>>>> +
>>>>> +        if (new) {
>>>>> +            if (d_config->num_rdms > nr_all_rdms - 1) {
>>>>> +                LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "Exceed rdm
>>>>> array boundary!\n");
>>>>
>>>> LOG(ERROR, ...)
>>>
>>> Fixed.
>>>
>>>>
>>>>> +                free(xrdm);
>>>>> +                return -1;
>>>>
>>>> Please use goto out idiom.
>>>
>>> We just have two 'return -1' differently so I'm not sure its worth doing
>>> this.
>>>
>>
>> Yes, please comply with libxl idiom.
>>
>>>>
>>>>> +            }
>>>>> +
>>>>> +            /*
>>>>> +             * This is a new entry.
>>>>> +             */
>>>>
>>>> /* This is a new entry. */
>>>
>>> Fixed.
>>>
>>>>
>>>>> +            d_config->rdms[d_config->num_rdms].start =
>>>>> +                                (uint64_t)xrdm[0].start_pfn <<
>>>>> XC_PAGE_SHIFT;
>>>>> +            d_config->rdms[d_config->num_rdms].size =
>>>>> +                                (uint64_t)xrdm[0].nr_pages <<
>>>>> XC_PAGE_SHIFT;
>>>>> +            d_config->rdms[d_config->num_rdms].flag =
>>>>> d_config->pcidevs[i].rdm_reserve;
>>>>> +            d_config->num_rdms++;
>>>>
>>>> Does this work? I don't see you reallocate memory.
>>>
>>> Like I replied above we always preallocate this at the beginning.
>>>
>>
>> Ah, OK.
>>
>> But please don't do this. It's hard to see you don't overrun the
>> buffer. Please allocate memory only when you need it.
>
> Sorry I don't understand this. As I mention above, we don't know how
> many rdm entries we really need to allocate. So this is why we'd like to
> preallocate all rdms at the beginning. Then this can cover all cases,
> global policy, (global policy & per device) and only per device. Even if
> multiple devices shares one rdm we also need to avoid duplicating a new...
>
>
>>
>>>>
>>>>> +        }
>>>>> +        free(xrdm);
>>>>
>>>> Bug: you free xrdm several times.
>>>
>>> No any conflict.
>>>
>>> What I did is that I would free once I finish to calling every
>>> xc_device_get_rdm().
>>>
>>
>> OK. I misread. Sorry.
>>
>>
>> [...]
>>
>>>
>>>>
>>>>> +    /* Next step is to check and avoid potential conflict between
>>>>> RDM entries
>>>>> +     * and guest RAM. To avoid intrusive impact to existing memory
>>>>> layout
>>>>> +     * {lowmem, mmio, highmem} which is passed around various
>>>>> function blocks,
>>>>> +     * below conflicts are not handled which are rare and handling
>>>>> them would
>>>>> +     * lead to a more scattered layout:
>>>>> +     *  - RMRR in highmem area (>4G)
>>>>> +     *  - RMRR lower than a defined memory boundary (e.g. 2G)
>>>>> +     * Otherwise for conflicts between boundary and 4G, we'll
>>>>> simply move lowmem
>>>>> +     * end below reserved region to solve conflict.
>>>>> +     *
>>>>> +     * If a conflict is detected on a given RMRR entry, an error
>>>>> will be
>>>>> +     * returned.
>>>>> +     * If 'force' policy is specified. Or conflict is treated as a
>>>>> warning if
>>>>> +     * 'try' policy is specified, and we also mark this as INVALID
>>>>> not to expose
>>>>> +     * this entry to hvmloader.
>>>>> +     *
>>>>> +     * Firstly we should check the case of rdm < 4G because we may
>>>>> need to
>>>>> +     * expand highmem_end.
>>>>
>>>> Is this strategy agreed in previous discussion? How future-proof is
>>>> this
>>>
>>> Yes, this is based on that design.
>>>
>>
>> OK.
>>
>> [...]
>>
>>>>>
>>>>>   int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
>>>>> -              libxl_domain_build_info *info,
>>>>> +              libxl_domain_config *d_config,
>>>>>                 libxl__domain_build_state *state)
>>>>>   {
>>>>>       libxl_ctx *ctx = libxl__gc_owner(gc);
>>>>>       struct xc_hvm_build_args args = {};
>>>>>       int ret, rc = ERROR_FAIL;
>>>>> +    libxl_domain_build_info *const info = &d_config->b_info;
>>>>> +    uint64_t rdm_mem_boundary, mmio_start;
>>
>> I didn't mention this in the first pass. You seem to have inserted some
>> tabs? We use space to indent.
>>
>>
>
> Okay.
>
> Thanks
> Tiejun
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-15  1:52         ` Chen, Tiejun
@ 2015-05-18  1:06           ` Chen, Tiejun
  2015-05-18 19:17           ` Wei Liu
  1 sibling, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-18  1:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

Wei,

Ping... :)

Thanks
Tiejun

On 2015/5/15 9:52, Chen, Tiejun wrote:
> On 2015/5/11 22:54, Wei Liu wrote:
>> On Mon, May 11, 2015 at 01:35:06PM +0800, Chen, Tiejun wrote:
>>> On 2015/5/8 21:04, Wei Liu wrote:
>>>> Sorry for the late review.
>>>>
>>>
>>> Really thanks for taking your time :)
>>>
>>>> On Fri, Apr 10, 2015 at 05:21:52PM +0800, Tiejun Chen wrote:
>>>>> This patch introduces user configurable parameters to specify RDM
>>>>> resource and according policies,
>>>>>
>>>>> Global RDM parameter:
>>>>>      rdm = [ 'host, reserve=force/try' ]
>>>>> Per-device RDM parameter:
>>>>>      pci = [ 'sbdf, rdm_reserve=force/try' ]
>>>>>
>>>>> Global RDM parameter allows user to specify reserved regions
>>>>> explicitly,
>>>>> e.g. using 'host' to include all reserved regions reported on this
>>>>> platform
>>>>> which is good to handle hotplug scenario. In the future this parameter
>>>>> may be further extended to allow specifying random regions, e.g. even
>>>>> those belonging to another platform as a preparation for live
>>>>> migration
>>>>> with passthrough devices.
>>>>>
>>>>> 'force/try' policy decides how to handle conflict when reserving RDM
>>>>> regions in pfn space. If conflict exists, 'force' means an
>>>>> immediate error
>>>>> so VM will be killed, while 'try' allows moving forward with a warning
>>>>> message thrown out.
>>>>>
>>>>> Default per-device RDM policy is 'force', while default global RDM
>>>>> policy
>>>>> is 'try'. When both policies are specified on a given region,
>>>>> 'force' is
>>>>> always preferred.
>>>>>
>>>>> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
>>>>> ---
>>>>>   docs/man/xl.cfg.pod.5       | 44 +++++++++++++++++++++++++
>>>>>   docs/misc/vtd.txt           | 34 ++++++++++++++++++++
>>>>>   tools/libxl/libxl_create.c  |  5 +++
>>>>>   tools/libxl/libxl_types.idl | 18 +++++++++++
>>>>>   tools/libxl/libxlu_pci.c    | 78
>>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>>   tools/libxl/libxlutil.h     |  4 +++
>>>>>   tools/libxl/xl_cmdimpl.c    | 21 +++++++++++-
>>>>>   7 files changed, 203 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
>>>>> index 408653f..9ed3055 100644
>>>>> --- a/docs/man/xl.cfg.pod.5
>>>>> +++ b/docs/man/xl.cfg.pod.5
>>>>> @@ -583,6 +583,36 @@ assigned slave device.
>>>>>
>>>>>   =back
>>>>>
>>>>> +=item B<rdm=[ "TYPE", "RDM_RESERVE_STRING", ... ]>
>>>>> +
>>>>
>>>> Shouldn't this be "TYPE,RDM_RESERVE_STRIGN" according to your commit
>>>> message? If the only available config is just one string, you probably
>>>> don't need a list for this?
>>>
>>> Yes, based on that design we don't need a list. So
>>>
>>> =item B<rdm=[ "RDM_RESERVE_STRING" ]>
>>>
>>
>> Note that this is still a list (enclosed by "[]"). Maybe you mean
>>
>>     rdm = "RDM_RESERVE_STRING"
>>
>> ?
>
> Yes, I'll do this.
>
>>
>>>>
>>>>> +(HVM/x86 only) Specifies the information about Reserved Device
>>>>> Memory (RDM),
>>>>> +which is necessary to enable robust device passthrough usage. One
>>>>> example of
>>>>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>>>>> +structure on x86 platform.
>>>>> +Each B<RDM_CHECK_STRING> has the form
>>>>> C<["TYPE",KEY=VALUE,KEY=VALUE,...> where:
>>>>> +
>>>>
>>>> RDM_CHECK_STRING?
>>>
>>> And here should be corrected like this,
>>>
>>> B<RDM_RESERVE_STRING> has the form ...
>>>
>>>>
>>>>> +=over 4
>>>>> +
>>>>> +=item B<"TYPE">
>>>>> +
>>>>> +Currently we just have one type. 'host' means all reserved device
>>>>> memory on
>>>>> +this platform should be reserved in this VM's pfn space.
>>>>> +
>>>>
>>>> What are other possible types? If there is only one type then we can
>>>
>>> Currently we just have one type and looks that design doesn't make this
>>> clear.
>>>
>>>> simply ignore the type?
>>>
>>> I just think we may introduce something else specific to live
>>> migration in
>>> the future... But I'm really not sure right now.
>>>
>>
>> Fair enough. I was just wondering if there would be any other types. If
>> so we do need provisioning.
>>
>> In any case, the "type" argument you proposed is a positional argument
>> (you require it to be the first element of the spec string").
>> I think you can just make it a key-value pair to make parsing easier.
>
> Do you mean this statement?
>
> =item B<rdm= "RDM_RESERVE_STRING" >
>
> ...
>
> B<RDM_RESERVE_STRING> has the form C<[KEY=VALUE,KEY=VALUE,...> where:
>
> =over 4
>
> =item B<KEY=VALUE>
>
> Possible B<KEY>s are:
>
> =over 4
>
> =item B<type="STRING">
>
> Currently we just have one type. "host" means all reserved device memory on
> this platform should be reserved in this VM's pfn space.
>
> =over 4
>
> =item B<reserve="STRING">
> ...
>
>
>>
>>>>
>>>>> +=item B<KEY=VALUE>
>>>>> +
>>>>> +Possible B<KEY>s are:
>>>>> +
>>>>> +=over 4
>>>>> +
>>>>> +=item B<reserve="STRING">
>>>>> +
>>>>> +Conflict may be detected when reserving reserved device memory in
>>>>> gfn space.
>>>>> +'force' means an unsolved conflict leads to immediate VM destroy,
>>>>> while
>>>>
>>>> Do you mean "immediate VM crash"?
>>>
>>> Yes. So I guess I should replace this.
>>>
>>>>
>>>>> +'try' allows VM moving forward with a warning message thrown out.
>>>>> 'try'
>>>>> +is default.
>>>>
>>>> Can you please your double quotes for "force", "try" etc.
>>>
>>> Sure. Just note we'd like to use "strict"/"relaxed" to replace
>>> "force"/"try"
>>> from next revision according to Jan's suggestion.
>>>
>>
>> No problem.
>>
>>>>
>>>>> +
>>>>> +Note this may be overrided by another sub item, rdm_reserve, in
>>>>> pci device.
>>>>> +
>>>>>   =item B<pci=[ "PCI_SPEC_STRING", "PCI_SPEC_STRING", ... ]>
>>>>>
>>>>>   Specifies the host PCI devices to passthrough to this guest. Each
>>>>> B<PCI_SPEC_STRING>
>>>>> @@ -645,6 +675,20 @@ dom0 without confirmation.  Please use with care.
>>>>>   D0-D3hot power management states for the PCI device. False (0) by
>>>>>   default.
>>>>>
>>>>> +=item B<rdm_check="STRING">
>>>>> +
>>>>> +(HVM/x86 only) Specifies the information about Reserved Device
>>>>> Memory (RDM),
>>>>> +which is necessary to enable robust device passthrough usage. One
>>>>> example of
>>>>> +RDM is reported through ACPI Reserved Memory Region Reporting (RMRR)
>>>>> +structure on x86 platform.
>>>>> +
>>>>> +Conflict may be detected when reserving reserved device memory in
>>>>> gfn space.
>>>>> +'force' means an unsolved conflict leads to immediate VM destroy,
>>>>> while
>>>>> +'try' allows VM moving forward with a warning message thrown out.
>>>>> 'force'
>>>>> +is default.
>>>>> +
>>>>> +Note this would override another global item, rdm = [''].
>>>>> +
>>>>
>>>> Note this would override global B<rdm> option.
>>>
>>> Fixed.
>>>
>>>>
>>>>>   =back
>>>>>
>>>>>   =back
>>>>> diff --git a/docs/misc/vtd.txt b/docs/misc/vtd.txt
>>>>> index 9af0e99..d7434d6 100644
>>>>> --- a/docs/misc/vtd.txt
>>>>> +++ b/docs/misc/vtd.txt
>>>>> @@ -111,6 +111,40 @@ in the config file:
>>>>>   To override for a specific device:
>>>>>       pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
>>>>>
>>>>> +RDM, 'reserved device memory', for PCI Device Passthrough
>>>>> +---------------------------------------------------------
>>>>> +
>>>>> +There are some devices the BIOS controls, for e.g. USB devices to
>>>>> perform
>>>>> +PS2 emulation. The regions of memory used for these devices are
>>>>> marked
>>>>> +reserved in the e820 map. When we turn on DMA translation, DMA to
>>>>> those
>>>>> +regions will fail. Hence BIOS uses RMRR to specify these regions
>>>>> along with
>>>>> +devices that need to access these regions. OS is expected to setup
>>>>> +identity mappings for these regions for these devices to access
>>>>> these regions.
>>>>> +
>>>>> +While creating a VM we should reserve them in advance, and avoid
>>>>> any conflicts.
>>>>> +So we introduce user configurable parameters to specify RDM
>>>>> resource and
>>>>> +according policies,
>>>>> +
>>>>> +To enable this globally, add "rdm" in the config file:
>>>>> +
>>>>> +    rdm = [ 'host, reserve=force/try' ]
>>>>> +
>>>>
>>>> The "force/try" should be called "policy". And then you explain what
>>>> policies we have.
>>>
>>> Do you mean we should rename this?
>>>
>>> rdm = [ 'host, policy=force/try' ]
>>>
>>
>> No, I didn't ask you to rename that.
>>
>> The above line is an example which should reflect the correct syntax.
>> "force/try" is not the *actual syntax*, i.e. you won't write that in
>> your config file.
>>
>> I meant to changes it to "reserve=POLICY". Then you explain the possible
>> values of POLICY.
>>
>
> Understood so what about this,
>
> To enable this globally, add "rdm" in the config file:
>
>      rdm = [ 'host, reserve=<POLICY>' ]
>
> Or just for a specific device:
>
>      pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]
>
> Global RDM parameter allows user to specify reserved regions explicitly.
> Using "host" to include all reserved regions reported on this platform
> which is good to handle hotplug scenario. In the future this parameter
> may be further extended to allow specifying random regions, e.g. even
> those belonging to another platform as a preparation for live migration
> with passthrough devices.
>
> Currently "POLICY" includes two options, "strict" and "relaxed". It
> decides how to handle conflict when reserving RDM regions in pfn space.
> If conflict ...
>
>>> This is really a policy but 'reserve' may can reflect our action
>>> explicitly,
>>> right?
>>>
>>>>
>>>>> +Or just for a specific device:
>>>>> +
>>>>> +    pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
>>>
>>> And you also can see this.
>>>
>>> But anyway, if you're really stick to rename this, I'm going to be
>>> fine as
>>> well but we should ping every one to check this point since this name is
>>> from our previous discussion.
>>>
>>>>> +
>>>>> +Global RDM parameter allows user to specify reserved regions
>>>>> explicitly.
>>>>> +Using 'host' to include all reserved regions reported on this
>>>>> platform
>>>>> +which is good to handle hotplug scenario. In the future this
>>>>> parameter
>>>>> +may be further extended to allow specifying random regions, e.g. even
>>>>> +those belonging to another platform as a preparation for live
>>>>> migration
>>>>> +with passthrough devices.
>>>>> +
>>>>> +'force/try' policy decides how to handle conflict when reserving RDM
>>>>> +regions in pfn space. If conflict exists, 'force' means an
>>>>> immediate error
>>>>> +so VM will be killed, while 'try' allows moving forward with a
>>>>> warning
>>>>
>>>> Be killed by whom? I think it's hvmloader crashes voluntarily, right?
>>>
>>> s/VM will be kille/hvmloader crashes voluntarily
>>>
>>>>
>>>>> +message thrown out.
>>>>> +
>>>>>
>>>>>   Caveat on Conventional PCI Device Passthrough
>>>>>   ---------------------------------------------
>>>>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>>>>> index 98687bd..9ed40d4 100644
>>>>> --- a/tools/libxl/libxl_create.c
>>>>> +++ b/tools/libxl/libxl_create.c
>>>>> @@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc
>>>>> *egc, libxl__multidev *multidev,
>>>>>       }
>>>>>
>>>>>       for (i = 0; i < d_config->num_pcidevs; i++) {
>>>>> +        /*
>>>>> +         * If the rdm global policy is 'force' we should override
>>>>> each device.
>>>>> +         */
>>>>> +        if (d_config->b_info.rdm.reserve ==
>>>>> LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>>> +            d_config->pcidevs[i].rdm_reserve =
>>>>> LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>           ret = libxl__device_pci_add(gc, domid,
>>>>> &d_config->pcidevs[i], 1);
>>>>>           if (ret < 0) {
>>>>>               LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
>>>>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>>>>> index 47af340..5786455 100644
>>>>> --- a/tools/libxl/libxl_types.idl
>>>>> +++ b/tools/libxl/libxl_types.idl
>>>>> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>>>>>       (2, "PV"),
>>>>>       ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>>>>>
>>>>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>>>>> +    (0, "none"),
>>>>> +    (1, "host"),
>>>>> +    ])
>>>>> +
>>>>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>>>>> +    (-1, "invalid"),
>>>>> +    (0, "force"),
>>>>> +    (1, "try"),
>>>>> +    ])
>>>>
>>>> If you don't set init_val, the default value would be "force" (0),
>>>> is this
>>>
>>> Yes.
>>>
>>>> want you want?
>>>
>>> We have a little bit of complexity here,
>>>
>>> "Default per-device RDM policy is 'force', while default global RDM
>>> policy
>>> is 'try'. When both policies are specified on a given region, 'force' is
>>> always preferred."
>>>
>>
>> This is going to be done in actual code anyway.
>>
>> This type is used both in global and per-device setting, so I envisage
>
> Yes.
>
>> this to have an invalid value to start with. Appropriate default values
>
> Sounds I should set this,
>
> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> +    (-1, "invalid"),
> +    (0, "strict"),
> +    (1, "relaxed"),
> +    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
> +
>
>
>> should be done in libxl_TYPE_setdefault functions. And the logic to
>> detect conflict and preferences done in your construct function.
>>
>> What do you think?
>>
>>>>
>>>>> +
>>>>>   libxl_channel_connection = Enumeration("channel_connection", [
>>>>>       (0, "UNKNOWN"),
>>>>>       (1, "PTY"),
>>>>> @@ -356,6 +367,11 @@ libxl_domain_sched_params =
>>>>> Struct("domain_sched_params",[
>>>>>       ("budget",       integer, {'init_val':
>>>>> 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
>>>>>       ])
>>>>>
>>>>> +libxl_rdm_reserve = Struct("rdm_reserve", [
>>>>> +    ("type",    libxl_rdm_reserve_type),
>>>>> +    ("reserve",   libxl_rdm_reserve_flag),
>>>>> +    ])
>>>>> +
>>>>>   libxl_domain_build_info = Struct("domain_build_info",[
>>>>>       ("max_vcpus",       integer),
>>>>>       ("avail_vcpus",     libxl_bitmap),
>>>>> @@ -401,6 +417,7 @@ libxl_domain_build_info =
>>>>> Struct("domain_build_info",[
>>>>>       ("kernel",           string),
>>>>>       ("cmdline",          string),
>>>>>       ("ramdisk",          string),
>>>>> +    ("rdm",     libxl_rdm_reserve),
>>>>>       ("u", KeyedUnion(None, libxl_domain_type, "type",
>>>>>                   [("hvm", Struct(None, [("firmware",         string),
>>>>>                                          ("bios",
>>>>> libxl_bios_type),
>>>>> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>>>>>       ("power_mgmt", bool),
>>>>>       ("permissive", bool),
>>>>>       ("seize", bool),
>>>>> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>>>>>       ])
>>>>>
>>>>>   libxl_device_vtpm = Struct("device_vtpm", [
>>>>> diff --git a/tools/libxl/libxlu_pci.c b/tools/libxl/libxlu_pci.c
>>>>> index 26fb143..45be0d9 100644
>>>>> --- a/tools/libxl/libxlu_pci.c
>>>>> +++ b/tools/libxl/libxlu_pci.c
>>>>> @@ -42,6 +42,8 @@ static int pcidev_struct_fill(libxl_device_pci
>>>>> *pcidev, unsigned int domain,
>>>>>   #define STATE_OPTIONS_K 6
>>>>>   #define STATE_OPTIONS_V 7
>>>>>   #define STATE_TERMINAL  8
>>>>> +#define STATE_TYPE      9
>>>>> +#define STATE_CHECK_FLAG      10
>>>>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev,
>>>>> const char *str)
>>>>>   {
>>>>>       unsigned state = STATE_DOMAIN;
>>>>> @@ -143,6 +145,17 @@ int xlu_pci_parse_bdf(XLU_Config *cfg,
>>>>> libxl_device_pci *pcidev, const char *str
>>>>>                       pcidev->permissive = atoi(tok);
>>>>>                   }else if ( !strcmp(optkey, "seize") ) {
>>>>>                       pcidev->seize = atoi(tok);
>>>>> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
>>>>> +                    if ( !strcmp(tok, "force") ) {
>>>>> +                        pcidev->rdm_reserve =
>>>>> LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>> +                    } else if ( !strcmp(tok, "try") ) {
>>>>> +                        pcidev->rdm_reserve =
>>>>> LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>> +                    } else {
>>>>> +                        pcidev->rdm_reserve =
>>>>> LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM
>>>>> property flag value:"
>>>>> +                                          " %s, so goes 'force' by
>>>>> default.",
>>>>
>>>> If this is not an error, you don't need XLU__PCI_ERR.
>>>>
>>>> But I would say we should  just treat this as an error and
>>>> abort/exit/report (whatever the parser should do in this case).
>>>
>>> In our case we just want to post a message to set a appropriate flag to
>>> recover this behavior like we write here,
>>>
>>>                          XLU__PCI_ERR(cfg, "Unknown PCI RDM property
>>> flag
>>> value:"
>>>                                            " %s, so goes 'strict' by
>>> default.",
>>>                                       tok);
>>
>> I suggest we just abort in this case and not second guess what the admin
>> wants.
>
> Okay,
>                      } else {
>                          XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM
> property"
>                                            " flag: 'strict' or 'relaxed'.",
>                                       tok);
>                          abort();
>
>
>>
>>>
>>> This may just be a warning? But I don't we have this sort of definition,
>>> XLU__PCI_WARN, ...
>>>
>>> So what LOG format can be adopted here?
>>
>> Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.
>
> If it goes to abort(), I think XLU__PCI_ERR() should be good.
>
>>
>>>
>>>>
>>>>> +                                     tok);
>>>>> +                    }
>>>>>                   }else{
>>>>>                       XLU__PCI_ERR(cfg, "Unknown PCI BDF option:
>>>>> %s", optkey);
>>>>>                   }
>>>>> @@ -167,6 +180,71 @@ parse_error:
>>>>>       return ERROR_INVAL;
>>>>>   }
>>>>>
>>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const
>>>>> char *str)
>>>>> +{
>>>>> +    unsigned state = STATE_TYPE;
>>>>> +    char *buf2, *tok, *ptr, *end;
>>>>> +
>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>>> +        return ERROR_NOMEM;
>>>>> +
>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>>> +        switch(state) {
>>
>> Coding style. I haven't checked what actual style this file uses, but
>> there is inconsistency in this function by itself.
>
> I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and
> they are in the same file...
>
> Anyway, I should change this line,
>
> for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {
>
>>
>>>>> +        case STATE_TYPE:
>>>>> +            if ( *ptr == '\0' || *ptr == ',' ) {
>>>>> +                state = STATE_CHECK_FLAG;
>>>>> +                *ptr = '\0';
>>>>> +                if ( !strcmp(tok, "host") ) {
>>>>> +                    rdm->type = LIBXL_RDM_RESERVE_TYPE_HOST;
>>>>> +                } else {
>>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM state option:
>>>>> %s", tok);
>>>>> +                    goto parse_error;
>>>>> +                }
>>>>> +                tok = ptr + 1;
>>>>> +            }
>>>>> +            break;
>>>>> +        case STATE_CHECK_FLAG:
>>>>> +            if ( *ptr == '=' ) {
>>>>> +                state = STATE_OPTIONS_V;
>>>>> +                *ptr = '\0';
>>>>> +                if ( strcmp(tok, "reserve") ) {
>>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property value:
>>>>> %s", tok);
>>>>> +                    goto parse_error;
>>>>> +                }
>>>>> +                tok = ptr + 1;
>>>>> +            }
>>>>> +            break;
>>>>> +        case STATE_OPTIONS_V:
>>>>> +            if ( *ptr == ',' || *ptr == '\0' ) {
>>>>> +                state = STATE_TERMINAL;
>>>>> +                *ptr = '\0';
>>>>> +                if ( !strcmp(tok, "force") ) {
>>>>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>> +                } else if ( !strcmp(tok, "try") ) {
>>>>> +                    rdm->reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>> +                } else {
>>>>> +                    XLU__PCI_ERR(cfg, "Unknown RDM property flag
>>>>> value: %s",
>>>>> +                                 tok);
>>>>> +                    goto parse_error;
>>>>> +                }
>>>>> +                tok = ptr + 1;
>>>>> +            }
>>>>> +        default:
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    free(buf2);
>>>>> +
>>>>> +    if ( tok != ptr || state != STATE_TERMINAL )
>>>>> +        goto parse_error;
>>>>> +
>>>>> +    return 0;
>>>>> +
>>>>> +parse_error:
>>>>> +    return ERROR_INVAL;
>>>>> +}
>>>>> +
>>>>>   /*
>>>>>    * Local variables:
>>>>>    * mode: C
>>>>> diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
>>>>> index 0333e55..80497f8 100644
>>>>> --- a/tools/libxl/libxlutil.h
>>>>> +++ b/tools/libxl/libxlutil.h
>>>>> @@ -93,6 +93,10 @@ int xlu_disk_parse(XLU_Config *cfg, int nspecs,
>>>>> const char *const *specs,
>>>>>    */
>>>>>   int xlu_pci_parse_bdf(XLU_Config *cfg, libxl_device_pci *pcidev,
>>>>> const char *str);
>>>>>
>>>>> +/*
>>>>> + * RDM parsing
>>>>> + */
>>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const
>>>>> char *str);
>>>>>
>>>>>   /*
>>>>>    * Vif rate parsing.
>>>>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>>>>> index 04faf98..9a58464 100644
>>>>> --- a/tools/libxl/xl_cmdimpl.c
>>>>> +++ b/tools/libxl/xl_cmdimpl.c
>>>>> @@ -988,7 +988,7 @@ static void parse_config_data(const char
>>>>> *config_source,
>>>>>       const char *buf;
>>>>>       long l;
>>>>>       XLU_Config *config;
>>>>> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids,
>>>>> *vtpms;
>>>>> +    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids,
>>>>> *vtpms, *rdms;
>>>>>       XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian;
>>>>>       int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
>>>>>       int pci_power_mgmt = 0;
>>>>> @@ -1727,6 +1727,23 @@ skip_vfb:
>>>>>           xlu_cfg_get_defbool(config, "e820_host",
>>>>> &b_info->u.pv.e820_host, 0);
>>>>>       }
>>>>>
>>>>> +    /*
>>>>> +     * By default our global policy is to query all rdm entries, and
>>>>> +     * force reserve them.
>>>>> +     */
>>>>> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
>>>>> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>
>>>> This should probably to into the _setdefault function of
>>>> libxl_domain_build_info.
>>>
>>> Sorry, I just see this
>>>
>>> libxl_domain_build_info_init()
>>>      |
>>>      + libxl_rdm_reserve_init(&p->rdm);
>>>     |
>>>     + memset(p, '\0', sizeof(*p));
>>>
>>> But this should be generated automatically, right? So how to
>>> implement your
>>> idea? Could you give me a show?
>>>
>>
>> Check libxl_domain_build_info_setdefault.
>>
>> To use libxl types. You normally do:
>>
>>    libxl_TYPE_init
>>    libxl_TYPE_setdefault
>>
>>    DO STUFF
>>
>>    libxl_TYPE_dispose
>>
>> _init and _dispose are auto-generated. _setdefault is not.
>
> So in our case, maybe we can do this,
>
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index f0da7dc..461606c 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -100,6 +100,17 @@ static int sched_params_valid(libxl__gc *gc,
>       return 1;
>   }
>
> +void libxl__device_rdm_setdefault(libxl__gc *gc,
> +                                  libxl_domain_build_info *b_info)
> +{
> +    /*
> +     * By default our global policy is to query all rdm entries, and
> +     * force reserve them.
> +     */
> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_STRICT;
> +}
> +
>   int libxl__domain_build_info_setdefault(libxl__gc *gc,
>                                           libxl_domain_build_info *b_info)
>   {
> @@ -410,6 +421,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>                      libxl_domain_type_to_string(b_info->type));
>           return ERROR_INVAL;
>       }
> +
> +    libxl__device_rdm_setdefault(gc, b_info);
>       return 0;
>   }
>
>
>>
>>>>
>>>>> +    if (!xlu_cfg_get_list (config, "rdm", &rdms, 0, 0)) {
>>>>> +        if ((buf = xlu_cfg_get_listitem (rdms, 0)) != NULL ) {
>>>>> +            libxl_rdm_reserve rdm;
>>>>> +            if (!xlu_rdm_parse(config, &rdm, buf))
>>>>> +            {
>>>>> +                b_info->rdm.type = rdm.type;
>>>>> +                b_info->rdm.reserve = rdm.reserve;
>>>>> +            }
>>>>
>>>> You only have one rdm in b_info, so there is no need to use a list for
>>>> it in config file.
>>>>
>>>
>>> Is this fine?
>>>
>>>      if (!xlu_cfg_get_string(config, "rdm", &buf, 0)) {
>>>
>>>          libxl_rdm_reserve rdm;
>>>
>>>          if (!xlu_rdm_parse(config, &rdm, buf)) {
>>>              b_info->rdm.type = rdm.type;
>>>
>>>              b_info->rdm.reserve = rdm.reserve;
>>>
>>>          }
>>>      }
>>>
>>
>> I think it is fine. But you'd better wait a little bit for other people
>> to voice their opinions.
>>
>
> Sure.
>
> Thanks
> Tiejun
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-15  1:52         ` Chen, Tiejun
  2015-05-18  1:06           ` Chen, Tiejun
@ 2015-05-18 19:17           ` Wei Liu
  2015-05-19  3:16             ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-18 19:17 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Fri, May 15, 2015 at 09:52:23AM +0800, Chen, Tiejun wrote:
[...]
> 
> 
> ...
> 
> B<RDM_RESERVE_STRING> has the form C<[KEY=VALUE,KEY=VALUE,...> where:
> 
> =over 4
> 
> =item B<KEY=VALUE>
> 
> Possible B<KEY>s are:
> 
> =over 4
> 
> =item B<type="STRING">
> 
> Currently we just have one type. "host" means all reserved device memory on
> this platform should be reserved in this VM's pfn space.
> 
> =over 4
> 
> =item B<reserve="STRING">
> ...
> 

Yes, something like this.

> 
> >
> >>>
[...]
> >>>>index 9af0e99..d7434d6 100644
> >>>>--- a/docs/misc/vtd.txt
> >>>>+++ b/docs/misc/vtd.txt
> >>>>@@ -111,6 +111,40 @@ in the config file:
> >>>>  To override for a specific device:
> >>>>  	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
> >>>>
> >>>>+RDM, 'reserved device memory', for PCI Device Passthrough
> >>>>+---------------------------------------------------------
> >>>>+
> >>>>+There are some devices the BIOS controls, for e.g. USB devices to perform
> >>>>+PS2 emulation. The regions of memory used for these devices are marked
> >>>>+reserved in the e820 map. When we turn on DMA translation, DMA to those
> >>>>+regions will fail. Hence BIOS uses RMRR to specify these regions along with
> >>>>+devices that need to access these regions. OS is expected to setup
> >>>>+identity mappings for these regions for these devices to access these regions.
> >>>>+
> >>>>+While creating a VM we should reserve them in advance, and avoid any conflicts.
> >>>>+So we introduce user configurable parameters to specify RDM resource and
> >>>>+according policies,
> >>>>+
> >>>>+To enable this globally, add "rdm" in the config file:
> >>>>+
> >>>>+    rdm = [ 'host, reserve=force/try' ]
> >>>>+
> >>>
> >>>The "force/try" should be called "policy". And then you explain what
> >>>policies we have.
> >>
> >>Do you mean we should rename this?
> >>
> >>rdm = [ 'host, policy=force/try' ]
> >>
> >
> >No, I didn't ask you to rename that.
> >
> >The above line is an example which should reflect the correct syntax.
> >"force/try" is not the *actual syntax*, i.e. you won't write that in
> >your config file.
> >
> >I meant to changes it to "reserve=POLICY". Then you explain the possible
> >values of POLICY.
> >
> 
> Understood so what about this,
> 
> To enable this globally, add "rdm" in the config file:
> 
>     rdm = [ 'host, reserve=<POLICY>' ]
> 

OK, so this is a specific example in vtd.txt. Last time I misread it as
part of the manpage.

I think you meant in this specific example (with other suggestions
incorporated):

     rdm = "type=host, reserve=force"

Then you point user to xl.cfg manpage.

> Or just for a specific device:
> 
>     pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]
> 

Same here.

Just don't write "force/try" or "strcit/relax" because that's not the
exact syntax you would use in a real config file.

> Global RDM parameter allows user to specify reserved regions explicitly.
> Using "host" to include all reserved regions reported on this platform
> which is good to handle hotplug scenario. In the future this parameter
> may be further extended to allow specifying random regions, e.g. even
> those belonging to another platform as a preparation for live migration
> with passthrough devices.
> 
> Currently "POLICY" includes two options, "strict" and "relaxed". It decides
> how to handle conflict when reserving RDM regions in pfn space. If conflict
> ...
> 
> >>This is really a policy but 'reserve' may can reflect our action explicitly,
> >>right?
> >>
> >>>
> >>>>+Or just for a specific device:
> >>>>+
> >>>>+	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
> >>
> >>And you also can see this.
> >>
> >>But anyway, if you're really stick to rename this, I'm going to be fine as
> >>well but we should ping every one to check this point since this name is
> >>from our previous discussion.
> >>
> >>>>+
> >>>>+Global RDM parameter allows user to specify reserved regions explicitly.
> >>>>+Using 'host' to include all reserved regions reported on this platform
> >>>>+which is good to handle hotplug scenario. In the future this parameter
> >>>>+may be further extended to allow specifying random regions, e.g. even
> >>>>+those belonging to another platform as a preparation for live migration
> >>>>+with passthrough devices.
> >>>>+
> >>>>+'force/try' policy decides how to handle conflict when reserving RDM
> >>>>+regions in pfn space. If conflict exists, 'force' means an immediate error
> >>>>+so VM will be killed, while 'try' allows moving forward with a warning
> >>>
> >>>Be killed by whom? I think it's hvmloader crashes voluntarily, right?
> >>
> >>s/VM will be kille/hvmloader crashes voluntarily
> >>
> >>>
> >>>>+message thrown out.
> >>>>+
> >>>>
> >>>>  Caveat on Conventional PCI Device Passthrough
> >>>>  ---------------------------------------------
> >>>>diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> >>>>index 98687bd..9ed40d4 100644
> >>>>--- a/tools/libxl/libxl_create.c
> >>>>+++ b/tools/libxl/libxl_create.c
> >>>>@@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
> >>>>      }
> >>>>
> >>>>      for (i = 0; i < d_config->num_pcidevs; i++) {
> >>>>+        /*
> >>>>+         * If the rdm global policy is 'force' we should override each device.
> >>>>+         */
> >>>>+        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
> >>>>+            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>>>          ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
> >>>>          if (ret < 0) {
> >>>>              LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
> >>>>diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> >>>>index 47af340..5786455 100644
> >>>>--- a/tools/libxl/libxl_types.idl
> >>>>+++ b/tools/libxl/libxl_types.idl
> >>>>@@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
> >>>>      (2, "PV"),
> >>>>      ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
> >>>>
> >>>>+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> >>>>+    (0, "none"),
> >>>>+    (1, "host"),
> >>>>+    ])
> >>>>+
> >>>>+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> >>>>+    (-1, "invalid"),
> >>>>+    (0, "force"),
> >>>>+    (1, "try"),
> >>>>+    ])
> >>>
> >>>If you don't set init_val, the default value would be "force" (0), is this
> >>
> >>Yes.
> >>
> >>>want you want?
> >>
> >>We have a little bit of complexity here,
> >>
> >>"Default per-device RDM policy is 'force', while default global RDM policy
> >>is 'try'. When both policies are specified on a given region, 'force' is
> >>always preferred."
> >>
> >
> >This is going to be done in actual code anyway.
> >
> >This type is used both in global and per-device setting, so I envisage
> 
> Yes.
> 
> >this to have an invalid value to start with. Appropriate default values
> 
> Sounds I should set this,
> 
> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> +    (-1, "invalid"),
> +    (0, "strict"),
> +    (1, "relaxed"),
> +    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
> +
> 

Yes, and then don't forget to set the value to appropriate value in the
_setdefault functions for different types.

> 
> >should be done in libxl_TYPE_setdefault functions. And the logic to
> >detect conflict and preferences done in your construct function.
> >
> >What do you think?
> >
> >>>
> >>>>+

[...]

> >>>>                      pcidev->permissive = atoi(tok);
> >>>>                  }else if ( !strcmp(optkey, "seize") ) {
> >>>>                      pcidev->seize = atoi(tok);
> >>>>+                }else if ( !strcmp(optkey, "rdm_reserve") ) {
> >>>>+                    if ( !strcmp(tok, "force") ) {
> >>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>>>+                    } else if ( !strcmp(tok, "try") ) {
> >>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >>>>+                    } else {
> >>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>>>+                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
> >>>>+                                          " %s, so goes 'force' by default.",
> >>>
> >>>If this is not an error, you don't need XLU__PCI_ERR.
> >>>
> >>>But I would say we should  just treat this as an error and
> >>>abort/exit/report (whatever the parser should do in this case).
> >>
> >>In our case we just want to post a message to set a appropriate flag to
> >>recover this behavior like we write here,
> >>
> >>                         XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
> >>value:"
> >>                                           " %s, so goes 'strict' by
> >>default.",
> >>                                      tok);
> >
> >I suggest we just abort in this case and not second guess what the admin
> >wants.
> 
> Okay,
>                     } else {
>                         XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM
> property"
>                                           " flag: 'strict' or 'relaxed'.",
>                                      tok);
>                         abort();
> 

No, not calling the "abort" function. I meant returning appropriate error
value and let the caller handles this situation.

> 
> >
> >>
> >>This may just be a warning? But I don't we have this sort of definition,
> >>XLU__PCI_WARN, ...
> >>
> >>So what LOG format can be adopted here?
> >
> >Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.
> 
> If it goes to abort(), I think XLU__PCI_ERR() should be good.
> 
> >
> >>
> >>>
> >>>>+                                     tok);
> >>>>+                    }
> >>>>                  }else{
> >>>>                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
> >>>>                  }
> >>>>@@ -167,6 +180,71 @@ parse_error:
> >>>>      return ERROR_INVAL;
> >>>>  }
> >>>>
> >>>>+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
> >>>>+{
> >>>>+    unsigned state = STATE_TYPE;
> >>>>+    char *buf2, *tok, *ptr, *end;
> >>>>+
> >>>>+    if ( NULL == (buf2 = ptr = strdup(str)) )
> >>>>+        return ERROR_NOMEM;
> >>>>+
> >>>>+    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> >>>>+        switch(state) {
> >
> >Coding style. I haven't checked what actual style this file uses, but
> >there is inconsistency in this function by itself.
> 
> I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and they
> are in the same file...
> 
> Anyway, I should change this line,
> 
> for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {
> 

  for (tok = ptr, end...)

  switch (state) {


> >
> >>>>+        case STATE_TYPE:
> >>>>+            if ( *ptr == '\0' || *ptr == ',' ) {
> >>>>+                state = STATE_CHECK_FLAG;
> >>>>+                *ptr = '\0';

[...]

> >>>>
> >>>>+    /*
> >>>>+     * By default our global policy is to query all rdm entries, and
> >>>>+     * force reserve them.
> >>>>+     */
> >>>>+    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
> >>>>+    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >>>
> >>>This should probably to into the _setdefault function of
> >>>libxl_domain_build_info.
> >>
> >>Sorry, I just see this
> >>
> >>libxl_domain_build_info_init()
> >>     |
> >>     + libxl_rdm_reserve_init(&p->rdm);
> >>	|
> >>	+ memset(p, '\0', sizeof(*p));
> >>
> >>But this should be generated automatically, right? So how to implement your
> >>idea? Could you give me a show?
> >>
> >
> >Check libxl_domain_build_info_setdefault.
> >
> >To use libxl types. You normally do:
> >
> >   libxl_TYPE_init
> >   libxl_TYPE_setdefault
> >
> >   DO STUFF
> >
> >   libxl_TYPE_dispose
> >
> >_init and _dispose are auto-generated. _setdefault is not.
> 
> So in our case, maybe we can do this,
> 
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index f0da7dc..461606c 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -100,6 +100,17 @@ static int sched_params_valid(libxl__gc *gc,
>      return 1;
>  }
> 
> +void libxl__device_rdm_setdefault(libxl__gc *gc,
> +                                  libxl_domain_build_info *b_info)

It's not a device. Use libxl__rdm_setdefault.

> +{
> +    /*
> +     * By default our global policy is to query all rdm entries, and
> +     * force reserve them.
> +     */
> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_STRICT;
> +}
> +

Isn't the global policy "relaxed" (or "try")? At least that's what your
old code does. BTW your original code contradicts your original comment.

>  int libxl__domain_build_info_setdefault(libxl__gc *gc,
>                                          libxl_domain_build_info *b_info)
>  {
> @@ -410,6 +421,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>                     libxl_domain_type_to_string(b_info->type));
>          return ERROR_INVAL;
>      }
> +
> +    libxl__device_rdm_setdefault(gc, b_info);
>      return 0;
>  }
> 

And you also need to modify libxl__device_pci_setdefault.

I actually have another question on the interface design. To recap, in
your patch:

diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 47af340..5786455 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
     (2, "PV"),                                                    
     ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")

+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
+    (0, "none"),
+    (1, "host"),
+    ])
+
+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
+    (-1, "invalid"),
+    (0, "force"),
+    (1, "try"),
+    ])
+
 libxl_channel_connection = Enumeration("channel_connection", [
     (0, "UNKNOWN"),
     (1, "PTY"),    
@@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])

+libxl_rdm_reserve = Struct("rdm_reserve", [
+    ("type",    libxl_rdm_reserve_type),
+    ("reserve",   libxl_rdm_reserve_flag),
+    ])
+
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),                     
     ("avail_vcpus",     libxl_bitmap),
@@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("kernel",           string),
     ("cmdline",          string),
     ("ramdisk",          string),
+    ("rdm",     libxl_rdm_reserve),
     ("u", KeyedUnion(None, libxl_domain_type, "type",
                 [("hvm", Struct(None, [("firmware",         string),
                                        ("bios",             libxl_bios_type),
@@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
     ("power_mgmt", bool),
     ("permissive", bool),
     ("seize", bool),
+    ("rdm_reserve",   libxl_rdm_reserve_flag),
     ])

Do you actually need libxl_rdm_reserve type? I.e. do you envisage that
structure to change a lot? Can you not just use libxl_rdm_reserve_type
and libxl_rdm_reserve_flag in build_info.

Wei.

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-14  8:27         ` Chen, Tiejun
  2015-05-18  1:06           ` Chen, Tiejun
@ 2015-05-18 20:00           ` Wei Liu
  2015-05-19  1:32             ` Tian, Kevin
  2015-05-19  6:47             ` Chen, Tiejun
  1 sibling, 2 replies; 125+ messages in thread
From: Wei Liu @ 2015-05-18 20:00 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Thu, May 14, 2015 at 04:27:45PM +0800, Chen, Tiejun wrote:
> On 2015/5/11 19:32, Wei Liu wrote:
> >On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
> >>On 2015/5/8 22:43, Wei Liu wrote:
> >>>Sorry for the late review. This series fell through the crack.
> >>>
> >>
> >>Thanks for your review.
> >>
> >>>On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
> >>>>While building a VM, HVM domain builder provides struct hvm_info_table{}
> >>>>to help hvmloader. Currently it includes two fields to construct guest
> >>>>e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
> >>>>check them to fix any conflict with RAM.
> >>>>
> >>>>RMRR can reside in address space beyond 4G theoretically, but we never
> >>>>see this in real world. So in order to avoid breaking highmem layout
> >>>
> >>>How does this break highmem layout?
> >>
> >>In most cases highmen is always continuous like [4G, ...] but RMRR is
> >>theoretically beyond 4G but very rarely. Especially we don't see this
> >>happened in real world. So we don't want to such a case breaking the
> >>highmem.
> >>
> >
> >The problem is  we take this approach just because this rarely happens
> >*now* is not future proof.  It needs to be clearly documented somewhere
> >in the manual (or any other Intel docs) and be referenced in the code.
> >Otherwise we might end up in a situation that no-one knows how it is
> >supposed to work and no-one can fix it if it breaks in the future, that
> >is, every single device on earth requires RMRR > 4G overnight (I'm
> >exaggerating).
> >
> >Or you can just make it works with highmem. How much more work do you
> >envisage?
> >
> >(If my above comment makes no sense do let me know. I only have very
> >shallow understanding of RMRR)
> 
> Maybe I'm misleading you :)
> 
> I don't mean RMRR > 4G is not allowed to work in our implementation. What
> I'm saying is that our *policy* is just simple for this kind of rare highmem
> case...
> 

Yes, but this policy is hardcoded in code (as in, you bail when
detecting conflict in highmem region). I don't think we have the
expertise to un-hardcode it in the future (it might require x86 specific
insider information and specific hardware to test). So I would like to
make it as future proof as possible.

I know you're limited by hvm_info. I would accept this hardcoded policy
as short term solution, but I would like commitment from Intel to
maintain this piece of code and properly work out a flexible solution to
deal with >4G case.

> >
> >>>
> >>>>we don't solve highmem conflict. Note this means highmem rmrr could still
> >>>>be supported if no conflict.
> >>>>
> 
> Like these two sentences above.
> 
> >>>
> >>>Aren't you actively trying to avoid conflict in libxl?
> >>
> >>RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
> >>good policies to address RMRR. In the case of highmemt, that simple policy
> >>should be enough?
> >>
> >
> >Whatever policy you and HV maintainers agree on. Just clearly document
> >it.
> 
> Do you mean I should brief this patch description into one separate
> document?
> 

Either in code comment or in a separate document.

> >
> >>>

[...]

> >>The caller to xc_device_get_rdm always frees this.
> >>
> >
> >I think I misunderstood how this function works. I thought xrdm was
> >passed in by caller, which is clearly not the case. Sorry!
> >
> >
> >In that case, the `if ( rc < 0 )' is not needed because the call should
> >always return rc < 0. An assert is good enough.
> 
> assert(rc < 0)? But we can't presume the user always pass '0' firstly, and
> additionally, we may have no any RMRR indeed.
> 
> So I guess what you want is,
> 
> assert( rc <=0 );
> if ( !rc )
>     goto xrdm;
> 

Yes I was thinking about something like this.

How in this case can rc equal to 0? Is it because you don't actually have any
regions? If so, please write a comment.

> if ( errno == ENOBUFS )
> ...
> 
> Right?
> 
> >

[...]

> >>>
> >>>>+    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
> >>>>+
> >>>>+    /* Query all RDM entries in this platform */
> >>>>+    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
> >>>>+        d_config->num_rdms = nr_all_rdms;
> >>>>+        for (i = 0; i < d_config->num_rdms; i++) {
> >>>>+            d_config->rdms[i].start =
> >>>>+                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
> >>>>+            d_config->rdms[i].size =
> >>>>+                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
> >>>>+            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
> >>>>+        }
> >>>>+    } else {
> >>>>+        d_config->num_rdms = 0;
> >>>>+    }
> >>>
> >>>And you should move d_config->rdms = libxl__calloc inside that `if'.
> >>>That is, don't allocate memory if you don't need it.
> >>
> >>We can't since in all cases we need to preallocate this, and then we will
> >>handle this according to our policy.
> >>
> >
> >How would it ever be used again if you set d_config->num_rdms to 0? How
> >do you know the exact size of your array again?
> 
> If we don't consider multiple devices shares one rdm entry, our workflow can
> be showed as follows:
> 
> #1. We always preallocate all rdms[] but with memset().

Because the number of RDM entries used in a domain never exceeds the
number of global entries? I.e. we never risk overrunning the array?

> #2. Then we have two cases for that global rule,
> 
> #2.1 If we really need all according to our global rule, we would set all
> rdms[] with all real rdm info and set d_config->num_rdms.
> #2.2 If we have no such a global rule to obtain all, we just clear
> d_config->num_rdms.
> 

No, don't do this. In any failure path, if the free / dispose function
depends on num_rdms to iterate through the whole list to dispose memory
(if your rdm structure later contains pointers to allocated memory),
this method leaks memory.

The num_ field should always reflect the actual size of the array.

> #3. Then for per device rule
> 
> #3.1 From #2.1, we just need to check if we should change one given rdm
> entry's policy if this given entry is just associated to this device.
> #3.2 From 2.2, obviously we just need to fill rdms one by one. Of course,
> its very possible that we don't fill all rdms since all passthroued devices
> might not have no rdm at all or they just occupy some. But anyway, finally
> we sync d_config->num_rdms.
> 

Sorry I don't follow. But if your problem is you don't know how many
entries you actually need, just use libxl__realloc?

> >
> >>>
> >>>>+    free(xrdm);
> >>>>+
> >>>>+    /* Query RDM entries per-device */
> >>>>+    for (i = 0; i < d_config->num_pcidevs; i++) {
> >>>>+        unsigned int nr_entries = 0;
> >>
> >>Maybe I should restate this,
> >>	unsigned int nr_entries;
> >>

[...]

> >>
> >>Like I replied above we always preallocate this at the beginning.
> >>
> >
> >Ah, OK.
> >
> >But please don't do this. It's hard to see you don't overrun the
> >buffer. Please allocate memory only when you need it.
> 
> Sorry I don't understand this. As I mention above, we don't know how many
> rdm entries we really need to allocate. So this is why we'd like to
> preallocate all rdms at the beginning. Then this can cover all cases, global
> policy, (global policy & per device) and only per device. Even if multiple
> devices shares one rdm we also need to avoid duplicating a new...
> 

Can you use libxl__realloc?

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-18 20:00           ` Wei Liu
@ 2015-05-19  1:32             ` Tian, Kevin
  2015-05-19 10:22               ` Wei Liu
  2015-05-19  6:47             ` Chen, Tiejun
  1 sibling, 1 reply; 125+ messages in thread
From: Tian, Kevin @ 2015-05-19  1:32 UTC (permalink / raw)
  To: Wei Liu, Chen, Tiejun
  Cc: ian.campbell, andrew.cooper3, tim, xen-devel, stefano.stabellini,
	JBeulich, Zhang, Yang Z, Ian.Jackson

> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: Tuesday, May 19, 2015 4:00 AM
> 
> On Thu, May 14, 2015 at 04:27:45PM +0800, Chen, Tiejun wrote:
> > On 2015/5/11 19:32, Wei Liu wrote:
> > >On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
> > >>On 2015/5/8 22:43, Wei Liu wrote:
> > >>>Sorry for the late review. This series fell through the crack.
> > >>>
> > >>
> > >>Thanks for your review.
> > >>
> > >>>On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
> > >>>>While building a VM, HVM domain builder provides struct
> hvm_info_table{}
> > >>>>to help hvmloader. Currently it includes two fields to construct guest
> > >>>>e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So
> we should
> > >>>>check them to fix any conflict with RAM.
> > >>>>
> > >>>>RMRR can reside in address space beyond 4G theoretically, but we
> never
> > >>>>see this in real world. So in order to avoid breaking highmem layout
> > >>>
> > >>>How does this break highmem layout?
> > >>
> > >>In most cases highmen is always continuous like [4G, ...] but RMRR is
> > >>theoretically beyond 4G but very rarely. Especially we don't see this
> > >>happened in real world. So we don't want to such a case breaking the
> > >>highmem.
> > >>
> > >
> > >The problem is  we take this approach just because this rarely happens
> > >*now* is not future proof.  It needs to be clearly documented somewhere
> > >in the manual (or any other Intel docs) and be referenced in the code.
> > >Otherwise we might end up in a situation that no-one knows how it is
> > >supposed to work and no-one can fix it if it breaks in the future, that
> > >is, every single device on earth requires RMRR > 4G overnight (I'm
> > >exaggerating).
> > >
> > >Or you can just make it works with highmem. How much more work do you
> > >envisage?
> > >
> > >(If my above comment makes no sense do let me know. I only have very
> > >shallow understanding of RMRR)
> >
> > Maybe I'm misleading you :)
> >
> > I don't mean RMRR > 4G is not allowed to work in our implementation. What
> > I'm saying is that our *policy* is just simple for this kind of rare highmem
> > case...
> >
> 
> Yes, but this policy is hardcoded in code (as in, you bail when
> detecting conflict in highmem region). I don't think we have the
> expertise to un-hardcode it in the future (it might require x86 specific
> insider information and specific hardware to test). So I would like to
> make it as future proof as possible.
> 
> I know you're limited by hvm_info. I would accept this hardcoded policy
> as short term solution, but I would like commitment from Intel to
> maintain this piece of code and properly work out a flexible solution to
> deal with >4G case.
> 

We made this simplification by balancing implementation complexity and
real-world possibility. Yes Intel will commit to maintain it if future platform
does break this assumption.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-18 19:17           ` Wei Liu
@ 2015-05-19  3:16             ` Chen, Tiejun
  2015-05-19  9:42               ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-19  3:16 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson


On 2015/5/19 3:17, Wei Liu wrote:
> On Fri, May 15, 2015 at 09:52:23AM +0800, Chen, Tiejun wrote:
> [...]
>>
>>
>> ...
>>
>> B<RDM_RESERVE_STRING> has the form C<[KEY=VALUE,KEY=VALUE,...> where:
>>
>> =over 4
>>
>> =item B<KEY=VALUE>
>>
>> Possible B<KEY>s are:
>>
>> =over 4
>>
>> =item B<type="STRING">
>>
>> Currently we just have one type. "host" means all reserved device memory on
>> this platform should be reserved in this VM's pfn space.
>>
>> =over 4
>>
>> =item B<reserve="STRING">
>> ...
>>
>
> Yes, something like this.
>
>>
>>>
>>>>>
> [...]
>>>>>> index 9af0e99..d7434d6 100644
>>>>>> --- a/docs/misc/vtd.txt
>>>>>> +++ b/docs/misc/vtd.txt
>>>>>> @@ -111,6 +111,40 @@ in the config file:
>>>>>>   To override for a specific device:
>>>>>>   	pci = [ '01:00.0,msitranslate=0', '03:00.0' ]
>>>>>>
>>>>>> +RDM, 'reserved device memory', for PCI Device Passthrough
>>>>>> +---------------------------------------------------------
>>>>>> +
>>>>>> +There are some devices the BIOS controls, for e.g. USB devices to perform
>>>>>> +PS2 emulation. The regions of memory used for these devices are marked
>>>>>> +reserved in the e820 map. When we turn on DMA translation, DMA to those
>>>>>> +regions will fail. Hence BIOS uses RMRR to specify these regions along with
>>>>>> +devices that need to access these regions. OS is expected to setup
>>>>>> +identity mappings for these regions for these devices to access these regions.
>>>>>> +
>>>>>> +While creating a VM we should reserve them in advance, and avoid any conflicts.
>>>>>> +So we introduce user configurable parameters to specify RDM resource and
>>>>>> +according policies,
>>>>>> +
>>>>>> +To enable this globally, add "rdm" in the config file:
>>>>>> +
>>>>>> +    rdm = [ 'host, reserve=force/try' ]
>>>>>> +
>>>>>
>>>>> The "force/try" should be called "policy". And then you explain what
>>>>> policies we have.
>>>>
>>>> Do you mean we should rename this?
>>>>
>>>> rdm = [ 'host, policy=force/try' ]
>>>>
>>>
>>> No, I didn't ask you to rename that.
>>>
>>> The above line is an example which should reflect the correct syntax.
>>> "force/try" is not the *actual syntax*, i.e. you won't write that in
>>> your config file.
>>>
>>> I meant to changes it to "reserve=POLICY". Then you explain the possible
>>> values of POLICY.
>>>
>>
>> Understood so what about this,
>>
>> To enable this globally, add "rdm" in the config file:
>>
>>      rdm = [ 'host, reserve=<POLICY>' ]
>>
>
> OK, so this is a specific example in vtd.txt. Last time I misread it as
> part of the manpage.
>
> I think you meant in this specific example (with other suggestions
> incorporated):
>
>       rdm = "type=host, reserve=force"
>
> Then you point user to xl.cfg manpage.
>
>> Or just for a specific device:
>>
>>      pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]
>>
>
> Same here.
>

In order to be compatible with vtd.txt, could this work for you?

To enable this globally, add "rdm" in the config file:

     rdm = [ 'type=host, reserve=<policy>' ]     (default policy is 
"relaxed")

Or just for a specific device:

     pci = [ '01:00.0,rdm_reserve=<policy>' ]    (default policy is 
"strict")


> Just don't write "force/try" or "strcit/relax" because that's not the
> exact syntax you would use in a real config file.

Yeah.

>
>> Global RDM parameter allows user to specify reserved regions explicitly.
>> Using "host" to include all reserved regions reported on this platform
>> which is good to handle hotplug scenario. In the future this parameter
>> may be further extended to allow specifying random regions, e.g. even
>> those belonging to another platform as a preparation for live migration
>> with passthrough devices.
>>
>> Currently "POLICY" includes two options, "strict" and "relaxed". It decides
>> how to handle conflict when reserving RDM regions in pfn space. If conflict
>> ...
>>
>>>> This is really a policy but 'reserve' may can reflect our action explicitly,
>>>> right?
>>>>
>>>>>
>>>>>> +Or just for a specific device:
>>>>>> +
>>>>>> +	pci = [ '01:00.0,rdm_reserve=force/try', '03:00.0' ]
>>>>
>>>> And you also can see this.
>>>>
>>>> But anyway, if you're really stick to rename this, I'm going to be fine as
>>>> well but we should ping every one to check this point since this name is
>>> >from our previous discussion.
>>>>
>>>>>> +
>>>>>> +Global RDM parameter allows user to specify reserved regions explicitly.
>>>>>> +Using 'host' to include all reserved regions reported on this platform
>>>>>> +which is good to handle hotplug scenario. In the future this parameter
>>>>>> +may be further extended to allow specifying random regions, e.g. even
>>>>>> +those belonging to another platform as a preparation for live migration
>>>>>> +with passthrough devices.
>>>>>> +
>>>>>> +'force/try' policy decides how to handle conflict when reserving RDM
>>>>>> +regions in pfn space. If conflict exists, 'force' means an immediate error
>>>>>> +so VM will be killed, while 'try' allows moving forward with a warning
>>>>>
>>>>> Be killed by whom? I think it's hvmloader crashes voluntarily, right?
>>>>
>>>> s/VM will be kille/hvmloader crashes voluntarily
>>>>
>>>>>
>>>>>> +message thrown out.
>>>>>> +
>>>>>>
>>>>>>   Caveat on Conventional PCI Device Passthrough
>>>>>>   ---------------------------------------------
>>>>>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>>>>>> index 98687bd..9ed40d4 100644
>>>>>> --- a/tools/libxl/libxl_create.c
>>>>>> +++ b/tools/libxl/libxl_create.c
>>>>>> @@ -1407,6 +1407,11 @@ static void domcreate_attach_pci(libxl__egc *egc, libxl__multidev *multidev,
>>>>>>       }
>>>>>>
>>>>>>       for (i = 0; i < d_config->num_pcidevs; i++) {
>>>>>> +        /*
>>>>>> +         * If the rdm global policy is 'force' we should override each device.
>>>>>> +         */
>>>>>> +        if (d_config->b_info.rdm.reserve == LIBXL_RDM_RESERVE_FLAG_FORCE)
>>>>>> +            d_config->pcidevs[i].rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>>           ret = libxl__device_pci_add(gc, domid, &d_config->pcidevs[i], 1);
>>>>>>           if (ret < 0) {
>>>>>>               LIBXL__LOG(ctx, LIBXL__LOG_ERROR,
>>>>>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>>>>>> index 47af340..5786455 100644
>>>>>> --- a/tools/libxl/libxl_types.idl
>>>>>> +++ b/tools/libxl/libxl_types.idl
>>>>>> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>>>>>>       (2, "PV"),
>>>>>>       ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>>>>>>
>>>>>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>>>>>> +    (0, "none"),
>>>>>> +    (1, "host"),
>>>>>> +    ])
>>>>>> +
>>>>>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>>>>>> +    (-1, "invalid"),
>>>>>> +    (0, "force"),
>>>>>> +    (1, "try"),
>>>>>> +    ])
>>>>>
>>>>> If you don't set init_val, the default value would be "force" (0), is this
>>>>
>>>> Yes.
>>>>
>>>>> want you want?
>>>>
>>>> We have a little bit of complexity here,
>>>>
>>>> "Default per-device RDM policy is 'force', while default global RDM policy
>>>> is 'try'. When both policies are specified on a given region, 'force' is
>>>> always preferred."
>>>>
>>>
>>> This is going to be done in actual code anyway.
>>>
>>> This type is used both in global and per-device setting, so I envisage
>>
>> Yes.
>>
>>> this to have an invalid value to start with. Appropriate default values
>>
>> Sounds I should set this,
>>
>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>> +    (-1, "invalid"),
>> +    (0, "strict"),
>> +    (1, "relaxed"),
>> +    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
>> +
>>
>
> Yes, and then don't forget to set the value to appropriate value in the
> _setdefault functions for different types.

Currently "type" is not associated to "policy" so we can do this if 
necessary in the future.

>
>>
>>> should be done in libxl_TYPE_setdefault functions. And the logic to
>>> detect conflict and preferences done in your construct function.
>>>
>>> What do you think?
>>>
>>>>>
>>>>>> +
>
> [...]
>
>>>>>>                       pcidev->permissive = atoi(tok);
>>>>>>                   }else if ( !strcmp(optkey, "seize") ) {
>>>>>>                       pcidev->seize = atoi(tok);
>>>>>> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
>>>>>> +                    if ( !strcmp(tok, "force") ) {
>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>> +                    } else if ( !strcmp(tok, "try") ) {
>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>>> +                    } else {
>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
>>>>>> +                                          " %s, so goes 'force' by default.",
>>>>>
>>>>> If this is not an error, you don't need XLU__PCI_ERR.
>>>>>
>>>>> But I would say we should  just treat this as an error and
>>>>> abort/exit/report (whatever the parser should do in this case).
>>>>
>>>> In our case we just want to post a message to set a appropriate flag to
>>>> recover this behavior like we write here,
>>>>
>>>>                          XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
>>>> value:"
>>>>                                            " %s, so goes 'strict' by
>>>> default.",
>>>>                                       tok);
>>>
>>> I suggest we just abort in this case and not second guess what the admin
>>> wants.
>>
>> Okay,
>>                      } else {
>>                          XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM
>> property"
>>                                            " flag: 'strict' or 'relaxed'.",
>>                                       tok);
>>                          abort();
>>
>
> No, not calling the "abort" function. I meant returning appropriate error
> value and let the caller handles this situation.

Okay, just call "goto parse_error".

>
>>
>>>
>>>>
>>>> This may just be a warning? But I don't we have this sort of definition,
>>>> XLU__PCI_WARN, ...
>>>>
>>>> So what LOG format can be adopted here?
>>>
>>> Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.
>>
>> If it goes to abort(), I think XLU__PCI_ERR() should be good.
>>
>>>
>>>>
>>>>>
>>>>>> +                                     tok);
>>>>>> +                    }
>>>>>>                   }else{
>>>>>>                       XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
>>>>>>                   }
>>>>>> @@ -167,6 +180,71 @@ parse_error:
>>>>>>       return ERROR_INVAL;
>>>>>>   }
>>>>>>
>>>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
>>>>>> +{
>>>>>> +    unsigned state = STATE_TYPE;
>>>>>> +    char *buf2, *tok, *ptr, *end;
>>>>>> +
>>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>>>> +        return ERROR_NOMEM;
>>>>>> +
>>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>>>> +        switch(state) {
>>>
>>> Coding style. I haven't checked what actual style this file uses, but
>>> there is inconsistency in this function by itself.
>>
>> I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and they
>> are in the same file...
>>
>> Anyway, I should change this line,
>>
>> for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {
>>
>
>    for (tok = ptr, end...)
>
>    switch (state) {

But what is the difference to compare the initial code?

 >>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
 >>>>>> +        return ERROR_NOMEM;
 >>>>>> +
 >>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
 >>>>>> +        switch(state) {

I thought initially you let me to follow that previous "if" :)

>
>
>>>
>>>>>> +        case STATE_TYPE:
>>>>>> +            if ( *ptr == '\0' || *ptr == ',' ) {
>>>>>> +                state = STATE_CHECK_FLAG;
>>>>>> +                *ptr = '\0';
>
> [...]
>
>>>>>>
>>>>>> +    /*
>>>>>> +     * By default our global policy is to query all rdm entries, and
>>>>>> +     * force reserve them.
>>>>>> +     */
>>>>>> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
>>>>>> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>>
>>>>> This should probably to into the _setdefault function of
>>>>> libxl_domain_build_info.
>>>>
>>>> Sorry, I just see this
>>>>
>>>> libxl_domain_build_info_init()
>>>>      |
>>>>      + libxl_rdm_reserve_init(&p->rdm);
>>>> 	|
>>>> 	+ memset(p, '\0', sizeof(*p));
>>>>
>>>> But this should be generated automatically, right? So how to implement your
>>>> idea? Could you give me a show?
>>>>
>>>
>>> Check libxl_domain_build_info_setdefault.
>>>
>>> To use libxl types. You normally do:
>>>
>>>    libxl_TYPE_init
>>>    libxl_TYPE_setdefault
>>>
>>>    DO STUFF
>>>
>>>    libxl_TYPE_dispose
>>>
>>> _init and _dispose are auto-generated. _setdefault is not.
>>
>> So in our case, maybe we can do this,
>>
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index f0da7dc..461606c 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -100,6 +100,17 @@ static int sched_params_valid(libxl__gc *gc,
>>       return 1;
>>   }
>>
>> +void libxl__device_rdm_setdefault(libxl__gc *gc,
>> +                                  libxl_domain_build_info *b_info)
>
> It's not a device. Use libxl__rdm_setdefault.

Okay.

>
>> +{
>> +    /*
>> +     * By default our global policy is to query all rdm entries, and
>> +     * force reserve them.
>> +     */
>> +    b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_HOST;
>> +    b_info->rdm.reserve = LIBXL_RDM_RESERVE_FLAG_STRICT;
>> +}
>> +
>
> Isn't the global policy "relaxed" (or "try")? At least that's what your
> old code does. BTW your original code contradicts your original comment.

Right, sorry for this typo.

>
>>   int libxl__domain_build_info_setdefault(libxl__gc *gc,
>>                                           libxl_domain_build_info *b_info)
>>   {
>> @@ -410,6 +421,8 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>>                      libxl_domain_type_to_string(b_info->type));
>>           return ERROR_INVAL;
>>       }
>> +
>> +    libxl__device_rdm_setdefault(gc, b_info);
>>       return 0;
>>   }
>>
>
> And you also need to modify libxl__device_pci_setdefault.
>

Okay.

> I actually have another question on the interface design. To recap, in
> your patch:
>
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 47af340..5786455 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -71,6 +71,17 @@ libxl_domain_type = Enumeration("domain_type", [
>       (2, "PV"),
>       ], init_val = "LIBXL_DOMAIN_TYPE_INVALID")
>
> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> +    (0, "none"),
> +    (1, "host"),
> +    ])
> +
> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> +    (-1, "invalid"),
> +    (0, "force"),
> +    (1, "try"),
> +    ])
> +
>   libxl_channel_connection = Enumeration("channel_connection", [
>       (0, "UNKNOWN"),
>       (1, "PTY"),
> @@ -356,6 +367,11 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
>       ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
>       ])
>
> +libxl_rdm_reserve = Struct("rdm_reserve", [
> +    ("type",    libxl_rdm_reserve_type),
> +    ("reserve",   libxl_rdm_reserve_flag),
> +    ])
> +
>   libxl_domain_build_info = Struct("domain_build_info",[
>       ("max_vcpus",       integer),
>       ("avail_vcpus",     libxl_bitmap),
> @@ -401,6 +417,7 @@ libxl_domain_build_info = Struct("domain_build_info",[
>       ("kernel",           string),
>       ("cmdline",          string),
>       ("ramdisk",          string),
> +    ("rdm",     libxl_rdm_reserve),
>       ("u", KeyedUnion(None, libxl_domain_type, "type",
>                   [("hvm", Struct(None, [("firmware",         string),
>                                          ("bios",             libxl_bios_type),
> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>       ("power_mgmt", bool),
>       ("permissive", bool),
>       ("seize", bool),
> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>       ])
>
> Do you actually need libxl_rdm_reserve type? I.e. do you envisage that
> structure to change a lot? Can you not just use libxl_rdm_reserve_type
> and libxl_rdm_reserve_flag in build_info.
>

We'd like to introduce this type, libxl_rdm_reserve, to combine "type" 
and "flag". From my point of view, this sole structure can represent a 
holistic approach to rdm because,

#1. Obviously its easy to get all;
#2. It will probably be extended since like this name, rdm, reserved 
device memory, this should not be restricted to RMRR currently. So I 
just feel its flexible to support others in the future, or much more 
properties :)

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-18 20:00           ` Wei Liu
  2015-05-19  1:32             ` Tian, Kevin
@ 2015-05-19  6:47             ` Chen, Tiejun
  1 sibling, 0 replies; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-19  6:47 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/19 4:00, Wei Liu wrote:
> On Thu, May 14, 2015 at 04:27:45PM +0800, Chen, Tiejun wrote:
>> On 2015/5/11 19:32, Wei Liu wrote:
>>> On Mon, May 11, 2015 at 04:09:53PM +0800, Chen, Tiejun wrote:
>>>> On 2015/5/8 22:43, Wei Liu wrote:
>>>>> Sorry for the late review. This series fell through the crack.
>>>>>
>>>>
>>>> Thanks for your review.
>>>>
>>>>> On Fri, Apr 10, 2015 at 05:21:55PM +0800, Tiejun Chen wrote:
>>>>>> While building a VM, HVM domain builder provides struct hvm_info_table{}
>>>>>> to help hvmloader. Currently it includes two fields to construct guest
>>>>>> e820 table by hvmloader, low_mem_pgend and high_mem_pgend. So we should
>>>>>> check them to fix any conflict with RAM.
>>>>>>
>>>>>> RMRR can reside in address space beyond 4G theoretically, but we never
>>>>>> see this in real world. So in order to avoid breaking highmem layout
>>>>>
>>>>> How does this break highmem layout?
>>>>
>>>> In most cases highmen is always continuous like [4G, ...] but RMRR is
>>>> theoretically beyond 4G but very rarely. Especially we don't see this
>>>> happened in real world. So we don't want to such a case breaking the
>>>> highmem.
>>>>
>>>
>>> The problem is  we take this approach just because this rarely happens
>>> *now* is not future proof.  It needs to be clearly documented somewhere
>>> in the manual (or any other Intel docs) and be referenced in the code.
>>> Otherwise we might end up in a situation that no-one knows how it is
>>> supposed to work and no-one can fix it if it breaks in the future, that
>>> is, every single device on earth requires RMRR > 4G overnight (I'm
>>> exaggerating).
>>>
>>> Or you can just make it works with highmem. How much more work do you
>>> envisage?
>>>
>>> (If my above comment makes no sense do let me know. I only have very
>>> shallow understanding of RMRR)
>>
>> Maybe I'm misleading you :)
>>
>> I don't mean RMRR > 4G is not allowed to work in our implementation. What
>> I'm saying is that our *policy* is just simple for this kind of rare highmem
>> case...
>>
>
> Yes, but this policy is hardcoded in code (as in, you bail when
> detecting conflict in highmem region). I don't think we have the
> expertise to un-hardcode it in the future (it might require x86 specific
> insider information and specific hardware to test). So I would like to
> make it as future proof as possible.
>
> I know you're limited by hvm_info. I would accept this hardcoded policy
> as short term solution, but I would like commitment from Intel to
> maintain this piece of code and properly work out a flexible solution to
> deal with >4G case.

Looks Kevin replied this.

>
>>>
>>>>>
>>>>>> we don't solve highmem conflict. Note this means highmem rmrr could still
>>>>>> be supported if no conflict.
>>>>>>
>>
>> Like these two sentences above.
>>
>>>>>
>>>>> Aren't you actively trying to avoid conflict in libxl?
>>>>
>>>> RMRR is fixed by BIOS so we can't aovid conflict. Here we want to adopt some
>>>> good policies to address RMRR. In the case of highmemt, that simple policy
>>>> should be enough?
>>>>
>>>
>>> Whatever policy you and HV maintainers agree on. Just clearly document
>>> it.
>>
>> Do you mean I should brief this patch description into one separate
>> document?
>>
>
> Either in code comment or in a separate document.

Okay.

>
>>>
>>>>>
>
> [...]
>
>>>> The caller to xc_device_get_rdm always frees this.
>>>>
>>>
>>> I think I misunderstood how this function works. I thought xrdm was
>>> passed in by caller, which is clearly not the case. Sorry!
>>>
>>>
>>> In that case, the `if ( rc < 0 )' is not needed because the call should
>>> always return rc < 0. An assert is good enough.
>>
>> assert(rc < 0)? But we can't presume the user always pass '0' firstly, and
>> additionally, we may have no any RMRR indeed.
>>
>> So I guess what you want is,
>>
>> assert( rc <=0 );
>> if ( !rc )
>>      goto xrdm;
>>
>
> Yes I was thinking about something like this.
>
> How in this case can rc equal to 0? Is it because you don't actually have any

Yes.

> regions? If so, please write a comment.
>

Sure.

>> if ( errno == ENOBUFS )
>> ...
>>
>> Right?
>>
>>>
>
> [...]
>
>>>>>
>>>>>> +    memset(d_config->rdms, 0, sizeof(libxl_device_rdm));
>>>>>> +
>>>>>> +    /* Query all RDM entries in this platform */
>>>>>> +    if (type == LIBXL_RDM_RESERVE_TYPE_HOST) {
>>>>>> +        d_config->num_rdms = nr_all_rdms;
>>>>>> +        for (i = 0; i < d_config->num_rdms; i++) {
>>>>>> +            d_config->rdms[i].start =
>>>>>> +                                (uint64_t)xrdm[i].start_pfn << XC_PAGE_SHIFT;
>>>>>> +            d_config->rdms[i].size =
>>>>>> +                                (uint64_t)xrdm[i].nr_pages << XC_PAGE_SHIFT;
>>>>>> +            d_config->rdms[i].flag = d_config->b_info.rdm.reserve;
>>>>>> +        }
>>>>>> +    } else {
>>>>>> +        d_config->num_rdms = 0;
>>>>>> +    }
>>>>>
>>>>> And you should move d_config->rdms = libxl__calloc inside that `if'.
>>>>> That is, don't allocate memory if you don't need it.
>>>>
>>>> We can't since in all cases we need to preallocate this, and then we will
>>>> handle this according to our policy.
>>>>
>>>
>>> How would it ever be used again if you set d_config->num_rdms to 0? How
>>> do you know the exact size of your array again?
>>
>> If we don't consider multiple devices shares one rdm entry, our workflow can
>> be showed as follows:
>>
>> #1. We always preallocate all rdms[] but with memset().
>
> Because the number of RDM entries used in a domain never exceeds the
> number of global entries? I.e. we never risk overrunning the array?

Yes.

"global" indicates the number would count all rdm entries. If w/o 
"global" flag, we just count these rdm entries associated to those pci 
devices assigned to this domain. So this means actually we have this 
equation,

The number without "global" <= The number with "global"

>
>> #2. Then we have two cases for that global rule,
>>
>> #2.1 If we really need all according to our global rule, we would set all
>> rdms[] with all real rdm info and set d_config->num_rdms.
>> #2.2 If we have no such a global rule to obtain all, we just clear
>> d_config->num_rdms.
>>
>
> No, don't do this. In any failure path, if the free / dispose function
> depends on num_rdms to iterate through the whole list to dispose memory
> (if your rdm structure later contains pointers to allocated memory),
> this method leaks memory.
>
> The num_ field should always reflect the actual size of the array.
>
>> #3. Then for per device rule
>>
>> #3.1 From #2.1, we just need to check if we should change one given rdm
>> entry's policy if this given entry is just associated to this device.
>> #3.2 From 2.2, obviously we just need to fill rdms one by one. Of course,
>> its very possible that we don't fill all rdms since all passthroued devices
>> might not have no rdm at all or they just occupy some. But anyway, finally
>> we sync d_config->num_rdms.
>>
>
> Sorry I don't follow. But if your problem is you don't know how many
> entries you actually need, just use libxl__realloc?
>
>>>
>>>>>
>>>>>> +    free(xrdm);
>>>>>> +
>>>>>> +    /* Query RDM entries per-device */
>>>>>> +    for (i = 0; i < d_config->num_pcidevs; i++) {
>>>>>> +        unsigned int nr_entries = 0;
>>>>
>>>> Maybe I should restate this,
>>>> 	unsigned int nr_entries;
>>>>
>
> [...]
>
>>>>
>>>> Like I replied above we always preallocate this at the beginning.
>>>>
>>>
>>> Ah, OK.
>>>
>>> But please don't do this. It's hard to see you don't overrun the
>>> buffer. Please allocate memory only when you need it.
>>
>> Sorry I don't understand this. As I mention above, we don't know how many
>> rdm entries we really need to allocate. So this is why we'd like to
>> preallocate all rdms at the beginning. Then this can cover all cases, global
>> policy, (global policy & per device) and only per device. Even if multiple
>> devices shares one rdm we also need to avoid duplicating a new...
>>
>
> Can you use libxl__realloc?
>

Looks this can expand size each time automatically, right? If so I think 
we can try this.

Thanks
Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-19  3:16             ` Chen, Tiejun
@ 2015-05-19  9:42               ` Wei Liu
  2015-05-19 10:50                 ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-19  9:42 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Tue, May 19, 2015 at 11:16:33AM +0800, Chen, Tiejun wrote:
> 
> On 2015/5/19 3:17, Wei Liu wrote:

[...]

> >>     rdm = [ 'host, reserve=<POLICY>' ]
> >>
> >
> >OK, so this is a specific example in vtd.txt. Last time I misread it as
> >part of the manpage.
> >
> >I think you meant in this specific example (with other suggestions
> >incorporated):
> >
> >      rdm = "type=host, reserve=force"
> >
> >Then you point user to xl.cfg manpage.
> >
> >>Or just for a specific device:
> >>
> >>     pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]
> >>
> >
> >Same here.
> >
> 
> In order to be compatible with vtd.txt, could this work for you?
> 
> To enable this globally, add "rdm" in the config file:
> 
>     rdm = [ 'type=host, reserve=<policy>' ]     (default policy is
> "relaxed")
> 
> Or just for a specific device:
> 
>     pci = [ '01:00.0,rdm_reserve=<policy>' ]    (default policy is "strict")
> 

In my last reply I meant to ask you to have specific example in vtd.txt
and point user to xl.cfg manpage (see below). Previously I thought the
examples you proposed were for manpages, that's why I had asked you to
"list all values of that option". But now this is in vtd.txt, I would
just write something like:


    rdm = "type=host, reserve=relaxed"   (default policy is "relaxed")

    pci = [ '01:00.0,rdm_reserve=try', '03:00.0,reserve=strict' ]

    For all the options available to RDM, see xl.cfg(5).

And of course you would need to list all values of these options in
xl.cfg(5), as I suggested in my first reply to this patch.

> 
> >Just don't write "force/try" or "strcit/relax" because that's not the
> >exact syntax you would use in a real config file.
> 
> Yeah.
> 
> >

[...]

> >>>>>want you want?
> >>>>
> >>>>We have a little bit of complexity here,
> >>>>
> >>>>"Default per-device RDM policy is 'force', while default global RDM policy
> >>>>is 'try'. When both policies are specified on a given region, 'force' is
> >>>>always preferred."
> >>>>
> >>>
> >>>This is going to be done in actual code anyway.
> >>>
> >>>This type is used both in global and per-device setting, so I envisage
> >>
> >>Yes.
> >>
> >>>this to have an invalid value to start with. Appropriate default values
> >>
> >>Sounds I should set this,
> >>
> >>+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> >>+    (-1, "invalid"),
> >>+    (0, "strict"),
> >>+    (1, "relaxed"),
> >>+    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
> >>+
> >>
> >

Yet another question about this feature. The current setup suggests that
we must choose a policy, either "strict" or "relaxed", i.e. there is no
way to disable this feature, as in there is no "none" policy to skip
checking rdm conflict.

AIUI this feature is more like a bug fix to existing problem, so we
always want to enable it. And the default relaxed policy makes sure we
don't break guest that was working before this feature. Do I understand
this correctly?

If we risk breaking existing guests, we might want to consider adding a
"none" (name subject to improvement) policy to just skip RDM all
together.

> >Yes, and then don't forget to set the value to appropriate value in the
> >_setdefault functions for different types.
> 
> Currently "type" is not associated to "policy" so we can do this if
> necessary in the future.
> 
> >
> >>
> >>>should be done in libxl_TYPE_setdefault functions. And the logic to
> >>>detect conflict and preferences done in your construct function.
> >>>
> >>>What do you think?
> >>>
> >>>>>
> >>>>>>+
> >
> >[...]
> >
> >>>>>>                      pcidev->permissive = atoi(tok);
> >>>>>>                  }else if ( !strcmp(optkey, "seize") ) {
> >>>>>>                      pcidev->seize = atoi(tok);
> >>>>>>+                }else if ( !strcmp(optkey, "rdm_reserve") ) {
> >>>>>>+                    if ( !strcmp(tok, "force") ) {
> >>>>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>>>>>+                    } else if ( !strcmp(tok, "try") ) {
> >>>>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
> >>>>>>+                    } else {
> >>>>>>+                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
> >>>>>>+                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
> >>>>>>+                                          " %s, so goes 'force' by default.",
> >>>>>
> >>>>>If this is not an error, you don't need XLU__PCI_ERR.
> >>>>>
> >>>>>But I would say we should  just treat this as an error and
> >>>>>abort/exit/report (whatever the parser should do in this case).
> >>>>
> >>>>In our case we just want to post a message to set a appropriate flag to
> >>>>recover this behavior like we write here,
> >>>>
> >>>>                         XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
> >>>>value:"
> >>>>                                           " %s, so goes 'strict' by
> >>>>default.",
> >>>>                                      tok);
> >>>
> >>>I suggest we just abort in this case and not second guess what the admin
> >>>wants.
> >>
> >>Okay,
> >>                     } else {
> >>                         XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM
> >>property"
> >>                                           " flag: 'strict' or 'relaxed'.",
> >>                                      tok);
> >>                         abort();
> >>
> >
> >No, not calling the "abort" function. I meant returning appropriate error
> >value and let the caller handles this situation.
> 
> Okay, just call "goto parse_error".
> 

Yes, that would work.

> >
> >>
> >>>
> >>>>
> >>>>This may just be a warning? But I don't we have this sort of definition,
> >>>>XLU__PCI_WARN, ...
> >>>>
> >>>>So what LOG format can be adopted here?
> >>>
> >>>Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.
> >>
> >>If it goes to abort(), I think XLU__PCI_ERR() should be good.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>+                                     tok);
> >>>>>>+                    }
> >>>>>>                  }else{
> >>>>>>                      XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
> >>>>>>                  }
> >>>>>>@@ -167,6 +180,71 @@ parse_error:
> >>>>>>      return ERROR_INVAL;
> >>>>>>  }
> >>>>>>
> >>>>>>+int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
> >>>>>>+{
> >>>>>>+    unsigned state = STATE_TYPE;
> >>>>>>+    char *buf2, *tok, *ptr, *end;
> >>>>>>+
> >>>>>>+    if ( NULL == (buf2 = ptr = strdup(str)) )
> >>>>>>+        return ERROR_NOMEM;
> >>>>>>+
> >>>>>>+    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> >>>>>>+        switch(state) {
> >>>
> >>>Coding style. I haven't checked what actual style this file uses, but
> >>>there is inconsistency in this function by itself.
> >>
> >>I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and they
> >>are in the same file...
> >>
> >>Anyway, I should change this line,
> >>
> >>for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {
> >>
> >
> >   for (tok = ptr, end...)
> >
> >   switch (state) {
> 
> But what is the difference to compare the initial code?
> 

Spaces.

> >>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
> >>>>>> +        return ERROR_NOMEM;
> >>>>>> +
> >>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> >>>>>> +        switch(state) {
> 
> I thought initially you let me to follow that previous "if" :)
> 

Just be consistent with other part of the source code.

> >
> >
> >>>
> >>>>>>+        case STATE_TYPE:
> >>>>>>+            if ( *ptr == '\0' || *ptr == ',' ) {
> >>>>>>+                state = STATE_CHECK_FLAG;
> >>>>>>+                *ptr = '\0';
> >

[...]

> >                  [("hvm", Struct(None, [("firmware",         string),
> >                                         ("bios",             libxl_bios_type),
> >@@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
> >      ("power_mgmt", bool),
> >      ("permissive", bool),
> >      ("seize", bool),
> >+    ("rdm_reserve",   libxl_rdm_reserve_flag),
> >      ])
> >
> >Do you actually need libxl_rdm_reserve type? I.e. do you envisage that
> >structure to change a lot? Can you not just use libxl_rdm_reserve_type
> >and libxl_rdm_reserve_flag in build_info.
> >
> 
> We'd like to introduce this type, libxl_rdm_reserve, to combine "type" and
> "flag". From my point of view, this sole structure can represent a holistic
> approach to rdm because,
> 
> #1. Obviously its easy to get all;
> #2. It will probably be extended since like this name, rdm, reserved device
> memory, this should not be restricted to RMRR currently. So I just feel its
> flexible to support others in the future, or much more properties :)
> 

Fair enough.

Wei.

> Thanks
> Tiejun

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM
  2015-05-19  1:32             ` Tian, Kevin
@ 2015-05-19 10:22               ` Wei Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Wei Liu @ 2015-05-19 10:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wei Liu, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, Zhang, Yang Z, Chen, Tiejun,
	Ian.Jackson

On Tue, May 19, 2015 at 01:32:27AM +0000, Tian, Kevin wrote:
> > From: Wei Liu [mailto:wei.liu2@citrix.com]
> > Sent: Tuesday, May 19, 2015 4:00 AM
> > 

[...]

> > 
> > Yes, but this policy is hardcoded in code (as in, you bail when
> > detecting conflict in highmem region). I don't think we have the
> > expertise to un-hardcode it in the future (it might require x86 specific
> > insider information and specific hardware to test). So I would like to
> > make it as future proof as possible.
> > 
> > I know you're limited by hvm_info. I would accept this hardcoded policy
> > as short term solution, but I would like commitment from Intel to
> > maintain this piece of code and properly work out a flexible solution to
> > deal with >4G case.
> > 
> 
> We made this simplification by balancing implementation complexity and
> real-world possibility. Yes Intel will commit to maintain it if future platform
> does break this assumption.
> 

Thanks. This is what I need at this stage. I'm fine with the way going
forward.

Wei.

> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-19  9:42               ` Wei Liu
@ 2015-05-19 10:50                 ` Chen, Tiejun
  2015-05-19 11:00                   ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-19 10:50 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/19 17:42, Wei Liu wrote:
> On Tue, May 19, 2015 at 11:16:33AM +0800, Chen, Tiejun wrote:
>>
>> On 2015/5/19 3:17, Wei Liu wrote:
>
> [...]
>
>>>>      rdm = [ 'host, reserve=<POLICY>' ]
>>>>
>>>
>>> OK, so this is a specific example in vtd.txt. Last time I misread it as
>>> part of the manpage.
>>>
>>> I think you meant in this specific example (with other suggestions
>>> incorporated):
>>>
>>>       rdm = "type=host, reserve=force"
>>>
>>> Then you point user to xl.cfg manpage.
>>>
>>>> Or just for a specific device:
>>>>
>>>>      pci = [ '01:00.0,rdm_reserve=<POLICY>', '03:00.0' ]
>>>>
>>>
>>> Same here.
>>>
>>
>> In order to be compatible with vtd.txt, could this work for you?
>>
>> To enable this globally, add "rdm" in the config file:
>>
>>      rdm = [ 'type=host, reserve=<policy>' ]     (default policy is
>> "relaxed")
>>
>> Or just for a specific device:
>>
>>      pci = [ '01:00.0,rdm_reserve=<policy>' ]    (default policy is "strict")
>>
>
> In my last reply I meant to ask you to have specific example in vtd.txt
> and point user to xl.cfg manpage (see below). Previously I thought the
> examples you proposed were for manpages, that's why I had asked you to
> "list all values of that option". But now this is in vtd.txt, I would
> just write something like:
>
>
>      rdm = "type=host, reserve=relaxed"   (default policy is "relaxed")
>
>      pci = [ '01:00.0,rdm_reserve=try', '03:00.0,reserve=strict' ]
>
>      For all the options available to RDM, see xl.cfg(5).
>
> And of course you would need to list all values of these options in
> xl.cfg(5), as I suggested in my first reply to this patch.

This really make me clear. Sorry for my previous misunderstanding.

>
>>
>>> Just don't write "force/try" or "strcit/relax" because that's not the
>>> exact syntax you would use in a real config file.
>>
>> Yeah.
>>
>>>
>
> [...]
>
>>>>>>> want you want?
>>>>>>
>>>>>> We have a little bit of complexity here,
>>>>>>
>>>>>> "Default per-device RDM policy is 'force', while default global RDM policy
>>>>>> is 'try'. When both policies are specified on a given region, 'force' is
>>>>>> always preferred."
>>>>>>
>>>>>
>>>>> This is going to be done in actual code anyway.
>>>>>
>>>>> This type is used both in global and per-device setting, so I envisage
>>>>
>>>> Yes.
>>>>
>>>>> this to have an invalid value to start with. Appropriate default values
>>>>
>>>> Sounds I should set this,
>>>>
>>>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>>>> +    (-1, "invalid"),
>>>> +    (0, "strict"),
>>>> +    (1, "relaxed"),
>>>> +    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
>>>> +
>>>>
>>>
>
> Yet another question about this feature. The current setup suggests that
> we must choose a policy, either "strict" or "relaxed", i.e. there is no
> way to disable this feature, as in there is no "none" policy to skip
> checking rdm conflict.
>
> AIUI this feature is more like a bug fix to existing problem, so we
> always want to enable it. And the default relaxed policy makes sure we
> don't break guest that was working before this feature. Do I understand
> this correctly?
>
> If we risk breaking existing guests, we might want to consider adding a
> "none" (name subject to improvement) policy to just skip RDM all
> together.

We have this definition,

+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
+    (0, "none"),
+    (1, "host"),
+    ])

If we set 'type=none', this means we would do nothing actually since we 
don't expose any rdms to guest. This behavior will ensue we go back the 
existing scenario without our patch.

>
>>> Yes, and then don't forget to set the value to appropriate value in the
>>> _setdefault functions for different types.
>>
>> Currently "type" is not associated to "policy" so we can do this if
>> necessary in the future.
>>
>>>
>>>>
>>>>> should be done in libxl_TYPE_setdefault functions. And the logic to
>>>>> detect conflict and preferences done in your construct function.
>>>>>
>>>>> What do you think?
>>>>>
>>>>>>>
>>>>>>>> +
>>>
>>> [...]
>>>
>>>>>>>>                       pcidev->permissive = atoi(tok);
>>>>>>>>                   }else if ( !strcmp(optkey, "seize") ) {
>>>>>>>>                       pcidev->seize = atoi(tok);
>>>>>>>> +                }else if ( !strcmp(optkey, "rdm_reserve") ) {
>>>>>>>> +                    if ( !strcmp(tok, "force") ) {
>>>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>>>> +                    } else if ( !strcmp(tok, "try") ) {
>>>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_TRY;
>>>>>>>> +                    } else {
>>>>>>>> +                        pcidev->rdm_reserve = LIBXL_RDM_RESERVE_FLAG_FORCE;
>>>>>>>> +                        XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag value:"
>>>>>>>> +                                          " %s, so goes 'force' by default.",
>>>>>>>
>>>>>>> If this is not an error, you don't need XLU__PCI_ERR.
>>>>>>>
>>>>>>> But I would say we should  just treat this as an error and
>>>>>>> abort/exit/report (whatever the parser should do in this case).
>>>>>>
>>>>>> In our case we just want to post a message to set a appropriate flag to
>>>>>> recover this behavior like we write here,
>>>>>>
>>>>>>                          XLU__PCI_ERR(cfg, "Unknown PCI RDM property flag
>>>>>> value:"
>>>>>>                                            " %s, so goes 'strict' by
>>>>>> default.",
>>>>>>                                       tok);
>>>>>
>>>>> I suggest we just abort in this case and not second guess what the admin
>>>>> wants.
>>>>
>>>> Okay,
>>>>                      } else {
>>>>                          XLU__PCI_ERR(cfg, "%s is not an valid PCI RDM
>>>> property"
>>>>                                            " flag: 'strict' or 'relaxed'.",
>>>>                                       tok);
>>>>                          abort();
>>>>
>>>
>>> No, not calling the "abort" function. I meant returning appropriate error
>>> value and let the caller handles this situation.
>>
>> Okay, just call "goto parse_error".
>>
>
> Yes, that would work.
>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> This may just be a warning? But I don't we have this sort of definition,
>>>>>> XLU__PCI_WARN, ...
>>>>>>
>>>>>> So what LOG format can be adopted here?
>>>>>
>>>>> Feel free to introduce XLU__PCI_WARN if it turns out to be necessary.
>>>>
>>>> If it goes to abort(), I think XLU__PCI_ERR() should be good.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +                                     tok);
>>>>>>>> +                    }
>>>>>>>>                   }else{
>>>>>>>>                       XLU__PCI_ERR(cfg, "Unknown PCI BDF option: %s", optkey);
>>>>>>>>                   }
>>>>>>>> @@ -167,6 +180,71 @@ parse_error:
>>>>>>>>       return ERROR_INVAL;
>>>>>>>>   }
>>>>>>>>
>>>>>>>> +int xlu_rdm_parse(XLU_Config *cfg, libxl_rdm_reserve *rdm, const char *str)
>>>>>>>> +{
>>>>>>>> +    unsigned state = STATE_TYPE;
>>>>>>>> +    char *buf2, *tok, *ptr, *end;
>>>>>>>> +
>>>>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>>>>>> +        return ERROR_NOMEM;
>>>>>>>> +
>>>>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>>>>>> +        switch(state) {
>>>>>
>>>>> Coding style. I haven't checked what actual style this file uses, but
>>>>> there is inconsistency in this function by itself.
>>>>
>>>> I just refer to xlu_pci_parse_bdf() to generate xlu_rdm_parse(), and they
>>>> are in the same file...
>>>>
>>>> Anyway, I should change this line,
>>>>
>>>> for ( tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++ ) {
>>>>
>>>
>>>    for (tok = ptr, end...)
>>>
>>>    switch (state) {
>>
>> But what is the difference to compare the initial code?
>>
>
> Spaces.
>
>>>>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>>>>>> +        return ERROR_NOMEM;
>>>>>>>> +
>>>>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>>>>>> +        switch(state) {
>>
>> I thought initially you let me to follow that previous "if" :)
>>
>
> Just be consistent with other part of the source code.

I just refer to that existing xlu_pci_parse_bdf()...

Anyway I guess you mean I should do something like this,

     if (NULL == (buf2 = ptr = strdup(str)))
         return ERROR_NOMEM;

     for (tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
         switch(state) {
         case STATE_TYPE:
             if (*ptr == '=') {
	    ...

Thanks
Tiejun

>
>>>
>>>
>>>>>
>>>>>>>> +        case STATE_TYPE:
>>>>>>>> +            if ( *ptr == '\0' || *ptr == ',' ) {
>>>>>>>> +                state = STATE_CHECK_FLAG;
>>>>>>>> +                *ptr = '\0';
>>>
>
> [...]
>
>>>                   [("hvm", Struct(None, [("firmware",         string),
>>>                                          ("bios",             libxl_bios_type),
>>> @@ -521,6 +538,7 @@ libxl_device_pci = Struct("device_pci", [
>>>       ("power_mgmt", bool),
>>>       ("permissive", bool),
>>>       ("seize", bool),
>>> +    ("rdm_reserve",   libxl_rdm_reserve_flag),
>>>       ])
>>>
>>> Do you actually need libxl_rdm_reserve type? I.e. do you envisage that
>>> structure to change a lot? Can you not just use libxl_rdm_reserve_type
>>> and libxl_rdm_reserve_flag in build_info.
>>>
>>
>> We'd like to introduce this type, libxl_rdm_reserve, to combine "type" and
>> "flag". From my point of view, this sole structure can represent a holistic
>> approach to rdm because,
>>
>> #1. Obviously its easy to get all;
>> #2. It will probably be extended since like this name, rdm, reserved device
>> memory, this should not be restricted to RMRR currently. So I just feel its
>> flexible to support others in the future, or much more properties :)
>>
>
> Fair enough.
>
> Wei.
>
>> Thanks
>> Tiejun
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-19 10:50                 ` Chen, Tiejun
@ 2015-05-19 11:00                   ` Wei Liu
  2015-05-20  5:27                     ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-19 11:00 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Tue, May 19, 2015 at 06:50:11PM +0800, Chen, Tiejun wrote:
> On 2015/5/19 17:42, Wei Liu wrote:

[...]

> >>>>>>>want you want?
> >>>>>>
> >>>>>>We have a little bit of complexity here,
> >>>>>>
> >>>>>>"Default per-device RDM policy is 'force', while default global RDM policy
> >>>>>>is 'try'. When both policies are specified on a given region, 'force' is
> >>>>>>always preferred."
> >>>>>>
> >>>>>
> >>>>>This is going to be done in actual code anyway.
> >>>>>
> >>>>>This type is used both in global and per-device setting, so I envisage
> >>>>
> >>>>Yes.
> >>>>
> >>>>>this to have an invalid value to start with. Appropriate default values
> >>>>
> >>>>Sounds I should set this,
> >>>>
> >>>>+libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
> >>>>+    (-1, "invalid"),
> >>>>+    (0, "strict"),
> >>>>+    (1, "relaxed"),
> >>>>+    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
> >>>>+
> >>>>
> >>>
> >
> >Yet another question about this feature. The current setup suggests that
> >we must choose a policy, either "strict" or "relaxed", i.e. there is no
> >way to disable this feature, as in there is no "none" policy to skip
> >checking rdm conflict.
> >
> >AIUI this feature is more like a bug fix to existing problem, so we
> >always want to enable it. And the default relaxed policy makes sure we
> >don't break guest that was working before this feature. Do I understand
> >this correctly?
> >
> >If we risk breaking existing guests, we might want to consider adding a
> >"none" (name subject to improvement) policy to just skip RDM all
> >together.
> 
> We have this definition,
> 
> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> +    (0, "none"),
> +    (1, "host"),
> +    ])
> 
> If we set 'type=none', this means we would do nothing actually since we
> don't expose any rdms to guest. This behavior will ensue we go back the
> existing scenario without our patch.
> 

But this only works with global configuration and individual
configuration in PCI spec trumps this, right?

Think about how an old configuration migrated to newer version of Xen
should work. For example, I don't have rdm= but have pci=['xxxx']. Do we
need to make sure this still work? I guess the answer is if it already
works before RDM it should continue to work as there is really no
conflict before. In this case whether  we enable RDM or not doesn't make
a difference to a guest that's already working before. Am I right?

> >
> >>>Yes, and then don't forget to set the value to appropriate value in the
> >>>_setdefault functions for different types.
> >>

[...]

> >Spaces.
> >
> >>>>>>>>+    if ( NULL == (buf2 = ptr = strdup(str)) )
> >>>>>>>>+        return ERROR_NOMEM;
> >>>>>>>>+
> >>>>>>>>+    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
> >>>>>>>>+        switch(state) {
> >>
> >>I thought initially you let me to follow that previous "if" :)
> >>
> >
> >Just be consistent with other part of the source code.
> 
> I just refer to that existing xlu_pci_parse_bdf()...
> 

Sorry I didn't mean to blame you for something that's not your fault.

> Anyway I guess you mean I should do something like this,
> 
>     if (NULL == (buf2 = ptr = strdup(str)))
>         return ERROR_NOMEM;
> 
>     for (tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>         switch(state) {
>         case STATE_TYPE:
>             if (*ptr == '=') {
> 	    ...
> 

Fair enough. I prefer consistency.

Wei.

> Thanks
> Tiejun
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-19 11:00                   ` Wei Liu
@ 2015-05-20  5:27                     ` Chen, Tiejun
  2015-05-20  8:36                       ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-20  5:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/19 19:00, Wei Liu wrote:
> On Tue, May 19, 2015 at 06:50:11PM +0800, Chen, Tiejun wrote:
>> On 2015/5/19 17:42, Wei Liu wrote:
>
> [...]
>
>>>>>>>>> want you want?
>>>>>>>>
>>>>>>>> We have a little bit of complexity here,
>>>>>>>>
>>>>>>>> "Default per-device RDM policy is 'force', while default global RDM policy
>>>>>>>> is 'try'. When both policies are specified on a given region, 'force' is
>>>>>>>> always preferred."
>>>>>>>>
>>>>>>>
>>>>>>> This is going to be done in actual code anyway.
>>>>>>>
>>>>>>> This type is used both in global and per-device setting, so I envisage
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> this to have an invalid value to start with. Appropriate default values
>>>>>>
>>>>>> Sounds I should set this,
>>>>>>
>>>>>> +libxl_rdm_reserve_flag = Enumeration("rdm_reserve_flag", [
>>>>>> +    (-1, "invalid"),
>>>>>> +    (0, "strict"),
>>>>>> +    (1, "relaxed"),
>>>>>> +    ], init_val = "LIBXL_RDM_RESERVE_FLAG_INVALID")
>>>>>> +
>>>>>>
>>>>>
>>>
>>> Yet another question about this feature. The current setup suggests that
>>> we must choose a policy, either "strict" or "relaxed", i.e. there is no
>>> way to disable this feature, as in there is no "none" policy to skip
>>> checking rdm conflict.
>>>
>>> AIUI this feature is more like a bug fix to existing problem, so we
>>> always want to enable it. And the default relaxed policy makes sure we
>>> don't break guest that was working before this feature. Do I understand
>>> this correctly?
>>>
>>> If we risk breaking existing guests, we might want to consider adding a
>>> "none" (name subject to improvement) policy to just skip RDM all
>>> together.
>>
>> We have this definition,
>>
>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>> +    (0, "none"),
>> +    (1, "host"),
>> +    ])
>>
>> If we set 'type=none', this means we would do nothing actually since we
>> don't expose any rdms to guest. This behavior will ensue we go back the
>> existing scenario without our patch.
>>
>
> But this only works with global configuration and individual
> configuration in PCI spec trumps this, right?

You're right.

>
> Think about how an old configuration migrated to newer version of Xen
> should work. For example, I don't have rdm= but have pci=['xxxx']. Do we
> need to make sure this still work? I guess the answer is if it already

Definitely.

> works before RDM it should continue to work as there is really no
> conflict before. In this case whether  we enable RDM or not doesn't make
> a difference to a guest that's already working before. Am I right?

I think we can set the default 'type' to NONE,

libxl__rdm_setdefault()
{
     b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_NONE;

and then,

libxl__domain_device_construct_rdm()
{
     ...
     /* Might not expose rdm. */
     if (type == LIBXL_RDM_RESERVE_TYPE_NONE)
	return 0;

This means we don't expose any rdm so we would never concern any policy 
anymore.


Thanks
Tiejun

>
>>>
>>>>> Yes, and then don't forget to set the value to appropriate value in the
>>>>> _setdefault functions for different types.
>>>>
>
> [...]
>
>>> Spaces.
>>>
>>>>>>>>>> +    if ( NULL == (buf2 = ptr = strdup(str)) )
>>>>>>>>>> +        return ERROR_NOMEM;
>>>>>>>>>> +
>>>>>>>>>> +    for(tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>>>>>>>>> +        switch(state) {
>>>>
>>>> I thought initially you let me to follow that previous "if" :)
>>>>
>>>
>>> Just be consistent with other part of the source code.
>>
>> I just refer to that existing xlu_pci_parse_bdf()...
>>
>
> Sorry I didn't mean to blame you for something that's not your fault.
>
>> Anyway I guess you mean I should do something like this,
>>
>>      if (NULL == (buf2 = ptr = strdup(str)))
>>          return ERROR_NOMEM;
>>
>>      for (tok = ptr, end = ptr + strlen(ptr) + 1; ptr < end; ptr++) {
>>          switch(state) {
>>          case STATE_TYPE:
>>              if (*ptr == '=') {
>> 	    ...
>>
>
> Fair enough. I prefer consistency.
>
> Wei.
>
>> Thanks
>> Tiejun
>>
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-20  5:27                     ` Chen, Tiejun
@ 2015-05-20  8:36                       ` Wei Liu
  2015-05-20  8:51                         ` Chen, Tiejun
  0 siblings, 1 reply; 125+ messages in thread
From: Wei Liu @ 2015-05-20  8:36 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Wed, May 20, 2015 at 01:27:56PM +0800, Chen, Tiejun wrote:
[...]
> >>We have this definition,
> >>
> >>+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> >>+    (0, "none"),
> >>+    (1, "host"),
> >>+    ])
> >>
> >>If we set 'type=none', this means we would do nothing actually since we
> >>don't expose any rdms to guest. This behavior will ensue we go back the
> >>existing scenario without our patch.
> >>
> >
> >But this only works with global configuration and individual
> >configuration in PCI spec trumps this, right?
> 
> You're right.
> 
> >
> >Think about how an old configuration migrated to newer version of Xen
> >should work. For example, I don't have rdm= but have pci=['xxxx']. Do we
> >need to make sure this still work? I guess the answer is if it already
> 
> Definitely.
> 
> >works before RDM it should continue to work as there is really no
> >conflict before. In this case whether  we enable RDM or not doesn't make
> >a difference to a guest that's already working before. Am I right?
> 

You haven't answered this question...  I'm trying to determine what
should be a sensible default value.

If the answer to that question is "yes", then we should enable RDM by
default, because it does no harm to guests that are already working and
fix problem for the guests that are not working; if the answer is "no"
or "not sure", we should use "none". Don't worry, we can change the
default value later if necessary.

Using "none" as default leaves us on the safe side but it would make it
less nicer to use Xen. But well, safety comes first.

Wei.

> I think we can set the default 'type' to NONE,
> 
> libxl__rdm_setdefault()
> {
>     b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_NONE;
> 
> and then,
> 
> libxl__domain_device_construct_rdm()
> {
>     ...
>     /* Might not expose rdm. */
>     if (type == LIBXL_RDM_RESERVE_TYPE_NONE)
> 	return 0;
> 
> This means we don't expose any rdm so we would never concern any policy
> anymore.
> 
> 
> Thanks
> Tiejun
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-20  8:36                       ` Wei Liu
@ 2015-05-20  8:51                         ` Chen, Tiejun
  2015-05-20  9:07                           ` Wei Liu
  0 siblings, 1 reply; 125+ messages in thread
From: Chen, Tiejun @ 2015-05-20  8:51 UTC (permalink / raw)
  To: Wei Liu
  Cc: kevin.tian, ian.campbell, andrew.cooper3, tim, xen-devel,
	stefano.stabellini, JBeulich, yang.z.zhang, Ian.Jackson

On 2015/5/20 16:36, Wei Liu wrote:
> On Wed, May 20, 2015 at 01:27:56PM +0800, Chen, Tiejun wrote:
> [...]
>>>> We have this definition,
>>>>
>>>> +libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
>>>> +    (0, "none"),
>>>> +    (1, "host"),
>>>> +    ])
>>>>
>>>> If we set 'type=none', this means we would do nothing actually since we
>>>> don't expose any rdms to guest. This behavior will ensue we go back the
>>>> existing scenario without our patch.
>>>>
>>>
>>> But this only works with global configuration and individual
>>> configuration in PCI spec trumps this, right?
>>
>> You're right.
>>
>>>
>>> Think about how an old configuration migrated to newer version of Xen
>>> should work. For example, I don't have rdm= but have pci=['xxxx']. Do we
>>> need to make sure this still work? I guess the answer is if it already
>>
>> Definitely.
>>
>>> works before RDM it should continue to work as there is really no
>>> conflict before. In this case whether  we enable RDM or not doesn't make
>>> a difference to a guest that's already working before. Am I right?
>>
>
> You haven't answered this question...  I'm trying to determine what
> should be a sensible default value.

I think I should say 'no' here. RDM (RMRR) already exists and its also 
being enabled before I'm trying to introduce this series of patch. But 
we have some legacy or potential problems to really work well with RMRR.

>
> If the answer to that question is "yes", then we should enable RDM by
> default, because it does no harm to guests that are already working and
> fix problem for the guests that are not working; if the answer is "no"
> or "not sure", we should use "none". Don't worry, we can change the
> default value later if necessary.
>

So we're going to the latter :)

> Using "none" as default leaves us on the safe side but it would make it
> less nicer to use Xen. But well, safety comes first.

Actually RMRR doesn't matter in most cases unless you're trying to do 
pass through with a device which owns RMRR and also really use RMRR 
indeed. So from my point of view, I agree "NONE" should be default since 
we really should make sure we have a way to approach our original 
behavior in accordance with your concern.

Thanks
Tiejun

>
> Wei.
>
>> I think we can set the default 'type' to NONE,
>>
>> libxl__rdm_setdefault()
>> {
>>      b_info->rdm.type = LIBXL_RDM_RESERVE_TYPE_NONE;
>>
>> and then,
>>
>> libxl__domain_device_construct_rdm()
>> {
>>      ...
>>      /* Might not expose rdm. */
>>      if (type == LIBXL_RDM_RESERVE_TYPE_NONE)
>> 	return 0;
>>
>> This means we don't expose any rdm so we would never concern any policy
>> anymore.
>>
>>
>> Thanks
>> Tiejun
>>
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy
  2015-05-20  8:51                         ` Chen, Tiejun
@ 2015-05-20  9:07                           ` Wei Liu
  0 siblings, 0 replies; 125+ messages in thread
From: Wei Liu @ 2015-05-20  9:07 UTC (permalink / raw)
  To: Chen, Tiejun
  Cc: kevin.tian, Wei Liu, ian.campbell, andrew.cooper3, tim,
	xen-devel, stefano.stabellini, JBeulich, yang.z.zhang,
	Ian.Jackson

On Wed, May 20, 2015 at 04:51:44PM +0800, Chen, Tiejun wrote:
> On 2015/5/20 16:36, Wei Liu wrote:
> >On Wed, May 20, 2015 at 01:27:56PM +0800, Chen, Tiejun wrote:
> >[...]
> >>>>We have this definition,
> >>>>
> >>>>+libxl_rdm_reserve_type = Enumeration("rdm_reserve_type", [
> >>>>+    (0, "none"),
> >>>>+    (1, "host"),
> >>>>+    ])
> >>>>
> >>>>If we set 'type=none', this means we would do nothing actually since we
> >>>>don't expose any rdms to guest. This behavior will ensue we go back the
> >>>>existing scenario without our patch.
> >>>>
> >>>
> >>>But this only works with global configuration and individual
> >>>configuration in PCI spec trumps this, right?
> >>
> >>You're right.
> >>
> >>>
> >>>Think about how an old configuration migrated to newer version of Xen
> >>>should work. For example, I don't have rdm= but have pci=['xxxx']. Do we
> >>>need to make sure this still work? I guess the answer is if it already
> >>
> >>Definitely.
> >>
> >>>works before RDM it should continue to work as there is really no
> >>>conflict before. In this case whether  we enable RDM or not doesn't make
> >>>a difference to a guest that's already working before. Am I right?
> >>
> >
> >You haven't answered this question...  I'm trying to determine what
> >should be a sensible default value.
> 
> I think I should say 'no' here. RDM (RMRR) already exists and its also being
> enabled before I'm trying to introduce this series of patch. But we have
> some legacy or potential problems to really work well with RMRR.
> 
> >
> >If the answer to that question is "yes", then we should enable RDM by
> >default, because it does no harm to guests that are already working and
> >fix problem for the guests that are not working; if the answer is "no"
> >or "not sure", we should use "none". Don't worry, we can change the
> >default value later if necessary.
> >
> 
> So we're going to the latter :)
> 
> >Using "none" as default leaves us on the safe side but it would make it
> >less nicer to use Xen. But well, safety comes first.
> 
> Actually RMRR doesn't matter in most cases unless you're trying to do pass
> through with a device which owns RMRR and also really use RMRR indeed. So
> from my point of view, I agree "NONE" should be default since we really
> should make sure we have a way to approach our original behavior in
> accordance with your concern.
> 

Makes sense. Thanks for your reply.

Wei.

^ permalink raw reply	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2015-05-20  9:07 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10  9:21 [RFC][PATCH 00/13] Fix RMRR Tiejun Chen
2015-04-10  9:21 ` [RFC][PATCH 01/13] tools: introduce some new parameters to set rdm policy Tiejun Chen
2015-05-08 13:04   ` Wei Liu
2015-05-11  5:35     ` Chen, Tiejun
2015-05-11 14:54       ` Wei Liu
2015-05-15  1:52         ` Chen, Tiejun
2015-05-18  1:06           ` Chen, Tiejun
2015-05-18 19:17           ` Wei Liu
2015-05-19  3:16             ` Chen, Tiejun
2015-05-19  9:42               ` Wei Liu
2015-05-19 10:50                 ` Chen, Tiejun
2015-05-19 11:00                   ` Wei Liu
2015-05-20  5:27                     ` Chen, Tiejun
2015-05-20  8:36                       ` Wei Liu
2015-05-20  8:51                         ` Chen, Tiejun
2015-05-20  9:07                           ` Wei Liu
2015-04-10  9:21 ` [RFC][PATCH 02/13] introduce XENMEM_reserved_device_memory_map Tiejun Chen
2015-04-16 14:59   ` Tim Deegan
2015-04-16 15:10     ` Jan Beulich
2015-04-16 15:24       ` Tim Deegan
2015-04-16 15:40         ` Tian, Kevin
2015-04-23 12:32       ` Chen, Tiejun
2015-04-23 12:59         ` Jan Beulich
2015-04-24  1:17           ` Chen, Tiejun
2015-04-24  7:21             ` Jan Beulich
2015-04-10  9:21 ` [RFC][PATCH 03/13] tools/libxc: Expose new hypercall xc_reserved_device_memory_map Tiejun Chen
2015-05-08 13:07   ` Wei Liu
2015-05-11  5:36     ` Chen, Tiejun
2015-05-11  9:50       ` Wei Liu
2015-04-10  9:21 ` [RFC][PATCH 04/13] tools/libxl: detect and avoid conflicts with RDM Tiejun Chen
2015-04-15 13:10   ` Ian Jackson
2015-04-15 18:22     ` Tian, Kevin
2015-04-23 12:31     ` Chen, Tiejun
2015-04-20 11:13   ` Jan Beulich
2015-05-06 15:00     ` Chen, Tiejun
2015-05-06 15:34       ` Jan Beulich
2015-05-07  2:22         ` Chen, Tiejun
2015-05-07  6:04           ` Jan Beulich
2015-05-08  1:14             ` Chen, Tiejun
2015-05-08  1:24           ` Chen, Tiejun
2015-05-08 15:13             ` Wei Liu
2015-05-11  6:06               ` Chen, Tiejun
2015-05-08 14:43   ` Wei Liu
2015-05-11  8:09     ` Chen, Tiejun
2015-05-11 11:32       ` Wei Liu
2015-05-14  8:27         ` Chen, Tiejun
2015-05-18  1:06           ` Chen, Tiejun
2015-05-18 20:00           ` Wei Liu
2015-05-19  1:32             ` Tian, Kevin
2015-05-19 10:22               ` Wei Liu
2015-05-19  6:47             ` Chen, Tiejun
2015-04-10  9:21 ` [RFC][PATCH 05/13] xen/x86/p2m: introduce set_identity_p2m_entry Tiejun Chen
2015-04-16 15:05   ` Tim Deegan
2015-04-23 12:33     ` Chen, Tiejun
2015-04-10  9:21 ` [RFC][PATCH 06/13] xen:vtd: create RMRR mapping Tiejun Chen
2015-04-16 15:16   ` Tim Deegan
2015-04-16 15:50     ` Tian, Kevin
2015-04-10  9:21 ` [RFC][PATCH 07/13] xen/passthrough: extend hypercall to support rdm reservation policy Tiejun Chen
2015-04-16 15:40   ` Tim Deegan
2015-04-23 12:32     ` Chen, Tiejun
2015-04-23 13:05       ` Tim Deegan
2015-04-23 13:59       ` Jan Beulich
2015-04-23 14:26         ` Tim Deegan
2015-05-04  8:15         ` Tian, Kevin
2015-04-20 13:36   ` Jan Beulich
2015-05-11  8:37     ` Chen, Tiejun
2015-05-08 16:07   ` Julien Grall
2015-05-11  8:42     ` Chen, Tiejun
2015-05-11  9:51       ` Julien Grall
2015-05-11 10:57         ` Jan Beulich
2015-05-14  5:48           ` Chen, Tiejun
2015-05-14 20:13             ` Jan Beulich
2015-05-14  5:47         ` Chen, Tiejun
2015-05-14 10:19           ` Julien Grall
2015-04-10  9:21 ` [RFC][PATCH 08/13] tools: extend xc_assign_device() " Tiejun Chen
2015-04-20 13:39   ` Jan Beulich
2015-05-11  9:45     ` Chen, Tiejun
2015-05-11 10:53       ` Jan Beulich
2015-05-14  7:04         ` Chen, Tiejun
2015-04-10  9:22 ` [RFC][PATCH 09/13] xen: enable XENMEM_set_memory_map in hvm Tiejun Chen
2015-04-16 15:42   ` Tim Deegan
2015-04-20 13:46   ` Jan Beulich
2015-05-15  2:33     ` Chen, Tiejun
2015-05-15  6:12       ` Jan Beulich
2015-05-15  6:24         ` Chen, Tiejun
2015-05-15  6:35           ` Jan Beulich
2015-05-15  6:59             ` Chen, Tiejun
2015-04-10  9:22 ` [RFC][PATCH 10/13] tools: extend XENMEM_set_memory_map Tiejun Chen
2015-04-10 10:01   ` Wei Liu
2015-04-13  2:09     ` Chen, Tiejun
2015-04-13 11:02       ` Wei Liu
2015-04-14  0:42         ` Chen, Tiejun
2015-05-05  9:32           ` Wei Liu
2015-04-20 13:51   ` Jan Beulich
2015-05-15  2:57     ` Chen, Tiejun
2015-05-15  6:16       ` Jan Beulich
2015-05-15  7:09         ` Chen, Tiejun
2015-05-15  7:32           ` Jan Beulich
2015-05-15  7:51             ` Chen, Tiejun
2015-04-10  9:22 ` [RFC][PATCH 11/13] hvmloader: get guest memory map into memory_map[] Tiejun Chen
2015-04-20 13:57   ` Jan Beulich
2015-05-15  3:10     ` Chen, Tiejun
2015-04-10  9:22 ` [RFC][PATCH 12/13] hvmloader/pci: skip reserved ranges Tiejun Chen
2015-04-20 14:21   ` Jan Beulich
2015-05-15  3:18     ` Chen, Tiejun
2015-05-15  6:19       ` Jan Beulich
2015-05-15  7:34         ` Chen, Tiejun
2015-05-15  7:44           ` Jan Beulich
2015-05-15  8:16             ` Chen, Tiejun
2015-05-15  8:31               ` Jan Beulich
2015-05-15  9:21                 ` Chen, Tiejun
2015-05-15  9:32                   ` Jan Beulich
2015-04-10  9:22 ` [RFC][PATCH 13/13] hvmloader/e820: construct guest e820 table Tiejun Chen
2015-04-20 14:29   ` Jan Beulich
2015-05-15  6:11     ` Chen, Tiejun
2015-05-15  6:25       ` Jan Beulich
2015-05-15  6:39         ` Chen, Tiejun
2015-05-15  6:56           ` Jan Beulich
2015-05-15  7:11             ` Chen, Tiejun
2015-05-15  7:34               ` Jan Beulich
2015-05-15  8:00                 ` Chen, Tiejun
2015-05-15  8:12                   ` Jan Beulich
2015-05-15  8:47                     ` Chen, Tiejun
2015-05-15  8:54                       ` Jan Beulich
2015-05-15  9:18                         ` Chen, Tiejun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.