All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
@ 2017-03-20  0:09 Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 01/15] xen/common: add Kconfig item for pmem support Haozhong Zhang
                   ` (16 more replies)
  0 siblings, 17 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams, Daniel De Graaf

This is v2 RFC patch series to add vNVDIMM support to HVM domains.
v1 can be found at https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00424.html.

No label and no _DSM except function 0 "query implemented functions"
is supported by this version, but they will be added by future patches.

The corresponding Qemu patch series is sent in another thread
"[RFC QEMU PATCH v2 00/10] Implement vNVDIMM for Xen HVM guest".

All patch series can be found at
  Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v2
  Qemu: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v2

Changes in v2
==============

- One of the primary changes in v2 is dropping the linux kernel
  patches, which were used to reserve on host pmem for placing its
  frametable and M2P table. In v2, we add a management tool xen-ndctl
  which is used in Dom0 to notify Xen hypervisor of which storage can
  be used to manage the host pmem.

  For example,
  1.   xen-ndctl setup 0x240000 0x380000 0x380000 0x3c0000
    tells Xen hypervisor to use host pmem pages at MFN 0x380000 ~
    0x3c0000 to manage host pmem pages at MFN 0x240000 ~ 0x380000.
    I.e. the former is used to place the frame table and M2P table of
    both ranges of pmem pages.

  2.   xen-ndctl setup 0x240000 0x380000
    tells Xen hypervisor to use the regular RAM to manage the host
    pmem pages at MFN 0x240000 ~ 0x380000. I.e the regular RMA is used
    to place the frame table and M2P table.

- Another primary change in v2 is dropping the support to map files on
  the host pmem to HVM domains as virtual NVDIMMs, as I cannot find a
  stable to fix the fiemap of host files. Instead, we can rely on the
  ability added in Linux kernel v4.9 that enables creating multiple
  pmem namespaces on a single nvdimm interleave set.

- Other changes are logged in each patch separately.

How to Test
==============

0. This patch series can be tested either on the real hardware with
   NVDIMM, or in the nested virtualization environment on KVM. The
   latter requires QEMU 2.9 or newer with, for example, following
   commands and options,
     # dd if=/dev/zero of=nvm-8G.img bs=1G count=8
     # rmmod kvm-intel; modprobe kvm-intel nested=1
     # qemu-system-x86_64 -enable-kvm -smp 4 -cpu host,+vmx \
                          -hda DISK_IMG_OF_XEN \
                          -machine pc,nvdimm \
                          -m 8G,slots=4,maxmem=128G \
                          -object memory-backend-file,id=mem1,mem-path=nvm-8G,size=8G \
                          -device nvdimm,id=nv1,memdev=mem1,label-size=2M \
                          ...
   Above will create a nested virtualization environment with a 8G
   pmem mode NVDIMM device (whose last 2MB is used as the label
   storage area).

1. Check out Xen and QEMU from above repositories and branches. Build
   and install Xen with qemu-xen replaced by above QEMU.

2. Build and install Linux kernel 4.9 or later as Dom0 kernel with the
   following configs selected:
       CONFIG_ACPI_NFIT
       CONFIG_LIBNVDIMM
       CONFIG_BLK_DEV_PMEM
       CONFIG_NVDIMM_PFN
       CONFIG_FS_DAX

3. Check out ndctl from https://github.com/pmem/ndctl.git. Build and
   install ndctl in Dom0.

4. Boot to Xen Dom0.

5. Create pmem namespaces on the host pmem region.
     # ndctl disable-region region0
     # ndctl zero-labels nmem0                        // clear existing labels
     # ndctl init-labels nmem0                        // initialize the label area
     # ndctl enable-region region0     
     # ndctl create-namespace -r region0 -s 4G -m raw // create one 4G pmem namespace
     # ndctl create-namespace -r region0 -s 1G -m raw // create one 1G pmem namespace
     # ndctl list --namespaces
     [
       {
           "dev":"namespace0.0",
           "mode":"raw",
           "size":4294967296,
           "uuid":"bbfbedbd-3ada-4f55-9484-01f2722c651b",
           "blockdev":"pmem0"
       },
       {
           "dev":"namespace0.1",
           "mode":"raw",
           "size":1073741824,
           "uuid":"dd4d3949-6887-417b-b819-89a7854fcdbd",
           "blockdev":"pmem0.1"
       }
     ]

6. Ask Xen hypervisor to use namespace0.1 to manage namespace0.0.
     # grep namespace /proc/iomem
         240000000-33fffffff : namespace0.0
         340000000-37fffffff : namespace0.1
     # xen-ndctl setup 0x240000 0x340000 0x340000 0x380000

7. Start a HVM domain with "vnvdimms=[ '/dev/pmem0' ]" in its xl config.

   If ndctl is installed in HVM domain, "ndctl list" should be able to
   list a 4G pmem namespace, e.g.
   {
     "dev":"namespace0.0",
     "mode":"raw",
     "size":4294967296,
     "blockdev":"pmem0"
   }
   

Haozhong Zhang (15):
  xen/common: add Kconfig item for pmem support
  xen: probe pmem regions via ACPI NFIT
  xen/x86: allow customizing locations of extended frametable & M2P
  xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem
  xen/x86: add XENMEM_populate_pmemmap to map host pmem pages to HVM domain
  tools: reserve guest memory for ACPI from device model
  tools/libacpi: expose the minimum alignment used by mem_ops.alloc
  tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address
  tools/libacpi: add callbacks to access XenStore
  tools/libacpi: add a simple AML builder
  tools/libacpi: load ACPI built by the device model
  tools/libxl: build qemu options from xl vNVDIMM configs
  tools/libxl: add support to map host pmem device to guests
  tools/libxl: initiate pmem mapping via qmp callback
  tools/misc: add xen-ndctl

 .gitignore                              |   1 +
 docs/man/xl.cfg.pod.5.in                |   6 +
 tools/firmware/hvmloader/Makefile       |   3 +-
 tools/firmware/hvmloader/util.c         |  75 ++++++
 tools/firmware/hvmloader/util.h         |  10 +
 tools/firmware/hvmloader/xenbus.c       |  44 +++-
 tools/flask/policy/modules/dom0.te      |   2 +-
 tools/flask/policy/modules/xen.if       |   2 +-
 tools/libacpi/acpi2_0.h                 |   2 +
 tools/libacpi/aml_build.c               | 326 +++++++++++++++++++++++
 tools/libacpi/aml_build.h               | 116 +++++++++
 tools/libacpi/build.c                   | 311 ++++++++++++++++++++++
 tools/libacpi/libacpi.h                 |  21 ++
 tools/libxc/include/xc_dom.h            |   1 +
 tools/libxc/include/xenctrl.h           |  36 +++
 tools/libxc/xc_dom_x86.c                |   7 +
 tools/libxc/xc_domain.c                 |  15 ++
 tools/libxc/xc_misc.c                   |  17 ++
 tools/libxl/Makefile                    |   7 +-
 tools/libxl/libxl_create.c              |   4 +-
 tools/libxl/libxl_dm.c                  | 109 +++++++-
 tools/libxl/libxl_dom.c                 |  22 ++
 tools/libxl/libxl_nvdimm.c              | 182 +++++++++++++
 tools/libxl/libxl_nvdimm.h              |  42 +++
 tools/libxl/libxl_qmp.c                 | 116 ++++++++-
 tools/libxl/libxl_types.idl             |   8 +
 tools/libxl/libxl_x86_acpi.c            |  41 +++
 tools/misc/Makefile                     |   4 +
 tools/misc/xen-ndctl.c                  | 227 ++++++++++++++++
 tools/xl/xl_parse.c                     |  16 ++
 xen/arch/x86/acpi/boot.c                |   4 +
 xen/arch/x86/domain.c                   |   7 +
 xen/arch/x86/sysctl.c                   |  22 ++
 xen/arch/x86/x86_64/mm.c                | 191 ++++++++++++--
 xen/common/Kconfig                      |   9 +
 xen/common/Makefile                     |   1 +
 xen/common/compat/memory.c              |   1 +
 xen/common/domain.c                     |   3 +
 xen/common/memory.c                     |  43 +++
 xen/common/pmem.c                       | 448 ++++++++++++++++++++++++++++++++
 xen/drivers/acpi/Makefile               |   2 +
 xen/drivers/acpi/nfit.c                 | 116 +++++++++
 xen/include/acpi/actbl.h                |   1 +
 xen/include/acpi/actbl1.h               |  42 +++
 xen/include/public/hvm/hvm_xs_strings.h |  11 +
 xen/include/public/memory.h             |  14 +-
 xen/include/public/sysctl.h             |  29 ++-
 xen/include/xen/acpi.h                  |   4 +
 xen/include/xen/pmem.h                  |  66 +++++
 xen/include/xen/sched.h                 |   3 +
 xen/include/xsm/dummy.h                 |  11 +
 xen/include/xsm/xsm.h                   |  12 +
 xen/xsm/dummy.c                         |   4 +
 xen/xsm/flask/hooks.c                   |  17 ++
 xen/xsm/flask/policy/access_vectors     |   4 +
 55 files changed, 2795 insertions(+), 43 deletions(-)
 create mode 100644 tools/libacpi/aml_build.c
 create mode 100644 tools/libacpi/aml_build.h
 create mode 100644 tools/libxl/libxl_nvdimm.c
 create mode 100644 tools/libxl/libxl_nvdimm.h
 create mode 100644 tools/misc/xen-ndctl.c
 create mode 100644 xen/common/pmem.c
 create mode 100644 xen/drivers/acpi/nfit.c
 create mode 100644 xen/include/xen/pmem.h

-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 01/15] xen/common: add Kconfig item for pmem support
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 02/15] xen: probe pmem regions via ACPI NFIT Haozhong Zhang
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Stefano Stabellini, Wei Liu, George Dunlap,
	Andrew Cooper, Ian Jackson, Tim Deegan, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams

Add CONFIG_PMEM to enable NVDIMM persistent memory support. By
default, it's N.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 xen/common/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index f2ecbc43d6..f58a47d3a7 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -237,4 +237,13 @@ config FAST_SYMBOL_LOOKUP
 	  The only user of this is Live patching.
 
 	  If unsure, say Y.
+
+config PMEM
+	bool "Pmem support"
+	default n
+	---help---
+	  Enable support for the persistent memory mode NVDIMM.
+
+	  If unsure, say N.
+
 endmenu
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 02/15] xen: probe pmem regions via ACPI NFIT
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 01/15] xen/common: add Kconfig item for pmem support Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 03/15] xen/x86: allow customizing locations of extended frametable & M2P Haozhong Zhang
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Andrew Cooper, Jan Beulich,
	Haozhong Zhang

Probe the address ranges of pmem regions via ACPI NFIT and report them
to Xen hypervisor.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/acpi/boot.c  |   4 ++
 xen/common/Makefile       |   1 +
 xen/common/pmem.c         | 106 ++++++++++++++++++++++++++++++++++++++++++
 xen/drivers/acpi/Makefile |   2 +
 xen/drivers/acpi/nfit.c   | 116 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/acpi/actbl.h  |   1 +
 xen/include/acpi/actbl1.h |  42 +++++++++++++++++
 xen/include/xen/acpi.h    |   4 ++
 xen/include/xen/pmem.h    |  28 +++++++++++
 9 files changed, 304 insertions(+)
 create mode 100644 xen/common/pmem.c
 create mode 100644 xen/drivers/acpi/nfit.c
 create mode 100644 xen/include/xen/pmem.h

diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
index 33c9133812..83d73868d6 100644
--- a/xen/arch/x86/acpi/boot.c
+++ b/xen/arch/x86/acpi/boot.c
@@ -754,5 +754,9 @@ int __init acpi_boot_init(void)
 
 	acpi_table_parse(ACPI_SIG_BGRT, acpi_invalidate_bgrt);
 
+#ifdef CONFIG_PMEM
+	acpi_nfit_init();
+#endif
+
 	return 0;
 }
diff --git a/xen/common/Makefile b/xen/common/Makefile
index 0fed30bcc6..16c273b6d4 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -29,6 +29,7 @@ obj-y += notifier.o
 obj-y += page_alloc.o
 obj-$(CONFIG_HAS_PDX) += pdx.o
 obj-$(CONFIG_PERF_COUNTERS) += perfc.o
+obj-$(CONFIG_PMEM) += pmem.o
 obj-y += preempt.o
 obj-y += random.o
 obj-y += rangeset.o
diff --git a/xen/common/pmem.c b/xen/common/pmem.c
new file mode 100644
index 0000000000..3c150cf1dd
--- /dev/null
+++ b/xen/common/pmem.c
@@ -0,0 +1,106 @@
+/*
+ * xen/common/pmem.c
+ *
+ * Copyright (C) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/errno.h>
+#include <xen/list.h>
+#include <xen/pmem.h>
+#include <xen/spinlock.h>
+
+/*
+ * All pmem regions probed via SPA range structures of ACPI NFIT are
+ * linked in pmem_regions.
+ */
+static DEFINE_SPINLOCK(pmem_regions_lock);
+static LIST_HEAD(pmem_regions);
+
+struct pmem {
+    struct list_head link;      /* link to pmem_list */
+    unsigned long smfn;         /* start MFN of the whole pmem region */
+    unsigned long emfn;         /* end MFN of the whole pmem region */
+};
+
+static bool check_overlap(unsigned long smfn1, unsigned long emfn1,
+                          unsigned long smfn2, unsigned long emfn2)
+{
+    return smfn1 < emfn2 && smfn2 < emfn1;
+}
+
+static struct pmem *alloc_pmem_struct(unsigned long smfn, unsigned long emfn)
+{
+    struct pmem *pmem = xzalloc(struct pmem);
+
+    if ( !pmem )
+        return NULL;
+
+    pmem->smfn = smfn;
+    pmem->emfn = emfn;
+
+    return pmem;
+}
+
+static int pmem_list_add(struct list_head *list, struct pmem *entry)
+{
+    struct list_head *cur;
+    unsigned long smfn = entry->smfn, emfn = entry->emfn;
+
+    list_for_each_prev(cur, list)
+    {
+        struct pmem *cur_pmem = list_entry(cur, struct pmem, link);
+        unsigned long cur_smfn = cur_pmem->smfn;
+        unsigned long cur_emfn = cur_pmem->emfn;
+
+        if ( check_overlap(smfn, emfn, cur_smfn, cur_emfn) )
+            return -EINVAL;
+
+        if ( cur_smfn < smfn )
+            break;
+    }
+
+    list_add(&entry->link, cur);
+
+    return 0;
+}
+
+/**
+ * Register a pmem region to Xen. It's used by Xen hypervisor to collect
+ * all pmem regions can be used later.
+ *
+ * Parameters:
+ *  smfn, emfn: start and end MFNs of the pmem region
+ *
+ * Return:
+ *  On success, return 0. Otherwise, an error number is returned.
+ */
+int pmem_register(unsigned long smfn, unsigned long emfn)
+{
+    int rc;
+    struct pmem *pmem;
+
+    if ( smfn >= emfn )
+        return -EINVAL;
+
+    pmem = alloc_pmem_struct(smfn, emfn);
+    if ( !pmem )
+        return -ENOMEM;
+
+    spin_lock(&pmem_regions_lock);
+    rc = pmem_list_add(&pmem_regions, pmem);
+    spin_unlock(&pmem_regions_lock);
+
+    return rc;
+}
diff --git a/xen/drivers/acpi/Makefile b/xen/drivers/acpi/Makefile
index 444b11d583..cef9d90222 100644
--- a/xen/drivers/acpi/Makefile
+++ b/xen/drivers/acpi/Makefile
@@ -9,3 +9,5 @@ obj-$(CONFIG_HAS_CPUFREQ) += pmstat.o
 
 obj-$(CONFIG_X86) += hwregs.o
 obj-$(CONFIG_X86) += reboot.o
+
+obj-$(CONFIG_PMEM) += nfit.o
diff --git a/xen/drivers/acpi/nfit.c b/xen/drivers/acpi/nfit.c
new file mode 100644
index 0000000000..ceac121dd2
--- /dev/null
+++ b/xen/drivers/acpi/nfit.c
@@ -0,0 +1,116 @@
+/*
+ * xen/drivers/acpi/nfit.c
+ *
+ * Copyright (C) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <xen/acpi.h>
+#include <xen/errno.h>
+#include <xen/init.h>
+#include <xen/mm.h>
+#include <xen/pfn.h>
+#include <xen/pmem.h>
+
+static struct acpi_table_nfit *nfit __read_mostly = NULL;
+
+/* ACPI 6.1: GUID of a byte addressable persistent memory region */
+static const uint8_t nfit_spa_pmem_uuid[] =
+{
+    0x79, 0xd3, 0xf0, 0x66, 0xf3, 0xb4, 0x74, 0x40,
+    0xac, 0x43, 0x0d, 0x33, 0x18, 0xb7, 0x8c, 0xdb,
+};
+
+/**
+ * Enumerate each sub-table of NFIT.
+ *
+ * For a sub-table of type @type, @parse_cb() (if not NULL) is called
+ * to parse the sub-table. @parse_cb() returns 0 on success, and
+ * returns non-zero error code on errors.
+ *
+ * Parameters:
+ *  nfit:      NFIT
+ *  type:      the type of sub-table that will be parsed
+ *  parse_cb:  the function used to parse each sub-table
+ *  arg:       the argument passed to @parse_cb()
+ *
+ * Return:
+ *  0 on success, non-zero on failure
+ */
+static int acpi_nfit_foreach_subtable(
+    struct acpi_table_nfit *nfit, enum acpi_nfit_type type,
+    int (*parse_cb)(const struct acpi_nfit_header *, void *arg), void *arg)
+{
+    struct acpi_table_header *table = (struct acpi_table_header *)nfit;
+    struct acpi_nfit_header *hdr;
+    uint32_t hdr_offset = sizeof(*nfit);
+    int ret = 0;
+
+    while ( hdr_offset < table->length )
+    {
+        hdr = (void *)nfit + hdr_offset;
+        hdr_offset += hdr->length;
+        if ( hdr->type == type && parse_cb )
+        {
+            ret = parse_cb(hdr, arg);
+            if ( ret )
+                break;
+        }
+    }
+
+    return ret;
+}
+
+static int __init acpi_nfit_spa_probe_pmem(const struct acpi_nfit_header *hdr,
+                                           void *opaque)
+{
+    struct acpi_nfit_system_address *spa =
+        (struct acpi_nfit_system_address *)hdr;
+    unsigned long smfn = paddr_to_pfn(spa->address);
+    unsigned long emfn = paddr_to_pfn(spa->address + spa->length);
+    int rc;
+
+    if ( memcmp(spa->range_guid, nfit_spa_pmem_uuid, 16) )
+        return 0;
+
+    rc = pmem_register(smfn, emfn);
+    if ( rc )
+        printk(XENLOG_ERR
+               "NFIT: failed to add pmem mfns: 0x%lx - 0x%lx, err %d\n",
+               smfn, emfn, rc);
+    else
+        printk(XENLOG_INFO "NFIT: pmem mfn 0x%lx - 0x%lx\n", smfn, emfn);
+
+    /* ignore the error and continue to add the next pmem range */
+    return 0;
+}
+
+void __init acpi_nfit_init(void)
+{
+    acpi_status status;
+    acpi_physical_address nfit_addr;
+    acpi_native_uint nfit_len;
+
+    status = acpi_get_table_phys(ACPI_SIG_NFIT, 0, &nfit_addr, &nfit_len);
+    if ( ACPI_FAILURE(status) )
+         return;
+
+    map_pages_to_xen((unsigned long)__va(nfit_addr), PFN_DOWN(nfit_addr),
+                     PFN_UP(nfit_addr + nfit_len) - PFN_DOWN(nfit_addr),
+                     PAGE_HYPERVISOR);
+    nfit = (struct acpi_table_nfit *)__va(nfit_addr);
+
+    acpi_nfit_foreach_subtable(nfit, ACPI_NFIT_TYPE_SYSTEM_ADDRESS,
+                               acpi_nfit_spa_probe_pmem, NULL);
+}
diff --git a/xen/include/acpi/actbl.h b/xen/include/acpi/actbl.h
index 3079176992..6e113b0873 100644
--- a/xen/include/acpi/actbl.h
+++ b/xen/include/acpi/actbl.h
@@ -71,6 +71,7 @@
 #define ACPI_SIG_XSDT           "XSDT"	/* Extended  System Description Table */
 #define ACPI_SIG_SSDT           "SSDT"	/* Secondary System Description Table */
 #define ACPI_RSDP_NAME          "RSDP"	/* Short name for RSDP, not signature */
+#define ACPI_SIG_NFIT           "NFIT"	/* NVDIMM Firmware Interface Table */
 
 /*
  * All tables and structures must be byte-packed to match the ACPI
diff --git a/xen/include/acpi/actbl1.h b/xen/include/acpi/actbl1.h
index e1991362dc..a59ac11325 100644
--- a/xen/include/acpi/actbl1.h
+++ b/xen/include/acpi/actbl1.h
@@ -905,6 +905,48 @@ struct acpi_msct_proximity {
 
 /*******************************************************************************
  *
+ * NFIT - NVDIMM Interface Table (ACPI 6.0+)
+ *        Version 1
+ *
+ ******************************************************************************/
+
+struct acpi_table_nfit {
+	struct acpi_table_header header;	/* Common ACPI table header */
+	u32 reserved;						/* Reserved, must be zero */
+};
+
+/* Subtable header for NFIT */
+
+struct acpi_nfit_header {
+	u16 type;
+	u16 length;
+};
+
+/* Values for subtable type in struct acpi_nfit_header */
+
+enum acpi_nfit_type {
+	ACPI_NFIT_TYPE_SYSTEM_ADDRESS = 0,
+};
+
+/*
+ * NFIT Subtables
+ */
+
+/* type 0: System Physical Address Range Structure */
+struct acpi_nfit_system_address {
+	struct acpi_nfit_header header;
+	u16 range_index;
+	u16 flags;
+	u32 reserved;		/* Reseved, must be zero */
+	u32 proximity_domain;
+	u8 range_guid[16];
+	u64 address;
+	u64 length;
+	u64 memory_mapping;
+};
+
+/*******************************************************************************
+ *
  * SBST - Smart Battery Specification Table
  *        Version 1
  *
diff --git a/xen/include/xen/acpi.h b/xen/include/xen/acpi.h
index 30ec0eec5f..8edb9a275e 100644
--- a/xen/include/xen/acpi.h
+++ b/xen/include/xen/acpi.h
@@ -180,4 +180,8 @@ void acpi_reboot(void);
 void acpi_dmar_zap(void);
 void acpi_dmar_reinstate(void);
 
+#ifdef CONFIG_PMEM
+void acpi_nfit_init(void);
+#endif /* CONFIG_PMEM */
+
 #endif /*_LINUX_ACPI_H*/
diff --git a/xen/include/xen/pmem.h b/xen/include/xen/pmem.h
new file mode 100644
index 0000000000..1144e86f98
--- /dev/null
+++ b/xen/include/xen/pmem.h
@@ -0,0 +1,28 @@
+/*
+ * xen/include/xen/pmem.h
+ *
+ * Copyright (C) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms and conditions of the GNU General Public
+ * License, version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __XEN_PMEM_H__
+#define __XEN_PMEM_H__
+#ifdef CONFIG_PMEM
+
+#include <xen/types.h>
+
+int pmem_register(unsigned long smfn, unsigned long emfn);
+
+#endif /* CONFIG_PMEM */
+#endif /* __XEN_PMEM_H__ */
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 03/15] xen/x86: allow customizing locations of extended frametable & M2P
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 01/15] xen/common: add Kconfig item for pmem support Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 02/15] xen: probe pmem regions via ACPI NFIT Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem Haozhong Zhang
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Andrew Cooper, Jan Beulich,
	Haozhong Zhang

Xen is not aware which portions of pmem can be used to store its
frametable and M2P table. Instead, it will rely on users or system
admins in Dom0 to specify the location. For the regular RAM, no
functional change is introduced.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes in v2:
 * Merge v1 patch 1 (for frametable) and v1 patch 2 (for M2P).
 * Add const to some parameters.
 * Explain new parameters of extend_frame_table() and setup_m2p_table().
---
 xen/arch/x86/x86_64/mm.c | 80 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 61 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 34f3250fd7..0f1ceacc6a 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -111,15 +111,27 @@ int hotadd_mem_valid(unsigned long pfn, struct mem_hotadd_info *info)
     return (pfn < info->epfn && pfn >= info->spfn);
 }
 
+/*
+ * Allocate pages in the PFN range from info->spfn to info->epfn. The
+ * first free page is indicated by info->cur. The allocation unit is
+ * (1 << PAGETABLE_ORDER) pages.
+ *
+ * On success, return PFN of the first allocated page. Otherwise, return
+ * mfn_x(INVALID_MFN).
+ */
+typedef unsigned long (*mfns_alloc_fn_t)(struct mem_hotadd_info *info);
+
 static unsigned long alloc_hotadd_mfn(struct mem_hotadd_info *info)
 {
-    unsigned mfn;
+    unsigned long mfn;
 
-    ASSERT((info->cur + ( 1UL << PAGETABLE_ORDER) < info->epfn) &&
-            info->cur >= info->spfn);
+    if ( (info->cur + (1UL << PAGETABLE_ORDER) >= info->epfn) ||
+         info->cur < info->spfn )
+        return mfn_x(INVALID_MFN);
 
     mfn = info->cur;
     info->cur += (1UL << PAGETABLE_ORDER);
+
     return mfn;
 }
 
@@ -313,11 +325,13 @@ void destroy_m2p_mapping(struct mem_hotadd_info *info)
 }
 
 /*
- * Allocate and map the compatibility mode machine-to-phys table.
- * spfn/epfn: the pfn ranges to be setup
- * free_s/free_e: the pfn ranges that is free still
+ * Allocate and map the compatibility mode machine-to-phys table for
+ * pages info->spfn ~ info->epfn. M2P is placed in pages allocated
+ * by alloc_fn from the range alloc_info->cur ~ alloc_info->epfn.
  */
-static int setup_compat_m2p_table(struct mem_hotadd_info *info)
+static int setup_compat_m2p_table(const struct mem_hotadd_info *info,
+                                  mfns_alloc_fn_t alloc_fn,
+                                  struct mem_hotadd_info *alloc_info)
 {
     unsigned long i, va, smap, emap, rwva, epfn = info->epfn, mfn;
     unsigned int n;
@@ -371,7 +385,12 @@ static int setup_compat_m2p_table(struct mem_hotadd_info *info)
         if ( n == CNT )
             continue;
 
-        mfn = alloc_hotadd_mfn(info);
+        mfn = alloc_fn(alloc_info);
+        if ( mfn == mfn_x(INVALID_MFN) )
+        {
+            err = -ENOMEM;
+            break;
+        }
         err = map_pages_to_xen(rwva, mfn, 1UL << PAGETABLE_ORDER,
                                PAGE_HYPERVISOR);
         if ( err )
@@ -389,9 +408,14 @@ static int setup_compat_m2p_table(struct mem_hotadd_info *info)
 
 /*
  * Allocate and map the machine-to-phys table.
- * The L3 for RO/RWRW MPT and the L2 for compatible MPT should be setup already
+ * The L3 for RO/RWRW MPT and the L2 for compatible MPT should be setup already.
+ *
+ * M2P is placed in pages allocated by alloc_fn from the range
+ * alloc_info->cur ~ alloc_info->epfn.
  */
-static int setup_m2p_table(struct mem_hotadd_info *info)
+static int setup_m2p_table(const struct mem_hotadd_info *info,
+                           mfns_alloc_fn_t alloc_fn,
+                           struct mem_hotadd_info *alloc_info)
 {
     unsigned long i, va, smap, emap;
     unsigned int n;
@@ -440,7 +464,13 @@ static int setup_m2p_table(struct mem_hotadd_info *info)
                 break;
         if ( n < CNT )
         {
-            unsigned long mfn = alloc_hotadd_mfn(info);
+            unsigned long mfn = alloc_fn(alloc_info);
+
+            if ( mfn == mfn_x(INVALID_MFN) )
+            {
+                ret = -ENOMEM;
+                goto error;
+            }
 
             ret = map_pages_to_xen(
                         RDWR_MPT_VIRT_START + i * sizeof(unsigned long),
@@ -485,7 +515,7 @@ static int setup_m2p_table(struct mem_hotadd_info *info)
 #undef CNT
 #undef MFN
 
-    ret = setup_compat_m2p_table(info);
+    ret = setup_compat_m2p_table(info, alloc_fn, alloc_info);
 error:
     return ret;
 }
@@ -769,7 +799,8 @@ void cleanup_frame_table(struct mem_hotadd_info *info)
 }
 
 static int setup_frametable_chunk(void *start, void *end,
-                                  struct mem_hotadd_info *info)
+                                  mfns_alloc_fn_t alloc_fn,
+                                  struct mem_hotadd_info *alloc_info)
 {
     unsigned long s = (unsigned long)start;
     unsigned long e = (unsigned long)end;
@@ -781,7 +812,9 @@ static int setup_frametable_chunk(void *start, void *end,
 
     for ( ; s < e; s += (1UL << L2_PAGETABLE_SHIFT))
     {
-        mfn = alloc_hotadd_mfn(info);
+        mfn = alloc_fn(alloc_info);
+        if ( mfn == mfn_x(INVALID_MFN) )
+            return -ENOMEM;
         err = map_pages_to_xen(s, mfn, 1UL << PAGETABLE_ORDER,
                                PAGE_HYPERVISOR);
         if ( err )
@@ -792,7 +825,14 @@ static int setup_frametable_chunk(void *start, void *end,
     return 0;
 }
 
-static int extend_frame_table(struct mem_hotadd_info *info)
+/*
+ * Create and map the frame table for page ranges info->spfn ~
+ * info->epfn. The frame table is placed in pages allocated by
+ * alloc_fn from page range alloc_info->cur ~ alloc_info->epfn.
+ */
+static int extend_frame_table(const struct mem_hotadd_info *info,
+                              mfns_alloc_fn_t alloc_fn,
+                              struct mem_hotadd_info *alloc_info)
 {
     unsigned long cidx, nidx, eidx, spfn, epfn;
 
@@ -818,9 +858,9 @@ static int extend_frame_table(struct mem_hotadd_info *info)
         nidx = find_next_bit(pdx_group_valid, eidx, cidx);
         if ( nidx >= eidx )
             nidx = eidx;
-        err = setup_frametable_chunk(pdx_to_page(cidx * PDX_GROUP_COUNT ),
+        err = setup_frametable_chunk(pdx_to_page(cidx * PDX_GROUP_COUNT),
                                      pdx_to_page(nidx * PDX_GROUP_COUNT),
-                                     info);
+                                     alloc_fn, alloc_info);
         if ( err )
             return err;
 
@@ -1422,7 +1462,8 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm)
     info.epfn = epfn;
     info.cur = spfn;
 
-    ret = extend_frame_table(&info);
+    /* Place the frame table at the beginning of hotplugged memory. */
+    ret = extend_frame_table(&info, alloc_hotadd_mfn, &info);
     if (ret)
         goto destroy_frametable;
 
@@ -1435,7 +1476,8 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm)
     total_pages += epfn - spfn;
 
     set_pdx_range(spfn, epfn);
-    ret = setup_m2p_table(&info);
+    /* Place M2P in the hotplugged memory after the frame table. */
+    ret = setup_m2p_table(&info, alloc_hotadd_mfn, &info);
 
     if ( ret )
         goto destroy_m2p;
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (2 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 03/15] xen/x86: allow customizing locations of extended frametable & M2P Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 05/15] xen/x86: add XENMEM_populate_pmem_map to map host pmem pages to HVM domain Haozhong Zhang
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams, Daniel De Graaf

Xen hypervisor is not aware which portions of pmem can be used to
store the frame table and M2P table of pmem. Instead, it provides
users or admins in Dom0 with a sysctl XEN_SYSCTL_nvdimm_pmem_setup to
specify the location.

XEN_SYSCTL_nvdimm_pmem_setup receives four arguments: data_smfn,
data_emfn, mgmt_smfn and mgmt_emfn.
 - data_smfn and data_emfn specify the start and end MFN of a host pmem
   region that can be used by guest.
 - If neither mgmt_smfn nor mgmt_emfn is INVALID_MFN, the host pmem
   pages from mgmt_smfn to mgmt_emfn will be used to store the
   frametable and M2P table of the pmem region data_smfn ~ data_emfn
   and itself. data_smfn ~ data_emfn and mgmt_smfn ~ mgmt_emfn should
   not overlap with each other.
 - If either mgmt_smfn or mgmt_emfn is INVALID_MFN, Xen hypervisor will
   store the frametable and M2P table of the pmem region data_smfn ~
   data_emfn in the regular RAM.

XEN_SYSCTL_nvdimm_pmem_setup currently only works on x86, and returns
-ENOSYS on other architectures.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes in v2:
 * Convert XENPF_pmem_add in v1 to XEN_SYSCTL_nvdimm_pmem_setup in v2. v2 patch
   series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
   location to store the frametable and M2P of pmem.
 * Separate the architecture-independent and -depend code to pmem_setup and
   pmem_arch_setup. Currently, only pmem_arch_setup on x86 is implemented, while
   it returns -ENOSYS on other architectures.
 * Add XSM check for XEN_SYSCTL_nvdimm_pmem_setup.
---
 tools/flask/policy/modules/dom0.te  |   2 +-
 tools/libxc/include/xenctrl.h       |  19 ++++
 tools/libxc/xc_misc.c               |  17 ++++
 tools/misc/xen-ndctl.c              |   0
 xen/arch/x86/sysctl.c               |  22 ++++
 xen/arch/x86/x86_64/mm.c            | 111 ++++++++++++++++++++
 xen/common/pmem.c                   | 197 ++++++++++++++++++++++++++++++++++++
 xen/include/public/sysctl.h         |  29 +++++-
 xen/include/xen/pmem.h              |  14 +++
 xen/xsm/flask/hooks.c               |   4 +
 xen/xsm/flask/policy/access_vectors |   2 +
 11 files changed, 415 insertions(+), 2 deletions(-)
 create mode 100644 tools/misc/xen-ndctl.c

diff --git a/tools/flask/policy/modules/dom0.te b/tools/flask/policy/modules/dom0.te
index d0a4d91ac0..d31c514550 100644
--- a/tools/flask/policy/modules/dom0.te
+++ b/tools/flask/policy/modules/dom0.te
@@ -16,7 +16,7 @@ allow dom0_t xen_t:xen {
 allow dom0_t xen_t:xen2 {
 	resource_op psr_cmt_op psr_cat_op pmu_ctrl get_symbol
 	get_cpu_levelling_caps get_cpu_featureset livepatch_op
-	gcov_op
+	gcov_op nvdimm_op
 };
 
 # Allow dom0 to use all XENVER_ subops that have checks.
diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index a48981abea..d4e3002c9e 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2534,6 +2534,25 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout);
 int xc_domain_cacheflush(xc_interface *xch, uint32_t domid,
                          xen_pfn_t start_pfn, xen_pfn_t nr_pfns);
 
+/*
+ * Query Xen hypervisor to prepare for mapping host pmem pages.
+ *
+ * Parameters:
+ *  xch:       xc interface handler
+ *  smfn:      the start MFN of the host pmem pages to be mapped
+ *  emfn:      the end MFN of the host pmem pages to be mapped
+ *  mgmt_smfn: If not INVALID_MFN, the start MFN of host pmem pages for managing
+ *             above pmem pages
+ *  mgmt_emfn: If not INVALID_MFN, the end MFN of host pmem pages for managing
+ *             above pmem pages
+ *
+ * Return:
+ *  0 on success; non-zero error code on failures.
+ */
+int xc_nvdimm_pmem_setup(xc_interface *xch,
+                         unsigned long smfn, unsigned long emfn,
+                         unsigned long mgmt_smfn, unsigned long mgmt_emfn);
+
 /* Compat shims */
 #include "xenctrl_compat.h"
 
diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index 88084fde30..0384f45b8e 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -817,6 +817,23 @@ int xc_livepatch_replace(xc_interface *xch, char *name, uint32_t timeout)
     return _xc_livepatch_action(xch, name, LIVEPATCH_ACTION_REPLACE, timeout);
 }
 
+int xc_nvdimm_pmem_setup(xc_interface *xch,
+                         unsigned long smfn, unsigned long emfn,
+                         unsigned long mgmt_smfn, unsigned long mgmt_emfn)
+{
+    DECLARE_SYSCTL;
+
+    sysctl.cmd = XEN_SYSCTL_nvdimm_op;
+    sysctl.u.nvdimm.cmd = XEN_SYSCTL_nvdimm_pmem_setup;
+    sysctl.u.nvdimm.pad = 0;
+    sysctl.u.nvdimm.u.setup.smfn = smfn;
+    sysctl.u.nvdimm.u.setup.emfn = emfn;
+    sysctl.u.nvdimm.u.setup.mgmt_smfn = mgmt_smfn;
+    sysctl.u.nvdimm.u.setup.mgmt_emfn = mgmt_emfn;
+
+    return do_sysctl(xch, &sysctl);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/misc/xen-ndctl.c b/tools/misc/xen-ndctl.c
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/xen/arch/x86/sysctl.c b/xen/arch/x86/sysctl.c
index 2f7056e816..62f2980840 100644
--- a/xen/arch/x86/sysctl.c
+++ b/xen/arch/x86/sysctl.c
@@ -19,6 +19,7 @@
 #include <xen/trace.h>
 #include <xen/console.h>
 #include <xen/iocap.h>
+#include <xen/pmem.h>
 #include <asm/irq.h>
 #include <asm/hvm/hvm.h>
 #include <asm/hvm/support.h>
@@ -250,6 +251,27 @@ long arch_do_sysctl(
         break;
     }
 
+#ifdef CONFIG_PMEM
+    case XEN_SYSCTL_nvdimm_op:
+    {
+        xen_sysctl_nvdimm_pmem_setup_t *setup;
+
+        switch ( sysctl->u.nvdimm.cmd )
+        {
+        case XEN_SYSCTL_nvdimm_pmem_setup:
+            setup = &sysctl->u.nvdimm.u.setup;
+            ret = pmem_setup(setup->smfn, setup->emfn,
+                             setup->mgmt_smfn, setup->mgmt_emfn);
+            break;
+
+        default:
+            ret = -ENOSYS;
+        }
+
+        break;
+    }
+#endif /* CONFIG_PMEM */
+
     default:
         ret = -ENOSYS;
         break;
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 0f1ceacc6a..77f51f399a 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -27,6 +27,7 @@ asm(".file \"" __FILE__ "\"");
 #include <xen/guest_access.h>
 #include <xen/hypercall.h>
 #include <xen/mem_access.h>
+#include <xen/pmem.h>
 #include <asm/current.h>
 #include <asm/asm_defns.h>
 #include <asm/page.h>
@@ -1522,6 +1523,116 @@ destroy_frametable:
     return ret;
 }
 
+#ifdef CONFIG_PMEM
+
+static unsigned long pmem_alloc_from_ram(struct mem_hotadd_info *unused)
+{
+    unsigned long mfn = mfn_x(INVALID_MFN);
+    struct page_info *page = alloc_domheap_pages(NULL, PAGETABLE_ORDER, 0);
+
+    if ( page )
+        mfn = page_to_mfn(page);
+
+    return mfn;
+}
+
+static int pmem_setup_m2p_table(const struct mem_hotadd_info *info,
+                                mfns_alloc_fn_t alloc_fn,
+                                struct mem_hotadd_info *alloc_info)
+{
+    unsigned long smfn = info->spfn;
+    unsigned long emfn = info->epfn;
+
+    if ( max_page < emfn )
+    {
+        max_page = emfn;
+        max_pdx = pfn_to_pdx(max_page - 1) + 1;
+    }
+    total_pages += emfn - smfn;
+
+    set_pdx_range(smfn, emfn);
+
+    return setup_m2p_table(info, alloc_fn, alloc_info);
+}
+
+int pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
+                    unsigned long mgmt_smfn, unsigned long mgmt_emfn)
+{
+    int ret;
+    unsigned old_max_mgmt = max_page, old_total_mgmt = total_pages;
+    unsigned old_max_data, old_total_data;
+    bool mgmt_in_pmem = (mgmt_smfn != mfn_x(INVALID_MFN) &&
+                         mgmt_emfn != mfn_x(INVALID_MFN));
+    mfns_alloc_fn_t alloc_fn = pmem_alloc_from_ram;
+    struct mem_hotadd_info *alloc_info = NULL;
+    struct mem_hotadd_info data_info =
+        { .spfn = data_smfn, .epfn = data_emfn, .cur = data_smfn };
+    struct mem_hotadd_info mgmt_info =
+        { .spfn = mgmt_smfn, .epfn = mgmt_emfn, .cur = mgmt_smfn };
+
+    if ( !mem_hotadd_check(data_smfn, data_emfn) )
+        return -EINVAL;
+
+    if ( mgmt_in_pmem )
+    {
+        if ( !mem_hotadd_check(mgmt_smfn, mgmt_emfn) )
+            return -EINVAL;
+
+        alloc_fn = alloc_hotadd_mfn;
+        alloc_info = &mgmt_info;
+
+        ret = extend_frame_table(&mgmt_info, alloc_fn, alloc_info);
+        if ( ret )
+            goto destroy_frametable_mgmt;
+
+        ret = pmem_setup_m2p_table(&mgmt_info, alloc_fn, alloc_info);
+        if ( ret )
+            goto destroy_m2p_mgmt;
+    }
+
+    ret = extend_frame_table(&data_info, alloc_fn, alloc_info);
+    if ( ret )
+        goto destroy_frametable_data;
+
+    old_max_data = max_page;
+    old_total_data = total_pages;
+    ret = pmem_setup_m2p_table(&data_info, alloc_fn, alloc_info);
+    if ( ret )
+        goto destroy_m2p_data;
+
+    share_hotadd_m2p_table(&data_info);
+    if ( mgmt_in_pmem )
+        share_hotadd_m2p_table(&mgmt_info);
+
+    return 0;
+
+destroy_m2p_data:
+    destroy_m2p_mapping(&data_info);
+    max_page = old_max_data;
+    total_pages = old_total_data;
+    max_pdx = pfn_to_pdx(max_page - 1) + 1;
+
+destroy_frametable_data:
+    cleanup_frame_table(&data_info);
+
+destroy_m2p_mgmt:
+    if ( mgmt_in_pmem )
+    {
+        destroy_m2p_mapping(&mgmt_info);
+        max_page = old_max_mgmt;
+        total_pages = old_total_mgmt;
+        max_pdx = pfn_to_pdx(max_page - 1) + 1;
+    }
+
+destroy_frametable_mgmt:
+    if ( mgmt_in_pmem )
+        cleanup_frame_table(&mgmt_info);
+
+    return ret;
+}
+
+#endif /* CONFIG_PMEM */
+
 #include "compat/mm.c"
 
 /*
diff --git a/xen/common/pmem.c b/xen/common/pmem.c
index 3c150cf1dd..0e2d66f94c 100644
--- a/xen/common/pmem.c
+++ b/xen/common/pmem.c
@@ -18,6 +18,7 @@
 
 #include <xen/errno.h>
 #include <xen/list.h>
+#include <xen/mm.h>
 #include <xen/pmem.h>
 #include <xen/spinlock.h>
 
@@ -28,10 +29,33 @@
 static DEFINE_SPINLOCK(pmem_regions_lock);
 static LIST_HEAD(pmem_regions);
 
+/*
+ * Two types of pmem regions are linked in this list and are
+ * distinguished by their ready flags.
+ * - Data pmem regions that can be mapped to guest, and their ready
+ *   flags are true.
+ * - Management pmem regions that are used to management data regions
+ *   and never mapped to guest, and their ready flags are false.
+ *
+ * All regions linked in this list must be covered by one or multiple
+ * regions in list pmem_regions as well.
+ */
+static DEFINE_SPINLOCK(pmem_gregions_lock);
+static LIST_HEAD(pmem_gregions);
+
 struct pmem {
     struct list_head link;      /* link to pmem_list */
     unsigned long smfn;         /* start MFN of the whole pmem region */
     unsigned long emfn;         /* end MFN of the whole pmem region */
+
+    /*
+     * If frametable and M2P of this pmem region is stored in the
+     * regular RAM, mgmt will be NULL. Otherwise, it refers to another
+     * pmem region used for those management structures.
+     */
+    struct pmem *mgmt;
+
+    bool ready;                 /* indicate whether it can be mapped to guest */
 };
 
 static bool check_overlap(unsigned long smfn1, unsigned long emfn1,
@@ -76,6 +100,82 @@ static int pmem_list_add(struct list_head *list, struct pmem *entry)
     return 0;
 }
 
+static void pmem_list_remove(struct pmem *entry)
+{
+    list_del(&entry->link);
+}
+
+static struct pmem *get_first_overlap(const struct list_head *list,
+                                      unsigned long smfn, unsigned emfn)
+{
+    struct list_head *cur;
+    struct pmem *overlap = NULL;
+
+    list_for_each(cur, list)
+    {
+        struct pmem *cur_pmem = list_entry(cur, struct pmem, link);
+        unsigned long cur_smfn = cur_pmem->smfn;
+        unsigned long cur_emfn = cur_pmem->emfn;
+
+        if ( emfn <= cur_smfn )
+            break;
+
+        if ( check_overlap(smfn, emfn, cur_smfn, cur_emfn) )
+        {
+            overlap = cur_pmem;
+            break;
+        }
+    }
+
+    return overlap;
+}
+
+static bool pmem_list_covered(const struct list_head *list,
+                              unsigned long smfn, unsigned emfn)
+{
+    struct pmem *overlap;
+    bool covered = false;
+
+    do {
+        overlap = get_first_overlap(list, smfn, emfn);
+
+        if ( !overlap || smfn < overlap->smfn )
+            break;
+
+        if ( emfn <= overlap->emfn )
+        {
+            covered = true;
+            break;
+        }
+
+        smfn = overlap->emfn;
+        list = &overlap->link;
+    } while ( list );
+
+    return covered;
+}
+
+static bool check_mgmt_size(unsigned long mgmt_mfns, unsigned long total_mfns)
+{
+    return mgmt_mfns >=
+        ((sizeof(struct page_info) * total_mfns) >> PAGE_SHIFT) +
+        ((sizeof(*machine_to_phys_mapping) * total_mfns) >> PAGE_SHIFT);
+}
+
+static bool check_region(unsigned long smfn, unsigned long emfn)
+{
+    bool rc;
+
+    if ( smfn >= emfn )
+        return false;
+
+    spin_lock(&pmem_regions_lock);
+    rc = pmem_list_covered(&pmem_regions, smfn, emfn);
+    spin_unlock(&pmem_regions_lock);
+
+    return rc;
+}
+
 /**
  * Register a pmem region to Xen. It's used by Xen hypervisor to collect
  * all pmem regions can be used later.
@@ -104,3 +204,100 @@ int pmem_register(unsigned long smfn, unsigned long emfn)
 
     return rc;
 }
+
+/**
+ * Setup a data pmem region that can be used by guest later. A
+ * separate pmem region, or the management region, can be specified to
+ * store the frametable and M2P tables of the data pmem region.
+ *
+ * Parameters:
+ *  data_smfn/_emfn: start and end MFNs of the data pmem region
+ *  mgmt_emfn/_emfn: If not mfn_x(INVALID_MFN), then the pmem region from
+ *                   mgmt_smfn to mgmt_emfn will be used for the frametable
+ *                   M2P of itself and the data pmem region. Otherwise, the
+ *                   regular RAM will be used.
+ *
+ * Return:
+ *  On success, return 0. Otherwise, an error number will be returned.
+ */
+int pmem_setup(unsigned long data_smfn, unsigned long data_emfn,
+               unsigned long mgmt_smfn, unsigned long mgmt_emfn)
+{
+    int rc = 0;
+    bool mgmt_in_pmem = mgmt_smfn != mfn_x(INVALID_MFN) &&
+                        mgmt_emfn != mfn_x(INVALID_MFN);
+    struct pmem *pmem, *mgmt = NULL;
+    unsigned long mgmt_mfns = mgmt_emfn - mgmt_smfn;
+    unsigned long total_mfns = data_emfn - data_smfn + mgmt_mfns;
+    unsigned long i;
+    struct page_info *pg;
+
+    if ( !check_region(data_smfn, data_emfn) )
+        return -EINVAL;
+
+    if ( mgmt_in_pmem &&
+         (!check_region(mgmt_smfn, mgmt_emfn) ||
+          !check_mgmt_size(mgmt_mfns, total_mfns)) )
+        return -EINVAL;
+
+    pmem = alloc_pmem_struct(data_smfn, data_emfn);
+    if ( !pmem )
+        return -ENOMEM;
+    if ( mgmt_in_pmem )
+    {
+        mgmt = alloc_pmem_struct(mgmt_smfn, mgmt_emfn);
+        if ( !mgmt )
+            return -ENOMEM;
+    }
+
+    spin_lock(&pmem_gregions_lock);
+    rc = pmem_list_add(&pmem_gregions, pmem);
+    if ( rc )
+    {
+        spin_unlock(&pmem_gregions_lock);
+        goto out;
+    }
+    if ( mgmt_in_pmem )
+    {
+        rc = pmem_list_add(&pmem_gregions, mgmt);
+        if ( rc )
+        {
+            spin_unlock(&pmem_gregions_lock);
+            goto out_remove_pmem;
+        }
+    }
+    spin_unlock(&pmem_gregions_lock);
+
+    rc = pmem_arch_setup(data_smfn, data_emfn, mgmt_smfn, mgmt_emfn);
+    if ( rc )
+        goto out_remove_mgmt;
+
+    for ( i = data_smfn; i < data_emfn; i++ )
+    {
+        pg = mfn_to_page(i);
+        pg->count_info = PGC_state_free;
+    }
+
+    if ( mgmt_in_pmem )
+        pmem->mgmt = mgmt->mgmt = mgmt;
+    /* As mgmt is never mapped to guest, we do not set its ready flag. */
+    pmem->ready = true;
+
+    return 0;
+
+ out_remove_mgmt:
+    if ( mgmt )
+    {
+        spin_lock(&pmem_gregions_lock);
+        pmem_list_remove(mgmt);
+        spin_unlock(&pmem_gregions_lock);
+        xfree(mgmt);
+    }
+ out_remove_pmem:
+    spin_lock(&pmem_gregions_lock);
+    pmem_list_remove(pmem);
+    spin_unlock(&pmem_gregions_lock);
+    xfree(pmem);
+ out:
+    return rc;
+}
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 00f5e77d91..27831a5e4f 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -36,7 +36,7 @@
 #include "physdev.h"
 #include "tmem.h"
 
-#define XEN_SYSCTL_INTERFACE_VERSION 0x0000000F
+#define XEN_SYSCTL_INTERFACE_VERSION 0x00000010
 
 /*
  * Read console content from Xen buffer ring.
@@ -1088,6 +1088,31 @@ struct xen_sysctl_livepatch_op {
 typedef struct xen_sysctl_livepatch_op xen_sysctl_livepatch_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_livepatch_op_t);
 
+#define XEN_SYSCTL_nvdimm_pmem_setup 0
+struct xen_sysctl_nvdimm_pmem_setup {
+    /* IN variables */
+    uint64_t smfn;      /* start MFN of the pmem region                  */
+    uint64_t emfn;      /* end MFN of the pmem region                    */
+    uint64_t mgmt_smfn; /* If not INVALID_MFN, start MFN of another pmem */
+                        /* region that will be used to manage above pmem */
+                        /* region.                                       */
+    uint64_t mgmt_emfn; /* If not INVALID_MFN, end MFN of another pmem   */
+                        /* region that will be used to manage above pmem */
+                        /* region. */
+};
+typedef struct xen_sysctl_nvdimm_pmem_setup xen_sysctl_nvdimm_pmem_setup_t;
+DEFINE_XEN_GUEST_HANDLE(xen_sysctl_nvdimm_pmem_setup_t);
+
+struct xen_sysctl_nvdimm_op {
+    uint32_t cmd; /* IN: XEN_SYSCTL_NVDIMM_*. */
+    uint32_t pad; /* IN: Always zero. */
+    union {
+        xen_sysctl_nvdimm_pmem_setup_t setup;
+    } u;
+};
+typedef struct xen_sysctl_nvdimm_op xen_sysctl_nvdimm_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_sysctl_nvdimm_op_t);
+
 struct xen_sysctl {
     uint32_t cmd;
 #define XEN_SYSCTL_readconsole                    1
@@ -1116,6 +1141,7 @@ struct xen_sysctl {
 #define XEN_SYSCTL_get_cpu_levelling_caps        25
 #define XEN_SYSCTL_get_cpu_featureset            26
 #define XEN_SYSCTL_livepatch_op                  27
+#define XEN_SYSCTL_nvdimm_op                     28
     uint32_t interface_version; /* XEN_SYSCTL_INTERFACE_VERSION */
     union {
         struct xen_sysctl_readconsole       readconsole;
@@ -1144,6 +1170,7 @@ struct xen_sysctl {
         struct xen_sysctl_cpu_levelling_caps cpu_levelling_caps;
         struct xen_sysctl_cpu_featureset    cpu_featureset;
         struct xen_sysctl_livepatch_op      livepatch;
+        struct xen_sysctl_nvdimm_op         nvdimm;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/xen/pmem.h b/xen/include/xen/pmem.h
index 1144e86f98..95c8207ff6 100644
--- a/xen/include/xen/pmem.h
+++ b/xen/include/xen/pmem.h
@@ -23,6 +23,20 @@
 #include <xen/types.h>
 
 int pmem_register(unsigned long smfn, unsigned long emfn);
+int pmem_setup(unsigned long data_spfn, unsigned long data_emfn,
+               unsigned long mgmt_smfn, unsigned long mgmt_emfn);
+
+#ifdef CONFIG_X86
+int pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
+                    unsigned long mgmt_smfn, unsigned long mgmt_emfn);
+#else /* !CONFIG_X86 */
+static inline int
+pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
+                unsigned mgmt_smfn, unsigned long mgmt_emfn)
+{
+    return -ENOSYS;
+}
+#endif /* CONFIG_X86 */
 
 #endif /* CONFIG_PMEM */
 #endif /* __XEN_PMEM_H__ */
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index 4baed39890..e3c77bbe3f 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -826,6 +826,10 @@ static int flask_sysctl(int cmd)
         return avc_current_has_perm(SECINITSID_XEN, SECCLASS_XEN2,
                                     XEN2__GCOV_OP, NULL);
 
+    case XEN_SYSCTL_nvdimm_op:
+        return avc_current_has_perm(SECINITSID_XEN, SECCLASS_XEN2,
+                                    XEN2__NVDIMM_OP, NULL);
+
     default:
         return avc_unknown_permission("sysctl", cmd);
     }
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index 1f7eb35fc8..a8ddd7ca84 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -101,6 +101,8 @@ class xen2
     livepatch_op
 # XEN_SYSCTL_gcov_op
     gcov_op
+# XEN_SYSCTL_nvdimm_op
+    nvdimm_op
 }
 
 # Classes domain and domain2 consist of operations that a domain performs on
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 05/15] xen/x86: add XENMEM_populate_pmem_map to map host pmem pages to HVM domain
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (3 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 06/15] tools: reserve guest memory for ACPI from device model Haozhong Zhang
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams, Daniel De Graaf

XENMEM_populate_pmemmap is used by toolstack to map the specified host
pmem pages to the specified guest physical address. Only pmem pages
that have been setup via XEN_SYSCTL_nvdimm_pmem_setup can be mapped
via XENMEM_populate_pmem_map. Because XEN_SYSCTL_nvdimm_pmem_setup only
works on x86, XENMEM_populate_pmem_map is made to work only on x86 as
well and return -ENOSYS on other architectures.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

Changes in v2:
 * Rename *_pmemmap to *_pmem_map.
 * Add XSM check fo XENMEM_populate_pmem_map.
 * Add compat code for XENMEM_populate_pmem_map.
 * Add stub for pmem_populate() on non-x86 architectures.
 * Add check to avoid populate pmem pages to dom0.
 * Merge v1 patch 5 "xen/x86: release pmem pages at domain destroy".
---
 tools/flask/policy/modules/xen.if   |   2 +-
 tools/libxc/include/xenctrl.h       |  17 ++++
 tools/libxc/xc_domain.c             |  15 ++++
 xen/arch/x86/domain.c               |   7 ++
 xen/common/compat/memory.c          |   1 +
 xen/common/domain.c                 |   3 +
 xen/common/memory.c                 |  43 ++++++++++
 xen/common/pmem.c                   | 151 +++++++++++++++++++++++++++++++++++-
 xen/include/public/memory.h         |  14 +++-
 xen/include/xen/pmem.h              |  24 ++++++
 xen/include/xen/sched.h             |   3 +
 xen/include/xsm/dummy.h             |  11 +++
 xen/include/xsm/xsm.h               |  12 +++
 xen/xsm/dummy.c                     |   4 +
 xen/xsm/flask/hooks.c               |  13 ++++
 xen/xsm/flask/policy/access_vectors |   2 +
 16 files changed, 317 insertions(+), 5 deletions(-)

diff --git a/tools/flask/policy/modules/xen.if b/tools/flask/policy/modules/xen.if
index ed0df4f010..bc4176c089 100644
--- a/tools/flask/policy/modules/xen.if
+++ b/tools/flask/policy/modules/xen.if
@@ -55,7 +55,7 @@ define(`create_domain_common', `
 			psr_cmt_op psr_cat_op soft_reset };
 	allow $1 $2:security check_context;
 	allow $1 $2:shadow enable;
-	allow $1 $2:mmu { map_read map_write adjust memorymap physmap pinpage mmuext_op updatemp };
+	allow $1 $2:mmu { map_read map_write adjust memorymap physmap pinpage mmuext_op updatemp populate_pmem_map };
 	allow $1 $2:grant setup;
 	allow $1 $2:hvm { cacheattr getparam hvmctl sethvmc
 			setparam nested altp2mhvm altp2mhvm_op dm };
diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index d4e3002c9e..f8a9581506 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -2553,6 +2553,23 @@ int xc_nvdimm_pmem_setup(xc_interface *xch,
                          unsigned long smfn, unsigned long emfn,
                          unsigned long mgmt_smfn, unsigned long mgmt_emfn);
 
+/*
+ * Map host pmem pages to a domain.
+ *
+ * Parameters:
+ *  xch:     xc interface handler
+ *  domid:   the target domain id
+ *  mfn:     start MFN of the host pmem pages to be mapped
+ *  nr_mfns: the number of host pmem pages to be mapped
+ *  gfn:     start GFN of the target guest physical pages
+ *
+ * Return:
+ *  0 on success; non-zero error code for failures.
+ */
+int xc_domain_populate_pmemmap(xc_interface *xch, uint32_t domid,
+                               unsigned long mfn, unsigned long gfn,
+                               unsigned long nr_mfns);
+
 /* Compat shims */
 #include "xenctrl_compat.h"
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index d862e537d9..9ccdda086d 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -2291,6 +2291,21 @@ int xc_domain_soft_reset(xc_interface *xch,
     domctl.domain = (domid_t)domid;
     return do_domctl(xch, &domctl);
 }
+
+int xc_domain_populate_pmemmap(xc_interface *xch, uint32_t domid,
+                               unsigned long mfn, unsigned long gfn,
+                               unsigned long nr_mfns)
+{
+    struct xen_pmem_map args = {
+        .domid   = domid,
+        .mfn     = mfn,
+        .gfn     = gfn,
+        .nr_mfns = nr_mfns,
+    };
+
+    return do_memory_op(xch, XENMEM_populate_pmem_map, &args, sizeof(args));
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 479aee641f..2333603f3e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -36,6 +36,7 @@
 #include <xen/wait.h>
 #include <xen/guest_access.h>
 #include <xen/livepatch.h>
+#include <xen/pmem.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
 #include <asm/regs.h>
@@ -2352,6 +2353,12 @@ int domain_relinquish_resources(struct domain *d)
         if ( ret )
             return ret;
 
+#ifdef CONFIG_PMEM
+        ret = pmem_teardown(d);
+        if ( ret )
+            return ret;
+#endif /* CONFIG_PMEM */
+
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
         if ( ret )
diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index a37a948331..19382f6dfc 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -523,6 +523,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
         case XENMEM_add_to_physmap:
         case XENMEM_remove_from_physmap:
         case XENMEM_access_op:
+        case XENMEM_populate_pmem_map:
             break;
 
         case XENMEM_get_vnumainfo:
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 4492c9c3d5..f8b4bd9c29 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -287,6 +287,9 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags,
     INIT_PAGE_LIST_HEAD(&d->page_list);
     INIT_PAGE_LIST_HEAD(&d->xenpage_list);
 
+    spin_lock_init_prof(d, pmem_lock);
+    INIT_PAGE_LIST_HEAD(&d->pmem_page_list);
+
     spin_lock_init(&d->node_affinity_lock);
     d->node_affinity = NODE_MASK_ALL;
     d->auto_node_affinity = 1;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index ad0b33ceb6..0883d2d9b8 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -23,6 +23,7 @@
 #include <xen/numa.h>
 #include <xen/mem_access.h>
 #include <xen/trace.h>
+#include <xen/pmem.h>
 #include <asm/current.h>
 #include <asm/hardirq.h>
 #include <asm/p2m.h>
@@ -1328,6 +1329,48 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     }
 #endif
 
+#ifdef CONFIG_PMEM
+    case XENMEM_populate_pmem_map:
+    {
+        struct xen_pmem_map map;
+        struct xen_pmem_map_args args;
+
+        if ( copy_from_guest(&map, arg, 1) )
+            return -EFAULT;
+
+        if ( map.domid == DOMID_SELF )
+            return -EINVAL;
+
+        d = rcu_lock_domain_by_any_id(map.domid);
+        if ( !d )
+            return -EINVAL;
+
+        rc = xsm_populate_pmem_map(XSM_TARGET, curr_d, d);
+        if ( rc )
+        {
+            rcu_unlock_domain(d);
+            return rc;
+        }
+
+        args.domain = d;
+        args.mfn = map.mfn;
+        args.gfn = map.gfn;
+        args.nr_mfns = map.nr_mfns;
+        args.nr_done = start_extent;
+        args.preempted = 0;
+
+        rc = pmem_populate(&args);
+        rcu_unlock_domain(d);
+
+        if ( rc == -ERESTART && args.preempted )
+            return hypercall_create_continuation(
+                __HYPERVISOR_memory_op, "lh",
+                op | (args.nr_done << MEMOP_EXTENT_SHIFT), arg);
+
+        break;
+    }
+#endif /* CONFIG_PMEM */
+
     default:
         rc = arch_memory_op(cmd, arg);
         break;
diff --git a/xen/common/pmem.c b/xen/common/pmem.c
index 0e2d66f94c..03f1c1b374 100644
--- a/xen/common/pmem.c
+++ b/xen/common/pmem.c
@@ -17,9 +17,12 @@
  */
 
 #include <xen/errno.h>
+#include <xen/event.h>
 #include <xen/list.h>
 #include <xen/mm.h>
+#include <xen/paging.h>
 #include <xen/pmem.h>
+#include <xen/sched.h>
 #include <xen/spinlock.h>
 
 /*
@@ -130,8 +133,9 @@ static struct pmem *get_first_overlap(const struct list_head *list,
     return overlap;
 }
 
-static bool pmem_list_covered(const struct list_head *list,
-                              unsigned long smfn, unsigned emfn)
+static bool pmem_list_covered_ready(const struct list_head *list,
+                                    unsigned long smfn, unsigned emfn,
+                                    bool check_ready)
 {
     struct pmem *overlap;
     bool covered = false;
@@ -139,7 +143,8 @@ static bool pmem_list_covered(const struct list_head *list,
     do {
         overlap = get_first_overlap(list, smfn, emfn);
 
-        if ( !overlap || smfn < overlap->smfn )
+        if ( !overlap || smfn < overlap->smfn ||
+             (check_ready && !overlap->ready) )
             break;
 
         if ( emfn <= overlap->emfn )
@@ -155,6 +160,12 @@ static bool pmem_list_covered(const struct list_head *list,
     return covered;
 }
 
+static bool pmem_list_covered(const struct list_head *list,
+                              unsigned long smfn, unsigned emfn)
+{
+    return pmem_list_covered_ready(list, smfn, emfn, false);
+}
+
 static bool check_mgmt_size(unsigned long mgmt_mfns, unsigned long total_mfns)
 {
     return mgmt_mfns >=
@@ -301,3 +312,137 @@ int pmem_setup(unsigned long data_smfn, unsigned long data_emfn,
  out:
     return rc;
 }
+
+#ifdef CONFIG_X86
+
+static void pmem_assign_page(struct domain *d, struct page_info *pg,
+                             unsigned long gfn)
+{
+    pg->u.inuse.type_info = 0;
+    page_set_owner(pg, d);
+    guest_physmap_add_page(d, _gfn(gfn), _mfn(page_to_mfn(pg)), 0);
+
+    spin_lock(&d->pmem_lock);
+    page_list_add_tail(pg, &d->pmem_page_list);
+    spin_unlock(&d->pmem_lock);
+}
+
+static void pmem_unassign_page(struct domain *d, struct page_info *pg,
+                               unsigned long gfn)
+{
+    spin_lock(&d->pmem_lock);
+    page_list_del(pg, &d->pmem_page_list);
+    spin_unlock(&d->pmem_lock);
+
+    guest_physmap_remove_page(d, _gfn(gfn), _mfn(page_to_mfn(pg)), 0);
+    page_set_owner(pg, NULL);
+    pg->count_info = (pg->count_info & ~PGC_count_mask) | PGC_state_free;
+}
+
+static void pmem_unassign_pages(struct domain *d, unsigned long mfn,
+                                unsigned long gfn, unsigned long nr_mfns)
+{
+    unsigned long emfn = mfn + nr_mfns;
+
+    for ( ; mfn < emfn; mfn++, gfn++ )
+        pmem_unassign_page(d, mfn_to_page(mfn), gfn);
+}
+
+/**
+ * Map host pmem pages to a domain. Currently only HVM domain is
+ * supported.
+ *
+ * Parameters:
+ *  args: please refer to comments of struct xen_pmemmap_args in xen/pmem.h
+ *
+ * Return:
+ *  0 on success; non-zero error code on failures.
+ */
+int pmem_populate(struct xen_pmem_map_args *args)
+{
+    struct domain *d = args->domain;
+    unsigned long i = args->nr_done;
+    unsigned long mfn = args->mfn + i;
+    unsigned long emfn = args->mfn + args->nr_mfns;
+    unsigned long gfn;
+    struct page_info *page;
+    int rc = 0;
+
+    if ( unlikely(d->is_dying) )
+        return -EINVAL;
+
+    if ( !has_hvm_container_domain(d) || !paging_mode_translate(d) )
+        return -EINVAL;
+
+    spin_lock(&pmem_gregions_lock);
+    if ( !pmem_list_covered_ready(&pmem_gregions, mfn, emfn, true) )
+    {
+        spin_unlock(&pmem_regions_lock);
+        return -EINVAL;
+    }
+    spin_unlock(&pmem_gregions_lock);
+
+    for ( gfn = args->gfn + i; mfn < emfn; i++, mfn++, gfn++ )
+    {
+        if ( i != args->nr_done && hypercall_preempt_check() )
+        {
+            args->preempted = 1;
+            rc = -ERESTART;
+            break;
+        }
+
+        page = mfn_to_page(mfn);
+
+        spin_lock(&pmem_gregions_lock);
+        if ( !page_state_is(page, free) )
+        {
+            dprintk(XENLOG_DEBUG, "pmem: mfn 0x%lx not in free state\n", mfn);
+            spin_unlock(&pmem_gregions_lock);
+            rc = -EINVAL;
+            break;
+        }
+        page->count_info = PGC_state_inuse | 1;
+        spin_unlock(&pmem_gregions_lock);
+
+        pmem_assign_page(d, page, gfn);
+    }
+
+    if ( rc && rc != -ERESTART )
+        pmem_unassign_pages(d, args->mfn, args->gfn, i);
+
+    args->nr_done = i;
+    return rc;
+}
+
+int pmem_teardown(struct domain *d)
+{
+    struct page_info *pg, *next;
+    int rc = 0;
+
+    ASSERT(d->is_dying);
+    ASSERT(d != current->domain);
+
+    spin_lock(&d->pmem_lock);
+
+    page_list_for_each_safe (pg, next, &d->pmem_page_list )
+    {
+        BUG_ON(page_get_owner(pg) != d);
+        BUG_ON(page_state_is(pg, free));
+
+        page_list_del(pg, &d->pmem_page_list);
+        page_set_owner(pg, NULL);
+        pg->count_info = (pg->count_info & ~PGC_count_mask) | PGC_state_free;
+
+        if ( hypercall_preempt_check() )
+        {
+            rc = -ERESTART;
+            break;
+        }
+    }
+
+    spin_unlock(&d->pmem_lock);
+
+    return rc;
+}
+
+#endif /* CONFIG_X86 */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 6eee0c8a16..fa636b313a 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -648,7 +648,19 @@ struct xen_vnuma_topology_info {
 typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
 
-/* Next available subop number is 28 */
+#define XENMEM_populate_pmem_map 28
+
+struct xen_pmem_map {
+    /* IN */
+    domid_t domid;
+    unsigned long mfn;
+    unsigned long gfn;
+    unsigned int nr_mfns;
+};
+typedef struct xen_pmem_map xen_pmem_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_pmem_map_t);
+
+/* Next available subop number is 29 */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
 
diff --git a/xen/include/xen/pmem.h b/xen/include/xen/pmem.h
index 95c8207ff6..cbc621048b 100644
--- a/xen/include/xen/pmem.h
+++ b/xen/include/xen/pmem.h
@@ -26,9 +26,23 @@ int pmem_register(unsigned long smfn, unsigned long emfn);
 int pmem_setup(unsigned long data_spfn, unsigned long data_emfn,
                unsigned long mgmt_smfn, unsigned long mgmt_emfn);
 
+struct xen_pmem_map_args {
+    struct domain *domain;
+
+    unsigned long mfn;     /* start MFN of pmems page to be mapped */
+    unsigned long gfn;     /* start GFN of target domain */
+    unsigned long nr_mfns; /* number of pmem pages to be mapped */
+
+    /* For preemption ... */
+    unsigned long nr_done; /* number of pmem pages processed so far */
+    int preempted;         /* Is the operation preempted? */
+};
+
 #ifdef CONFIG_X86
 int pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
                     unsigned long mgmt_smfn, unsigned long mgmt_emfn);
+int pmem_populate(struct xen_pmem_map_args *args);
+int pmem_teardown(struct domain *d);
 #else /* !CONFIG_X86 */
 static inline int
 pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
@@ -36,6 +50,16 @@ pmem_arch_setup(unsigned long data_smfn, unsigned long data_emfn,
 {
     return -ENOSYS;
 }
+
+static inline int pmem_populate(struct xen_pmem_map_args *args)
+{
+    return -ENOSYS;
+}
+
+static inline int pmem_teardown(struct domain *d)
+{
+    return -ENOSYS;
+}
 #endif /* CONFIG_X86 */
 
 #endif /* CONFIG_PMEM */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 0929c0b910..39057243d6 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -336,6 +336,9 @@ struct domain
     atomic_t         shr_pages;       /* number of shared pages             */
     atomic_t         paged_pages;     /* number of paged-out pages          */
 
+    spinlock_t       pmem_lock;       /* protect all following pmem_ fields */
+    struct page_list_head pmem_page_list; /* linked list of pmem pages      */
+
     /* Scheduling. */
     void            *sched_priv;    /* scheduler-specific data */
     struct cpupool  *cpupool;
diff --git a/xen/include/xsm/dummy.h b/xen/include/xsm/dummy.h
index 4b27ae72de..aea0b9376f 100644
--- a/xen/include/xsm/dummy.h
+++ b/xen/include/xsm/dummy.h
@@ -728,3 +728,14 @@ static XSM_INLINE int xsm_xen_version (XSM_DEFAULT_ARG uint32_t op)
         return xsm_default_action(XSM_PRIV, current->domain, NULL);
     }
 }
+
+#ifdef CONFIG_PMEM
+
+static XSM_INLINE int xsm_populate_pmem_map(XSM_DEFAULT_ARG
+                                            struct domain *d1, struct domain *d2)
+{
+    XSM_ASSERT_ACTION(XSM_TARGET);
+    return xsm_default_action(action, d1, d2);
+}
+
+#endif /* CONFIG_PMEM */
diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
index 2cf7ac10db..8f62b21739 100644
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -182,6 +182,10 @@ struct xsm_operations {
     int (*dm_op) (struct domain *d);
 #endif
     int (*xen_version) (uint32_t cmd);
+
+#ifdef CONFIG_PMEM
+    int (*populate_pmem_map) (struct domain *d1, struct domain *d2);
+#endif /* CONFIG_PMEM */
 };
 
 #ifdef CONFIG_XSM
@@ -705,6 +709,14 @@ static inline int xsm_xen_version (xsm_default_t def, uint32_t op)
     return xsm_ops->xen_version(op);
 }
 
+#ifdef CONFIG_PMEM
+static inline int xsm_populate_pmem_map(xsm_default_t def,
+                                        struct domain *d1, struct domain *d2)
+{
+    return xsm_ops->populate_pmem_map(d1, d2);
+}
+#endif /* CONFIG_PMEM */
+
 #endif /* XSM_NO_WRAPPERS */
 
 #ifdef CONFIG_MULTIBOOT
diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
index 3cb5492dd3..dde68ecf59 100644
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -159,4 +159,8 @@ void __init xsm_fixup_ops (struct xsm_operations *ops)
     set_to_dummy_if_null(ops, dm_op);
 #endif
     set_to_dummy_if_null(ops, xen_version);
+
+#ifdef CONFIG_PMEM
+    set_to_dummy_if_null(ops, populate_pmem_map);
+#endif
 }
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index e3c77bbe3f..582ddf81d3 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -1659,6 +1659,15 @@ static int flask_xen_version (uint32_t op)
     }
 }
 
+#ifdef CONFIG_PMEM
+
+static int flask_populate_pmem_map(struct domain *d1, struct domain *d2)
+{
+    return domain_has_perm(d1, d2, SECCLASS_MMU, MMU__POPULATE_PMEM_MAP);
+}
+
+#endif /* CONFIG_PMEM */
+
 long do_flask_op(XEN_GUEST_HANDLE_PARAM(xsm_op_t) u_flask_op);
 int compat_flask_op(XEN_GUEST_HANDLE_PARAM(xsm_op_t) u_flask_op);
 
@@ -1794,6 +1803,10 @@ static struct xsm_operations flask_ops = {
     .dm_op = flask_dm_op,
 #endif
     .xen_version = flask_xen_version,
+
+#ifdef CONFIG_PMEM
+    .populate_pmem_map = flask_populate_pmem_map,
+#endif /* CONFIG_PMEM */
 };
 
 void __init flask_init(const void *policy_buffer, size_t policy_size)
diff --git a/xen/xsm/flask/policy/access_vectors b/xen/xsm/flask/policy/access_vectors
index a8ddd7ca84..44cbd66f4d 100644
--- a/xen/xsm/flask/policy/access_vectors
+++ b/xen/xsm/flask/policy/access_vectors
@@ -385,6 +385,8 @@ class mmu
 # Allow a privileged domain to install a map of a page it does not own.  Used
 # for stub domain device models with the PV framebuffer.
     target_hack
+# XENMEM_populate_pmem_map
+    populate_pmem_map
 }
 
 # control of the paging_domctl split by subop
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 06/15] tools: reserve guest memory for ACPI from device model
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (4 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 05/15] xen/x86: add XENMEM_populate_pmem_map to map host pmem pages to HVM domain Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 07/15] tools/libacpi: expose the minimum alignment used by mem_ops.alloc Haozhong Zhang
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Ian Jackson, Wei Liu,
	Haozhong Zhang

Some virtual devices (e.g. NVDIMM) require complex ACPI tables and
definition blocks (in AML), which a device model (e.g. QEMU) has
already been able to construct. Instead of introducing the similar
implementation to Xen, we would like to reuse the device model to
provide those ACPI stuffs.

This commit reserves an area in the guest memory for the device model
to pass its ACPI tables and definition blocks to guest, which will be
loaded by hvmloader. The base guest physical address and the size of
the reserved area are passed to the device model via XenStore keys
hvmloader/dm-acpi/{address, length}.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxc/include/xc_dom.h            |  1 +
 tools/libxc/xc_dom_x86.c                |  7 +++++++
 tools/libxl/libxl_dom.c                 | 22 ++++++++++++++++++++++
 xen/include/public/hvm/hvm_xs_strings.h | 11 +++++++++++
 4 files changed, 41 insertions(+)

diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index 608cbc2ad6..19d65cda1e 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -98,6 +98,7 @@ struct xc_dom_image {
     xen_pfn_t xenstore_pfn;
     xen_pfn_t shared_info_pfn;
     xen_pfn_t bootstack_pfn;
+    xen_pfn_t dm_acpi_pfn;
     xen_pfn_t pfn_alloc_end;
     xen_vaddr_t virt_alloc_end;
     xen_vaddr_t bsd_symtab_start;
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index 6495e7fc30..917fb51abf 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -674,6 +674,13 @@ static int alloc_magic_pages_hvm(struct xc_dom_image *dom)
                          ioreq_server_pfn(0));
         xc_hvm_param_set(xch, domid, HVM_PARAM_NR_IOREQ_SERVER_PAGES,
                          NR_IOREQ_SERVER_PAGES);
+
+        dom->dm_acpi_pfn = xc_dom_alloc_page(dom, "DM ACPI");
+        if ( dom->dm_acpi_pfn == INVALID_PFN )
+        {
+            DOMPRINTF("Could not allocate page for device model ACPI.");
+            goto error_out;
+        }
     }
 
     rc = xc_dom_alloc_segment(dom, &dom->start_info_seg,
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index d519c8d440..5f621466d5 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -865,6 +865,28 @@ static int hvm_build_set_xs_values(libxl__gc *gc,
             goto err;
     }
 
+    if (dom->dm_acpi_pfn) {
+        uint64_t guest_addr_out = dom->dm_acpi_pfn * XC_DOM_PAGE_SIZE(dom);
+
+        if (guest_addr_out >= 0x100000000ULL) {
+            LOG(ERROR,
+                "Guest address of DM ACPI is 0x%"PRIx64", but expected below 4G",
+                guest_addr_out);
+            goto err;
+        }
+
+        path = GCSPRINTF("/local/domain/%d/"HVM_XS_DM_ACPI_ADDRESS, domid);
+        ret = libxl__xs_printf(gc, XBT_NULL, path, "0x%"PRIx64, guest_addr_out);
+        if (ret)
+            goto err;
+
+        path = GCSPRINTF("/local/domain/%d/"HVM_XS_DM_ACPI_LENGTH, domid);
+        ret = libxl__xs_printf(gc, XBT_NULL, path, "0x%"PRIx64,
+                               (uint64_t)XC_DOM_PAGE_SIZE(dom));
+        if (ret)
+            goto err;
+    }
+
     return 0;
 
 err:
diff --git a/xen/include/public/hvm/hvm_xs_strings.h b/xen/include/public/hvm/hvm_xs_strings.h
index 146b0b0582..f9b82dbbc0 100644
--- a/xen/include/public/hvm/hvm_xs_strings.h
+++ b/xen/include/public/hvm/hvm_xs_strings.h
@@ -79,4 +79,15 @@
  */
 #define HVM_XS_OEM_STRINGS             "bios-strings/oem-%d"
 
+/* Follows are XenStore keys for DM ACPI (ACPI built by device model,
+ * e.g. QEMU).
+ *
+ * A reserved area of guest physical memory is used to pass DM
+ * ACPI. Values of following two keys specify the base physical
+ * address and length (in bytes) of the reserved area.
+ */
+#define HVM_XS_DM_ACPI_ROOT            "hvmloader/dm-acpi"
+#define HVM_XS_DM_ACPI_ADDRESS         HVM_XS_DM_ACPI_ROOT"/address"
+#define HVM_XS_DM_ACPI_LENGTH          HVM_XS_DM_ACPI_ROOT"/length"
+
 #endif /* __XEN_PUBLIC_HVM_HVM_XS_STRINGS_H__ */
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 07/15] tools/libacpi: expose the minimum alignment used by mem_ops.alloc
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (5 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 06/15] tools: reserve guest memory for ACPI from device model Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 08/15] tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address Haozhong Zhang
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams

The AML builder added later requires this information to implement a
memory allocator that can allocate contiguous memory across multiple
calls to mem_ops.alloc().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

Changes in v2:
 * Only expose the minimal alignment.
 * Rename min_alloc_align to min_alloc_byte_align to clarify the unit.
---
 tools/firmware/hvmloader/util.c | 2 ++
 tools/libacpi/libacpi.h         | 2 ++
 tools/libxl/libxl_x86_acpi.c    | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 03cfb795d3..d289361317 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -971,6 +971,8 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
     ctxt.mem_ops.free = acpi_mem_free;
     ctxt.mem_ops.v2p = acpi_v2p;
 
+    ctxt.min_alloc_byte_align = 16;
+
     acpi_build_tables(&ctxt, config);
 
     hvm_param_set(HVM_PARAM_VM_GENERATION_ID_ADDR, config->vm_gid_addr);
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index 67bd67fa0a..2049a1b032 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -51,6 +51,8 @@ struct acpi_ctxt {
         void (*free)(struct acpi_ctxt *ctxt, void *v, uint32_t size);
         unsigned long (*v2p)(struct acpi_ctxt *ctxt, void *v);
     } mem_ops;
+
+    uint32_t min_alloc_byte_align; /* minimum alignment used by mem_ops.alloc */
 };
 
 struct acpi_config {
diff --git a/tools/libxl/libxl_x86_acpi.c b/tools/libxl/libxl_x86_acpi.c
index c0a6e321ec..f242450166 100644
--- a/tools/libxl/libxl_x86_acpi.c
+++ b/tools/libxl/libxl_x86_acpi.c
@@ -183,6 +183,8 @@ int libxl__dom_load_acpi(libxl__gc *gc,
     libxl_ctxt.c.mem_ops.v2p = virt_to_phys;
     libxl_ctxt.c.mem_ops.free = acpi_mem_free;
 
+    libxl_ctxt.c.min_alloc_byte_align = 16;
+
     rc = init_acpi_config(gc, dom, b_info, &config);
     if (rc) {
         LOG(ERROR, "init_acpi_config failed (rc=%d)", rc);
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 08/15] tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (6 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 07/15] tools/libacpi: expose the minimum alignment used by mem_ops.alloc Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 09/15] tools/libacpi: add callbacks to access XenStore Haozhong Zhang
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams

The address of ACPI blobs passed from device model is provided via
XenStore as the physical address. libacpi needs this callback to
access them.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/firmware/hvmloader/util.c |  6 ++++++
 tools/firmware/hvmloader/util.h |  1 +
 tools/libacpi/libacpi.h         |  1 +
 tools/libxl/libxl_x86_acpi.c    | 10 ++++++++++
 4 files changed, 18 insertions(+)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index d289361317..b2372a75be 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -871,6 +871,11 @@ static unsigned long acpi_v2p(struct acpi_ctxt *ctxt, void *v)
     return virt_to_phys(v);
 }
 
+static void *acpi_p2v(struct acpi_ctxt *ctxt, unsigned long p)
+{
+    return phys_to_virt(p);
+}
+
 static void *acpi_mem_alloc(struct acpi_ctxt *ctxt,
                             uint32_t size, uint32_t align)
 {
@@ -970,6 +975,7 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
     ctxt.mem_ops.alloc = acpi_mem_alloc;
     ctxt.mem_ops.free = acpi_mem_free;
     ctxt.mem_ops.v2p = acpi_v2p;
+    ctxt.mem_ops.p2v = acpi_p2v;
 
     ctxt.min_alloc_byte_align = 16;
 
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 6062f0b8cf..6a50dae1eb 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -200,6 +200,7 @@ xen_pfn_t mem_hole_alloc(uint32_t nr_mfns);
 /* Allocate memory in a reserved region below 4GB. */
 void *mem_alloc(uint32_t size, uint32_t align);
 #define virt_to_phys(v) ((unsigned long)(v))
+#define phys_to_virt(p) ((void *)(p))
 
 /* Allocate memory in a scratch region */
 void *scratch_alloc(uint32_t size, uint32_t align);
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index 2049a1b032..48acf9583c 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -50,6 +50,7 @@ struct acpi_ctxt {
         void *(*alloc)(struct acpi_ctxt *ctxt, uint32_t size, uint32_t align);
         void (*free)(struct acpi_ctxt *ctxt, void *v, uint32_t size);
         unsigned long (*v2p)(struct acpi_ctxt *ctxt, void *v);
+        void *(*p2v)(struct acpi_ctxt *ctxt, unsigned long p);
     } mem_ops;
 
     uint32_t min_alloc_byte_align; /* minimum alignment used by mem_ops.alloc */
diff --git a/tools/libxl/libxl_x86_acpi.c b/tools/libxl/libxl_x86_acpi.c
index f242450166..2622e03ce6 100644
--- a/tools/libxl/libxl_x86_acpi.c
+++ b/tools/libxl/libxl_x86_acpi.c
@@ -52,6 +52,15 @@ static unsigned long virt_to_phys(struct acpi_ctxt *ctxt, void *v)
             libxl_ctxt->alloc_base_paddr);
 }
 
+static void *phys_to_virt(struct acpi_ctxt *ctxt, unsigned long p)
+{
+    struct libxl_acpi_ctxt *libxl_ctxt =
+        CONTAINER_OF(ctxt, struct libxl_acpi_ctxt, c);
+
+    return (void *)((p - libxl_ctxt->alloc_base_paddr) +
+                    libxl_ctxt->alloc_base_vaddr);
+}
+
 static void *mem_alloc(struct acpi_ctxt *ctxt,
                        uint32_t size, uint32_t align)
 {
@@ -181,6 +190,7 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 
     libxl_ctxt.c.mem_ops.alloc = mem_alloc;
     libxl_ctxt.c.mem_ops.v2p = virt_to_phys;
+    libxl_ctxt.c.mem_ops.p2v = phys_to_virt;
     libxl_ctxt.c.mem_ops.free = acpi_mem_free;
 
     libxl_ctxt.c.min_alloc_byte_align = 16;
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 09/15] tools/libacpi: add callbacks to access XenStore
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (7 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 08/15] tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 10/15] tools/libacpi: add a simple AML builder Haozhong Zhang
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams

libacpi needs to access information placed in XenStore in order to
load ACPI built by the device model.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

Changes in v2:
 * Extract the common part of the existing xenstore_read() and the
   new xenstore_directory() in xenbus.c.
---
 tools/firmware/hvmloader/util.c   | 52 +++++++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/util.h   |  9 +++++++
 tools/firmware/hvmloader/xenbus.c | 44 +++++++++++++++++++++++----------
 tools/libacpi/libacpi.h           | 10 ++++++++
 tools/libxl/libxl_x86_acpi.c      | 29 ++++++++++++++++++++++
 5 files changed, 131 insertions(+), 13 deletions(-)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index b2372a75be..ec0de711c2 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -893,6 +893,53 @@ static uint8_t acpi_lapic_id(unsigned cpu)
     return LAPIC_ID(cpu);
 }
 
+static const char *acpi_xs_read(struct acpi_ctxt *ctxt, const char *path)
+{
+    return xenstore_read(path, NULL);
+}
+
+static int acpi_xs_write(struct acpi_ctxt *ctxt,
+                         const char *path, const char *value)
+{
+    return xenstore_write(path, value);
+}
+
+static unsigned int count_strings(const char *strings, unsigned int len)
+{
+    const char *p;
+    unsigned int n;
+
+    for ( p = strings, n = 0; p < strings + len; p++ )
+        if ( *p == '\0' )
+            n++;
+
+    return n;
+}
+
+static char **acpi_xs_directory(struct acpi_ctxt *ctxt,
+                                const char *path, unsigned int *num)
+{
+    const char *strings;
+    char *s, *p, **ret;
+    unsigned int len, n;
+
+    strings = xenstore_directory(path, &len, NULL);
+    if ( !strings )
+        return NULL;
+
+    n = count_strings(strings, len);
+    ret = ctxt->mem_ops.alloc(ctxt, n * sizeof(p) + len, 0);
+    if ( !ret )
+        return NULL;
+    memcpy(&ret[n], strings, len);
+
+    s = (char *)&ret[n];
+    for ( p = s, *num = 0; p < s + len; p += strlen(p) + 1 )
+        ret[(*num)++] = p;
+
+    return ret;
+}
+
 void hvmloader_acpi_build_tables(struct acpi_config *config,
                                  unsigned int physical)
 {
@@ -979,6 +1026,11 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
 
     ctxt.min_alloc_byte_align = 16;
 
+    ctxt.xs_ops.read = acpi_xs_read;
+    ctxt.xs_ops.write = acpi_xs_write;
+    ctxt.xs_ops.directory = acpi_xs_directory;
+    ctxt.xs_opaque = NULL;
+
     acpi_build_tables(&ctxt, config);
 
     hvm_param_set(HVM_PARAM_VM_GENERATION_ID_ADDR, config->vm_gid_addr);
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 6a50dae1eb..ac10a7106a 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -225,6 +225,15 @@ const char *xenstore_read(const char *path, const char *default_resp);
  */
 int xenstore_write(const char *path, const char *value);
 
+/* Read a xenstore directory. Return NULL, or a nul-terminated string
+ * which contains all names of directory entries. Names are separated
+ * by '\0'. The returned string is in a static buffer, so only valid
+ * until the next xenstore/xenbus operation.  If @default_resp is
+ * specified, it is returned in preference to a NULL or empty string
+ * received from xenstore.
+ */
+const char *xenstore_directory(const char *path, uint32_t *len,
+                               const char *default_resp);
 
 /* Get a HVM param.
  */
diff --git a/tools/firmware/hvmloader/xenbus.c b/tools/firmware/hvmloader/xenbus.c
index 448157dcb0..8cad7af019 100644
--- a/tools/firmware/hvmloader/xenbus.c
+++ b/tools/firmware/hvmloader/xenbus.c
@@ -245,24 +245,16 @@ static int xenbus_recv(uint32_t *reply_len, const char **reply_data,
     return 0;
 }
 
-
-/* Read a xenstore key.  Returns a nul-terminated string (even if the XS
- * data wasn't nul-terminated) or NULL.  The returned string is in a
- * static buffer, so only valid until the next xenstore/xenbus operation.
- * If @default_resp is specified, it is returned in preference to a NULL or
- * empty string received from xenstore.
- */
-const char *xenstore_read(const char *path, const char *default_resp)
+static const char *xenstore_read_common(const char *path, uint32_t *len,
+                                        const char *default_resp, bool is_dir)
 {
-    uint32_t len = 0, type = 0;
+    uint32_t type = 0, expected_type = is_dir ? XS_DIRECTORY : XS_READ;
     const char *answer = NULL;
 
-    xenbus_send(XS_READ,
-                path, strlen(path),
-                "", 1, /* nul separator */
+    xenbus_send(expected_type, path, strlen(path), "", 1, /* nul separator */
                 NULL, 0);
 
-    if ( xenbus_recv(&len, &answer, &type) || (type != XS_READ) )
+    if ( xenbus_recv(len, &answer, &type) || type != expected_type )
         answer = NULL;
 
     if ( (default_resp != NULL) && ((answer == NULL) || (*answer == '\0')) )
@@ -272,6 +264,32 @@ const char *xenstore_read(const char *path, const char *default_resp)
     return answer;
 }
 
+/* Read a xenstore key.  Returns a nul-terminated string (even if the XS
+ * data wasn't nul-terminated) or NULL.  The returned string is in a
+ * static buffer, so only valid until the next xenstore/xenbus operation.
+ * If @default_resp is specified, it is returned in preference to a NULL or
+ * empty string received from xenstore.
+ */
+const char *xenstore_read(const char *path, const char *default_resp)
+{
+    uint32_t len = 0;
+
+    return xenstore_read_common(path, &len, default_resp, false);
+}
+
+/* Read a xenstore directory. Return NULL, or a nul-terminated string
+ * which contains all names of directory entries. Names are separated
+ * by '\0'. The returned string is in a static buffer, so only valid
+ * until the next xenstore/xenbus operation.  If @default_resp is
+ * specified, it is returned in preference to a NULL or empty string
+ * received from xenstore.
+ */
+const char *xenstore_directory(const char *path, uint32_t *len,
+                               const char *default_resp)
+{
+    return xenstore_read_common(path, len, default_resp, true);
+}
+
 /* Write a xenstore key.  @value must be a nul-terminated string. Returns
  * zero on success or a xenstore error code on failure.
  */
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index 48acf9583c..645e091f68 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -54,6 +54,16 @@ struct acpi_ctxt {
     } mem_ops;
 
     uint32_t min_alloc_byte_align; /* minimum alignment used by mem_ops.alloc */
+
+    struct acpi_xs_ops {
+        const char *(*read)(struct acpi_ctxt *ctxt, const char *path);
+        int (*write)(struct acpi_ctxt *ctxt,
+                     const char *path, const char *value);
+        char **(*directory)(struct acpi_ctxt *ctxt,
+                            const char *path, unsigned int *num);
+    } xs_ops;
+
+    void *xs_opaque;
 };
 
 struct acpi_config {
diff --git a/tools/libxl/libxl_x86_acpi.c b/tools/libxl/libxl_x86_acpi.c
index 2622e03ce6..591a4f44fa 100644
--- a/tools/libxl/libxl_x86_acpi.c
+++ b/tools/libxl/libxl_x86_acpi.c
@@ -93,6 +93,25 @@ static void acpi_mem_free(struct acpi_ctxt *ctxt,
 {
 }
 
+static const char *acpi_xs_read(struct acpi_ctxt *ctxt, const char *path)
+{
+    return libxl__xs_read((libxl__gc *)ctxt->xs_opaque, XBT_NULL, path);
+}
+
+static int acpi_xs_write(struct acpi_ctxt *ctxt,
+                         const char *path, const char *value)
+{
+    return libxl__xs_write_checked((libxl__gc *)ctxt->xs_opaque, XBT_NULL,
+                                   path, value);
+}
+
+static char **acpi_xs_directory(struct acpi_ctxt *ctxt,
+                                const char *path, unsigned int *num)
+{
+    return libxl__xs_directory((libxl__gc *)ctxt->xs_opaque, XBT_NULL,
+                               path, num);
+}
+
 static uint8_t acpi_lapic_id(unsigned cpu)
 {
     return cpu * 2;
@@ -195,6 +214,16 @@ int libxl__dom_load_acpi(libxl__gc *gc,
 
     libxl_ctxt.c.min_alloc_byte_align = 16;
 
+    libxl_ctxt.c.xs_ops.read = acpi_xs_read;
+    libxl_ctxt.c.xs_ops.write = acpi_xs_write;
+    libxl_ctxt.c.xs_ops.directory = acpi_xs_directory;
+    libxl_ctxt.c.xs_opaque = gc;
+
+    libxl_ctxt.c.xs_ops.read = acpi_xs_read;
+    libxl_ctxt.c.xs_ops.write = acpi_xs_write;
+    libxl_ctxt.c.xs_ops.directory = acpi_xs_directory;
+    libxl_ctxt.c.xs_opaque = gc;
+
     rc = init_acpi_config(gc, dom, b_info, &config);
     if (rc) {
         LOG(ERROR, "init_acpi_config failed (rc=%d)", rc);
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 10/15] tools/libacpi: add a simple AML builder
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (8 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 09/15] tools/libacpi: add callbacks to access XenStore Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 11/15] tools/libacpi: load ACPI built by the device model Haozhong Zhang
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson,
	ross.philipson, Jan Beulich, Konrad Rzeszutek Wilk, Dan Williams

It is used by libacpi to generate SSDTs from ACPI namespace devices
built by the device model.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: ross.philipson@ainfosec.com

Changes in v2:
 * Add code comment for what is built by each function.
 * Change the license to LGPL.
---
 tools/firmware/hvmloader/Makefile |   3 +-
 tools/libacpi/aml_build.c         | 326 ++++++++++++++++++++++++++++++++++++++
 tools/libacpi/aml_build.h         | 116 ++++++++++++++
 tools/libxl/Makefile              |   3 +-
 4 files changed, 446 insertions(+), 2 deletions(-)
 create mode 100644 tools/libacpi/aml_build.c
 create mode 100644 tools/libacpi/aml_build.h

diff --git a/tools/firmware/hvmloader/Makefile b/tools/firmware/hvmloader/Makefile
index 80d7b448a5..abbdaaeb40 100644
--- a/tools/firmware/hvmloader/Makefile
+++ b/tools/firmware/hvmloader/Makefile
@@ -76,11 +76,12 @@ smbios.o: CFLAGS += -D__SMBIOS_DATE__="\"$(SMBIOS_REL_DATE)\""
 
 ACPI_PATH = ../../libacpi
 DSDT_FILES = dsdt_anycpu.c dsdt_15cpu.c dsdt_anycpu_qemu_xen.c
-ACPI_OBJS = $(patsubst %.c,%.o,$(DSDT_FILES)) build.o static_tables.o
+ACPI_OBJS = $(patsubst %.c,%.o,$(DSDT_FILES)) build.o static_tables.o aml_build.o
 $(ACPI_OBJS): CFLAGS += -I. -DLIBACPI_STDUTILS=\"$(CURDIR)/util.h\"
 CFLAGS += -I$(ACPI_PATH)
 vpath build.c $(ACPI_PATH)
 vpath static_tables.c $(ACPI_PATH)
+vpath aml_build.c $(ACPI_PATH)
 OBJS += $(ACPI_OBJS)
 
 hvmloader: $(OBJS)
diff --git a/tools/libacpi/aml_build.c b/tools/libacpi/aml_build.c
new file mode 100644
index 0000000000..9b4e28ad95
--- /dev/null
+++ b/tools/libacpi/aml_build.c
@@ -0,0 +1,326 @@
+/*
+ * tools/libacpi/aml_build.c
+ *
+ * Copyright (C) 2017, Intel Corporation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License, version 2.1, as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include LIBACPI_STDUTILS
+#include "libacpi.h"
+#include "aml_build.h"
+
+#define AML_OP_SCOPE     0x10
+#define AML_OP_EXT       0x5B
+#define AML_OP_DEVICE    0x82
+
+#define ACPI_NAMESEG_LEN 4
+
+struct aml_build_alloctor {
+    struct acpi_ctxt *ctxt;
+    uint8_t *buf;
+    uint32_t capacity;
+    uint32_t used;
+};
+static struct aml_build_alloctor alloc;
+
+static uint8_t *aml_buf_alloc(uint32_t size)
+{
+    uint8_t *buf = NULL;
+    struct acpi_ctxt *ctxt = alloc.ctxt;
+    uint32_t alloc_size, alloc_align = ctxt->min_alloc_byte_align;
+    uint32_t length = alloc.used + size;
+
+    /* Overflow ... */
+    if ( length < alloc.used )
+        return NULL;
+
+    if ( length <= alloc.capacity )
+    {
+        buf = alloc.buf + alloc.used;
+        alloc.used += size;
+    }
+    else
+    {
+        alloc_size = length - alloc.capacity;
+        alloc_size = (alloc_size + alloc_align) & ~(alloc_align - 1);
+        buf = ctxt->mem_ops.alloc(ctxt, alloc_size, alloc_align);
+
+        if ( buf &&
+             buf == alloc.buf + alloc.capacity /* cont to existing buf */ )
+        {
+            alloc.capacity += alloc_size;
+            buf = alloc.buf + alloc.used;
+            alloc.used += size;
+        }
+        else
+            buf = NULL;
+    }
+
+    return buf;
+}
+
+static uint32_t get_package_length(uint8_t *pkg)
+{
+    uint32_t len;
+
+    len = pkg - alloc.buf;
+    len = alloc.used - len;
+
+    return len;
+}
+
+/*
+ * On success, an object in the following form is stored at @buf.
+ *   @byte
+ *   the original content in @buf
+ */
+static int build_prepend_byte(uint8_t *buf, uint8_t byte)
+{
+    uint32_t len;
+
+    len = buf - alloc.buf;
+    len = alloc.used - len;
+
+    if ( !aml_buf_alloc(sizeof(uint8_t)) )
+        return -1;
+
+    if ( len )
+        memmove(buf + 1, buf, len);
+    buf[0] = byte;
+
+    return 0;
+}
+
+/*
+ * On success, an object in the following form is stored at @buf.
+ *   AML encoding of four-character @name
+ *   the original content in @buf
+ *
+ * Refer to  ACPI spec 6.1, Sec 20.2.2 "Name Objects Encoding".
+ *
+ * XXX: names of multiple segments (e.g. X.Y.Z) are not supported
+ */
+static int build_prepend_name(uint8_t *buf, const char *name)
+{
+    uint8_t *p = buf;
+    const char *s = name;
+    uint32_t len, name_len;
+
+    while ( *s == '\\' || *s == '^' )
+    {
+        if ( build_prepend_byte(p, (uint8_t) *s) )
+            return -1;
+        ++p;
+        ++s;
+    }
+
+    if ( !*s )
+        return build_prepend_byte(p, 0x00);
+
+    len = p - alloc.buf;
+    len = alloc.used - len;
+    name_len = strlen(s);
+    ASSERT(name_len <= ACPI_NAMESEG_LEN);
+
+    if ( !aml_buf_alloc(ACPI_NAMESEG_LEN) )
+        return -1;
+    if ( len )
+        memmove(p + ACPI_NAMESEG_LEN, p, len);
+    memcpy(p, s, name_len);
+    memcpy(p + name_len, "____", ACPI_NAMESEG_LEN - name_len);
+
+    return 0;
+}
+
+enum {
+    PACKAGE_LENGTH_1BYTE_SHIFT = 6, /* Up to 63 - use extra 2 bits. */
+    PACKAGE_LENGTH_2BYTE_SHIFT = 4,
+    PACKAGE_LENGTH_3BYTE_SHIFT = 12,
+    PACKAGE_LENGTH_4BYTE_SHIFT = 20,
+};
+
+/*
+ * On success, an object in the following form is stored at @pkg.
+ *   AML encoding of package length @length
+ *   the original content in @pkg
+ *
+ * Refer to ACPI spec 6.1, Sec 20.2.4 "Package Length Encoding".
+ */
+static int build_prepend_package_length(uint8_t *pkg, uint32_t length)
+{
+    int rc = 0;
+    uint8_t byte;
+    unsigned length_bytes;
+
+    if ( length + 1 < (1 << PACKAGE_LENGTH_1BYTE_SHIFT) )
+        length_bytes = 1;
+    else if ( length + 2 < (1 << PACKAGE_LENGTH_3BYTE_SHIFT) )
+        length_bytes = 2;
+    else if ( length + 3 < (1 << PACKAGE_LENGTH_4BYTE_SHIFT) )
+        length_bytes = 3;
+    else
+        length_bytes = 4;
+
+    length += length_bytes;
+
+    switch ( length_bytes )
+    {
+    case 1:
+        byte = length;
+        return build_prepend_byte(pkg, byte);
+
+    case 4:
+        byte = length >> PACKAGE_LENGTH_4BYTE_SHIFT;
+        if ( build_prepend_byte(pkg, byte) )
+            break;
+        length &= (1 << PACKAGE_LENGTH_4BYTE_SHIFT) - 1;
+        /* fall through */
+    case 3:
+        byte = length >> PACKAGE_LENGTH_3BYTE_SHIFT;
+        if ( build_prepend_byte(pkg, byte) )
+            break;
+        length &= (1 << PACKAGE_LENGTH_3BYTE_SHIFT) - 1;
+        /* fall through */
+    case 2:
+        byte = length >> PACKAGE_LENGTH_2BYTE_SHIFT;
+        if ( build_prepend_byte(pkg, byte) )
+            break;
+        length &= (1 << PACKAGE_LENGTH_2BYTE_SHIFT) - 1;
+        /* fall through */
+    }
+
+    if ( !rc )
+    {
+        /*
+         * Most significant two bits of byte zero indicate how many
+         * following bytes are in PkgLength encoding.
+         */
+        byte = ((length_bytes - 1) << PACKAGE_LENGTH_1BYTE_SHIFT) | length;
+        rc = build_prepend_byte(pkg, byte);
+    }
+
+    return rc;
+}
+
+/*
+ * On success, an object in the following form is stored at @buf.
+ *   @op
+ *   AML encoding of package length of @buf
+ *   original content in @buf
+ *
+ * Refer to comments of callers for ACPI spec sections.
+ */
+static int build_prepend_package(uint8_t *buf, uint8_t op)
+{
+    uint32_t length = get_package_length(buf);
+
+    if ( !build_prepend_package_length(buf, length) )
+        return build_prepend_byte(buf, op);
+    else
+        return -1;
+}
+
+/*
+ * On success, an object in the following form is stored at @buf.
+ *   AML_OP_EXT
+ *   @op
+ *   AML encoding of package length of @buf
+ *   original content in @buf
+ *
+ * Refer to comments of callers for ACPI spec sections.
+ */
+static int build_prepend_ext_package(uint8_t *buf, uint8_t op)
+{
+    if ( !build_prepend_package(buf, op) )
+        return build_prepend_byte(buf, AML_OP_EXT);
+    else
+        return -1;
+}
+
+void *aml_build_begin(struct acpi_ctxt *ctxt)
+{
+    uint32_t align = ctxt->min_alloc_byte_align;
+
+    alloc.ctxt = ctxt;
+    alloc.buf = ctxt->mem_ops.alloc(ctxt, align, align);
+    alloc.capacity = align;
+    alloc.used = 0;
+
+    return alloc.buf;
+}
+
+uint32_t aml_build_end(void)
+{
+    return alloc.used;
+}
+
+/*
+ * On success, an object in the following form is stored at @buf.
+ *   the first @length bytes in @blob
+ *   the original content in @buf
+ */
+int aml_prepend_blob(uint8_t *buf, const void *blob, uint32_t blob_length)
+{
+    uint32_t len;
+
+    ASSERT(buf >= alloc.buf);
+    len = buf - alloc.buf;
+    ASSERT(alloc.used >= len);
+    len = alloc.used - len;
+
+    if ( !aml_buf_alloc(blob_length) )
+        return -1;
+    if ( len )
+        memmove(buf + blob_length, buf, len);
+
+    memcpy(buf, blob, blob_length);
+
+    return 0;
+}
+
+/*
+ * On success, an object decoded as below is stored at @buf.
+ *   Device (@name)
+ *   {
+ *     the original content in @buf
+ *   }
+ *
+ * Refer to ACPI spec 6.1, Sec 20.2.5.2 "Named Objects Encoding" -
+ * "DefDevice".
+ */
+int aml_prepend_device(uint8_t *buf, const char *name)
+{
+    if ( !build_prepend_name(buf, name) )
+        return build_prepend_ext_package(buf, AML_OP_DEVICE);
+    else
+        return -1;
+}
+
+/*
+ * On success, an object decoded as below is stored at @buf.
+ *   Scope (@name)
+ *   {
+ *     the original content in @buf
+ *   }
+ *
+ * Refer to ACPI spec 6.1, Sec 20.2.5.1 "Namespace Modifier Objects
+ * Encoding" - "DefScope".
+ */
+int aml_prepend_scope(uint8_t *buf, const char *name)
+{
+    if ( !build_prepend_name(buf, name) )
+        return build_prepend_package(buf, AML_OP_SCOPE);
+    else
+        return -1;
+}
diff --git a/tools/libacpi/aml_build.h b/tools/libacpi/aml_build.h
new file mode 100644
index 0000000000..30acc0f7a1
--- /dev/null
+++ b/tools/libacpi/aml_build.h
@@ -0,0 +1,116 @@
+/*
+ * tools/libacpi/aml_build.h
+ *
+ * Copyright (C) 2017, Intel Corporation.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License, version 2.1, as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _AML_BUILD_H_
+#define _AML_BUILD_H_
+
+#include <stdint.h>
+#include "libacpi.h"
+
+/*
+ * NB: All aml_prepend_* calls, which build AML code in one ACPI
+ *     table, should be placed between a pair of calls to
+ *     aml_build_begin() and aml_build_end(). Nested aml_build_begin()
+ *     and aml_build_end() are not supported.
+ *
+ * NB: If a call to aml_prepend_*() fails, the AML builder buffer
+ *     will be in an inconsistent state, and any following calls to
+ *     aml_prepend_*() will result in undefined behavior.
+ */
+
+/**
+ * Reset the AML builder and begin a new round of building.
+ *
+ * Parameters:
+ *   ctxt: ACPI context used by the AML builder
+ *
+ * Returns:
+ *   a pointer to the builder buffer where the AML code will be stored
+ */
+void *aml_build_begin(struct acpi_ctxt *ctxt);
+
+/**
+ * Mark the end of a round of AML building.
+ *
+ * Returns:
+ *  the number of bytes in the builder buffer built in this round
+ */
+uint32_t aml_build_end(void);
+
+/**
+ * Prepend a blob, which can contain arbitrary content, to the builder buffer.
+ *
+ * On success, an object in the following form is stored at @buf.
+ *   the first @length bytes in @blob
+ *   the original content in @buf
+ *
+ * Parameters:
+ *   buf:    pointer to the builder buffer
+ *   blob:   pointer to the blob
+ *   length: the number of bytes in the blob
+ *
+ * Return:
+ *   0 on success, -1 on failure.
+ */
+int aml_prepend_blob(uint8_t *buf, const void *blob, uint32_t length);
+
+/**
+ * Prepend an AML device structure to the builder buffer. The existing
+ * data in the builder buffer is included in the AML device.
+ *
+ * On success, an object decoded as below is stored at @buf.
+ *   Device (@name)
+ *   {
+ *     the original content in @buf
+ *   }
+ *
+ * Refer to ACPI spec 6.1, Sec 20.2.5.2 "Named Objects Encoding" -
+ * "DefDevice".
+ *
+ * Parameters:
+ *   buf:  pointer to the builder buffer
+ *   name: the name of the device
+ *
+ * Return:
+ *   0 on success, -1 on failure.
+ */
+int aml_prepend_device(uint8_t *buf, const char *name);
+
+/**
+ * Prepend an AML scope structure to the builder buffer. The existing
+ * data in the builder buffer is included in the AML scope.
+ *
+ * On success, an object decoded as below is stored at @buf.
+ *   Scope (@name)
+ *   {
+ *     the original content in @buf
+ *   }
+ *
+ * Refer to ACPI spec 6.1, Sec 20.2.5.1 "Namespace Modifier Objects
+ * Encoding" - "DefScope".
+ *
+ * Parameters:
+ *   buf:  pointer to the builder buffer
+ *   name: the name of the scope
+ *
+ * Return:
+ *   0 on success, -1 on failure.
+ */
+int aml_prepend_scope(uint8_t *buf, const char *name);
+
+#endif /* _AML_BUILD_H_ */
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index f00d9ef355..a3e5e9909f 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -77,11 +77,12 @@ endif
 
 ACPI_PATH  = $(XEN_ROOT)/tools/libacpi
 DSDT_FILES-$(CONFIG_X86) = dsdt_pvh.c
-ACPI_OBJS  = $(patsubst %.c,%.o,$(DSDT_FILES-y)) build.o static_tables.o
+ACPI_OBJS  = $(patsubst %.c,%.o,$(DSDT_FILES-y)) build.o static_tables.o aml_build.o
 $(DSDT_FILES-y): acpi
 $(ACPI_OBJS): CFLAGS += -I. -DLIBACPI_STDUTILS=\"$(CURDIR)/libxl_x86_acpi.h\"
 vpath build.c $(ACPI_PATH)/
 vpath static_tables.c $(ACPI_PATH)/
+vpath aml_build.c $(ACPI_PATH)/
 LIBXL_OBJS-$(CONFIG_X86) += $(ACPI_OBJS)
 
 .PHONY: acpi
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 11/15] tools/libacpi: load ACPI built by the device model
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (9 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 10/15] tools/libacpi: add a simple AML builder Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 12/15] tools/libxl: build qemu options from xl vNVDIMM configs Haozhong Zhang
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Haozhong Zhang, Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	Konrad Rzeszutek Wilk, Dan Williams

ACPI tables built by the device model, whose signatures do not
conflict with tables built by Xen (except SSDT), are loaded after ACPI
tables built by Xen.

ACPI namespace devices built by the device model, whose names do not
conflict with devices built by Xen, are assembled and placed in SSDTs
after ACPI tables built by Xen.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

Changes in v2:
 * Add dm_acpi_{signature, devname}_blacklist_register() to let each acpi
   construct function register tables and namespace devices that should
   not be loaded from DM ACPI.
---
 tools/firmware/hvmloader/util.c |  15 ++
 tools/libacpi/acpi2_0.h         |   2 +
 tools/libacpi/build.c           | 311 ++++++++++++++++++++++++++++++++++++++++
 tools/libacpi/libacpi.h         |   8 ++
 4 files changed, 336 insertions(+)

diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index ec0de711c2..295b748829 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -1000,6 +1000,21 @@ void hvmloader_acpi_build_tables(struct acpi_config *config,
     if ( !strncmp(xenstore_read("platform/acpi_s4", "1"), "1", 1)  )
         config->table_flags |= ACPI_HAS_SSDT_S4;
 
+    s = xenstore_read(HVM_XS_DM_ACPI_ADDRESS, NULL);
+    if ( s )
+    {
+        config->dm.addr = strtoll(s, NULL, 0);
+
+        s = xenstore_read(HVM_XS_DM_ACPI_LENGTH, NULL);
+        if ( s )
+        {
+            config->dm.length = strtoll(s, NULL, 0);
+            config->table_flags |= ACPI_HAS_DM;
+        }
+        else
+            config->dm.addr = 0;
+    }
+
     config->table_flags |= (ACPI_HAS_TCPA | ACPI_HAS_IOAPIC |
                             ACPI_HAS_WAET | ACPI_HAS_PMTIMER |
                             ACPI_HAS_BUTTONS | ACPI_HAS_VGA |
diff --git a/tools/libacpi/acpi2_0.h b/tools/libacpi/acpi2_0.h
index 2619ba32db..365825e6bc 100644
--- a/tools/libacpi/acpi2_0.h
+++ b/tools/libacpi/acpi2_0.h
@@ -435,6 +435,7 @@ struct acpi_20_slit {
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
 #define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
 #define ACPI_2_0_SLIT_SIGNATURE ASCII32('S','L','I','T')
+#define ACPI_2_0_SSDT_SIGNATURE ASCII32('S','S','D','T')
 
 /*
  * Table revision numbers.
@@ -449,6 +450,7 @@ struct acpi_20_slit {
 #define ACPI_1_0_FADT_REVISION 0x01
 #define ACPI_2_0_SRAT_REVISION 0x01
 #define ACPI_2_0_SLIT_REVISION 0x01
+#define ACPI_2_0_SSDT_REVISION 0x02
 
 #pragma pack ()
 
diff --git a/tools/libacpi/build.c b/tools/libacpi/build.c
index a02ffbf43c..62c2570b7d 100644
--- a/tools/libacpi/build.c
+++ b/tools/libacpi/build.c
@@ -15,6 +15,7 @@
 
 #include LIBACPI_STDUTILS
 #include "acpi2_0.h"
+#include "aml_build.h"
 #include "libacpi.h"
 #include "ssdt_s3.h"
 #include "ssdt_s4.h"
@@ -55,6 +56,14 @@ struct acpi_info {
     uint64_t pci_hi_min, pci_hi_len; /* 24, 32 - PCI I/O hole boundaries */
 };
 
+#define DM_ACPI_BLOB_TYPE_TABLE 0 /* ACPI table */
+#define DM_ACPI_BLOB_TYPE_NSDEV 1 /* AML of an ACPI namespace device */
+
+/* ACPI tables of following signatures should not appear in DM ACPI */
+static uint64_t dm_acpi_signature_blacklist[64];
+/* ACPI namespace devices of following names should not appear in DM ACPI */
+static const char *dm_acpi_devname_blacklist[64];
+
 static void set_checksum(
     void *table, uint32_t checksum_offset, uint32_t length)
 {
@@ -339,6 +348,281 @@ static int construct_passthrough_tables(struct acpi_ctxt *ctxt,
     return nr_added;
 }
 
+static bool has_dm_tables(struct acpi_ctxt *ctxt,
+                          const struct acpi_config *config)
+{
+    char **dir;
+    unsigned int num;
+
+    if ( !(config->table_flags & ACPI_HAS_DM) || !config->dm.addr )
+        return false;
+
+    dir = ctxt->xs_ops.directory(ctxt, HVM_XS_DM_ACPI_ROOT, &num);
+    if ( !dir || !num )
+        return false;
+
+    return true;
+}
+
+static int dm_acpi_signature_blacklist_register(const struct acpi_config *config,
+                                                uint64_t sig)
+{
+    unsigned int i, nr = ARRAY_SIZE(dm_acpi_signature_blacklist);
+
+    if ( !(config->table_flags & ACPI_HAS_DM) )
+        return 0;
+
+    for ( i = 0; i < nr; i++ )
+    {
+        uint64_t entry = dm_acpi_signature_blacklist[i];
+        if ( entry == sig )
+            return 0;
+        else if ( entry == 0 )
+            break;
+    }
+
+    if ( i >= nr )
+        return -ENOSPC;
+
+    dm_acpi_signature_blacklist[i] = sig;
+    return 0;
+}
+
+static int dm_acpi_devname_blacklist_register(const struct acpi_config *config,
+                                              const char *devname)
+{
+    unsigned int i, nr = ARRAY_SIZE(dm_acpi_devname_blacklist);
+
+    if ( !(config->table_flags & ACPI_HAS_DM) )
+        return 0;
+
+    for ( i = 0; i < nr; i++ )
+    {
+        const char *entry = dm_acpi_devname_blacklist[i];
+        if ( !entry )
+            break;
+        if ( !strncmp(entry, devname, 4) )
+            return 0;
+    }
+
+    if ( i > nr )
+        return -ENOSPC;
+
+    dm_acpi_devname_blacklist[i] = devname;
+    return 0;
+}
+
+/* Return true if no collision is found. */
+static bool check_signature_collision(uint64_t sig)
+{
+    unsigned int i;
+    for ( i = 0; i < ARRAY_SIZE(dm_acpi_signature_blacklist); i++ )
+    {
+        if ( sig == dm_acpi_signature_blacklist[i] )
+            return false;
+    }
+    return true;
+}
+
+/* Return true if no collision is found. */
+static int check_devname_collision(const char *name)
+{
+    unsigned int i;
+    for ( i = 0; i < ARRAY_SIZE(dm_acpi_devname_blacklist); i++ )
+    {
+        if ( !strncmp(name, dm_acpi_devname_blacklist[i], 4) )
+            return false;
+    }
+    return true;
+}
+
+static const char *xs_read_dm_acpi_blob_key(struct acpi_ctxt *ctxt,
+                                            const char *name, const char *key)
+{
+/*
+ * name is supposed to be 4 characters at most, and the longest @key
+ * so far is 'address' (7), so 30 characters is enough to hold the
+ * longest path HVM_XS_DM_ACPI_ROOT/name/key.
+ */
+#define DM_ACPI_BLOB_PATH_MAX_LENGTH   30
+    char path[DM_ACPI_BLOB_PATH_MAX_LENGTH];
+    snprintf(path, DM_ACPI_BLOB_PATH_MAX_LENGTH, HVM_XS_DM_ACPI_ROOT"/%s/%s",
+             name, key);
+    return ctxt->xs_ops.read(ctxt, path);
+}
+
+static bool construct_dm_table(struct acpi_ctxt *ctxt,
+                               unsigned long *table_ptrs,
+                               unsigned int nr_tables,
+                               const void *blob, uint32_t length)
+{
+    const struct acpi_header *header = blob;
+    uint8_t *buffer;
+
+    if ( !check_signature_collision(header->signature) )
+        return false;
+
+    if ( header->length > length || header->length == 0 )
+        return false;
+
+    buffer = ctxt->mem_ops.alloc(ctxt, header->length, 16);
+    if ( !buffer )
+        return false;
+    memcpy(buffer, header, header->length);
+
+    /* some device models (e.g. QEMU) does not set checksum */
+    set_checksum(buffer, offsetof(struct acpi_header, checksum),
+                 header->length);
+
+    table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, buffer);
+
+    return true;
+}
+
+static bool construct_dm_nsdev(struct acpi_ctxt *ctxt,
+                               unsigned long *table_ptrs,
+                               unsigned int nr_tables,
+                               const char *dev_name,
+                               const void *blob, uint32_t blob_length)
+{
+    struct acpi_header ssdt, *header;
+    uint8_t *buffer;
+    int rc;
+
+    if ( !check_devname_collision(dev_name) )
+        return false;
+
+#define AML_BUILD(STMT)           \
+    do {                          \
+        rc = STMT;                \
+        if ( rc )                 \
+            goto out;             \
+    } while (0)
+
+    /* built ACPI namespace device from [name, blob] */
+    buffer = aml_build_begin(ctxt);
+    if ( !buffer )
+        return false;
+
+    AML_BUILD(aml_prepend_blob(buffer, blob, blob_length));
+    AML_BUILD(aml_prepend_device(buffer, dev_name));
+    AML_BUILD((aml_prepend_scope(buffer, "\\_SB")));
+
+    /* build SSDT header */
+    ssdt.signature = ACPI_2_0_SSDT_SIGNATURE;
+    ssdt.revision = ACPI_2_0_SSDT_REVISION;
+    fixed_strcpy(ssdt.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(ssdt.oem_table_id, ACPI_OEM_TABLE_ID);
+    ssdt.oem_revision = ACPI_OEM_REVISION;
+    ssdt.creator_id = ACPI_CREATOR_ID;
+    ssdt.creator_revision = ACPI_CREATOR_REVISION;
+
+    /* prepend SSDT header to ACPI namespace device */
+    AML_BUILD(aml_prepend_blob(buffer, &ssdt, sizeof(ssdt)));
+    header = (struct acpi_header *) buffer;
+
+out:
+    header->length = aml_build_end();
+
+    if ( rc )
+        return false;
+
+    /* calculate checksum of SSDT */
+    set_checksum(header, offsetof(struct acpi_header, checksum),
+                 header->length);
+
+    table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, buffer);
+
+    return true;
+}
+
+/*
+ * All ACPI stuffs built by the device model are placed in the guest
+ * buffer whose address and size are specified by config->dm.{addr, length},
+ * or XenStore keys HVM_XS_DM_ACPI_{ADDRESS, LENGTH}.
+ *
+ * The data layout within the buffer is further specified by XenStore
+ * directories under HVM_XS_DM_ACPI_ROOT. Each directory specifies a
+ * data blob and contains following XenStore keys:
+ *
+ * - "type":
+ *   * DM_ACPI_BLOB_TYPE_TABLE
+ *     The data blob specified by this directory is an ACPI table.
+ *   * DM_ACPI_BLOB_TYPE_NSDEV
+ *     The data blob specified by this directory is an ACPI namespace device.
+ *     Its name is specified by the directory name, while the AML code of the
+ *     body of the AML device structure is in the data blob.
+ *
+ * - "length": the number of bytes in this data blob.
+ *
+ * - "offset": the offset in bytes of this data blob from the beginning of buffer
+ */
+static int construct_dm_tables(struct acpi_ctxt *ctxt,
+                               unsigned long *table_ptrs,
+                               unsigned int nr_tables,
+                               struct acpi_config *config)
+{
+    const char *s;
+    char **dir;
+    uint8_t type;
+    void *blob;
+    unsigned int num, length, offset, i, nr_added = 0;
+
+    if ( !config->dm.addr )
+        return 0;
+
+    dir = ctxt->xs_ops.directory(ctxt, HVM_XS_DM_ACPI_ROOT, &num);
+    if ( !dir || !num )
+        return 0;
+
+    if ( num > ACPI_MAX_SECONDARY_TABLES - nr_tables )
+        return 0;
+
+    for ( i = 0; i < num; i++, dir++ )
+    {
+        if ( *dir == NULL )
+            continue;
+
+        s = xs_read_dm_acpi_blob_key(ctxt, *dir, "type");
+        if ( !s )
+            continue;
+        type = (uint8_t)strtoll(s, NULL, 0);
+
+        s = xs_read_dm_acpi_blob_key(ctxt, *dir, "length");
+        if ( !s )
+            continue;
+        length = (uint32_t)strtoll(s, NULL, 0);
+
+        s = xs_read_dm_acpi_blob_key(ctxt, *dir, "offset");
+        if ( !s )
+            continue;
+        offset = (uint32_t)strtoll(s, NULL, 0);
+
+        blob = ctxt->mem_ops.p2v(ctxt, config->dm.addr + offset);
+
+        switch ( type )
+        {
+        case DM_ACPI_BLOB_TYPE_TABLE:
+            nr_added += construct_dm_table(ctxt,
+                                           table_ptrs, nr_tables + nr_added,
+                                           blob, length);
+            break;
+
+        case DM_ACPI_BLOB_TYPE_NSDEV:
+            nr_added += construct_dm_nsdev(ctxt,
+                                           table_ptrs, nr_tables + nr_added,
+                                           *dir, blob, length);
+            break;
+
+        default:
+            /* skip blobs of unknown types */
+            continue;
+        }
+    }
+
+    return nr_added;
+}
+
 static int construct_secondary_tables(struct acpi_ctxt *ctxt,
                                       unsigned long *table_ptrs,
                                       struct acpi_config *config,
@@ -359,6 +643,7 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
         madt = construct_madt(ctxt, config, info);
         if (!madt) return -1;
         table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, madt);
+        dm_acpi_signature_blacklist_register(config, madt->header.signature);
     }
 
     /* HPET. */
@@ -367,6 +652,7 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
         hpet = construct_hpet(ctxt, config);
         if (!hpet) return -1;
         table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, hpet);
+        dm_acpi_signature_blacklist_register(config, hpet->header.signature);
     }
 
     /* WAET. */
@@ -376,6 +662,7 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
         if ( !waet )
             return -1;
         table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, waet);
+        dm_acpi_signature_blacklist_register(config, waet->header.signature);
     }
 
     if ( config->table_flags & ACPI_HAS_SSDT_PM )
@@ -384,6 +671,9 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
         if (!ssdt) return -1;
         memcpy(ssdt, ssdt_pm, sizeof(ssdt_pm));
         table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, ssdt);
+        dm_acpi_devname_blacklist_register(config, "AC");
+        dm_acpi_devname_blacklist_register(config, "BAT0");
+        dm_acpi_devname_blacklist_register(config, "BAT1");
     }
 
     if ( config->table_flags & ACPI_HAS_SSDT_S3 )
@@ -439,6 +729,8 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
                          offsetof(struct acpi_header, checksum),
                          tcpa->header.length);
         }
+        dm_acpi_signature_blacklist_register(config, tcpa->header.signature);
+        dm_acpi_devname_blacklist_register(config, "TPM");
     }
 
     /* SRAT and SLIT */
@@ -448,11 +740,17 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
         struct acpi_20_slit *slit = construct_slit(ctxt, config);
 
         if ( srat )
+        {
             table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, srat);
+            dm_acpi_signature_blacklist_register(config, srat->header.signature);
+        }
         else
             printf("Failed to build SRAT, skipping...\n");
         if ( slit )
+        {
             table_ptrs[nr_tables++] = ctxt->mem_ops.v2p(ctxt, slit);
+            dm_acpi_signature_blacklist_register(config, slit->header.signature);
+        }
         else
             printf("Failed to build SLIT, skipping...\n");
     }
@@ -461,6 +759,9 @@ static int construct_secondary_tables(struct acpi_ctxt *ctxt,
     nr_tables += construct_passthrough_tables(ctxt, table_ptrs,
                                               nr_tables, config);
 
+    /* Load any additional tables passed from device model (e.g. QEMU). */
+    nr_tables += construct_dm_tables(ctxt, table_ptrs, nr_tables, config);
+
     table_ptrs[nr_tables] = 0;
     return nr_tables;
 }
@@ -525,6 +826,9 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
         acpi_info->pci_hi_len = config->pci_hi_len;
     }
 
+    if ( !has_dm_tables(ctxt, config) )
+        config->table_flags &= ~ACPI_HAS_DM;
+
     /*
      * Fill in high-memory data structures, starting at @buf.
      */
@@ -532,6 +836,7 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
     facs = ctxt->mem_ops.alloc(ctxt, sizeof(struct acpi_20_facs), 16);
     if (!facs) goto oom;
     memcpy(facs, &Facs, sizeof(struct acpi_20_facs));
+    dm_acpi_signature_blacklist_register(config, facs->signature);
 
     /*
      * Alternative DSDTs we get linked against. A cover-all DSDT for up to the
@@ -553,6 +858,8 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
         if (!dsdt) goto oom;
         memcpy(dsdt, config->dsdt_anycpu, config->dsdt_anycpu_len);
     }
+    dm_acpi_devname_blacklist_register(config, "MEM0");
+    dm_acpi_devname_blacklist_register(config, "PCI0");
 
     /*
      * N.B. ACPI 1.0 operating systems may not handle FADT with revision 2
@@ -623,6 +930,7 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
         fadt->iapc_boot_arch |= ACPI_FADT_NO_CMOS_RTC;
     }
     set_checksum(fadt, offsetof(struct acpi_header, checksum), fadt_size);
+    dm_acpi_signature_blacklist_register(config, fadt->header.signature);
 
     nr_secondaries = construct_secondary_tables(ctxt, secondary_tables,
                  config, acpi_info);
@@ -641,6 +949,7 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
     set_checksum(xsdt,
                  offsetof(struct acpi_header, checksum),
                  xsdt->header.length);
+    dm_acpi_signature_blacklist_register(config, xsdt->header.signature);
 
     rsdt = ctxt->mem_ops.alloc(ctxt, sizeof(struct acpi_20_rsdt) +
                                sizeof(uint32_t) * nr_secondaries,
@@ -654,6 +963,7 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
     set_checksum(rsdt,
                  offsetof(struct acpi_header, checksum),
                  rsdt->header.length);
+    dm_acpi_signature_blacklist_register(config, rsdt->header.signature);
 
     /*
      * Fill in low-memory data structures: acpi_info and RSDP.
@@ -669,6 +979,7 @@ int acpi_build_tables(struct acpi_ctxt *ctxt, struct acpi_config *config)
     set_checksum(rsdp,
                  offsetof(struct acpi_20_rsdp, extended_checksum),
                  sizeof(struct acpi_20_rsdp));
+    dm_acpi_signature_blacklist_register(config, rsdp->signature);
 
     if ( !new_vm_gid(ctxt, config, acpi_info) )
         goto oom;
diff --git a/tools/libacpi/libacpi.h b/tools/libacpi/libacpi.h
index 645e091f68..b41113ff3f 100644
--- a/tools/libacpi/libacpi.h
+++ b/tools/libacpi/libacpi.h
@@ -20,6 +20,8 @@
 #ifndef __LIBACPI_H__
 #define __LIBACPI_H__
 
+#define ARRAY_SIZE(a) (sizeof(a) / sizeof(a[0]))
+
 #define ACPI_HAS_COM1        (1<<0)
 #define ACPI_HAS_COM2        (1<<1)
 #define ACPI_HAS_LPT1        (1<<2)
@@ -35,6 +37,7 @@
 #define ACPI_HAS_VGA         (1<<12)
 #define ACPI_HAS_8042        (1<<13)
 #define ACPI_HAS_CMOS_RTC    (1<<14)
+#define ACPI_HAS_DM          (1<<15)
 
 struct xen_vmemrange;
 struct acpi_numa {
@@ -87,6 +90,11 @@ struct acpi_config {
         uint32_t length;
     } pt;
 
+    struct {
+        uint32_t addr;
+        uint32_t length;
+    } dm;
+
     struct acpi_numa numa;
     const struct hvm_info_table *hvminfo;
 
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 12/15] tools/libxl: build qemu options from xl vNVDIMM configs
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (10 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 11/15] tools/libacpi: load ACPI built by the device model Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 13/15] tools/libxl: add support to map host pmem device to guests Haozhong Zhang
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Ian Jackson, Wei Liu,
	Haozhong Zhang

For xl configs
  vnvdimms = [ '/path/to/pmem0', '/path/to/pmem1', ... ]

the following qemu options are built
  -machine <existing options>,nvdimm
  -m <existing options>,slots=$NR_SLOTS,maxmem=$MEM_SIZE
  -object memory-backend-xen,id=mem1,size=$PMEM0_SIZE,mem-path=/path/to/pmem0
  -device nvdimm,id=nvdimm1,memdev=mem1
  -object memory-backend-xen,id=mem2,size=$PMEM1_SIZE,mem-path=/path/to/pmem1
  -device nvdimm,id=nvdimm2,memdev=mem2
  ...
where
 - NR_SLOTS is the number of entries in vnvdimms + 1,
 - MEM_SIZE is the total size of all RAM and NVDIMM devices,
 - PMEM#_SIZE is the size of the host pmem device/file '/path/to/pmem#'.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

The qemu option "-object memory-backend-xen" is added by the QEMU patch
"hostmem: add a host memory backend for Xen". Other qemu options have been
implemented since QEMU 2.6.0.

Changes in v2:
 * Update the manpage of xl.cfg for the new option "vnvdimms".
---
 docs/man/xl.cfg.pod.5.in    |   6 +++
 tools/libxl/libxl_dm.c      | 109 +++++++++++++++++++++++++++++++++++++++++++-
 tools/libxl/libxl_types.idl |   8 ++++
 tools/xl/xl_parse.c         |  16 +++++++
 4 files changed, 137 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5.in b/docs/man/xl.cfg.pod.5.in
index 505c11137f..53296595ff 100644
--- a/docs/man/xl.cfg.pod.5.in
+++ b/docs/man/xl.cfg.pod.5.in
@@ -1064,6 +1064,12 @@ FIFO-based event channel ABI support up to 131,071 event channels.
 Other guests are limited to 4095 (64-bit x86 and ARM) or 1023 (32-bit
 x86).
 
+=item B<vnvdimms=[ 'PATH0', 'PATH1', ... ]>
+
+Specify the virtual NVDIMM devices which are provided to the guest.
+B<PATH0>, B<PATH1>, ... specify the host NVDIMM pmem devices which are used
+as the backend storage of each virtual NVDIMM device.
+
 =back
 
 =head2 Paravirtualised (PV) Guest Specific Options
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 281058de45..695396da2d 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -24,6 +24,10 @@
 #include <sys/types.h>
 #include <pwd.h>
 
+#if defined(__linux__)
+#include <linux/fs.h> /* for ioctl(BLKGETSIZE64) */
+#endif
+
 static const char *libxl_tapif_script(libxl__gc *gc)
 {
 #if defined(__linux__) || defined(__FreeBSD__)
@@ -910,6 +914,82 @@ static char *qemu_disk_ide_drive_string(libxl__gc *gc, const char *target_path,
     return drive;
 }
 
+#if defined(__linux__)
+
+static uint64_t libxl__build_dm_vnvdimm_args(libxl__gc *gc, flexarray_t *dm_args,
+                                             struct libxl_device_vnvdimm *dev,
+                                             int dev_no)
+{
+    int fd, rc;
+    struct stat st;
+    uint64_t size = 0;
+    char *arg;
+
+    fd = open(dev->file, O_RDONLY);
+    if (fd < 0) {
+        LOG(ERROR, "failed to open file %s: %s",
+            dev->file, strerror(errno));
+        goto out;
+    }
+
+    if (stat(dev->file, &st)) {
+        LOG(ERROR, "failed to get status of file %s: %s",
+            dev->file, strerror(errno));
+        goto out_fclose;
+    }
+
+    switch (st.st_mode & S_IFMT) {
+    case S_IFBLK:
+        rc = ioctl(fd, BLKGETSIZE64, &size);
+        if (rc == -1) {
+            LOG(ERROR, "failed to get size of block device %s: %s",
+                dev->file, strerror(errno));
+            size = 0;
+        }
+        break;
+
+    default:
+        LOG(ERROR, "%s not block device", dev->file);
+        break;
+    }
+
+    if (!size)
+        goto out_fclose;
+
+    flexarray_append(dm_args, "-object");
+    arg = GCSPRINTF("memory-backend-xen,id=mem%d,size=%"PRIu64",mem-path=%s",
+                    dev_no + 1, size, dev->file);
+    flexarray_append(dm_args, arg);
+
+    flexarray_append(dm_args, "-device");
+    arg = GCSPRINTF("nvdimm,id=nvdimm%d,memdev=mem%d", dev_no + 1, dev_no + 1);
+    flexarray_append(dm_args, arg);
+
+ out_fclose:
+    close(fd);
+ out:
+    return size;
+}
+
+static uint64_t libxl__build_dm_vnvdimms_args(
+    libxl__gc *gc, flexarray_t *dm_args,
+    struct libxl_device_vnvdimm *vnvdimms, int num_vnvdimms)
+{
+    uint64_t total_size = 0, size;
+    unsigned int i;
+
+    for (i = 0; i < num_vnvdimms; i++) {
+        size = libxl__build_dm_vnvdimm_args(gc, dm_args, &vnvdimms[i], i);
+        if (!size)
+            break;
+        total_size += size;
+    }
+
+    return total_size;
+}
+
+#endif /* __linux__ */
+
 static int libxl__build_device_model_args_new(libxl__gc *gc,
                                         const char *dm, int guest_domid,
                                         const libxl_domain_config *guest_config,
@@ -923,13 +1003,18 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
     const libxl_device_nic *nics = guest_config->nics;
     const int num_disks = guest_config->num_disks;
     const int num_nics = guest_config->num_nics;
+#if defined(__linux__)
+    const int num_vnvdimms = guest_config->num_vnvdimms;
+#else
+    const int num_vnvdimms = 0;
+#endif
     const libxl_vnc_info *vnc = libxl__dm_vnc(guest_config);
     const libxl_sdl_info *sdl = dm_sdl(guest_config);
     const char *keymap = dm_keymap(guest_config);
     char *machinearg;
     flexarray_t *dm_args, *dm_envs;
     int i, connection, devid, ret;
-    uint64_t ram_size;
+    uint64_t ram_size, ram_size_in_byte, vnvdimms_size = 0;
     const char *path, *chardev;
     char *user = NULL;
 
@@ -1313,6 +1398,9 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
             }
         }
 
+        if (num_vnvdimms)
+            machinearg = libxl__sprintf(gc, "%s,nvdimm", machinearg);
+
         flexarray_append(dm_args, machinearg);
         for (i = 0; b_info->extra_hvm && b_info->extra_hvm[i] != NULL; i++)
             flexarray_append(dm_args, b_info->extra_hvm[i]);
@@ -1322,8 +1410,25 @@ static int libxl__build_device_model_args_new(libxl__gc *gc,
     }
 
     ram_size = libxl__sizekb_to_mb(b_info->max_memkb - b_info->video_memkb);
+    ram_size_in_byte = ram_size * 1024 * 1024;
+    if (num_vnvdimms) {
+        vnvdimms_size = libxl__build_dm_vnvdimms_args(gc, dm_args,
+                                                     guest_config->vnvdimms,
+                                                     num_vnvdimms);
+        if (ram_size_in_byte + vnvdimms_size < ram_size_in_byte) {
+            LOG(ERROR,
+                "total size of RAM (%"PRIu64") and NVDIMM (%"PRIu64") overflow",
+                ram_size_in_byte, vnvdimms_size);
+            return ERROR_INVAL;
+        }
+    }
     flexarray_append(dm_args, "-m");
-    flexarray_append(dm_args, GCSPRINTF("%"PRId64, ram_size));
+    flexarray_append(dm_args,
+                     vnvdimms_size ?
+                     GCSPRINTF("%"PRId64",slots=%d,maxmem=%"PRId64,
+                               ram_size, num_vnvdimms + 1,
+                               ROUNDUP(ram_size_in_byte + vnvdimms_size, 12)) :
+                     GCSPRINTF("%"PRId64, ram_size));
 
     if (b_info->type == LIBXL_DOMAIN_TYPE_HVM) {
         if (b_info->u.hvm.hdtype == LIBXL_HDTYPE_AHCI)
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index a612d1f4ff..e1a3fd9279 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -704,6 +704,13 @@ libxl_device_channel = Struct("device_channel", [
            ])),
 ])
 
+libxl_device_vnvdimm = Struct("device_vnvdimm", [
+    ("backend_domid",   libxl_domid),
+    ("backend_domname", string),
+    ("devid",           libxl_devid),
+    ("file",            string),
+])
+
 libxl_domain_config = Struct("domain_config", [
     ("c_info", libxl_domain_create_info),
     ("b_info", libxl_domain_build_info),
@@ -721,6 +728,7 @@ libxl_domain_config = Struct("domain_config", [
     ("channels", Array(libxl_device_channel, "num_channels")),
     ("usbctrls", Array(libxl_device_usbctrl, "num_usbctrls")),
     ("usbdevs", Array(libxl_device_usbdev, "num_usbdevs")),
+    ("vnvdimms", Array(libxl_device_vnvdimm, "num_vnvdimms")),
 
     ("on_poweroff", libxl_action_on_shutdown),
     ("on_reboot", libxl_action_on_shutdown),
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 1ef0c272a8..2cf8c9756b 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -718,6 +718,7 @@ void parse_config_data(const char *config_source,
     XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms,
                    *usbctrls, *usbdevs;
     XLU_ConfigList *channels, *ioports, *irqs, *iomem, *viridian, *dtdevs;
+    XLU_ConfigList *vnvdimms;
     int num_ioports, num_irqs, num_iomem, num_cpus, num_viridian;
     int pci_power_mgmt = 0;
     int pci_msitranslate = 0;
@@ -1902,6 +1903,21 @@ skip_usbdev:
         }
      }
 
+    if (!xlu_cfg_get_list (config, "vnvdimms", &vnvdimms, 0, 0)) {
+#if defined(__linux__)
+        while ((buf = xlu_cfg_get_listitem(vnvdimms,
+                                           d_config->num_vnvdimms)) != NULL) {
+            libxl_device_vnvdimm *vnvdimm =
+                ARRAY_EXTEND_INIT(d_config->vnvdimms, d_config->num_vnvdimms,
+                                  libxl_device_vnvdimm_init);
+            vnvdimm->file = strdup(buf);
+        }
+#else
+        fprintf(stderr, "ERROR: vnvdimms is only supported on Linux\n");
+        exit(-ERROR_FAIL);
+#endif /* __linux__ */
+    }
+
     xlu_cfg_destroy(config);
 }
 
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 13/15] tools/libxl: add support to map host pmem device to guests
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (11 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 12/15] tools/libxl: build qemu options from xl vNVDIMM configs Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 14/15] tools/libxl: initiate pmem mapping via qmp callback Haozhong Zhang
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Ian Jackson, Wei Liu,
	Haozhong Zhang

We can map host pmem devices or files on pmem devices to guests. This
patch adds support to map pmem devices. The implementation relies on
the Linux pmem driver (CONFIG_ACPI_NFIT, CONFIG_LIBNVDIMM, CONFIG_BLK_DEV_PMEM),
so it functions only when libxl is compiled for Linux right now.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/libxl/Makefile       |   4 +
 tools/libxl/libxl_nvdimm.c | 182 +++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_nvdimm.h |  42 +++++++++++
 3 files changed, 228 insertions(+)
 create mode 100644 tools/libxl/libxl_nvdimm.c
 create mode 100644 tools/libxl/libxl_nvdimm.h

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index a3e5e9909f..6bfc78972f 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -115,6 +115,10 @@ endif
 endif
 endif
 
+ifeq ($(CONFIG_Linux), y)
+LIBXL_OBJS-y += libxl_nvdimm.o
+endif
+
 ifeq ($(FLEX),)
 %.c %.h:: %.l
 	$(warning Flex is needed to rebuild some libxl parsers and \
diff --git a/tools/libxl/libxl_nvdimm.c b/tools/libxl/libxl_nvdimm.c
new file mode 100644
index 0000000000..1b3c83f2ca
--- /dev/null
+++ b/tools/libxl/libxl_nvdimm.c
@@ -0,0 +1,182 @@
+/*
+ * tools/libxl/libxl_nvdimm.c
+ *
+ * Copyright (C) 2017,  Intel Corporation
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License, version 2.1, as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "libxl_internal.h"
+#include "libxl_arch.h"
+#include "libxl_nvdimm.h"
+
+#include <xenctrl.h>
+
+#define BLK_DEVICE_ROOT "/sys/dev/block"
+
+static int nvdimm_sysfs_read(libxl__gc *gc,
+                             unsigned int major, unsigned int minor,
+                             const char *name, void **data_r)
+{
+    char *path = libxl__sprintf(gc, BLK_DEVICE_ROOT"/%u:%u/device/%s",
+                                major, minor, name);
+    return libxl__read_sysfs_file_contents(gc, path, data_r, NULL);
+}
+
+static int nvdimm_get_spa(libxl__gc *gc, unsigned int major, unsigned int minor,
+                          uint64_t *spa_r)
+{
+    void *data;
+    int ret = nvdimm_sysfs_read(gc, major, minor, "resource", &data);
+
+    if ( ret )
+        return ret;
+
+    *spa_r = strtoll(data, NULL, 0);
+    return 0;
+}
+
+static int nvdimm_get_size(libxl__gc *gc, unsigned int major, unsigned int minor,
+                           uint64_t *size_r)
+{
+    void *data;
+    int ret = nvdimm_sysfs_read(gc, major, minor, "size", &data);
+
+    if ( ret )
+        return ret;
+
+    *size_r = strtoll(data, NULL, 0);
+
+    return 0;
+}
+
+static int add_pages(libxl__gc *gc, uint32_t domid,
+                     xen_pfn_t mfn, xen_pfn_t gpfn, unsigned long nr_mfns)
+{
+    unsigned int nr;
+    int ret = 0;
+
+    while ( nr_mfns )
+    {
+        nr = min(nr_mfns, (unsigned long) UINT_MAX);
+
+        ret = xc_domain_populate_pmemmap(CTX->xch, domid, mfn, gpfn, nr);
+        if ( ret )
+        {
+            LOG(ERROR, "failed to map pmem pages, "
+                "mfn 0x%" PRIx64", gpfn 0x%" PRIx64 ", nr_mfns %u, err %d",
+                mfn, gpfn, nr, ret);
+            break;
+        }
+
+        nr_mfns -= nr;
+        mfn += nr;
+        gpfn += nr;
+    }
+
+    return ret;
+}
+
+int libxl_nvdimm_add_device(libxl__gc *gc,
+                            uint32_t domid, const char *path,
+                            uint64_t guest_spa, uint64_t guest_size)
+{
+    int fd;
+    struct stat st;
+    unsigned int major, minor;
+    uint64_t host_spa, host_size;
+    xen_pfn_t mfn, gpfn;
+    unsigned long nr_gpfns;
+    int ret;
+
+    if ( (guest_spa & ~XC_PAGE_MASK) || (guest_size & ~XC_PAGE_MASK) )
+        return -EINVAL;
+
+    fd = open(path, O_RDONLY);
+    if ( fd < 0 )
+    {
+        LOG(ERROR, "failed to open file %s (err: %d)", path, errno);
+        return -EIO;
+    }
+
+    ret = fstat(fd, &st);
+    if ( ret )
+    {
+        LOG(ERROR, "failed to get status of file %s (err: %d)",
+            path, errno);
+        goto out;
+    }
+
+    switch ( st.st_mode & S_IFMT )
+    {
+    case S_IFBLK:
+        major = major(st.st_rdev);
+        minor = minor(st.st_rdev);
+        break;
+
+    default:
+        LOG(ERROR, "only support block device now");
+        ret = -EINVAL;
+        goto out;
+    }
+
+    ret = nvdimm_get_spa(gc, major, minor, &host_spa);
+    if ( ret )
+    {
+        LOG(ERROR, "failed to get SPA of device %u:%u", major, minor);
+        goto out;
+    }
+    else if ( host_spa & ~XC_PAGE_MASK )
+    {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    ret = nvdimm_get_size(gc, major, minor, &host_size);
+    if ( ret )
+    {
+        LOG(ERROR, "failed to get size of device %u:%u", major, minor);
+        goto out;
+    }
+    else if ( guest_size > host_size )
+    {
+        LOG(ERROR, "vNVDIMM size %" PRIu64 " expires NVDIMM size %" PRIu64,
+            guest_size, host_size);
+        ret = -EINVAL;
+        goto out;
+    }
+
+    mfn = host_spa >> XC_PAGE_SHIFT;
+    gpfn = guest_spa >> XC_PAGE_SHIFT;
+    nr_gpfns = guest_size >> XC_PAGE_SHIFT;
+    ret = add_pages(gc, domid, mfn, gpfn, nr_gpfns);
+
+ out:
+    close(fd);
+    return ret;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxl/libxl_nvdimm.h b/tools/libxl/libxl_nvdimm.h
new file mode 100644
index 0000000000..e95be99c67
--- /dev/null
+++ b/tools/libxl/libxl_nvdimm.h
@@ -0,0 +1,42 @@
+/*
+ * tools/libxl/libxl_nvdimm.h
+ *
+ * Copyright (C) 2017,  Intel Corporation
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License, version 2.1, as published by the Free Software Foundation.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef LIBXL_NVDIMM_H
+#define LIBXL_NVDIMM_H
+
+#include <stdint.h>
+#include "libxl_internal.h"
+
+#if defined(__linux__)
+
+int libxl_nvdimm_add_device(libxl__gc *gc,
+                            uint32_t domid, const char *path,
+                            uint64_t spa, uint64_t length);
+
+#else
+
+static inline int libxl_nvdimm_add_device(libxl__gc *gc,
+                                          uint32_t domid, const char *path,
+                                          uint64_t spa, uint64_t length)
+{
+    return -ENOSYS;
+}
+
+#endif /* __linux__ */
+
+#endif /* !LIBXL_NVDIMM_H */
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 14/15] tools/libxl: initiate pmem mapping via qmp callback
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (12 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 13/15] tools/libxl: add support to map host pmem device to guests Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-20  0:09 ` [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl Haozhong Zhang
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Ian Jackson, Wei Liu,
	Haozhong Zhang

Get the backend device, the guest SPA and the size of each vNVDIMM
device via QMP commands "query-memory-device devtype=nvdimm" and
"qom-get", and pass them to libxl to map each backend device to guest.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

Changes in v2:
 * Fail the domain creation if QMP initialization for NVDIMM fails.
   Other failures in QMP initialization do not fail the domain creation
   as before.
---
 tools/libxl/libxl_create.c |   4 +-
 tools/libxl/libxl_qmp.c    | 116 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 117 insertions(+), 3 deletions(-)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index e741b9a39a..b8c867d0fa 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1510,7 +1510,9 @@ static void domcreate_devmodel_started(libxl__egc *egc,
     if (dcs->sdss.dm.guest_domid) {
         if (d_config->b_info.device_model_version
             == LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN) {
-            libxl__qmp_initializations(gc, domid, d_config);
+            ret = libxl__qmp_initializations(gc, domid, d_config);
+            if (ret == ERROR_BADFAIL)
+                goto error_out;
         }
     }
 
diff --git a/tools/libxl/libxl_qmp.c b/tools/libxl/libxl_qmp.c
index a91643a4f9..244d4bee5a 100644
--- a/tools/libxl/libxl_qmp.c
+++ b/tools/libxl/libxl_qmp.c
@@ -26,6 +26,7 @@
 
 #include "_libxl_list.h"
 #include "libxl_internal.h"
+#include "libxl_nvdimm.h"
 
 /* #define DEBUG_RECEIVED */
 
@@ -1146,6 +1147,111 @@ out:
     return rc;
 }
 
+static int qmp_nvdimm_get_mempath(libxl__qmp_handler *qmp,
+                                  const libxl__json_object *o,
+                                  void *opaque)
+{
+    const char **output = opaque;
+    const char *mem_path;
+    int rc = 0;
+    GC_INIT(qmp->ctx);
+
+    if (!o) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    mem_path = libxl__json_object_get_string(o);
+    if (!mem_path) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+    *output = libxl__strdup(NOGC, mem_path);
+
+ out:
+    GC_FREE;
+    return 0;
+}
+
+static int qmp_register_nvdimm_callback(libxl__qmp_handler *qmp,
+                                        const libxl__json_object *o,
+                                        void *unused)
+{
+    GC_INIT(qmp->ctx);
+    const libxl__json_object *obj, *sub_obj, *sub_map;
+    libxl__json_object *args = NULL;
+    unsigned int i = 0;
+    const char *mem_path = NULL, *memdev;
+    uint64_t slot, spa, size;
+    int rc = 0;
+
+    for (i = 0; (obj = libxl__json_array_get(o, i)); i++) {
+        if (!libxl__json_object_is_map(obj))
+            continue;
+
+        sub_map = libxl__json_map_get("data", obj, JSON_MAP);
+        if (!sub_map)
+            continue;
+
+        sub_obj = libxl__json_map_get("slot", sub_map, JSON_INTEGER);
+        slot = libxl__json_object_get_integer(sub_obj);
+
+        sub_obj = libxl__json_map_get("memdev", sub_map, JSON_STRING);
+        memdev = libxl__json_object_get_string(sub_obj);
+        if (!memdev) {
+            LOG(ERROR, "Cannot get backend memdev of NVDIMM #%" PRId64, slot);
+            rc = ERROR_FAIL;
+            goto out;
+        }
+        qmp_parameters_add_string(gc, &args, "path", memdev);
+        qmp_parameters_add_string(gc, &args, "property", "mem-path");
+        rc = qmp_synchronous_send(qmp, "qom-get", args, qmp_nvdimm_get_mempath,
+                                  &mem_path, qmp->timeout);
+        if (rc) {
+            LOG(ERROR, "Cannot get the backend device of NVDIMM #%" PRId64, slot);
+            goto out;
+        }
+
+        sub_obj = libxl__json_map_get("addr", sub_map, JSON_INTEGER);
+        spa = libxl__json_object_get_integer(sub_obj);
+
+        sub_obj = libxl__json_map_get("size", sub_map, JSON_INTEGER);
+        size = libxl__json_object_get_integer(sub_obj);
+
+        LOG(DEBUG,
+            "vNVDIMM #%" PRId64 ": %s, spa 0x%" PRIx64 ", size 0x%" PRIx64,
+            slot, mem_path, spa, size);
+
+        rc = libxl_nvdimm_add_device(gc, qmp->domid, mem_path, spa, size);
+        if (rc) {
+            LOG(ERROR,
+                "Failed to add NVDIMM #%" PRId64
+                "(mem_path %s, spa 0x%" PRIx64 ", size 0x%" PRIx64 ") "
+                "to domain %d (err = %d)",
+                slot, mem_path, spa, size, qmp->domid, rc);
+            goto out;
+        }
+    }
+
+ out:
+    GC_FREE;
+    return rc;
+}
+
+static int libxl__qmp_query_nvdimms(libxl__qmp_handler *qmp)
+{
+    libxl__json_object *args = NULL;
+    int rc;
+    GC_INIT(qmp->ctx);
+
+    qmp_parameters_add_string(gc, &args, "devtype", "nvdimm");
+    rc = qmp_synchronous_send(qmp, "query-memory-devices", args,
+                              qmp_register_nvdimm_callback, NULL, qmp->timeout);
+
+    GC_FREE;
+    return rc;
+}
+
 int libxl__qmp_hmp(libxl__gc *gc, int domid, const char *command_line,
                    char **output)
 {
@@ -1174,11 +1280,12 @@ int libxl__qmp_initializations(libxl__gc *gc, uint32_t domid,
 {
     const libxl_vnc_info *vnc = libxl__dm_vnc(guest_config);
     libxl__qmp_handler *qmp = NULL;
+    bool ignore_error = true;
     int ret = 0;
 
     qmp = libxl__qmp_initialize(gc, domid);
     if (!qmp)
-        return -1;
+        return ERROR_FAIL;
     ret = libxl__qmp_query_serial(qmp);
     if (!ret && vnc && vnc->passwd) {
         ret = qmp_change(gc, qmp, "vnc", "password", vnc->passwd);
@@ -1187,8 +1294,13 @@ int libxl__qmp_initializations(libxl__gc *gc, uint32_t domid,
     if (!ret) {
         ret = qmp_query_vnc(qmp);
     }
+    if (!ret && guest_config->num_vnvdimms) {
+        ret = libxl__qmp_query_nvdimms(qmp);
+        ignore_error = false;
+    }
     libxl__qmp_close(qmp);
-    return ret;
+
+    return ret ? (ignore_error ? ERROR_FAIL : ERROR_BADFAIL) : 0;
 }
 
 /*
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (13 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 14/15] tools/libxl: initiate pmem mapping via qmp callback Haozhong Zhang
@ 2017-03-20  0:09 ` Haozhong Zhang
  2017-03-30  4:11   ` Dan Williams
  2017-03-30  4:20 ` [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Dan Williams
  2017-04-01 12:24 ` Konrad Rzeszutek Wilk
  16 siblings, 1 reply; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-20  0:09 UTC (permalink / raw)
  To: xen-devel
  Cc: Konrad Rzeszutek Wilk, Dan Williams, Ian Jackson, Wei Liu,
	Haozhong Zhang

xen-ndctl is a tool for users in Dom0 to setup the host pmem with Xen
hypervisor. It's used to specify the storage, which is either the
regular RAM or a pmem range, to manage the specified pmem.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 .gitignore             |   1 +
 tools/misc/Makefile    |   4 +
 tools/misc/xen-ndctl.c | 227 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 232 insertions(+)

diff --git a/.gitignore b/.gitignore
index 4567de7a59..1af7cda8bd 100644
--- a/.gitignore
+++ b/.gitignore
@@ -206,6 +206,7 @@ tools/misc/xen-hvmctx
 tools/misc/xenlockprof
 tools/misc/lowmemd
 tools/misc/xencov
+tools/misc/xen-ndctl
 tools/pkg-config/*
 tools/qemu-xen-build
 tools/xentrace/xenalyze
diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 8152f7bfb5..86d89f1577 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -31,6 +31,7 @@ INSTALL_SBIN                   += xenperf
 INSTALL_SBIN                   += xenpm
 INSTALL_SBIN                   += xenwatchdogd
 INSTALL_SBIN                   += xen-livepatch
+INSTALL_SBIN                   += xen-ndctl
 INSTALL_SBIN += $(INSTALL_SBIN-y)
 
 # Everything to be installed in a private bin/
@@ -108,4 +109,7 @@ xen-lowmemd: xen-lowmemd.o
 xencov: xencov.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-ndctl: xen-ndctl.o
+	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
+
 -include $(DEPS)
diff --git a/tools/misc/xen-ndctl.c b/tools/misc/xen-ndctl.c
index e69de29bb2..f556a31dcf 100644
--- a/tools/misc/xen-ndctl.c
+++ b/tools/misc/xen-ndctl.c
@@ -0,0 +1,227 @@
+/*
+ * xen-ndctl.c
+ *
+ * Xen NVDIMM management tool
+ *
+ * Copyright (C) 2017,  Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person
+ * obtaining a copy of this software and associated documentation
+ * files (the "Software"), to deal in the Software without restriction,
+ * including without limitation the rights to use, copy, modify, merge,
+ * publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so,
+ * subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+ * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+ * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+ * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+ * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <xenctrl.h>
+
+#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))
+
+static xc_interface *xch;
+
+static int handle_help(int argc, char *argv[]);
+static int handle_list_cmds(int argc, char *argv[]);
+static int handle_setup(int argc, char *argv[]);
+
+static const struct
+{
+    const char *name;
+    const char *syntax;
+    const char *help;
+    int (*handler)(int argc, char *argv[]);
+    bool need_xc;
+} cmds[] =
+{
+    {
+        .name    = "help",
+        .syntax  = "[command]",
+        .help    = "Show this message or the help message of 'command'.\n"
+                   "Use command 'list-cmds' to list all supported commands.\n",
+        .handler = handle_help,
+    },
+
+    {
+        .name    = "list-cmds",
+        .syntax  = "",
+        .help    = "List all supported commands\n",
+        .handler = handle_list_cmds,
+    },
+
+    {
+        .name    = "setup",
+        .syntax  = "<smfn> <emfn> [<mgmt_smfn> <mgmt_emfn>]",
+        .help    = "Setup Xen hypervisor to prepare for mapping host pmem pages\n"
+                   "from MFN 'smfn' to 'emfn'. If 'mgmt_smfn' and 'mgmt_emfn' are\n"
+                   "specified, host pmem pages from MFN 'mgmt_smfn' and 'mgmt_emfn'\n"
+                   "will be used by Xen hypervisor to manage this mapping.\n",
+        .handler = handle_setup,
+        .need_xc = true,
+    },
+};
+
+static void show_help(const char *cmd)
+{
+    int nr = ARRAY_SIZE(cmds);
+    unsigned int i;
+
+    fprintf(stderr, "Usage: xen-ndctl ");
+
+    if ( !cmd )
+    {
+        fprintf(stderr,
+                "<command> [args]\n\n"
+                "List all supported commands by 'xen-ndctl list-cmds'\n"
+                "Get help of a command by 'xen-ndctl help <command>'\n");
+        return;
+    }
+
+    for ( i = 0; i < nr; i++ )
+        if ( !strcmp(cmd, cmds[i].name) )
+        {
+            fprintf(stderr, "%s %s\n\n%s",
+                    cmds[i].name, cmds[i].syntax, cmds[i].help);
+            break;
+        }
+
+    if ( i == nr )
+        fprintf(stderr, "Unsupported command '%s'. "
+                "List all supported commands by 'xen-ndctl list-cmds'.\n",
+                cmd);
+}
+
+static int handle_unrecognized_arguments(const char *cmd)
+{
+    fprintf(stderr, "Unrecognized arguments\n\n");
+    show_help(cmd);
+    return -EINVAL;
+}
+
+static int handle_help(int argc, char *argv[])
+{
+    if ( argc == 1 )
+        show_help(NULL);
+    else if ( argc == 2 )
+        show_help(argv[1]);
+    else
+        return handle_unrecognized_arguments(argv[0]);
+
+    return 0;
+}
+
+static int handle_list_cmds(int argc, char *argv[])
+{
+    unsigned int i;
+
+    if ( argc > 1 )
+        return handle_unrecognized_arguments(argv[0]);
+
+    for ( i = 0; i < ARRAY_SIZE(cmds); i++ )
+        fprintf(stderr, "%s\n", cmds[i].name);
+
+    return 0;
+}
+
+static int handle_setup(int argc, char *argv[])
+{
+    unsigned long smfn, emfn, mgmt_smfn = INVALID_MFN, mgmt_emfn = INVALID_MFN;
+
+    if ( argc != 3 /* xen-ndctl setup smfn emfn */ &&
+         argc != 5 /* xen-ndctl setup smfn emfn mgmt_smfn mgmt_emfn */ )
+        return handle_unrecognized_arguments(argv[0]);
+
+#define strtoul_check_overflow(str, ret)                    \
+    do {                                                    \
+        (ret) = strtoul((str), NULL, 0);                    \
+        if ( errno == ERANGE )                              \
+        {                                                   \
+            fprintf(stderr, "MFN '%s' overflow\n", (str));  \
+            return -errno;                                  \
+        }                                                   \
+    } while (0)
+
+    strtoul_check_overflow(argv[1], smfn);
+    strtoul_check_overflow(argv[2], emfn);
+
+    if ( argc == 5 )
+    {
+        strtoul_check_overflow(argv[3], mgmt_smfn);
+        strtoul_check_overflow(argv[4], mgmt_emfn);
+    }
+
+#undef strtoul_check_overflow
+
+    return xc_nvdimm_pmem_setup(xch, smfn, emfn, mgmt_smfn, mgmt_emfn);
+}
+
+int main(int argc, char *argv[])
+{
+    unsigned int i;
+    int nr_cmds = ARRAY_SIZE(cmds), rc = 0;
+    const char *cmd;
+
+    if ( argc <= 1 )
+    {
+        show_help(NULL);
+        return 0;
+    }
+
+    cmd = argv[1];
+
+    for ( i = 0; i < nr_cmds; i++ )
+        if ( !strcmp(cmd, cmds[i].name) )
+        {
+            if ( cmds[i].need_xc )
+            {
+                xch = xc_interface_open(0, 0, 0);
+                if ( !xch )
+                {
+                    rc = -errno;
+                    fprintf(stderr, "failed to get xc handler\n");
+                    break;
+                }
+            }
+            rc = cmds[i].handler(argc - 1, &argv[1]);
+            if ( rc )
+                fprintf(stderr, "%s failed: %s\n", cmds[i].name, strerror(-rc));
+            break;
+        }
+
+    if ( i == nr_cmds )
+    {
+        fprintf(stderr, "Unsupported command '%s'. "
+                "List all supported commands by 'xen-ndctl list-cmds'.\n",
+                cmd);
+        rc = -ENOSYS;
+    }
+
+    if ( xch )
+        xc_interface_close(xch);
+
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.12.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl
  2017-03-20  0:09 ` [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl Haozhong Zhang
@ 2017-03-30  4:11   ` Dan Williams
  2017-03-30  7:58     ` Haozhong Zhang
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-03-30  4:11 UTC (permalink / raw)
  To: Haozhong Zhang; +Cc: Konrad Rzeszutek Wilk, Ian Jackson, Wei Liu, xen-devel

On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> xen-ndctl is a tool for users in Dom0 to setup the host pmem with Xen
> hypervisor. It's used to specify the storage, which is either the
> regular RAM or a pmem range, to manage the specified pmem.
>
> Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> ---
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---

I would be open to moving this tooling into upstream ndctl [1].
Especially since you're reusing the same name it would be confusing to
have 2 ndctl tools.

[1]: https://github.com/pmem/ndctl

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (14 preceding siblings ...)
  2017-03-20  0:09 ` [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl Haozhong Zhang
@ 2017-03-30  4:20 ` Dan Williams
  2017-03-30  8:21   ` Haozhong Zhang
  2017-04-01 12:24 ` Konrad Rzeszutek Wilk
  16 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-03-30  4:20 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> This is v2 RFC patch series to add vNVDIMM support to HVM domains.
> v1 can be found at https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00424.html.
>
> No label and no _DSM except function 0 "query implemented functions"
> is supported by this version, but they will be added by future patches.
>
> The corresponding Qemu patch series is sent in another thread
> "[RFC QEMU PATCH v2 00/10] Implement vNVDIMM for Xen HVM guest".
>
> All patch series can be found at
>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v2
>   Qemu: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v2
>
> Changes in v2
> ==============
>
> - One of the primary changes in v2 is dropping the linux kernel
>   patches, which were used to reserve on host pmem for placing its
>   frametable and M2P table. In v2, we add a management tool xen-ndctl
>   which is used in Dom0 to notify Xen hypervisor of which storage can
>   be used to manage the host pmem.
>
>   For example,
>   1.   xen-ndctl setup 0x240000 0x380000 0x380000 0x3c0000
>     tells Xen hypervisor to use host pmem pages at MFN 0x380000 ~
>     0x3c0000 to manage host pmem pages at MFN 0x240000 ~ 0x380000.
>     I.e. the former is used to place the frame table and M2P table of
>     both ranges of pmem pages.
>
>   2.   xen-ndctl setup 0x240000 0x380000
>     tells Xen hypervisor to use the regular RAM to manage the host
>     pmem pages at MFN 0x240000 ~ 0x380000. I.e the regular RMA is used
>     to place the frame table and M2P table.
>
> - Another primary change in v2 is dropping the support to map files on
>   the host pmem to HVM domains as virtual NVDIMMs, as I cannot find a
>   stable to fix the fiemap of host files. Instead, we can rely on the
>   ability added in Linux kernel v4.9 that enables creating multiple
>   pmem namespaces on a single nvdimm interleave set.

This restriction is unfortunate, and it seems to limit the future
architecture of the pmem driver. We may not always be able to
guarantee a contiguous physical address range to Xen for a given
namespace and may want to concatenate disjoint physical address ranges
into a logically contiguous namespace.

Is there a resource I can read more about why the hypervisor needs to
have this M2P mapping for nvdimm support?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl
  2017-03-30  4:11   ` Dan Williams
@ 2017-03-30  7:58     ` Haozhong Zhang
  2017-04-01 11:55       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-30  7:58 UTC (permalink / raw)
  To: Dan Williams; +Cc: Konrad Rzeszutek Wilk, Ian Jackson, Wei Liu, xen-devel

On 03/29/17 21:11 -0700, Dan Williams wrote:
> On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > xen-ndctl is a tool for users in Dom0 to setup the host pmem with Xen
> > hypervisor. It's used to specify the storage, which is either the
> > regular RAM or a pmem range, to manage the specified pmem.
> >
> > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > ---
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Wei Liu <wei.liu2@citrix.com>
> > ---
> 
> I would be open to moving this tooling into upstream ndctl [1].
> Especially since you're reusing the same name it would be confusing to
> have 2 ndctl tools.
> 
> [1]: https://github.com/pmem/ndctl

Then it could leverage existing code in ndctl (e.g. getting parameters
from a device name rather than using address in my current implementation).

I'm not sure about Xen's policy whether a tool should be included in
Xen or can be left in 3rd party program. Let's wait for Xen maintainers' reply.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-03-30  4:20 ` [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Dan Williams
@ 2017-03-30  8:21   ` Haozhong Zhang
  2017-03-30 16:01     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Haozhong Zhang @ 2017-03-30  8:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On 03/29/17 21:20 -0700, Dan Williams wrote:
> On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > This is v2 RFC patch series to add vNVDIMM support to HVM domains.
> > v1 can be found at https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00424.html.
> >
> > No label and no _DSM except function 0 "query implemented functions"
> > is supported by this version, but they will be added by future patches.
> >
> > The corresponding Qemu patch series is sent in another thread
> > "[RFC QEMU PATCH v2 00/10] Implement vNVDIMM for Xen HVM guest".
> >
> > All patch series can be found at
> >   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v2
> >   Qemu: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v2
> >
> > Changes in v2
> > ==============
> >
> > - One of the primary changes in v2 is dropping the linux kernel
> >   patches, which were used to reserve on host pmem for placing its
> >   frametable and M2P table. In v2, we add a management tool xen-ndctl
> >   which is used in Dom0 to notify Xen hypervisor of which storage can
> >   be used to manage the host pmem.
> >
> >   For example,
> >   1.   xen-ndctl setup 0x240000 0x380000 0x380000 0x3c0000
> >     tells Xen hypervisor to use host pmem pages at MFN 0x380000 ~
> >     0x3c0000 to manage host pmem pages at MFN 0x240000 ~ 0x380000.
> >     I.e. the former is used to place the frame table and M2P table of
> >     both ranges of pmem pages.
> >
> >   2.   xen-ndctl setup 0x240000 0x380000
> >     tells Xen hypervisor to use the regular RAM to manage the host
> >     pmem pages at MFN 0x240000 ~ 0x380000. I.e the regular RMA is used
> >     to place the frame table and M2P table.
> >
> > - Another primary change in v2 is dropping the support to map files on
> >   the host pmem to HVM domains as virtual NVDIMMs, as I cannot find a
> >   stable to fix the fiemap of host files. Instead, we can rely on the
> >   ability added in Linux kernel v4.9 that enables creating multiple
> >   pmem namespaces on a single nvdimm interleave set.
> 
> This restriction is unfortunate, and it seems to limit the future
> architecture of the pmem driver. We may not always be able to
> guarantee a contiguous physical address range to Xen for a given
> namespace and may want to concatenate disjoint physical address ranges
> into a logically contiguous namespace.
>

The hypervisor code that actual maps host pmem address to guest does
not require the host address be contiguous. We can modify the
toolstack code that get the address range from a namespace to support
passing multiple address ranges to Xen hypervisor

> Is there a resource I can read more about why the hypervisor needs to
> have this M2P mapping for nvdimm support?

M2P is basically an array of frame numbers. It's indexed by the host
page frame number, or the machine frame number (MFN) in Xen's
definition. The n'th entry records the guest page frame number that is
mapped to MFN n. M2P is one of the core data structures used in Xen
memory management, and is used to convert MFN to guest PFN. A
read-only version of M2P is also exposed as part of ABI to guest. In
the previous design discussion, we decided to put the management of
NVDIMM in the existing Xen memory management as much as possible, so
we need to build M2P for NVDIMM as well.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-03-30  8:21   ` Haozhong Zhang
@ 2017-03-30 16:01     ` Dan Williams
  2017-04-01 11:54       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-03-30 16:01 UTC (permalink / raw)
  To: Dan Williams, xen-devel, Konrad Rzeszutek Wilk, Ian Jackson,
	Wei Liu, Jan Beulich, Andrew Cooper, Daniel De Graaf

On Thu, Mar 30, 2017 at 1:21 AM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 03/29/17 21:20 -0700, Dan Williams wrote:
>> On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > This is v2 RFC patch series to add vNVDIMM support to HVM domains.
>> > v1 can be found at https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00424.html.
>> >
>> > No label and no _DSM except function 0 "query implemented functions"
>> > is supported by this version, but they will be added by future patches.
>> >
>> > The corresponding Qemu patch series is sent in another thread
>> > "[RFC QEMU PATCH v2 00/10] Implement vNVDIMM for Xen HVM guest".
>> >
>> > All patch series can be found at
>> >   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v2
>> >   Qemu: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v2
>> >
>> > Changes in v2
>> > ==============
>> >
>> > - One of the primary changes in v2 is dropping the linux kernel
>> >   patches, which were used to reserve on host pmem for placing its
>> >   frametable and M2P table. In v2, we add a management tool xen-ndctl
>> >   which is used in Dom0 to notify Xen hypervisor of which storage can
>> >   be used to manage the host pmem.
>> >
>> >   For example,
>> >   1.   xen-ndctl setup 0x240000 0x380000 0x380000 0x3c0000
>> >     tells Xen hypervisor to use host pmem pages at MFN 0x380000 ~
>> >     0x3c0000 to manage host pmem pages at MFN 0x240000 ~ 0x380000.
>> >     I.e. the former is used to place the frame table and M2P table of
>> >     both ranges of pmem pages.
>> >
>> >   2.   xen-ndctl setup 0x240000 0x380000
>> >     tells Xen hypervisor to use the regular RAM to manage the host
>> >     pmem pages at MFN 0x240000 ~ 0x380000. I.e the regular RMA is used
>> >     to place the frame table and M2P table.
>> >
>> > - Another primary change in v2 is dropping the support to map files on
>> >   the host pmem to HVM domains as virtual NVDIMMs, as I cannot find a
>> >   stable to fix the fiemap of host files. Instead, we can rely on the
>> >   ability added in Linux kernel v4.9 that enables creating multiple
>> >   pmem namespaces on a single nvdimm interleave set.
>>
>> This restriction is unfortunate, and it seems to limit the future
>> architecture of the pmem driver. We may not always be able to
>> guarantee a contiguous physical address range to Xen for a given
>> namespace and may want to concatenate disjoint physical address ranges
>> into a logically contiguous namespace.
>>
>
> The hypervisor code that actual maps host pmem address to guest does
> not require the host address be contiguous. We can modify the
> toolstack code that get the address range from a namespace to support
> passing multiple address ranges to Xen hypervisor
>
>> Is there a resource I can read more about why the hypervisor needs to
>> have this M2P mapping for nvdimm support?
>
> M2P is basically an array of frame numbers. It's indexed by the host
> page frame number, or the machine frame number (MFN) in Xen's
> definition. The n'th entry records the guest page frame number that is
> mapped to MFN n. M2P is one of the core data structures used in Xen
> memory management, and is used to convert MFN to guest PFN. A
> read-only version of M2P is also exposed as part of ABI to guest. In
> the previous design discussion, we decided to put the management of
> NVDIMM in the existing Xen memory management as much as possible, so
> we need to build M2P for NVDIMM as well.
>

Thanks, but what I don't understand is why this M2P lookup is needed?
Does Xen establish this metadata for PCI mmio ranges as well? What Xen
memory management operations does this enable? Sorry if these are
basic Xen questions, I'm just looking to see if we can make the
mapping support more dynamic. For example, what if we wanted to change
the MFN to guest PFN relationship after every fault?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-03-30 16:01     ` Dan Williams
@ 2017-04-01 11:54       ` Konrad Rzeszutek Wilk
  2017-04-01 15:45         ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-01 11:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf

..snip..
> >> Is there a resource I can read more about why the hypervisor needs to
> >> have this M2P mapping for nvdimm support?
> >
> > M2P is basically an array of frame numbers. It's indexed by the host
> > page frame number, or the machine frame number (MFN) in Xen's
> > definition. The n'th entry records the guest page frame number that is
> > mapped to MFN n. M2P is one of the core data structures used in Xen
> > memory management, and is used to convert MFN to guest PFN. A
> > read-only version of M2P is also exposed as part of ABI to guest. In
> > the previous design discussion, we decided to put the management of
> > NVDIMM in the existing Xen memory management as much as possible, so
> > we need to build M2P for NVDIMM as well.
> >
> 
> Thanks, but what I don't understand is why this M2P lookup is needed?

Xen uses it to construct the EPT page tables for the guests.

> Does Xen establish this metadata for PCI mmio ranges as well? What Xen

It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
ranges construct (since those are usually contingous and given
in ranges to a guest).
> memory management operations does this enable? Sorry if these are
> basic Xen questions, I'm just looking to see if we can make the
> mapping support more dynamic. For example, what if we wanted to change
> the MFN to guest PFN relationship after every fault?

As in swap it out? (Like a hard drive swaps out faulty sectors?).
That is certainly done. We also have tools (xen-hptool) that can
mark certain pages as broken/etc and inject the MCEs in the guest
to reflect that. But all of that is driven by hypercalls

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl
  2017-03-30  7:58     ` Haozhong Zhang
@ 2017-04-01 11:55       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-01 11:55 UTC (permalink / raw)
  To: Dan Williams, xen-devel, Ian Jackson, Wei Liu

On Thu, Mar 30, 2017 at 03:58:25PM +0800, Haozhong Zhang wrote:
> On 03/29/17 21:11 -0700, Dan Williams wrote:
> > On Sun, Mar 19, 2017 at 5:09 PM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:
> > > xen-ndctl is a tool for users in Dom0 to setup the host pmem with Xen
> > > hypervisor. It's used to specify the storage, which is either the
> > > regular RAM or a pmem range, to manage the specified pmem.
> > >
> > > Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > > ---
> > > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > > Cc: Wei Liu <wei.liu2@citrix.com>
> > > ---
> > 
> > I would be open to moving this tooling into upstream ndctl [1].
> > Especially since you're reusing the same name it would be confusing to
> > have 2 ndctl tools.
> > 
> > [1]: https://github.com/pmem/ndctl
> 
> Then it could leverage existing code in ndctl (e.g. getting parameters
> from a device name rather than using address in my current implementation).
> 
> I'm not sure about Xen's policy whether a tool should be included in
> Xen or can be left in 3rd party program. Let's wait for Xen maintainers' reply.

No problems. For example QEMU has it (it has an configure to detect the
headers and then uses xc_XYZ calls). Also kexec-tools does it as well.

The big thing that you have to keep in mind is to make the hypercall
as future proof as possible - so you don't have #ifdef all over the
code.

> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
                   ` (15 preceding siblings ...)
  2017-03-30  4:20 ` [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Dan Williams
@ 2017-04-01 12:24 ` Konrad Rzeszutek Wilk
  16 siblings, 0 replies; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-01 12:24 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Dan Williams, Daniel De Graaf

On Mon, Mar 20, 2017 at 08:09:34AM +0800, Haozhong Zhang wrote:
> This is v2 RFC patch series to add vNVDIMM support to HVM domains.
> v1 can be found at https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00424.html.
> 
> No label and no _DSM except function 0 "query implemented functions"
> is supported by this version, but they will be added by future patches.
> 
> The corresponding Qemu patch series is sent in another thread
> "[RFC QEMU PATCH v2 00/10] Implement vNVDIMM for Xen HVM guest".
> 
> All patch series can be found at
>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v2
>   Qemu: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v2
> 

Hey!

Thank you for posting this. A quick question.
> Changes in v2
> ==============
> 
> - One of the primary changes in v2 is dropping the linux kernel
>   patches, which were used to reserve on host pmem for placing its
>   frametable and M2P table. In v2, we add a management tool xen-ndctl
>   which is used in Dom0 to notify Xen hypervisor of which storage can
>   be used to manage the host pmem.
> 
>   For example,
>   1.   xen-ndctl setup 0x240000 0x380000 0x380000 0x3c0000
>     tells Xen hypervisor to use host pmem pages at MFN 0x380000 ~
>     0x3c0000 to manage host pmem pages at MFN 0x240000 ~ 0x380000.
>     I.e. the former is used to place the frame table and M2P table of
>     both ranges of pmem pages.
> 
>   2.   xen-ndctl setup 0x240000 0x380000
>     tells Xen hypervisor to use the regular RAM to manage the host
>     pmem pages at MFN 0x240000 ~ 0x380000. I.e the regular RMA is used
>     to place the frame table and M2P table.

How were you thinking to 'glue' this to the libvirt (xl) way of setting
up NVDIMM? Could you explain (even in broad ways) how that would be
done? I see the 'vnvdimms' but somehow would have thought the
libxl would parse the /proc/iomem (or perhaps call ndctl to
obtain this ?)

> 
> - Another primary change in v2 is dropping the support to map files on
>   the host pmem to HVM domains as virtual NVDIMMs, as I cannot find a
>   stable to fix the fiemap of host files. Instead, we can rely on the
>   ability added in Linux kernel v4.9 that enables creating multiple
>   pmem namespaces on a single nvdimm interleave set.

Could you expand on this a bit please? This is a quite important feature
and I thought the mix of mlock + fiemap would have solved this?

> 
> - Other changes are logged in each patch separately.
> 
> How to Test

Thank you for the detailed way this is explained!
> ==============
> 
> 0. This patch series can be tested either on the real hardware with
>    NVDIMM, or in the nested virtualization environment on KVM. The

Real hardware, eh? Nice!

>    latter requires QEMU 2.9 or newer with, for example, following
>    commands and options,
>      # dd if=/dev/zero of=nvm-8G.img bs=1G count=8
>      # rmmod kvm-intel; modprobe kvm-intel nested=1
>      # qemu-system-x86_64 -enable-kvm -smp 4 -cpu host,+vmx \
>                           -hda DISK_IMG_OF_XEN \
>                           -machine pc,nvdimm \
>                           -m 8G,slots=4,maxmem=128G \
>                           -object memory-backend-file,id=mem1,mem-path=nvm-8G,size=8G \
>                           -device nvdimm,id=nv1,memdev=mem1,label-size=2M \
>                           ...
>    Above will create a nested virtualization environment with a 8G
>    pmem mode NVDIMM device (whose last 2MB is used as the label
>    storage area).
> 
> 1. Check out Xen and QEMU from above repositories and branches. Build
>    and install Xen with qemu-xen replaced by above QEMU.
> 
> 2. Build and install Linux kernel 4.9 or later as Dom0 kernel with the
>    following configs selected:
>        CONFIG_ACPI_NFIT
>        CONFIG_LIBNVDIMM
>        CONFIG_BLK_DEV_PMEM
>        CONFIG_NVDIMM_PFN
>        CONFIG_FS_DAX
> 
> 3. Check out ndctl from https://github.com/pmem/ndctl.git. Build and
>    install ndctl in Dom0.
> 
> 4. Boot to Xen Dom0.
> 
> 5. Create pmem namespaces on the host pmem region.
>      # ndctl disable-region region0
>      # ndctl zero-labels nmem0                        // clear existing labels
>      # ndctl init-labels nmem0                        // initialize the label area
>      # ndctl enable-region region0     
>      # ndctl create-namespace -r region0 -s 4G -m raw // create one 4G pmem namespace
>      # ndctl create-namespace -r region0 -s 1G -m raw // create one 1G pmem namespace
>      # ndctl list --namespaces
>      [
>        {
>            "dev":"namespace0.0",
>            "mode":"raw",
>            "size":4294967296,
>            "uuid":"bbfbedbd-3ada-4f55-9484-01f2722c651b",
>            "blockdev":"pmem0"
>        },
>        {
>            "dev":"namespace0.1",
>            "mode":"raw",
>            "size":1073741824,
>            "uuid":"dd4d3949-6887-417b-b819-89a7854fcdbd",
>            "blockdev":"pmem0.1"
>        }
>      ]
> 
> 6. Ask Xen hypervisor to use namespace0.1 to manage namespace0.0.
>      # grep namespace /proc/iomem
>          240000000-33fffffff : namespace0.0
>          340000000-37fffffff : namespace0.1
>      # xen-ndctl setup 0x240000 0x340000 0x340000 0x380000
> 
> 7. Start a HVM domain with "vnvdimms=[ '/dev/pmem0' ]" in its xl config.
> 
>    If ndctl is installed in HVM domain, "ndctl list" should be able to
>    list a 4G pmem namespace, e.g.
>    {
>      "dev":"namespace0.0",
>      "mode":"raw",
>      "size":4294967296,
>      "blockdev":"pmem0"
>    }
>    
> 
> Haozhong Zhang (15):
>   xen/common: add Kconfig item for pmem support
>   xen: probe pmem regions via ACPI NFIT
>   xen/x86: allow customizing locations of extended frametable & M2P
>   xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem
>   xen/x86: add XENMEM_populate_pmemmap to map host pmem pages to HVM domain
>   tools: reserve guest memory for ACPI from device model
>   tools/libacpi: expose the minimum alignment used by mem_ops.alloc
>   tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address
>   tools/libacpi: add callbacks to access XenStore
>   tools/libacpi: add a simple AML builder
>   tools/libacpi: load ACPI built by the device model
>   tools/libxl: build qemu options from xl vNVDIMM configs
>   tools/libxl: add support to map host pmem device to guests
>   tools/libxl: initiate pmem mapping via qmp callback
>   tools/misc: add xen-ndctl
> 
>  .gitignore                              |   1 +
>  docs/man/xl.cfg.pod.5.in                |   6 +
>  tools/firmware/hvmloader/Makefile       |   3 +-
>  tools/firmware/hvmloader/util.c         |  75 ++++++
>  tools/firmware/hvmloader/util.h         |  10 +
>  tools/firmware/hvmloader/xenbus.c       |  44 +++-
>  tools/flask/policy/modules/dom0.te      |   2 +-
>  tools/flask/policy/modules/xen.if       |   2 +-
>  tools/libacpi/acpi2_0.h                 |   2 +
>  tools/libacpi/aml_build.c               | 326 +++++++++++++++++++++++
>  tools/libacpi/aml_build.h               | 116 +++++++++
>  tools/libacpi/build.c                   | 311 ++++++++++++++++++++++
>  tools/libacpi/libacpi.h                 |  21 ++
>  tools/libxc/include/xc_dom.h            |   1 +
>  tools/libxc/include/xenctrl.h           |  36 +++
>  tools/libxc/xc_dom_x86.c                |   7 +
>  tools/libxc/xc_domain.c                 |  15 ++
>  tools/libxc/xc_misc.c                   |  17 ++
>  tools/libxl/Makefile                    |   7 +-
>  tools/libxl/libxl_create.c              |   4 +-
>  tools/libxl/libxl_dm.c                  | 109 +++++++-
>  tools/libxl/libxl_dom.c                 |  22 ++
>  tools/libxl/libxl_nvdimm.c              | 182 +++++++++++++
>  tools/libxl/libxl_nvdimm.h              |  42 +++
>  tools/libxl/libxl_qmp.c                 | 116 ++++++++-
>  tools/libxl/libxl_types.idl             |   8 +
>  tools/libxl/libxl_x86_acpi.c            |  41 +++
>  tools/misc/Makefile                     |   4 +
>  tools/misc/xen-ndctl.c                  | 227 ++++++++++++++++
>  tools/xl/xl_parse.c                     |  16 ++
>  xen/arch/x86/acpi/boot.c                |   4 +
>  xen/arch/x86/domain.c                   |   7 +
>  xen/arch/x86/sysctl.c                   |  22 ++
>  xen/arch/x86/x86_64/mm.c                | 191 ++++++++++++--
>  xen/common/Kconfig                      |   9 +
>  xen/common/Makefile                     |   1 +
>  xen/common/compat/memory.c              |   1 +
>  xen/common/domain.c                     |   3 +
>  xen/common/memory.c                     |  43 +++
>  xen/common/pmem.c                       | 448 ++++++++++++++++++++++++++++++++
>  xen/drivers/acpi/Makefile               |   2 +
>  xen/drivers/acpi/nfit.c                 | 116 +++++++++
>  xen/include/acpi/actbl.h                |   1 +
>  xen/include/acpi/actbl1.h               |  42 +++
>  xen/include/public/hvm/hvm_xs_strings.h |  11 +
>  xen/include/public/memory.h             |  14 +-
>  xen/include/public/sysctl.h             |  29 ++-
>  xen/include/xen/acpi.h                  |   4 +
>  xen/include/xen/pmem.h                  |  66 +++++
>  xen/include/xen/sched.h                 |   3 +
>  xen/include/xsm/dummy.h                 |  11 +
>  xen/include/xsm/xsm.h                   |  12 +
>  xen/xsm/dummy.c                         |   4 +
>  xen/xsm/flask/hooks.c                   |  17 ++
>  xen/xsm/flask/policy/access_vectors     |   4 +
>  55 files changed, 2795 insertions(+), 43 deletions(-)
>  create mode 100644 tools/libacpi/aml_build.c
>  create mode 100644 tools/libacpi/aml_build.h
>  create mode 100644 tools/libxl/libxl_nvdimm.c
>  create mode 100644 tools/libxl/libxl_nvdimm.h
>  create mode 100644 tools/misc/xen-ndctl.c
>  create mode 100644 xen/common/pmem.c
>  create mode 100644 xen/drivers/acpi/nfit.c
>  create mode 100644 xen/include/xen/pmem.h
> 
> -- 
> 2.12.0
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-01 11:54       ` Konrad Rzeszutek Wilk
@ 2017-04-01 15:45         ` Dan Williams
  2017-04-04 17:00           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-04-01 15:45 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf

On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> ..snip..
>> >> Is there a resource I can read more about why the hypervisor needs to
>> >> have this M2P mapping for nvdimm support?
>> >
>> > M2P is basically an array of frame numbers. It's indexed by the host
>> > page frame number, or the machine frame number (MFN) in Xen's
>> > definition. The n'th entry records the guest page frame number that is
>> > mapped to MFN n. M2P is one of the core data structures used in Xen
>> > memory management, and is used to convert MFN to guest PFN. A
>> > read-only version of M2P is also exposed as part of ABI to guest. In
>> > the previous design discussion, we decided to put the management of
>> > NVDIMM in the existing Xen memory management as much as possible, so
>> > we need to build M2P for NVDIMM as well.
>> >
>>
>> Thanks, but what I don't understand is why this M2P lookup is needed?
>
> Xen uses it to construct the EPT page tables for the guests.
>
>> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
>
> It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
> ranges construct (since those are usually contingous and given
> in ranges to a guest).

So, I'm confused again. This patchset / enabling requires both M2P and
contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
need M2P and can just reuse the MMIO enabling, or am I missing
something?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-01 15:45         ` Dan Williams
@ 2017-04-04 17:00           ` Konrad Rzeszutek Wilk
  2017-04-04 17:16             ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-04 17:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Sat, Apr 01, 2017 at 08:45:45AM -0700, Dan Williams wrote:
> On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> > ..snip..
> >> >> Is there a resource I can read more about why the hypervisor needs to
> >> >> have this M2P mapping for nvdimm support?
> >> >
> >> > M2P is basically an array of frame numbers. It's indexed by the host
> >> > page frame number, or the machine frame number (MFN) in Xen's
> >> > definition. The n'th entry records the guest page frame number that is
> >> > mapped to MFN n. M2P is one of the core data structures used in Xen
> >> > memory management, and is used to convert MFN to guest PFN. A
> >> > read-only version of M2P is also exposed as part of ABI to guest. In
> >> > the previous design discussion, we decided to put the management of
> >> > NVDIMM in the existing Xen memory management as much as possible, so
> >> > we need to build M2P for NVDIMM as well.
> >> >
> >>
> >> Thanks, but what I don't understand is why this M2P lookup is needed?
> >
> > Xen uses it to construct the EPT page tables for the guests.
> >
> >> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
> >
> > It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
> > ranges construct (since those are usually contingous and given
> > in ranges to a guest).
> 
> So, I'm confused again. This patchset / enabling requires both M2P and
> contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
> need M2P and can just reuse the MMIO enabling, or am I missing
> something?

I think I am confusing you.

The patchset (specifically 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem )
adds a hypercall to tell Xen where on the NVDIMM it can put 
the M2P array and as well the frametables ('struct page').

There is no range support. The reason is that if break up
an NVDIMM in various chunks (and then put a filesystem on top of it) - and
then figure out which of the SPAs belong to the file - and then
"expose" that file to a guest as NVDIMM - it's SPAs won't
be contingous. Hence the hypervisor would need to break down
the 'ranges' structure down in either a bitmap or an M2P
and also store it. This can get quite tricky so you may
as well just start with an M2P and 'struct page'.

The placement of those datastructures is "
v2 patch
   series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
   location to store the frametable and M2P of pmem.
"

Hope this helps?
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 17:00           ` Konrad Rzeszutek Wilk
@ 2017-04-04 17:16             ` Dan Williams
  2017-04-04 17:34               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-04-04 17:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Tue, Apr 4, 2017 at 10:00 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Sat, Apr 01, 2017 at 08:45:45AM -0700, Dan Williams wrote:
>> On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
>> > ..snip..
>> >> >> Is there a resource I can read more about why the hypervisor needs to
>> >> >> have this M2P mapping for nvdimm support?
>> >> >
>> >> > M2P is basically an array of frame numbers. It's indexed by the host
>> >> > page frame number, or the machine frame number (MFN) in Xen's
>> >> > definition. The n'th entry records the guest page frame number that is
>> >> > mapped to MFN n. M2P is one of the core data structures used in Xen
>> >> > memory management, and is used to convert MFN to guest PFN. A
>> >> > read-only version of M2P is also exposed as part of ABI to guest. In
>> >> > the previous design discussion, we decided to put the management of
>> >> > NVDIMM in the existing Xen memory management as much as possible, so
>> >> > we need to build M2P for NVDIMM as well.
>> >> >
>> >>
>> >> Thanks, but what I don't understand is why this M2P lookup is needed?
>> >
>> > Xen uses it to construct the EPT page tables for the guests.
>> >
>> >> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
>> >
>> > It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
>> > ranges construct (since those are usually contingous and given
>> > in ranges to a guest).
>>
>> So, I'm confused again. This patchset / enabling requires both M2P and
>> contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
>> need M2P and can just reuse the MMIO enabling, or am I missing
>> something?
>
> I think I am confusing you.
>
> The patchset (specifically 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem )
> adds a hypercall to tell Xen where on the NVDIMM it can put
> the M2P array and as well the frametables ('struct page').
>
> There is no range support. The reason is that if break up
> an NVDIMM in various chunks (and then put a filesystem on top of it) - and
> then figure out which of the SPAs belong to the file - and then
> "expose" that file to a guest as NVDIMM - it's SPAs won't
> be contingous. Hence the hypervisor would need to break down
> the 'ranges' structure down in either a bitmap or an M2P
> and also store it. This can get quite tricky so you may
> as well just start with an M2P and 'struct page'.

Ok... but the problem then becomes that the filesystem is free to
change the file-offset to SPA mapping any time it wants. So the M2P
support is broken if it expects static relationships.

> The placement of those datastructures is "
> v2 patch
>    series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
>    location to store the frametable and M2P of pmem.
> "
>
> Hope this helps?

It does, but it still seems we're stuck between either 1/ not needing
M2P if we can pass a whole pmem-namespace through to the guest or 2/
M2P being broken by non-static file-offset to physical address
mappings.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 17:16             ` Dan Williams
@ 2017-04-04 17:34               ` Konrad Rzeszutek Wilk
  2017-04-04 17:59                 ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-04 17:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Tue, Apr 04, 2017 at 10:16:41AM -0700, Dan Williams wrote:
> On Tue, Apr 4, 2017 at 10:00 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Sat, Apr 01, 2017 at 08:45:45AM -0700, Dan Williams wrote:
> >> On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> >> > ..snip..
> >> >> >> Is there a resource I can read more about why the hypervisor needs to
> >> >> >> have this M2P mapping for nvdimm support?
> >> >> >
> >> >> > M2P is basically an array of frame numbers. It's indexed by the host
> >> >> > page frame number, or the machine frame number (MFN) in Xen's
> >> >> > definition. The n'th entry records the guest page frame number that is
> >> >> > mapped to MFN n. M2P is one of the core data structures used in Xen
> >> >> > memory management, and is used to convert MFN to guest PFN. A
> >> >> > read-only version of M2P is also exposed as part of ABI to guest. In
> >> >> > the previous design discussion, we decided to put the management of
> >> >> > NVDIMM in the existing Xen memory management as much as possible, so
> >> >> > we need to build M2P for NVDIMM as well.
> >> >> >
> >> >>
> >> >> Thanks, but what I don't understand is why this M2P lookup is needed?
> >> >
> >> > Xen uses it to construct the EPT page tables for the guests.
> >> >
> >> >> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
> >> >
> >> > It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
> >> > ranges construct (since those are usually contingous and given
> >> > in ranges to a guest).
> >>
> >> So, I'm confused again. This patchset / enabling requires both M2P and
> >> contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
> >> need M2P and can just reuse the MMIO enabling, or am I missing
> >> something?
> >
> > I think I am confusing you.
> >
> > The patchset (specifically 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem )
> > adds a hypercall to tell Xen where on the NVDIMM it can put
> > the M2P array and as well the frametables ('struct page').
> >
> > There is no range support. The reason is that if break up
> > an NVDIMM in various chunks (and then put a filesystem on top of it) - and
> > then figure out which of the SPAs belong to the file - and then
> > "expose" that file to a guest as NVDIMM - it's SPAs won't
> > be contingous. Hence the hypervisor would need to break down
> > the 'ranges' structure down in either a bitmap or an M2P
> > and also store it. This can get quite tricky so you may
> > as well just start with an M2P and 'struct page'.
> 
> Ok... but the problem then becomes that the filesystem is free to
> change the file-offset to SPA mapping any time it wants. So the M2P
> support is broken if it expects static relationships.

Can't you flock an file and that will freeze it? Or mlock it since
one is rather mmap-ing it?
> 
> > The placement of those datastructures is "
> > v2 patch
> >    series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
> >    location to store the frametable and M2P of pmem.
> > "
> >
> > Hope this helps?
> 
> It does, but it still seems we're stuck between either 1/ not needing
> M2P if we can pass a whole pmem-namespace through to the guest or 2/
> M2P being broken by non-static file-offset to physical address
> mappings.

Aye. So how can 2/ be fixed? I am assuming you would have the same
issue with KVM - if the file is 'moving' underneath (and the file-offset
to SPA has changed) won't that affect the EPT and other page entries?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 17:34               ` Konrad Rzeszutek Wilk
@ 2017-04-04 17:59                 ` Dan Williams
  2017-04-04 18:05                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-04-04 17:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Tue, Apr 4, 2017 at 10:34 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Apr 04, 2017 at 10:16:41AM -0700, Dan Williams wrote:
>> On Tue, Apr 4, 2017 at 10:00 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> > On Sat, Apr 01, 2017 at 08:45:45AM -0700, Dan Williams wrote:
>> >> On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
>> >> > ..snip..
>> >> >> >> Is there a resource I can read more about why the hypervisor needs to
>> >> >> >> have this M2P mapping for nvdimm support?
>> >> >> >
>> >> >> > M2P is basically an array of frame numbers. It's indexed by the host
>> >> >> > page frame number, or the machine frame number (MFN) in Xen's
>> >> >> > definition. The n'th entry records the guest page frame number that is
>> >> >> > mapped to MFN n. M2P is one of the core data structures used in Xen
>> >> >> > memory management, and is used to convert MFN to guest PFN. A
>> >> >> > read-only version of M2P is also exposed as part of ABI to guest. In
>> >> >> > the previous design discussion, we decided to put the management of
>> >> >> > NVDIMM in the existing Xen memory management as much as possible, so
>> >> >> > we need to build M2P for NVDIMM as well.
>> >> >> >
>> >> >>
>> >> >> Thanks, but what I don't understand is why this M2P lookup is needed?
>> >> >
>> >> > Xen uses it to construct the EPT page tables for the guests.
>> >> >
>> >> >> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
>> >> >
>> >> > It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
>> >> > ranges construct (since those are usually contingous and given
>> >> > in ranges to a guest).
>> >>
>> >> So, I'm confused again. This patchset / enabling requires both M2P and
>> >> contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
>> >> need M2P and can just reuse the MMIO enabling, or am I missing
>> >> something?
>> >
>> > I think I am confusing you.
>> >
>> > The patchset (specifically 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem )
>> > adds a hypercall to tell Xen where on the NVDIMM it can put
>> > the M2P array and as well the frametables ('struct page').
>> >
>> > There is no range support. The reason is that if break up
>> > an NVDIMM in various chunks (and then put a filesystem on top of it) - and
>> > then figure out which of the SPAs belong to the file - and then
>> > "expose" that file to a guest as NVDIMM - it's SPAs won't
>> > be contingous. Hence the hypervisor would need to break down
>> > the 'ranges' structure down in either a bitmap or an M2P
>> > and also store it. This can get quite tricky so you may
>> > as well just start with an M2P and 'struct page'.
>>
>> Ok... but the problem then becomes that the filesystem is free to
>> change the file-offset to SPA mapping any time it wants. So the M2P
>> support is broken if it expects static relationships.
>
> Can't you flock an file and that will freeze it? Or mlock it since
> one is rather mmap-ing it?

Unfortunately no. This dovetails with the discussion we have been
having with filesystem folks about the need to call msync() after
every write. Whenever the filesystem sees a write fault it is free to
move blocks around in the file, think allocation or copy-on-write
operations like reflink. The filesystem depends on the application
calling msync/fsync before it makes the writes from those faults
durable against crash / powerloss.  Also, actions like online defrag
can change these offset to physical address relationships without any
involvement from the application. There's currently no mechanism to
lock out this behavior because the filesystem assumes that it can just
invalidate mappings to make the application re-fault.

>>
>> > The placement of those datastructures is "
>> > v2 patch
>> >    series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
>> >    location to store the frametable and M2P of pmem.
>> > "
>> >
>> > Hope this helps?
>>
>> It does, but it still seems we're stuck between either 1/ not needing
>> M2P if we can pass a whole pmem-namespace through to the guest or 2/
>> M2P being broken by non-static file-offset to physical address
>> mappings.
>
> Aye. So how can 2/ be fixed? I am assuming you would have the same
> issue with KVM - if the file is 'moving' underneath (and the file-offset
> to SPA has changed) won't that affect the EPT and other page entries?

I don't think KVM has the same issue, but honestly I don't have the
full mental model of how KVM supports mmap. I've at least been able to
run a guest where the "pmem" is just dynamic page cache on the host
side so the physical memory mapping is changing all the time due to
swap. KVM does not have this third-party M2P mapping table to keep up
to date so I assume it is just handled by the standard mmap support
for establishing a guest physical address range and the standard
mapping-invalidate + remap mechanism just works.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 17:59                 ` Dan Williams
@ 2017-04-04 18:05                   ` Konrad Rzeszutek Wilk
  2017-04-04 18:59                     ` Dan Williams
  0 siblings, 1 reply; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-04 18:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Tue, Apr 04, 2017 at 10:59:01AM -0700, Dan Williams wrote:
> On Tue, Apr 4, 2017 at 10:34 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Tue, Apr 04, 2017 at 10:16:41AM -0700, Dan Williams wrote:
> >> On Tue, Apr 4, 2017 at 10:00 AM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >> > On Sat, Apr 01, 2017 at 08:45:45AM -0700, Dan Williams wrote:
> >> >> On Sat, Apr 1, 2017 at 4:54 AM, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
> >> >> > ..snip..
> >> >> >> >> Is there a resource I can read more about why the hypervisor needs to
> >> >> >> >> have this M2P mapping for nvdimm support?
> >> >> >> >
> >> >> >> > M2P is basically an array of frame numbers. It's indexed by the host
> >> >> >> > page frame number, or the machine frame number (MFN) in Xen's
> >> >> >> > definition. The n'th entry records the guest page frame number that is
> >> >> >> > mapped to MFN n. M2P is one of the core data structures used in Xen
> >> >> >> > memory management, and is used to convert MFN to guest PFN. A
> >> >> >> > read-only version of M2P is also exposed as part of ABI to guest. In
> >> >> >> > the previous design discussion, we decided to put the management of
> >> >> >> > NVDIMM in the existing Xen memory management as much as possible, so
> >> >> >> > we need to build M2P for NVDIMM as well.
> >> >> >> >
> >> >> >>
> >> >> >> Thanks, but what I don't understand is why this M2P lookup is needed?
> >> >> >
> >> >> > Xen uses it to construct the EPT page tables for the guests.
> >> >> >
> >> >> >> Does Xen establish this metadata for PCI mmio ranges as well? What Xen
> >> >> >
> >> >> > It doesn't have that (M2P) for PCI MMIO ranges. For those it has an
> >> >> > ranges construct (since those are usually contingous and given
> >> >> > in ranges to a guest).
> >> >>
> >> >> So, I'm confused again. This patchset / enabling requires both M2P and
> >> >> contiguous PMEM ranges. If the PMEM is contiguous it seems you don't
> >> >> need M2P and can just reuse the MMIO enabling, or am I missing
> >> >> something?
> >> >
> >> > I think I am confusing you.
> >> >
> >> > The patchset (specifically 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem )
> >> > adds a hypercall to tell Xen where on the NVDIMM it can put
> >> > the M2P array and as well the frametables ('struct page').
> >> >
> >> > There is no range support. The reason is that if break up
> >> > an NVDIMM in various chunks (and then put a filesystem on top of it) - and
> >> > then figure out which of the SPAs belong to the file - and then
> >> > "expose" that file to a guest as NVDIMM - it's SPAs won't
> >> > be contingous. Hence the hypervisor would need to break down
> >> > the 'ranges' structure down in either a bitmap or an M2P
> >> > and also store it. This can get quite tricky so you may
> >> > as well just start with an M2P and 'struct page'.
> >>
> >> Ok... but the problem then becomes that the filesystem is free to
> >> change the file-offset to SPA mapping any time it wants. So the M2P
> >> support is broken if it expects static relationships.
> >
> > Can't you flock an file and that will freeze it? Or mlock it since
> > one is rather mmap-ing it?
> 
> Unfortunately no. This dovetails with the discussion we have been
> having with filesystem folks about the need to call msync() after
> every write. Whenever the filesystem sees a write fault it is free to
> move blocks around in the file, think allocation or copy-on-write
> operations like reflink. The filesystem depends on the application
> calling msync/fsync before it makes the writes from those faults
> durable against crash / powerloss.  Also, actions like online defrag
> can change these offset to physical address relationships without any
> involvement from the application. There's currently no mechanism to
> lock out this behavior because the filesystem assumes that it can just
> invalidate mappings to make the application re-fault.
> 
> >>
> >> > The placement of those datastructures is "
> >> > v2 patch
> >> >    series relies on users/admins in Dom0 instead of Dom0 driver to indicate the
> >> >    location to store the frametable and M2P of pmem.
> >> > "
> >> >
> >> > Hope this helps?
> >>
> >> It does, but it still seems we're stuck between either 1/ not needing
> >> M2P if we can pass a whole pmem-namespace through to the guest or 2/
> >> M2P being broken by non-static file-offset to physical address
> >> mappings.
> >
> > Aye. So how can 2/ be fixed? I am assuming you would have the same
> > issue with KVM - if the file is 'moving' underneath (and the file-offset
> > to SPA has changed) won't that affect the EPT and other page entries?
> 
> I don't think KVM has the same issue, but honestly I don't have the
> full mental model of how KVM supports mmap. I've at least been able to
> run a guest where the "pmem" is just dynamic page cache on the host
> side so the physical memory mapping is changing all the time due to
> swap. KVM does not have this third-party M2P mapping table to keep up
> to date so I assume it is just handled by the standard mmap support
> for establishing a guest physical address range and the standard
> mapping-invalidate + remap mechanism just works.

Could it be possible to have an Xen driver that would listen on
these notifications and percolate those changes this driver. Then
this driver would make the appropiate hypercalls to update the M2P ?

That would solve the 2/ I think?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 18:05                   ` Konrad Rzeszutek Wilk
@ 2017-04-04 18:59                     ` Dan Williams
  2017-04-11 17:48                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 32+ messages in thread
From: Dan Williams @ 2017-04-04 18:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

>> I don't think KVM has the same issue, but honestly I don't have the
>> full mental model of how KVM supports mmap. I've at least been able to
>> run a guest where the "pmem" is just dynamic page cache on the host
>> side so the physical memory mapping is changing all the time due to
>> swap. KVM does not have this third-party M2P mapping table to keep up
>> to date so I assume it is just handled by the standard mmap support
>> for establishing a guest physical address range and the standard
>> mapping-invalidate + remap mechanism just works.
>
> Could it be possible to have an Xen driver that would listen on
> these notifications and percolate those changes this driver. Then
> this driver would make the appropiate hypercalls to update the M2P ?
>
> That would solve the 2/ I think?

I think that could work. That sounds like userfaultfd support for DAX
which is something I want to take a look at in the next couple kernel
cycles for other reasons like live migration of guest-VMs with DAX
mappings.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains
  2017-04-04 18:59                     ` Dan Williams
@ 2017-04-11 17:48                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 32+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-11 17:48 UTC (permalink / raw)
  To: Dan Williams, haozhong.zhang
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk, Daniel De Graaf

On Tue, Apr 04, 2017 at 11:59:03AM -0700, Dan Williams wrote:
> >> I don't think KVM has the same issue, but honestly I don't have the
> >> full mental model of how KVM supports mmap. I've at least been able to
> >> run a guest where the "pmem" is just dynamic page cache on the host
> >> side so the physical memory mapping is changing all the time due to
> >> swap. KVM does not have this third-party M2P mapping table to keep up
> >> to date so I assume it is just handled by the standard mmap support
> >> for establishing a guest physical address range and the standard
> >> mapping-invalidate + remap mechanism just works.
> >
> > Could it be possible to have an Xen driver that would listen on
> > these notifications and percolate those changes this driver. Then
> > this driver would make the appropiate hypercalls to update the M2P ?
> >
> > That would solve the 2/ I think?
> 
> I think that could work. That sounds like userfaultfd support for DAX
> which is something I want to take a look at in the next couple kernel
> cycles for other reasons like live migration of guest-VMs with DAX
> mappings.

I would need to educate myself a bit more about this.
But I just realized we lost Haozhong on this thread. Adding him back in,
perhaps he has more experience with that.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-04-11 17:48 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-20  0:09 [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 01/15] xen/common: add Kconfig item for pmem support Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 02/15] xen: probe pmem regions via ACPI NFIT Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 03/15] xen/x86: allow customizing locations of extended frametable & M2P Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 04/15] xen/x86: add XEN_SYSCTL_nvdimm_pmem_setup to setup host pmem Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 05/15] xen/x86: add XENMEM_populate_pmem_map to map host pmem pages to HVM domain Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 06/15] tools: reserve guest memory for ACPI from device model Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 07/15] tools/libacpi: expose the minimum alignment used by mem_ops.alloc Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 08/15] tools/libacpi: add callback acpi_ctxt.p2v to get a pointer from physical address Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 09/15] tools/libacpi: add callbacks to access XenStore Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 10/15] tools/libacpi: add a simple AML builder Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 11/15] tools/libacpi: load ACPI built by the device model Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 12/15] tools/libxl: build qemu options from xl vNVDIMM configs Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 13/15] tools/libxl: add support to map host pmem device to guests Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 14/15] tools/libxl: initiate pmem mapping via qmp callback Haozhong Zhang
2017-03-20  0:09 ` [RFC XEN PATCH v2 15/15] tools/misc: add xen-ndctl Haozhong Zhang
2017-03-30  4:11   ` Dan Williams
2017-03-30  7:58     ` Haozhong Zhang
2017-04-01 11:55       ` Konrad Rzeszutek Wilk
2017-03-30  4:20 ` [RFC XEN PATCH v2 00/15] Add vNVDIMM support to HVM domains Dan Williams
2017-03-30  8:21   ` Haozhong Zhang
2017-03-30 16:01     ` Dan Williams
2017-04-01 11:54       ` Konrad Rzeszutek Wilk
2017-04-01 15:45         ` Dan Williams
2017-04-04 17:00           ` Konrad Rzeszutek Wilk
2017-04-04 17:16             ` Dan Williams
2017-04-04 17:34               ` Konrad Rzeszutek Wilk
2017-04-04 17:59                 ` Dan Williams
2017-04-04 18:05                   ` Konrad Rzeszutek Wilk
2017-04-04 18:59                     ` Dan Williams
2017-04-11 17:48                       ` Konrad Rzeszutek Wilk
2017-04-01 12:24 ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.