xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [RFC Design Doc v2] Add vNVDIMM support for Xen
@ 2016-07-18  0:29 Haozhong Zhang
  2016-07-18  8:36 ` Tian, Kevin
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Haozhong Zhang @ 2016-07-18  0:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, Zhang, Haozhong, Tian, Kevin, Stefano Stabellini,
	Wei Liu, Nakajima, Jun, George Dunlap, Andrew Cooper,
	Ian Jackson, Jan Beulich, Xiao Guangrong

Hi,

Following is version 2 of the design doc for supporting vNVDIMM in
Xen. It's basically the summary of discussion on previous v1 design
(https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
Any comments are welcome. The corresponding patches are WIP.

Thanks,
Haozhong



vNVDIMM Design v2

Changes in v2:
 - Rewrite the the design details based on previous discussion [7].
 - Add Section 3 Usage Example of vNVDIMM in Xen.
 - Remove content about pcommit instruction which has been deprecated [8].

Content
=======
1. Background
 1.1 Access Mechanisms: Persistent Memory and Block Window
 1.2 ACPI Support
  1.2.1 NFIT
  1.2.2 _DSM and _FIT
 1.3 Namespace
 1.4 clwb/clflushopt
2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
 2.1 NVDIMM Driver in Linux Kernel
 2.2 vNVDIMM Implementation in KVM/QEMU
3. Usage Example of vNVDIMM in Xen
4. Design of vNVDIMM in Xen
 4.1 Guest clwb/clflushopt Enabling
 4.2 pmem Address Management
  4.2.1 Reserve Storage for Management Structures
  4.2.2 Detection of Host pmem Devices
  4.2.3 Get Host Machine Address (SPA) of Host pmem Files
  4.2.4 Map Host pmem to Guests
  4.2.5 Misc 1: RAS
  4.2.6 Misc 2: hotplug
 4.3 Guest ACPI Emulation
  4.3.1 Building Guest ACPI Tables
  4.3.2 Emulating Guest _DSM
References


Non-Volatile DIMM or NVDIMM is a type of RAM device that provides
persistent storage and retains data across reboot and even power
failures. This document describes the design to provide virtual NVDIMM
devices or vNVDIMM in Xen.

The rest of this document is organized as below.
 - Section 1 introduces the background knowledge of NVDIMM hardware,
   which is used by other parts of this document.

 - Section 2 briefly introduces the current/future NVDIMM/vNVDIMM
   support in Linux kernel/KVM/QEMU. They will affect the vNVDIMM
   design in Xen.

 - Section 3 shows the basic usage example of vNVDIMM in Xen.

 - Section 4 proposes design details of vNVDIMM in Xen.



1. Background

1.1 Access Mechanisms: Persistent Memory and Block Window

 NVDIMM provides two access mechanisms: byte-addressable persistent
 memory (pmem) and block window (pblk). An NVDIMM can contain multiple
 ranges and each range can be accessed through either pmem or pblk
 (but not both).

 Byte-addressable persistent memory mechanism (pmem) maps NVDIMM or
 ranges of NVDIMM into the system physical address (SPA) space, so
 that software can access NVDIMM via normal memory loads and
 stores. If the virtual address is used, then MMU will translate it to
 the physical address.

 In the virtualization circumstance, we can pass through a pmem range
 or partial of it to a guest by mapping it in EPT (i.e. mapping guest
 vNVDIMM physical address to host NVDIMM physical address), so that
 guest accesses are applied directly to the host NVDIMM device without
 hypervisor's interceptions.

 Block window mechanism (pblk) provides one or multiple block windows
 (BW).  Each BW is composed of a command register, a status register
 and a 8 Kbytes aperture register. Software fills the direction of the
 transfer (read/write), the start address (LBA) and size on NVDIMM it
 is going to transfer. If nothing goes wrong, the transferred data can
 be read/write via the aperture register. The status and errors of the
 transfer can be got from the status register. Other vendor-specific
 commands and status can be implemented for BW as well. Details of the
 block window access mechanism can be found in [3].

 In the virtualization circumstance, different pblk regions on a
 single NVDIMM device may be accessed by different guests, so the
 hypervisor needs to emulate BW, which would introduce a high overhead
 for I/O intensive workload.

 Therefore, we are going to only implement pmem for vNVDIMM. The rest
 of this document will mostly concentrate on pmem.


1.2 ACPI Support

 ACPI provides two factors of support for NVDIMM. First, NVDIMM
 devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
 NVDIMM Firmware Interface Table (NFIT). Second, several functions of
 NVDIMM, including operations on namespace labels, S.M.A.R.T and
 hotplug, are provided by ACPI methods (_DSM and _FIT).

1.2.1 NFIT

 NFIT is a new system description table added in ACPI v6 with
 signature "NFIT". It contains a set of structures.

 - System Physical Address Range Structure
   (SPA Range Structure)

   SPA range structure describes system physical address ranges
   occupied by NVDIMMs and types of regions.

   If Address Range Type GUID field of a SPA range structure is "Byte
   Addressable Persistent Memory (PM) Region", then the structure
   describes a NVDIMM region that is accessed via pmem. The System
   Physical Address Range Base and Length fields describe the start
   system physical address and the length that is occupied by that
   NVDIMM region.

   A SPA range structure is identified by a non-zero SPA range
   structure index.

   Note: [1] reserves E820 type 7: OSPM must comprehend this memory as
         having non-volatile attributes and handle distinct from
         conventional volatile memory (in Table 15-312 of [1]). The
         memory region supports byte-addressable non-volatility. E820
         type 12 (OEM defined) may be also used for legacy NVDIMM
         prior to ACPI v6.

   Note: Besides OS, EFI firmware may also parse NFIT for booting
         drives (Section 9.3.6.9 of [5]).

 - Memory Device to System Physical Address Range Mapping Structure
   (Range Mapping Structure)

   An NVDIMM region described by a SPA range structure can be
   interleaved across multiple NVDIMM devices. A range mapping
   structure is used to describe the single mapping on each NVDIMM
   device. It describes the size and the offset in a SPA range that an
   NVDIMM device occupies. It may refer to an Interleave Structure
   that contains details of the entire interleave set. Those
   information is used in pblk by the NVDIMM driver for address
   translation.

   The NVDIMM device described by the range mapping structure is
   identified by an unique NFIT Device Handle.

 Details of NFIT and other structures can be found in Section 5.25 in [1].

1.2.2 _DSM and _FIT

 The ACPI namespace device uses _HID of ACPI0012 to identify the root
 NVDIMM interface device. An ACPI namespace device is also present
 under the root device For each NVDIMM device. Above ACPI namespace
 devices are defined in SSDT.

 _DSM methods are present under the root device and each NVDIMM
 device. _DSM methods are used by drivers to access the label storage
 area, get health information, perform vendor-specific commands,
 etc. Details of all _DSM methods can be found in [4].

 _FIT method is under the root device and evaluated by OSPM to get
 NFIT of hotplugged NVDIMM. The hotplugged NVDIMM is indicated to OS
 using ACPI Namespace device with PNPID of PNP0C80 and the device
 object notification value is 0x80. Details of NVDIMM hotplug can be
 found in Section 9.20 of [1].


1.3 Namespace

 [2] describes a mechanism to sub-divide NVDIMMs into namespaces,
 which are logic units of storage similar to SCSI LUNs and NVM Express
 namespaces.

 The namespace information is describes by namespace labels stored in
 the persistent label storage area on each NVDIMM device. The label
 storage area is excluded from the the range mapped by the SPA range
 structure and can only be accessed via _DSM methods.

 There are two types of namespaces defined in [2]: the persistent
 memory namespace and the block namespaces. Persistent memory
 namespaces is built for only pmem NVDIMM regions, while block
 namespaces only for pblk. Only one persistent memory namespace is
 allowed for a pmem NVDIMM region.

 Besides being accessed via _DSM, namespaces are managed and
 interpreted by software. OS vendors may decide to not follow [2] and
 store other types of information in the label storage area.


1.4 clwb/clflushopt

 Writes to NVDIMM may be cached by caches, so certain flushing
 operations should be performed to make them persistent on
 NVDIMM. clwb is used in favor of clflushopt and clflush to flush
 writes from caches to memory.

 Details of clwb/clflushopt can be found in Chapter 10 of [6].



2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU

2.1 NVDIMM Driver in Linux Kernel

 Linux kernel since 4.2 has added support for ACPI-defined NVDIMM
 devices.

 NVDIMM driver in Linux probes NVDIMM devices through ACPI (i.e. NFIT
 and _FIT). It is also responsible to parse the namsepace labels on
 each NVDIMM devices, recover namespace after power failure (as
 described in [2]) and handle NVDIMM hotplug. There are also some
 vendor drivers to perform vendor-specific operations on NVDIMMs
 (e.g. via _DSM).

 Compared to the ordinary ram, NVDIMM is used more like a persistent
 storage drive for its persistent aspect. For each persistent memory
 namespace, or a label-less pmem NVDIMM range, NVDIMM driver
 implements a block device interface (/dev/pmemX) for it.

 Userspace applications can mmap(2) the whole pmem into its own
 virtual address space. Linux kernel maps the system physical address
 space range occupied by pmem into the virtual address space, so that every
 normal memory loads/writes with proper flushing instructions are
 applied to the underlying pmem NVDIMM regions.

 Alternatively, a DAX file system can be made on /dev/pmemX. Files on
 that file system can be used in the same way as above. As Linux
 kernel maps the system address space range occupied by those files on
 NVDIMM to the virtual address space, reads/writes on those files are
 applied to the underlying NVDIMM regions as well.

2.2 vNVDIMM Implementation in KVM/QEMU

 An overview of vNVDIMM implementation in KVM (Linux kernel v4.2) / QEMU (commit
 70d1fb9 and patches in-review/future) is showed by the following figure.


                                       +---------------------------------+
 Guest                             GPA |                    | /dev/pmem0 |
                                       +---------------------------------+
           parse        evaluate                            ^            ^
            ACPI          _DSM                              |            |
              |            |                                |            |
 -------------|------------|--------------------------------|------------|----
              V            V                                |            |
          +-------+    +-------+                            |            |
 QEMU     | vACPI |    | v_DSM |                            |            |
          +-------+    +-------+                            |            |
                           ^                                |            |
                           | Read/Write                     |            |
                           V                                |            |
          +...+--------------------+...+-----------+        |            |
    VA    |   | Label Storage Area |   |    buf    |  KVM_SET_USER_MEMORY_REGION(buf)
          +...+--------------------+...+-----------+        |            |
                                       ^  mmap(2)  ^        |            |
 --------------------------------------|-----------|--------|------------|----
                                       |           +--------~------------+
                                       |                    |            |
 Linux/KVM                             +--------------------+            |
                                                            |            |
                                                       +....+------------+
                                                SPA    |    | /dev/pmem0 |
                                                       +....+------------+
                                                                   ^
                                                                   |
                                                            Host NVDIMM Driver
-------------------------------------------------------------------|---------
                                                                   |
 HW                                                          +------------+
                                                             |   NVDIMM   |
                                                             +------------+


 A part not put in above figure is enabling guest clwb/clflushopt
 which exposes those instructions to guest via guest cpuid.

 Besides instruction enabling, there are two primary parts of vNVDIMM
 implementation in KVM/QEMU.

 (1) Address Mapping

  As described before, the host Linux NVDIMM driver provides a block
  device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
  region. QEMU can than mmap(2) that device into its virtual address
  space (buf). QEMU is responsible to find a proper guest physical
  address space range that is large enough to hold /dev/pmem0. Then
  QEMU passes the virtual address of mmapped buf to a KVM API
  KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
  address range of buf to the guest physical address space range where
  the virtual pmem device will be.

  In this way, all guest writes/reads on the virtual pmem device is
  applied directly to the host one.

  Besides, above implementation also allows to back a virtual pmem
  device by a mmapped regular file or a piece of ordinary ram.

 (2) Guest ACPI Emulation

  As guest system physical address and the size of the virtual pmem
  device are determined by QEMU, QEMU is responsible to emulate the
  guest NFIT and SSDT. Basically, it builds the guest NFIT and its
  sub-structures that describes the virtual NVDIMM topology, and a
  guest SSDT that defines ACPI namespace devices of virtual NVDIMM in
  guest SSDT.

  As a portion of host pmem device or a regular file/ordinary file can
  be used to back the guest pmem device, the label storage area on
  host pmem cannot always be passed through to guest. Therefore, the
  guest reads/writes on the label storage area is emulated by QEMU. As
  described before, _DSM method is utilized by OSPM to access the
  label storage area, and therefore it is emulated by QEMU. The _DSM
  buffer is registered as MMIO, and its guest physical address and
  size are described in the guest ACPI. Every command/status
  read/write from guest is trapped and emulated by QEMU.

  Guest _FIT method will be implemented similarly in the future.



3. Usage Example of vNVDIMM in Xen

 Our design is to provide virtual pmem devices to HVM domains. The
 virtual pmem devices are backed by host pmem devices.

 Dom0 Linux kernel can detect the host pmem devices and create
 /dev/pmemXX for each detected devices. Users in Dom0 can then create
 DAX file system on /dev/pmemXX and create several pre-allocate files
 in the DAX file system.

 After setup the file system on the host pmem, users can add the
 following lines in the xl configuration files to assign the host pmem
 regions to domains:
     vnvdimm = [ 'file=/dev/pmem0' ]
 or
     vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]

  The first type of configuration assigns the entire pmem device
  (/dev/pmem0) to the domain, while the second assigns the space
  allocated to /mnt/dax/pre_allocated_file on the host pmem device to
  the domain.

  When the domain starts, guest can detect the (virtual) pmem devices
  via ACPI and guest read/write on the virtual pmem devices are
  directly applied on their host backends.



4. Design of vNVDIMM in Xen

 As KVM/QEMU, our design currently only provides pmem vNVDIMM.

 Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
 three parts:
 (1) Guest clwb/clflushopt enabling,
 (2) pmem address management, and
 (3) Guest ACPI emulation.

 The rest of this section present the design of each part
 respectively. The basic design principle to reuse existing code in
 Linux NVDIMM driver, QEMU and Xen as much as possible.


4.1 Guest clwb/clflushopt Enabling

 The instruction enabling is simple and we do the same work as in KVM/QEMU:
 - clwb/clflushopt are exposed to guest via guest cpuid.


4.2 pmem Address Management

 pmem address management is primarily composed of three parts:
 (1) detection of pmem devices and their address ranges, which is
     accomplished by Dom0 Linux pmem driver and Xen hypervisor;
 (2) get SPA ranges of an pmem area that will be mapped to domain,
     which is accomplished by xl;
 (3) map the pmem area to a domain, which is accomplished by qemu and
     Xen hypervisor.

 Our design intends to reuse the current memory management for normal
 RAM in Xen to manage the mapping of pmem. Then we will come across a
 problem: where we store the memory management data structs for pmem.

 The rest of this section addresses above aspects respectively.

4.2.1 Reserve Storage for Management Structures

 A core data struct in Xen memory management is 'struct page_info'.
 For normal ram, Xen creates a page_info struct for each page. For
 pmem, we are going to do the same. However, for large capacity pmem
 devices (e.g. several terabytes or even larger), a large amount of
 page_info structs will occupy too much storage space that cannot
 fit in the normal ram.

 Our solution, as used by Linux kernel, is to reserve an area on pmem
 and place pmem's page_info structs in that reserved area. Therefore,
 we can always ensure there is enough space for pmem page_info
 structs, though the access to them is slower than directly from the
 normal ram.

 Such a pmem namespace can be created via a userspace tool ndctl and
 then recognized by Linux NVDIMM driver. However, they currently only
 reserve space for Linux kernel's page structs. Therefore, our design
 need to extend both Linux NVDIMM driver and ndctl to reserve
 arbitrary size.

4.2.2 Detection of Host pmem Devices

 The detection and initialize host pmem devices require a non-trivial
 driver to interact with the corresponding ACPI namespace devices,
 parse namespace labels and make necessary recovery actions. Instead
 of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
 our designs leaves it to Dom0 Linux and let Dom0 Linux report
 detected host pmem devices to Xen hypervisor.

 Our design takes following steps to detect host pmem devices when Xen
 boots.
 (1) As booting on bare metal, host pmem devices are detected by Dom0
     Linux NVDIMM driver.

 (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
     of the pmem devices and reserved areas to Xen hypervisor via a
     new hypercall.

 (3) Xen hypervisor then checks
     - whether SPA and size of the newly reported pmem device is overlap
       with any previously reported pmem devices;
     - whether the reserved area can fit in the pmem device and is
       large enough to hold page_info structs for itself.

     If any checks fail, the reported pmem device will be ignored by
     Xen hypervisor and hence will not be used by any
     guests. Otherwise, Xen hypervisor will recorded the reported
     parameters and create page_info structs in the reserved area.

 (4) Because the reserved area is now used by Xen hypervisor, it
     should not be accessible by Dom0 any more. Therefore, if a host
     pmem device is recorded by Xen hypervisor, Xen will unmap its
     reserved area from Dom0. Our design also needs to extend Linux
     NVDIMM driver to "balloon out" the reserved area after it
     successfully reports a pmem device to Xen hypervisor.

4.2.3 Get Host Machine Address (SPA) of Host pmem Files

 Before a pmem file is assigned to a domain, we need to know the host
 SPA ranges that are allocated to this file. We do this work in xl.

 If a pmem device /dev/pmem0 is given, xl will read
 /sys/block/pmem0/device/{resource,size} respectively for the start
 SPA and size of the pmem device.

 If a pre-allocated file /mnt/dax/file is given,
 (1) xl first finds the host pmem device where /mnt/dax/file is. Then
     it uses the method above to get the start SPA of the host pmem
     device.
 (2) xl then uses fiemap ioctl to get the extend mappings of
     /mnt/dax/file, and adds the corresponding physical offsets and
     lengths in each mapping entries to above start SPA to get the SPA
     ranges pre-allocated for this file.

 The resulting host SPA ranges will be passed to QEMU which allocates
 guest address space for vNVDIMM devices and calls Xen hypervisor to
 map the guest address to the host SPA ranges.

4.2.4 Map Host pmem to Guests

 Our design reuses the existing address mapping in Xen for the normal
 ram to map pmem. We will still initiate the mapping for pmem from
 QEMU, and there is one difference from the mapping of normal ram:
 - For the normal ram, QEMU only needs to provide gpfn, and the actual
   host memory where gpfn is mapped is allocated by Xen hypervisor.
 - For pmem, QEMU provides both gpfn and mfn where gpfn is expected to
   be mapped to. mfn is provided by xl as described in Section 4.2.3.

 Our design introduce a new XENMEM op for the pmem mapping, which
 finally calls guest_physmap_add_page() to add the host pmem page to a
 domain's address space.

4.2.5 Misc 1: RAS

 Machine check can occur from NVDIMM as normal ram, so that we follow
 the current machine check handling in Xen for MC# from NVDIMM.

4.2.6 Misc 2: hotplug

 The hotplugged host NVDIMM devices is detected via _FIT method under
 the root ACPI namespace device for NVDIMM. We rely on Dom0 Linux
 kernel to discover the hotplugged NVDIMM devices and follow steps in
 Section 4.2.2 to report the hotplugged devices to Xen hypervisor.


4.3 Guest ACPI Emulation

 Guest ACPI emulation is composed of two parts: building guest NFIT
 and SSDT that defines ACPI namespace devices for NVDIMM, and
 emulating guest _DSM. As QEMU has already implemented ACPI support
 for vNVDIMM on KVM, our design intends to reuse its implementation.

4.3.1 Building Guest ACPI Tables

 Two tables for vNVDIMM need to be built:
 - NFIT, which defines the basic parameters of vNVDIMM devices and
   does not contain any AML code.
 - SSDT, which defines ACPI namespace devices for vNVDIMM in AML code.

 The contents of both tables are affected by some parameters
 (e.g. address and size of vNVDIMM devices) that cannot be determined
 until a guest configuration is given. However, all AML code in guest
 ACPI are currently generated at compile time fro pre-crafted .asl
 files, which is not feasible for ACPI namespace devices for vNVDIMM.

 We could either introduce an AML builder to generate AML code at
 runtime like what QEMU is currently doing, or pass vNVDIMM ACPI
 tables from QEMU to Xen. In order to reduce the duplicated code (to
 AML builder in QEMU), our design takes the latter approach. Basically,
 our design takes the following steps:
 1) The current QEMU does not build any ACPI stuffs when it runs as
    the Xen device model, so we need to patch it to generate NFIT and
    AML code of ACPI namespace devices for vNVDIMM.

 2) QEMU then copies above NFIT and ACPI namespace devices to an area
    at the end of guest memory below 4G. The guest physical address
    and size of this area are written to xenstore keys
    (/local/domain/domid/hvmloader/dm-acpi/{address,length}) The
    detailed format of data in this area is explained later.

 3) hvmloader reads above xenstore keys to probe the passed-in ACPI
    tables and ACPI namespace devices, and detects the potential
    collisions as explained later.

 4) If no collisions are found, hvmloader will
    (1) append the passed-in ACPI tables to the end of existing guest
        ACPI tables, like what current construct_passthrough_tables()
        does.
    (2) construct a SSDT for each passed-in ACPI namespace devices and
        append to the end of existing guest ACPI tables.

 Passing arbitrary ACPI tables and AML code from QEMU could
 introduce at least two types of collisions:
 1) a passed-in table and a Xen-built table have the same signature
 2) a passed-in ACPI namespace device and a Xen-built ACPI namespace
    device have the same device name.

 Our design takes the following method to avoid and detect collisions.
 1) The data layout of area where QEMU copies its NFIT and ACPI
    namespace devices is organized as below:

     1 byte 4 bytes  length bytes
    +------+--------+-----------+------+--------+-----------+-----
    | type | length | data blob | type | length | data blob | ...
    +------+--------+-----------+------+--------+-----------+-----

    type: 0 - data blob contains a complete ACPI table
          1 - data blob contains AML code for an ACPI namespace device

    length: the number of bytes of data blob

    data blob: type 0 - a complete ACPI table
               type 1 - composed as below:

                         4 bytes   (length - 4) bytes
	                +---------+------------------+
			| name[4] | AML code snippet |
			+---------+------------------+

                        name[4]         : name of ACPI namespace device
			AML code snippet: AML code inside "Device(name[4])"

               e.g. for an ACPI namespace device defined by
	             Device(NVDR)
		     {
		       Name (_HID, "ACPI0012")
		       ...
		     }
		    QEMU builds a data blob like
		        +--------------------+-----------------------------+
			| 'N', 'V', 'D', 'R' | Name (_HID, "ACPI0012") ... |
			+--------------------+-----------------------------+

 2) hvmloader stores signatures of its own guest ACPI tables in an
    array builtin_table_sigs[], and names of its own guest ACPI
    namespace devices in an array builtin_nd_names[]. Because there
    are only a few guest ACPI tables and namespace devices built by
    Xen, we can hardcode their signatures or names in hvmloader.

 3) When hvmloader loads a type 0 entry, it extracts the signature
    from the data blob and search for it in builtin_table_sigs[].  If
    found anyone, hvmloader will report an error and stop. Otherwise,
    it will append it to the end of loaded guest ACPI.

 4) When hvmloader loads a type 1 entry, it extracts the device name
    from the data blob and search for it in builtin_nd_names[]. If
    found anyone, hvmloader will report and error and stop. Otherwise,
    it will wrap the AML code snippet by "Device (name[4]) {...}" and
    include it in a new SSDT which is then appended to the end of
    loaded guest ACPI.

4.3.2 Emulating Guest _DSM

 Our design leaves the emulation of guest _DSM to QEMU. Just as what
 it does with KVM, QEMU registers the _DSM buffer as MMIO region with
 Xen and then all guest evaluations of _DSM are trapped and emulated
 by QEMU.


References:
[1] ACPI Specification v6,
    http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2] NVDIMM Namespace Specification,
    http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
[3] NVDIMM Block Window Driver Writer's Guide,
    http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
[4] NVDIMM DSM Interface Example,
    http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[5] UEFI Specification v2.6,
    http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
[6] Intel Architecture Instruction Set Extensions Programming Reference,
    https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
[7] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html
[8] https://lists.xen.org/archives/html/xen-devel/2016-06/msg00606.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  0:29 [RFC Design Doc v2] Add vNVDIMM support for Xen Haozhong Zhang
@ 2016-07-18  8:36 ` Tian, Kevin
  2016-07-18  9:01   ` Zhang, Haozhong
  2016-07-19  1:57 ` Bob Liu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Tian, Kevin @ 2016-07-18  8:36 UTC (permalink / raw)
  To: Zhang, Haozhong, xen-devel
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Nakajima, Jun,
	George Dunlap, Andrew Cooper, Ian Jackson, Jan Beulich,
	Xiao Guangrong

> From: Zhang, Haozhong
> Sent: Monday, July 18, 2016 8:29 AM
> 
> Hi,
> 
> Following is version 2 of the design doc for supporting vNVDIMM in
> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> Any comments are welcome. The corresponding patches are WIP.
> 
> Thanks,
> Haozhong

It's a very clear doc. Thanks a lot!

> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>      Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>      of the pmem devices and reserved areas to Xen hypervisor via a
>      new hypercall.

Does Linux need to provide reserved area to Xen? Why not leaving Xen
to decide reserved area within reported pmem regions and then return
reserved info to Dom0 NVDIMM driver to balloon out?

> 
>  (3) Xen hypervisor then checks
>      - whether SPA and size of the newly reported pmem device is overlap
>        with any previously reported pmem devices;
>      - whether the reserved area can fit in the pmem device and is
>        large enough to hold page_info structs for itself.
> 
>      If any checks fail, the reported pmem device will be ignored by
>      Xen hypervisor and hence will not be used by any
>      guests. Otherwise, Xen hypervisor will recorded the reported
>      parameters and create page_info structs in the reserved area.
> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>      should not be accessible by Dom0 any more. Therefore, if a host
>      pmem device is recorded by Xen hypervisor, Xen will unmap its
>      reserved area from Dom0. Our design also needs to extend Linux
>      NVDIMM driver to "balloon out" the reserved area after it
>      successfully reports a pmem device to Xen hypervisor.

Then both ndctl and Xen become source of requesting reserved area
to Linux NVDIMM driver. You don't need change ndctl as described in
4.2.1. User can still use ndctl to reserve for Dom0's own purpose.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  8:36 ` Tian, Kevin
@ 2016-07-18  9:01   ` Zhang, Haozhong
  2016-07-19  0:58     ` Tian, Kevin
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Haozhong @ 2016-07-18  9:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Nakajima, Jun,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Xiao Guangrong

On 07/18/16 16:36, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, July 18, 2016 8:29 AM
> > 
> > Hi,
> > 
> > Following is version 2 of the design doc for supporting vNVDIMM in
> > Xen. It's basically the summary of discussion on previous v1 design
> > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> > Any comments are welcome. The corresponding patches are WIP.
> > 
> > Thanks,
> > Haozhong
> 
> It's a very clear doc. Thanks a lot!
> 
> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> 
> Does Linux need to provide reserved area to Xen? Why not leaving Xen
> to decide reserved area within reported pmem regions and then return
> reserved info to Dom0 NVDIMM driver to balloon out?
>

NVDIMM can be used as a persistent storage like a disk drive, so the
reservation should be done out of Xen and Dom0, for example, by an
administrator who is expected to make necessary data backup in
advance.

Therefore, dom0 linux actually reports (instead of providing) the
reserved area to Xen, and the latter checks if the reserved area is
large enough and (if yes) asks dom0 to balloon out the reserved area.

> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> > 
> >      If any checks fail, the reported pmem device will be ignored by
> >      Xen hypervisor and hence will not be used by any
> >      guests. Otherwise, Xen hypervisor will recorded the reported
> >      parameters and create page_info structs in the reserved area.
> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> 
> Then both ndctl and Xen become source of requesting reserved area
> to Linux NVDIMM driver. You don't need change ndctl as described in
> 4.2.1. User can still use ndctl to reserve for Dom0's own purpose.
>

I missed something here: Dom0 pmem driver should also prevent
further operations on host namespace after it successfully reports to
Xen. In this way, we can prevent uerspace tools like ndctl to break
the host pmem device.

Thanks.
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  9:01   ` Zhang, Haozhong
@ 2016-07-19  0:58     ` Tian, Kevin
  2016-07-19  2:10       ` Zhang, Haozhong
  0 siblings, 1 reply; 24+ messages in thread
From: Tian, Kevin @ 2016-07-19  0:58 UTC (permalink / raw)
  To: Zhang, Haozhong
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Nakajima, Jun,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Xiao Guangrong

> From: Zhang, Haozhong
> Sent: Monday, July 18, 2016 5:02 PM
> 
> On 07/18/16 16:36, Tian, Kevin wrote:
> > > From: Zhang, Haozhong
> > > Sent: Monday, July 18, 2016 8:29 AM
> > >
> > > Hi,
> > >
> > > Following is version 2 of the design doc for supporting vNVDIMM in
> > > Xen. It's basically the summary of discussion on previous v1 design
> > > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> > > Any comments are welcome. The corresponding patches are WIP.
> > >
> > > Thanks,
> > > Haozhong
> >
> > It's a very clear doc. Thanks a lot!
> >
> > >
> > > 4.2.2 Detection of Host pmem Devices
> > >
> > >  The detection and initialize host pmem devices require a non-trivial
> > >  driver to interact with the corresponding ACPI namespace devices,
> > >  parse namespace labels and make necessary recovery actions. Instead
> > >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> > >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> > >  detected host pmem devices to Xen hypervisor.
> > >
> > >  Our design takes following steps to detect host pmem devices when Xen
> > >  boots.
> > >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> > >      Linux NVDIMM driver.
> > >
> > >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> > >      of the pmem devices and reserved areas to Xen hypervisor via a
> > >      new hypercall.
> >
> > Does Linux need to provide reserved area to Xen? Why not leaving Xen
> > to decide reserved area within reported pmem regions and then return
> > reserved info to Dom0 NVDIMM driver to balloon out?
> >
> 
> NVDIMM can be used as a persistent storage like a disk drive, so the
> reservation should be done out of Xen and Dom0, for example, by an
> administrator who is expected to make necessary data backup in
> advance.

What prevents NVDIMM driver from reserving some region itself before
reporting to user space?

> 
> Therefore, dom0 linux actually reports (instead of providing) the
> reserved area to Xen, and the latter checks if the reserved area is
> large enough and (if yes) asks dom0 to balloon out the reserved area.

It looks non-intuitive since administrator doesn't know the actual requirement
of Xen. Then administrator has to guess and try. Even it finally works, the 
reserved size may not be optimal.

If Dom0 NVDIMM driver does reservation itself and notify Xen, at least there 
is a way for Xen to return a failure with required size and then at the 2nd time 
the NVDIMM driver can adjust the reservation as desired. 

Did I misunderstand the flow here?

> 
> > >
> > >  (3) Xen hypervisor then checks
> > >      - whether SPA and size of the newly reported pmem device is overlap
> > >        with any previously reported pmem devices;
> > >      - whether the reserved area can fit in the pmem device and is
> > >        large enough to hold page_info structs for itself.
> > >
> > >      If any checks fail, the reported pmem device will be ignored by
> > >      Xen hypervisor and hence will not be used by any
> > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > >      parameters and create page_info structs in the reserved area.
> > >
> > >  (4) Because the reserved area is now used by Xen hypervisor, it
> > >      should not be accessible by Dom0 any more. Therefore, if a host
> > >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> > >      reserved area from Dom0. Our design also needs to extend Linux
> > >      NVDIMM driver to "balloon out" the reserved area after it
> > >      successfully reports a pmem device to Xen hypervisor.
> >
> > Then both ndctl and Xen become source of requesting reserved area
> > to Linux NVDIMM driver. You don't need change ndctl as described in
> > 4.2.1. User can still use ndctl to reserve for Dom0's own purpose.
> >
> 
> I missed something here: Dom0 pmem driver should also prevent
> further operations on host namespace after it successfully reports to
> Xen. In this way, we can prevent uerspace tools like ndctl to break
> the host pmem device.
> 

yes, Dom0 driver is expected to reserve the region allocated for Xen.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  0:29 [RFC Design Doc v2] Add vNVDIMM support for Xen Haozhong Zhang
  2016-07-18  8:36 ` Tian, Kevin
@ 2016-07-19  1:57 ` Bob Liu
  2016-07-19  2:40   ` Haozhong Zhang
  2016-08-02 14:46 ` Jan Beulich
  2016-08-03 21:25 ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 24+ messages in thread
From: Bob Liu @ 2016-07-19  1:57 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, Konrad Rzeszutek Wilk, George Dunlap,
	Andrew Cooper, Ian Jackson, Stefano Stabellini, Juergen Gross,
	Wei Liu, Tian, Kevin, Xiao Guangrong, Nakajima, Jun

Hey Haozhong,

On 07/18/2016 08:29 AM, Haozhong Zhang wrote:
> Hi,
> 
> Following is version 2 of the design doc for supporting vNVDIMM in

This version is really good, very clear and included almost everything I'd like to know.

> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> Any comments are welcome. The corresponding patches are WIP.
> 

So are you(or Intel) going to write all the patches? Is there any task the community to take part in?

[..snip..]
> 3. Usage Example of vNVDIMM in Xen
> 
>  Our design is to provide virtual pmem devices to HVM domains. The
>  virtual pmem devices are backed by host pmem devices.
> 
>  Dom0 Linux kernel can detect the host pmem devices and create
>  /dev/pmemXX for each detected devices. Users in Dom0 can then create
>  DAX file system on /dev/pmemXX and create several pre-allocate files
>  in the DAX file system.
> 
>  After setup the file system on the host pmem, users can add the
>  following lines in the xl configuration files to assign the host pmem
>  regions to domains:
>      vnvdimm = [ 'file=/dev/pmem0' ]
>  or
>      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> 

Could you please also consider the case when driver domain gets involved?
E.g vnvdimm = [ 'file=/dev/pmem0', backend='xxx' ]?

>   The first type of configuration assigns the entire pmem device
>   (/dev/pmem0) to the domain, while the second assigns the space
>   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
>   the domain.
> 
..[snip..]
> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>      Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>      of the pmem devices and reserved areas to Xen hypervisor via a
>      new hypercall.
> 
>  (3) Xen hypervisor then checks
>      - whether SPA and size of the newly reported pmem device is overlap
>        with any previously reported pmem devices;
>      - whether the reserved area can fit in the pmem device and is
>        large enough to hold page_info structs for itself.
> 
>      If any checks fail, the reported pmem device will be ignored by
>      Xen hypervisor and hence will not be used by any
>      guests. Otherwise, Xen hypervisor will recorded the reported
>      parameters and create page_info structs in the reserved area.
> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>      should not be accessible by Dom0 any more. Therefore, if a host
>      pmem device is recorded by Xen hypervisor, Xen will unmap its
>      reserved area from Dom0. Our design also needs to extend Linux
>      NVDIMM driver to "balloon out" the reserved area after it
>      successfully reports a pmem device to Xen hypervisor.
> 
> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>      it uses the method above to get the start SPA of the host pmem
>      device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>      /mnt/dax/file, and adds the corresponding physical offsets and
>      lengths in each mapping entries to above start SPA to get the SPA
>      ranges pre-allocated for this file.
> 

Looks like PMEM can't be passed through to driver domain directly like e.g PCI devices.

So if created a driver domain by: vnvdimm = [ 'file=/dev/pmem0' ], and make a DAX file system on the driver domain.

Then creating new guests with vnvdimm = [ 'file=dax file in driver domain', backend = 'driver domain' ].
Is this going to work? In my understanding, fiemap can only get the GPFN instead of the really SPA of PMEM in this case.


>  The resulting host SPA ranges will be passed to QEMU which allocates
>  guest address space for vNVDIMM devices and calls Xen hypervisor to
>  map the guest address to the host SPA ranges.
> 

Can Dom0 still access the same SPA range when Xen decides to assign it to new domU?
I assume the range will be unmapped automatically from dom0 in the hypercall?

Thanks,
-Bob

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-19  0:58     ` Tian, Kevin
@ 2016-07-19  2:10       ` Zhang, Haozhong
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Haozhong @ 2016-07-19  2:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu, Nakajima, Jun,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jan Beulich, Xiao Guangrong

On 07/19/16 08:58, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, July 18, 2016 5:02 PM
> > 
> > On 07/18/16 16:36, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Monday, July 18, 2016 8:29 AM
> > > >
> > > > Hi,
> > > >
> > > > Following is version 2 of the design doc for supporting vNVDIMM in
> > > > Xen. It's basically the summary of discussion on previous v1 design
> > > > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> > > > Any comments are welcome. The corresponding patches are WIP.
> > > >
> > > > Thanks,
> > > > Haozhong
> > >
> > > It's a very clear doc. Thanks a lot!
> > >
> > > >
> > > > 4.2.2 Detection of Host pmem Devices
> > > >
> > > >  The detection and initialize host pmem devices require a non-trivial
> > > >  driver to interact with the corresponding ACPI namespace devices,
> > > >  parse namespace labels and make necessary recovery actions. Instead
> > > >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> > > >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> > > >  detected host pmem devices to Xen hypervisor.
> > > >
> > > >  Our design takes following steps to detect host pmem devices when Xen
> > > >  boots.
> > > >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> > > >      Linux NVDIMM driver.
> > > >
> > > >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> > > >      of the pmem devices and reserved areas to Xen hypervisor via a
> > > >      new hypercall.
> > >
> > > Does Linux need to provide reserved area to Xen? Why not leaving Xen
> > > to decide reserved area within reported pmem regions and then return
> > > reserved info to Dom0 NVDIMM driver to balloon out?
> > >
> > 
> > NVDIMM can be used as a persistent storage like a disk drive, so the
> > reservation should be done out of Xen and Dom0, for example, by an
> > administrator who is expected to make necessary data backup in
> > advance.
> 
> What prevents NVDIMM driver from reserving some region itself before
> reporting to user space?
>

Nothing in theory prevents the driver doing reservations. I just mean
the reservation should be initiated by someone which can ensure, for
example, the current data on pmem is either useless or properly
backup. The reservation is of course finally done by the driver.

> > 
> > Therefore, dom0 linux actually reports (instead of providing) the
> > reserved area to Xen, and the latter checks if the reserved area is
> > large enough and (if yes) asks dom0 to balloon out the reserved area.
> 
> It looks non-intuitive since administrator doesn't know the actual requirement
> of Xen. Then administrator has to guess and try. Even it finally works, the 
> reserved size may not be optimal.
> 
> If Dom0 NVDIMM driver does reservation itself and notify Xen, at least there 
> is a way for Xen to return a failure with required size and then at the 2nd time 
> the NVDIMM driver can adjust the reservation as desired. 
>
> Did I misunderstand the flow here?
>

I designed to let the administrator calculate the reserved size and
pass to the driver. Now, you are right, I think it's better to let Xen
advise the reserved size to NVDIMM driver in dom0 and therefore no
need for manually calculated size.

Thanks,
Haozhong

> > 
> > > >
> > > >  (3) Xen hypervisor then checks
> > > >      - whether SPA and size of the newly reported pmem device is overlap
> > > >        with any previously reported pmem devices;
> > > >      - whether the reserved area can fit in the pmem device and is
> > > >        large enough to hold page_info structs for itself.
> > > >
> > > >      If any checks fail, the reported pmem device will be ignored by
> > > >      Xen hypervisor and hence will not be used by any
> > > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > > >      parameters and create page_info structs in the reserved area.
> > > >
> > > >  (4) Because the reserved area is now used by Xen hypervisor, it
> > > >      should not be accessible by Dom0 any more. Therefore, if a host
> > > >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> > > >      reserved area from Dom0. Our design also needs to extend Linux
> > > >      NVDIMM driver to "balloon out" the reserved area after it
> > > >      successfully reports a pmem device to Xen hypervisor.
> > >
> > > Then both ndctl and Xen become source of requesting reserved area
> > > to Linux NVDIMM driver. You don't need change ndctl as described in
> > > 4.2.1. User can still use ndctl to reserve for Dom0's own purpose.
> > >
> > 
> > I missed something here: Dom0 pmem driver should also prevent
> > further operations on host namespace after it successfully reports to
> > Xen. In this way, we can prevent uerspace tools like ndctl to break
> > the host pmem device.
> > 
> 
> yes, Dom0 driver is expected to reserve the region allocated for Xen.
> 
> Thanks
> Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-19  1:57 ` Bob Liu
@ 2016-07-19  2:40   ` Haozhong Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Haozhong Zhang @ 2016-07-19  2:40 UTC (permalink / raw)
  To: Bob Liu
  Cc: Juergen Gross, Tian, Kevin, Stefano Stabellini, Wei Liu,
	Nakajima, Jun, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong

On 07/19/16 09:57, Bob Liu wrote:
> Hey Haozhong,
> 
> On 07/18/2016 08:29 AM, Haozhong Zhang wrote:
> > Hi,
> > 
> > Following is version 2 of the design doc for supporting vNVDIMM in
> 
> This version is really good, very clear and included almost everything I'd like to know.
> 
> > Xen. It's basically the summary of discussion on previous v1 design
> > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> > Any comments are welcome. The corresponding patches are WIP.
> > 
> 
> So are you(or Intel) going to write all the patches? Is there any task the community to take part in?
>

For the first version I think so. Currently there are some
dependencies among multiple parts in my patches (Xen/Linux/Qemu), and
I have to adjust them from time to time in the development. Once after
I can provide a working first version, I'm very glad to work with the
community for the further development.

> [..snip..]
> > 3. Usage Example of vNVDIMM in Xen
> > 
> >  Our design is to provide virtual pmem devices to HVM domains. The
> >  virtual pmem devices are backed by host pmem devices.
> > 
> >  Dom0 Linux kernel can detect the host pmem devices and create
> >  /dev/pmemXX for each detected devices. Users in Dom0 can then create
> >  DAX file system on /dev/pmemXX and create several pre-allocate files
> >  in the DAX file system.
> > 
> >  After setup the file system on the host pmem, users can add the
> >  following lines in the xl configuration files to assign the host pmem
> >  regions to domains:
> >      vnvdimm = [ 'file=/dev/pmem0' ]
> >  or
> >      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> > 
> 
> Could you please also consider the case when driver domain gets involved?
> E.g vnvdimm = [ 'file=/dev/pmem0', backend='xxx' ]?
>

Will consider the design, but for the first version patches I would
like to make things simple.

> >   The first type of configuration assigns the entire pmem device
> >   (/dev/pmem0) to the domain, while the second assigns the space
> >   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
> >   the domain.
> > 
> ..[snip..]
> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> > 
> >      If any checks fail, the reported pmem device will be ignored by
> >      Xen hypervisor and hence will not be used by any
> >      guests. Otherwise, Xen hypervisor will recorded the reported
> >      parameters and create page_info structs in the reserved area.
> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> > 
> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> > 
> 
> Looks like PMEM can't be passed through to driver domain directly like e.g PCI devices.
>

pmem is not a PCI device.

I'm not familiar with the driver domain. If only PCI devices can be
passed through to driver domain, it may not be able to passthrough a
pmem device to driver domain.

> So if created a driver domain by: vnvdimm = [ 'file=/dev/pmem0' ], and make a DAX file system on the driver domain.
> 
> Then creating new guests with vnvdimm = [ 'file=dax file in driver domain', backend = 'driver domain' ].
> Is this going to work? In my understanding, fiemap can only get the GPFN instead of the really SPA of PMEM in this case.
>

fiemap returns the offsets of extents. They will be added to the start
SPA of the corresponding /dev/pmem0 (got via /sys/block/pmem0/device/resource).
For Dom0, we can get the host physical address in this way.

I'm not sure whether a pmem device can be passed to a driver domain,
and (if it can) whether the host SPA would be seen by the driver
domain. If answer to either one is no, pmem would not work with driver
domain in above way.

> 
> >  The resulting host SPA ranges will be passed to QEMU which allocates
> >  guest address space for vNVDIMM devices and calls Xen hypervisor to
> >  map the guest address to the host SPA ranges.
> > 
> 
> Can Dom0 still access the same SPA range when Xen decides to assign it to new domU?
> I assume the range will be unmapped automatically from dom0 in the hypercall?
>

Yes, they will be unmaaped from dom0.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  0:29 [RFC Design Doc v2] Add vNVDIMM support for Xen Haozhong Zhang
  2016-07-18  8:36 ` Tian, Kevin
  2016-07-19  1:57 ` Bob Liu
@ 2016-08-02 14:46 ` Jan Beulich
  2016-08-03  6:54   ` Haozhong Zhang
  2016-08-03 21:25 ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-08-02 14:46 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

>>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>      Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>      of the pmem devices and reserved areas to Xen hypervisor via a
>      new hypercall.
> 
>  (3) Xen hypervisor then checks
>      - whether SPA and size of the newly reported pmem device is overlap
>        with any previously reported pmem devices;

... or with system RAM.

>      - whether the reserved area can fit in the pmem device and is
>        large enough to hold page_info structs for itself.

So "reserved" here means available for Xen's use, but not for more
general purposes? How would the area Linux uses for its own
purposes get represented?

>  (4) Because the reserved area is now used by Xen hypervisor, it
>      should not be accessible by Dom0 any more. Therefore, if a host
>      pmem device is recorded by Xen hypervisor, Xen will unmap its
>      reserved area from Dom0. Our design also needs to extend Linux
>      NVDIMM driver to "balloon out" the reserved area after it
>      successfully reports a pmem device to Xen hypervisor.

... "balloon out" ... _after_? That'd be unsafe.

> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>      it uses the method above to get the start SPA of the host pmem
>      device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>      /mnt/dax/file, and adds the corresponding physical offsets and
>      lengths in each mapping entries to above start SPA to get the SPA
>      ranges pre-allocated for this file.

Remind me again: These extents never change, not even across
reboot? I think this would be good to be written down here explicitly.
Hadn't there been talk of using labels to be able to allow a guest to
own the exact same physical range again after reboot or guest or
host?

>  3) When hvmloader loads a type 0 entry, it extracts the signature
>     from the data blob and search for it in builtin_table_sigs[].  If
>     found anyone, hvmloader will report an error and stop. Otherwise,
>     it will append it to the end of loaded guest ACPI.

Duplicate table names aren't generally collisions: There can, for
example, be many tables named "SSDT".

>  4) When hvmloader loads a type 1 entry, it extracts the device name
>     from the data blob and search for it in builtin_nd_names[]. If
>     found anyone, hvmloader will report and error and stop. Otherwise,
>     it will wrap the AML code snippet by "Device (name[4]) {...}" and
>     include it in a new SSDT which is then appended to the end of
>     loaded guest ACPI.

But all of these could go into a single SSDT, instead of (as it sounds)
each into its own one?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-02 14:46 ` Jan Beulich
@ 2016-08-03  6:54   ` Haozhong Zhang
  2016-08-03  8:45     ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-03  6:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

On 08/02/16 08:46, Jan Beulich wrote:
> >>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> 
> ... or with system RAM.
> 
> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> 
> So "reserved" here means available for Xen's use, but not for more
> general purposes? How would the area Linux uses for its own
> purposes get represented?
>

Reserved for xen only. I was going to reuse the existing reservation
mechanism in linux pmem driver to allow reserving two areas - one for
xen and another for linux itself. However, I later realized the
existing mechanism depends on huge page support, so it does not work
in dom0. For the first implementation, I'm implementing in a different
way to reserve only for xen, and letting dom0 linux put page struct
for pmem in the normal ram. Afterwards, I'll look for a way to allow
both.

> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> 
> ... "balloon out" ... _after_? That'd be unsafe.
>

Before ballooning is accomplished, the pmem driver does not create any
device node under /dev/ and hence no one except the pmem drive can
access the reserved area on pmem, so I think it's okey to balloon
after reporting.

> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> 
> Remind me again: These extents never change, not even across
> reboot? I think this would be good to be written down here explicitly.

Yes

> Hadn't there been talk of using labels to be able to allow a guest to
> own the exact same physical range again after reboot or guest or
> host?
>

You mean labels in NVDIMM label storage area? As defined in Intel
NVDIMM Namespace Specification, labels are used to specify
namespaces. For a pmem interleave set (possible cross several dimms),
at most one pmem namespace (and hence at most one label) is
allowed. Therefore, labels can not be used to partition pmem.

> >  3) When hvmloader loads a type 0 entry, it extracts the signature
> >     from the data blob and search for it in builtin_table_sigs[].  If
> >     found anyone, hvmloader will report an error and stop. Otherwise,
> >     it will append it to the end of loaded guest ACPI.
> 
> Duplicate table names aren't generally collisions: There can, for
> example, be many tables named "SSDT".
>

I'll exclude SSDT from the duplication check.

> >  4) When hvmloader loads a type 1 entry, it extracts the device name
> >     from the data blob and search for it in builtin_nd_names[]. If
> >     found anyone, hvmloader will report and error and stop. Otherwise,
> >     it will wrap the AML code snippet by "Device (name[4]) {...}" and
> >     include it in a new SSDT which is then appended to the end of
> >     loaded guest ACPI.
> 
> But all of these could go into a single SSDT, instead of (as it sounds)
> each into its own one?
>

Yes, I meant to put them in one SSDT.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03  6:54   ` Haozhong Zhang
@ 2016-08-03  8:45     ` Jan Beulich
  2016-08-03  9:37       ` Haozhong Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-08-03  8:45 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

>>> On 03.08.16 at 08:54, <haozhong.zhang@intel.com> wrote:
> On 08/02/16 08:46, Jan Beulich wrote:
>> >>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
>> >  (4) Because the reserved area is now used by Xen hypervisor, it
>> >      should not be accessible by Dom0 any more. Therefore, if a host
>> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
>> >      reserved area from Dom0. Our design also needs to extend Linux
>> >      NVDIMM driver to "balloon out" the reserved area after it
>> >      successfully reports a pmem device to Xen hypervisor.
>> 
>> ... "balloon out" ... _after_? That'd be unsafe.
>>
> 
> Before ballooning is accomplished, the pmem driver does not create any
> device node under /dev/ and hence no one except the pmem drive can
> access the reserved area on pmem, so I think it's okey to balloon
> after reporting.

Right now Dom0 isn't allowed to access any memory in use by Xen
(and not explicitly shared), and I don't think we should deviate
from that model for pmem.

>> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
>> > 
>> >  Before a pmem file is assigned to a domain, we need to know the host
>> >  SPA ranges that are allocated to this file. We do this work in xl.
>> > 
>> >  If a pmem device /dev/pmem0 is given, xl will read
>> >  /sys/block/pmem0/device/{resource,size} respectively for the start
>> >  SPA and size of the pmem device.
>> > 
>> >  If a pre-allocated file /mnt/dax/file is given,
>> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>> >      it uses the method above to get the start SPA of the host pmem
>> >      device.
>> >  (2) xl then uses fiemap ioctl to get the extend mappings of
>> >      /mnt/dax/file, and adds the corresponding physical offsets and
>> >      lengths in each mapping entries to above start SPA to get the SPA
>> >      ranges pre-allocated for this file.
>> 
>> Remind me again: These extents never change, not even across
>> reboot? I think this would be good to be written down here explicitly.
> 
> Yes
> 
>> Hadn't there been talk of using labels to be able to allow a guest to
>> own the exact same physical range again after reboot or guest or
>> host?
> 
> You mean labels in NVDIMM label storage area? As defined in Intel
> NVDIMM Namespace Specification, labels are used to specify
> namespaces. For a pmem interleave set (possible cross several dimms),
> at most one pmem namespace (and hence at most one label) is
> allowed. Therefore, labels can not be used to partition pmem.

Okay. But then how do particular ranges get associated with the
owning guest(s)? Merely by SPA would seem rather fragile to me.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03  8:45     ` Jan Beulich
@ 2016-08-03  9:37       ` Haozhong Zhang
  2016-08-03  9:47         ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-03  9:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

On 08/03/16 02:45, Jan Beulich wrote:
> >>> On 03.08.16 at 08:54, <haozhong.zhang@intel.com> wrote:
> > On 08/02/16 08:46, Jan Beulich wrote:
> >> >>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
> >> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >> >      should not be accessible by Dom0 any more. Therefore, if a host
> >> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >> >      reserved area from Dom0. Our design also needs to extend Linux
> >> >      NVDIMM driver to "balloon out" the reserved area after it
> >> >      successfully reports a pmem device to Xen hypervisor.
> >> 
> >> ... "balloon out" ... _after_? That'd be unsafe.
> >>
> > 
> > Before ballooning is accomplished, the pmem driver does not create any
> > device node under /dev/ and hence no one except the pmem drive can
> > access the reserved area on pmem, so I think it's okey to balloon
> > after reporting.
> 
> Right now Dom0 isn't allowed to access any memory in use by Xen
> (and not explicitly shared), and I don't think we should deviate
> from that model for pmem.
>

In this design, Xen hypervisor unmaps the reserved area from Dom0 so
that Dom0 cannot access the reserved area afterwards. And "balloon" is
in fact not a memory ballooning, because Linux kernel never allocates
from pmem like normal ram. In my current implementation, it's just to
remove the reserved area from a resource struct covering pmem.

> >> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> >> > 
> >> >  Before a pmem file is assigned to a domain, we need to know the host
> >> >  SPA ranges that are allocated to this file. We do this work in xl.
> >> > 
> >> >  If a pmem device /dev/pmem0 is given, xl will read
> >> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >> >  SPA and size of the pmem device.
> >> > 
> >> >  If a pre-allocated file /mnt/dax/file is given,
> >> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >> >      it uses the method above to get the start SPA of the host pmem
> >> >      device.
> >> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >> >      lengths in each mapping entries to above start SPA to get the SPA
> >> >      ranges pre-allocated for this file.
> >> 
> >> Remind me again: These extents never change, not even across
> >> reboot? I think this would be good to be written down here explicitly.
> > 
> > Yes
> > 
> >> Hadn't there been talk of using labels to be able to allow a guest to
> >> own the exact same physical range again after reboot or guest or
> >> host?
> > 
> > You mean labels in NVDIMM label storage area? As defined in Intel
> > NVDIMM Namespace Specification, labels are used to specify
> > namespaces. For a pmem interleave set (possible cross several dimms),
> > at most one pmem namespace (and hence at most one label) is
> > allowed. Therefore, labels can not be used to partition pmem.
> 
> Okay. But then how do particular ranges get associated with the
> owning guest(s)? Merely by SPA would seem rather fragile to me.
> 

By using the file name, e.g. if I specify vnvdimm = [ 'file=/mnt/dax/foo' ]
in a domain config file, SPA occupied by /mnt/dax/foo are mapped to
the domain.  If the same file is used every time the domain is created,
the same virtual device will be seen by that domain.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03  9:37       ` Haozhong Zhang
@ 2016-08-03  9:47         ` Jan Beulich
  2016-08-03 10:08           ` Haozhong Zhang
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-08-03  9:47 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

>>> On 03.08.16 at 11:37, <haozhong.zhang@intel.com> wrote:
> On 08/03/16 02:45, Jan Beulich wrote:
>> >>> On 03.08.16 at 08:54, <haozhong.zhang@intel.com> wrote:
>> > On 08/02/16 08:46, Jan Beulich wrote:
>> >> >>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
>> >> >  (4) Because the reserved area is now used by Xen hypervisor, it
>> >> >      should not be accessible by Dom0 any more. Therefore, if a host
>> >> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
>> >> >      reserved area from Dom0. Our design also needs to extend Linux
>> >> >      NVDIMM driver to "balloon out" the reserved area after it
>> >> >      successfully reports a pmem device to Xen hypervisor.
>> >> 
>> >> ... "balloon out" ... _after_? That'd be unsafe.
>> >>
>> > 
>> > Before ballooning is accomplished, the pmem driver does not create any
>> > device node under /dev/ and hence no one except the pmem drive can
>> > access the reserved area on pmem, so I think it's okey to balloon
>> > after reporting.
>> 
>> Right now Dom0 isn't allowed to access any memory in use by Xen
>> (and not explicitly shared), and I don't think we should deviate
>> from that model for pmem.
> 
> In this design, Xen hypervisor unmaps the reserved area from Dom0 so
> that Dom0 cannot access the reserved area afterwards. And "balloon" is
> in fact not a memory ballooning, because Linux kernel never allocates
> from pmem like normal ram. In my current implementation, it's just to
> remove the reserved area from a resource struct covering pmem.

Ah, in that case please either use a different term, or explain what
"balloon out" is meant to mean in this context.

>> >> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
>> >> > 
>> >> >  Before a pmem file is assigned to a domain, we need to know the host
>> >> >  SPA ranges that are allocated to this file. We do this work in xl.
>> >> > 
>> >> >  If a pmem device /dev/pmem0 is given, xl will read
>> >> >  /sys/block/pmem0/device/{resource,size} respectively for the start
>> >> >  SPA and size of the pmem device.
>> >> > 
>> >> >  If a pre-allocated file /mnt/dax/file is given,
>> >> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>> >> >      it uses the method above to get the start SPA of the host pmem
>> >> >      device.
>> >> >  (2) xl then uses fiemap ioctl to get the extend mappings of
>> >> >      /mnt/dax/file, and adds the corresponding physical offsets and
>> >> >      lengths in each mapping entries to above start SPA to get the SPA
>> >> >      ranges pre-allocated for this file.
>> >> 
>> >> Remind me again: These extents never change, not even across
>> >> reboot? I think this would be good to be written down here explicitly.
>> > 
>> > Yes
>> > 
>> >> Hadn't there been talk of using labels to be able to allow a guest to
>> >> own the exact same physical range again after reboot or guest or
>> >> host?
>> > 
>> > You mean labels in NVDIMM label storage area? As defined in Intel
>> > NVDIMM Namespace Specification, labels are used to specify
>> > namespaces. For a pmem interleave set (possible cross several dimms),
>> > at most one pmem namespace (and hence at most one label) is
>> > allowed. Therefore, labels can not be used to partition pmem.
>> 
>> Okay. But then how do particular ranges get associated with the
>> owning guest(s)? Merely by SPA would seem rather fragile to me.
>> 
> 
> By using the file name, e.g. if I specify vnvdimm = [ 'file=/mnt/dax/foo' ]
> in a domain config file, SPA occupied by /mnt/dax/foo are mapped to
> the domain.  If the same file is used every time the domain is created,
> the same virtual device will be seen by that domain.

So what if the file got deleted and re-created in between? Since
I don't think you can specify the SPAs to use when creating such
a file, such an operation would be quite different from removing
and re-adding e.g. a specific PCI device (to be used by a guest)
on a host (while the guest is not running).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03  9:47         ` Jan Beulich
@ 2016-08-03 10:08           ` Haozhong Zhang
  2016-08-03 10:18             ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-03 10:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

On 08/03/16 03:47, Jan Beulich wrote:
> >>> On 03.08.16 at 11:37, <haozhong.zhang@intel.com> wrote:
> > On 08/03/16 02:45, Jan Beulich wrote:
> >> >>> On 03.08.16 at 08:54, <haozhong.zhang@intel.com> wrote:
> >> > On 08/02/16 08:46, Jan Beulich wrote:
> >> >> >>> On 18.07.16 at 02:29, <haozhong.zhang@intel.com> wrote:
> >> >> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >> >> >      should not be accessible by Dom0 any more. Therefore, if a host
> >> >> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> >> >> >      reserved area from Dom0. Our design also needs to extend Linux
> >> >> >      NVDIMM driver to "balloon out" the reserved area after it
> >> >> >      successfully reports a pmem device to Xen hypervisor.
> >> >> 
> >> >> ... "balloon out" ... _after_? That'd be unsafe.
> >> >>
> >> > 
> >> > Before ballooning is accomplished, the pmem driver does not create any
> >> > device node under /dev/ and hence no one except the pmem drive can
> >> > access the reserved area on pmem, so I think it's okey to balloon
> >> > after reporting.
> >> 
> >> Right now Dom0 isn't allowed to access any memory in use by Xen
> >> (and not explicitly shared), and I don't think we should deviate
> >> from that model for pmem.
> > 
> > In this design, Xen hypervisor unmaps the reserved area from Dom0 so
> > that Dom0 cannot access the reserved area afterwards. And "balloon" is
> > in fact not a memory ballooning, because Linux kernel never allocates
> > from pmem like normal ram. In my current implementation, it's just to
> > remove the reserved area from a resource struct covering pmem.
> 
> Ah, in that case please either use a different term, or explain what
> "balloon out" is meant to mean in this context.
> 
> >> >> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> >> >> > 
> >> >> >  Before a pmem file is assigned to a domain, we need to know the host
> >> >> >  SPA ranges that are allocated to this file. We do this work in xl.
> >> >> > 
> >> >> >  If a pmem device /dev/pmem0 is given, xl will read
> >> >> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >> >> >  SPA and size of the pmem device.
> >> >> > 
> >> >> >  If a pre-allocated file /mnt/dax/file is given,
> >> >> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >> >> >      it uses the method above to get the start SPA of the host pmem
> >> >> >      device.
> >> >> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >> >> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >> >> >      lengths in each mapping entries to above start SPA to get the SPA
> >> >> >      ranges pre-allocated for this file.
> >> >> 
> >> >> Remind me again: These extents never change, not even across
> >> >> reboot? I think this would be good to be written down here explicitly.
> >> > 
> >> > Yes
> >> > 
> >> >> Hadn't there been talk of using labels to be able to allow a guest to
> >> >> own the exact same physical range again after reboot or guest or
> >> >> host?
> >> > 
> >> > You mean labels in NVDIMM label storage area? As defined in Intel
> >> > NVDIMM Namespace Specification, labels are used to specify
> >> > namespaces. For a pmem interleave set (possible cross several dimms),
> >> > at most one pmem namespace (and hence at most one label) is
> >> > allowed. Therefore, labels can not be used to partition pmem.
> >> 
> >> Okay. But then how do particular ranges get associated with the
> >> owning guest(s)? Merely by SPA would seem rather fragile to me.
> >> 
> > 
> > By using the file name, e.g. if I specify vnvdimm = [ 'file=/mnt/dax/foo' ]
> > in a domain config file, SPA occupied by /mnt/dax/foo are mapped to
> > the domain.  If the same file is used every time the domain is created,
> > the same virtual device will be seen by that domain.
> 
> So what if the file got deleted and re-created in between? Since
> I don't think you can specify the SPAs to use when creating such
> a file, such an operation would be quite different from removing
> and re-adding e.g. a specific PCI device (to be used by a guest)
> on a host (while the guest is not running).
> 

If modified in between, guest will see a virtual pmem device of
different data. But the usage of pmem is similar to disk: if a file of
the same content is given every time, the guest can get a virtual
pmem/disk of the same data as last reboot/shutdown; keeping the data
unchanged between multiple boots is out of the scope of Xen.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03 10:08           ` Haozhong Zhang
@ 2016-08-03 10:18             ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2016-08-03 10:18 UTC (permalink / raw)
  To: Haozhong Zhang
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

>>> On 03.08.16 at 12:08, <haozhong.zhang@intel.com> wrote:
> On 08/03/16 03:47, Jan Beulich wrote:
>> >>> On 03.08.16 at 11:37, <haozhong.zhang@intel.com> wrote:
>> > By using the file name, e.g. if I specify vnvdimm = [ 'file=/mnt/dax/foo' ]
>> > in a domain config file, SPA occupied by /mnt/dax/foo are mapped to
>> > the domain.  If the same file is used every time the domain is created,
>> > the same virtual device will be seen by that domain.
>> 
>> So what if the file got deleted and re-created in between? Since
>> I don't think you can specify the SPAs to use when creating such
>> a file, such an operation would be quite different from removing
>> and re-adding e.g. a specific PCI device (to be used by a guest)
>> on a host (while the guest is not running).
> 
> If modified in between, guest will see a virtual pmem device of
> different data. But the usage of pmem is similar to disk: if a file of
> the same content is given every time, the guest can get a virtual
> pmem/disk of the same data as last reboot/shutdown; keeping the data
> unchanged between multiple boots is out of the scope of Xen.

Except that here we're talking of handing a piece of hardware to a
guest, which to me is more like a PCI device than a (virtual) disk. But
anyway ...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-07-18  0:29 [RFC Design Doc v2] Add vNVDIMM support for Xen Haozhong Zhang
                   ` (2 preceding siblings ...)
  2016-08-02 14:46 ` Jan Beulich
@ 2016-08-03 21:25 ` Konrad Rzeszutek Wilk
  2016-08-03 23:16   ` Konrad Rzeszutek Wilk
  2016-08-04  8:52   ` Haozhong Zhang
  3 siblings, 2 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-03 21:25 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Stefano Stabellini, Juergen Gross, Wei Liu, Tian,
	Kevin, Xiao Guangrong, Nakajima, Jun

On Mon, Jul 18, 2016 at 08:29:12AM +0800, Haozhong Zhang wrote:
> Hi,
> 

Hey!

Thanks for posting! Sorry for the late review. Below are some of my
comment.

> Following is version 2 of the design doc for supporting vNVDIMM in
> Xen. It's basically the summary of discussion on previous v1 design
> (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html).
> Any comments are welcome. The corresponding patches are WIP.
> 
> Thanks,
> Haozhong
> 
> 
> 
> vNVDIMM Design v2
> 
> Changes in v2:
>  - Rewrite the the design details based on previous discussion [7].
>  - Add Section 3 Usage Example of vNVDIMM in Xen.
>  - Remove content about pcommit instruction which has been deprecated [8].
> 
> Content
> =======
> 1. Background
>  1.1 Access Mechanisms: Persistent Memory and Block Window
>  1.2 ACPI Support
>   1.2.1 NFIT
>   1.2.2 _DSM and _FIT
>  1.3 Namespace
>  1.4 clwb/clflushopt
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
>  2.1 NVDIMM Driver in Linux Kernel
>  2.2 vNVDIMM Implementation in KVM/QEMU
> 3. Usage Example of vNVDIMM in Xen
> 4. Design of vNVDIMM in Xen
>  4.1 Guest clwb/clflushopt Enabling
>  4.2 pmem Address Management
>   4.2.1 Reserve Storage for Management Structures
>   4.2.2 Detection of Host pmem Devices
>   4.2.3 Get Host Machine Address (SPA) of Host pmem Files
>   4.2.4 Map Host pmem to Guests
>   4.2.5 Misc 1: RAS
>   4.2.6 Misc 2: hotplug
>  4.3 Guest ACPI Emulation
>   4.3.1 Building Guest ACPI Tables
>   4.3.2 Emulating Guest _DSM
> References
> 
> 
> Non-Volatile DIMM or NVDIMM is a type of RAM device that provides
> persistent storage and retains data across reboot and even power
> failures. This document describes the design to provide virtual NVDIMM
> devices or vNVDIMM in Xen.
> 
> The rest of this document is organized as below.
>  - Section 1 introduces the background knowledge of NVDIMM hardware,
>    which is used by other parts of this document.
> 
>  - Section 2 briefly introduces the current/future NVDIMM/vNVDIMM
>    support in Linux kernel/KVM/QEMU. They will affect the vNVDIMM
>    design in Xen.
> 
>  - Section 3 shows the basic usage example of vNVDIMM in Xen.
> 
>  - Section 4 proposes design details of vNVDIMM in Xen.
> 
> 
> 
> 1. Background
> 
> 1.1 Access Mechanisms: Persistent Memory and Block Window
> 
>  NVDIMM provides two access mechanisms: byte-addressable persistent
>  memory (pmem) and block window (pblk). An NVDIMM can contain multiple
>  ranges and each range can be accessed through either pmem or pblk
>  (but not both).
> 
>  Byte-addressable persistent memory mechanism (pmem) maps NVDIMM or
>  ranges of NVDIMM into the system physical address (SPA) space, so
>  that software can access NVDIMM via normal memory loads and
>  stores. If the virtual address is used, then MMU will translate it to
>  the physical address.
> 
>  In the virtualization circumstance, we can pass through a pmem range
>  or partial of it to a guest by mapping it in EPT (i.e. mapping guest
>  vNVDIMM physical address to host NVDIMM physical address), so that
>  guest accesses are applied directly to the host NVDIMM device without
>  hypervisor's interceptions.
> 
>  Block window mechanism (pblk) provides one or multiple block windows
>  (BW).  Each BW is composed of a command register, a status register
>  and a 8 Kbytes aperture register. Software fills the direction of the
>  transfer (read/write), the start address (LBA) and size on NVDIMM it
>  is going to transfer. If nothing goes wrong, the transferred data can
>  be read/write via the aperture register. The status and errors of the
>  transfer can be got from the status register. Other vendor-specific
>  commands and status can be implemented for BW as well. Details of the
>  block window access mechanism can be found in [3].
> 
>  In the virtualization circumstance, different pblk regions on a
>  single NVDIMM device may be accessed by different guests, so the
>  hypervisor needs to emulate BW, which would introduce a high overhead
>  for I/O intensive workload.
> 
>  Therefore, we are going to only implement pmem for vNVDIMM. The rest
>  of this document will mostly concentrate on pmem.
> 
> 
> 1.2 ACPI Support
> 
>  ACPI provides two factors of support for NVDIMM. First, NVDIMM
>  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
>  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
>  NVDIMM, including operations on namespace labels, S.M.A.R.T and
>  hotplug, are provided by ACPI methods (_DSM and _FIT).
> 
> 1.2.1 NFIT
> 
>  NFIT is a new system description table added in ACPI v6 with
>  signature "NFIT". It contains a set of structures.
> 
>  - System Physical Address Range Structure
>    (SPA Range Structure)
> 
>    SPA range structure describes system physical address ranges
>    occupied by NVDIMMs and types of regions.
> 
>    If Address Range Type GUID field of a SPA range structure is "Byte
>    Addressable Persistent Memory (PM) Region", then the structure
>    describes a NVDIMM region that is accessed via pmem. The System
>    Physical Address Range Base and Length fields describe the start
>    system physical address and the length that is occupied by that
>    NVDIMM region.
> 
>    A SPA range structure is identified by a non-zero SPA range
>    structure index.
> 
>    Note: [1] reserves E820 type 7: OSPM must comprehend this memory as
>          having non-volatile attributes and handle distinct from
>          conventional volatile memory (in Table 15-312 of [1]). The
>          memory region supports byte-addressable non-volatility. E820
>          type 12 (OEM defined) may be also used for legacy NVDIMM
>          prior to ACPI v6.
> 
>    Note: Besides OS, EFI firmware may also parse NFIT for booting
>          drives (Section 9.3.6.9 of [5]).
> 
>  - Memory Device to System Physical Address Range Mapping Structure
>    (Range Mapping Structure)
> 
>    An NVDIMM region described by a SPA range structure can be
>    interleaved across multiple NVDIMM devices. A range mapping
>    structure is used to describe the single mapping on each NVDIMM
>    device. It describes the size and the offset in a SPA range that an
>    NVDIMM device occupies. It may refer to an Interleave Structure
>    that contains details of the entire interleave set. Those
>    information is used in pblk by the NVDIMM driver for address
>    translation.
> 
>    The NVDIMM device described by the range mapping structure is
>    identified by an unique NFIT Device Handle.
> 
>  Details of NFIT and other structures can be found in Section 5.25 in [1].
> 
> 1.2.2 _DSM and _FIT
> 
>  The ACPI namespace device uses _HID of ACPI0012 to identify the root
>  NVDIMM interface device. An ACPI namespace device is also present
>  under the root device For each NVDIMM device. Above ACPI namespace

s/For/for/

>  devices are defined in SSDT.
> 
>  _DSM methods are present under the root device and each NVDIMM
>  device. _DSM methods are used by drivers to access the label storage
>  area, get health information, perform vendor-specific commands,
>  etc. Details of all _DSM methods can be found in [4].
> 
>  _FIT method is under the root device and evaluated by OSPM to get
>  NFIT of hotplugged NVDIMM. The hotplugged NVDIMM is indicated to OS
>  using ACPI Namespace device with PNPID of PNP0C80 and the device
>  object notification value is 0x80. Details of NVDIMM hotplug can be
>  found in Section 9.20 of [1].
> 
> 
> 1.3 Namespace
> 
>  [2] describes a mechanism to sub-divide NVDIMMs into namespaces,
>  which are logic units of storage similar to SCSI LUNs and NVM Express
>  namespaces.
> 
>  The namespace information is describes by namespace labels stored in
>  the persistent label storage area on each NVDIMM device. The label
>  storage area is excluded from the the range mapped by the SPA range

s/the the/the

>  structure and can only be accessed via _DSM methods.
> 
>  There are two types of namespaces defined in [2]: the persistent
>  memory namespace and the block namespaces. Persistent memory
>  namespaces is built for only pmem NVDIMM regions, while block
>  namespaces only for pblk. Only one persistent memory namespace is
>  allowed for a pmem NVDIMM region.
> 
>  Besides being accessed via _DSM, namespaces are managed and
>  interpreted by software. OS vendors may decide to not follow [2] and
>  store other types of information in the label storage area.
> 
> 
> 1.4 clwb/clflushopt
> 
>  Writes to NVDIMM may be cached by caches, so certain flushing
>  operations should be performed to make them persistent on
>  NVDIMM. clwb is used in favor of clflushopt and clflush to flush
>  writes from caches to memory.
> 
>  Details of clwb/clflushopt can be found in Chapter 10 of [6].

Didn't that opcode get dropped in favour of poking in some register?
> 
> 
> 
> 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> 
> 2.1 NVDIMM Driver in Linux Kernel
> 
>  Linux kernel since 4.2 has added support for ACPI-defined NVDIMM
>  devices.
> 
>  NVDIMM driver in Linux probes NVDIMM devices through ACPI (i.e. NFIT
>  and _FIT). It is also responsible to parse the namsepace labels on

s/namspace/namespace/

>  each NVDIMM devices, recover namespace after power failure (as
>  described in [2]) and handle NVDIMM hotplug. There are also some
>  vendor drivers to perform vendor-specific operations on NVDIMMs
>  (e.g. via _DSM).
> 
>  Compared to the ordinary ram, NVDIMM is used more like a persistent

s/ram/RAM/
>  storage drive for its persistent aspect. For each persistent memory
>  namespace, or a label-less pmem NVDIMM range, NVDIMM driver
>  implements a block device interface (/dev/pmemX) for it.
> 
>  Userspace applications can mmap(2) the whole pmem into its own
>  virtual address space. Linux kernel maps the system physical address
>  space range occupied by pmem into the virtual address space, so that every
>  normal memory loads/writes with proper flushing instructions are
>  applied to the underlying pmem NVDIMM regions.
> 
>  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
>  that file system can be used in the same way as above. As Linux
>  kernel maps the system address space range occupied by those files on
>  NVDIMM to the virtual address space, reads/writes on those files are
>  applied to the underlying NVDIMM regions as well.
> 
> 2.2 vNVDIMM Implementation in KVM/QEMU
> 
>  An overview of vNVDIMM implementation in KVM (Linux kernel v4.2) / QEMU (commit
>  70d1fb9 and patches in-review/future) is showed by the following figure.
> 
> 
>                                        +---------------------------------+
>  Guest                             GPA |                    | /dev/pmem0 |
>                                        +---------------------------------+
>            parse        evaluate                            ^            ^
>             ACPI          _DSM                              |            |
>               |            |                                |            |
>  -------------|------------|--------------------------------|------------|----
>               V            V                                |            |
>           +-------+    +-------+                            |            |
>  QEMU     | vACPI |    | v_DSM |                            |            |
>           +-------+    +-------+                            |            |
>                            ^                                |            |
>                            | Read/Write                     |            |
>                            V                                |            |
>           +...+--------------------+...+-----------+        |            |
>     VA    |   | Label Storage Area |   |    buf    |  KVM_SET_USER_MEMORY_REGION(buf)
>           +...+--------------------+...+-----------+        |            |
>                                        ^  mmap(2)  ^        |            |
>  --------------------------------------|-----------|--------|------------|----
>                                        |           +--------~------------+
>                                        |                    |            |
>  Linux/KVM                             +--------------------+            |
>                                                             |            |
>                                                        +....+------------+
>                                                 SPA    |    | /dev/pmem0 |
>                                                        +....+------------+
>                                                                    ^
>                                                                    |
>                                                             Host NVDIMM Driver
> -------------------------------------------------------------------|---------
>                                                                    |
>  HW                                                          +------------+
>                                                              |   NVDIMM   |
>                                                              +------------+
> 
> 

Nice picture!

>  A part not put in above figure is enabling guest clwb/clflushopt
>  which exposes those instructions to guest via guest cpuid.

And aren't those deprecated?
> 
>  Besides instruction enabling, there are two primary parts of vNVDIMM
>  implementation in KVM/QEMU.
> 
>  (1) Address Mapping
> 
>   As described before, the host Linux NVDIMM driver provides a block
>   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
>   region. QEMU can than mmap(2) that device into its virtual address
>   space (buf). QEMU is responsible to find a proper guest physical
>   address space range that is large enough to hold /dev/pmem0. Then
>   QEMU passes the virtual address of mmapped buf to a KVM API
>   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
>   address range of buf to the guest physical address space range where
>   the virtual pmem device will be.
> 
>   In this way, all guest writes/reads on the virtual pmem device is
>   applied directly to the host one.
> 
>   Besides, above implementation also allows to back a virtual pmem
>   device by a mmapped regular file or a piece of ordinary ram.
> 
>  (2) Guest ACPI Emulation
> 
>   As guest system physical address and the size of the virtual pmem
>   device are determined by QEMU, QEMU is responsible to emulate the
>   guest NFIT and SSDT. Basically, it builds the guest NFIT and its
>   sub-structures that describes the virtual NVDIMM topology, and a
>   guest SSDT that defines ACPI namespace devices of virtual NVDIMM in
>   guest SSDT.
> 
>   As a portion of host pmem device or a regular file/ordinary file can
>   be used to back the guest pmem device, the label storage area on
>   host pmem cannot always be passed through to guest. Therefore, the
>   guest reads/writes on the label storage area is emulated by QEMU. As
>   described before, _DSM method is utilized by OSPM to access the
>   label storage area, and therefore it is emulated by QEMU. The _DSM
>   buffer is registered as MMIO, and its guest physical address and
>   size are described in the guest ACPI. Every command/status
>   read/write from guest is trapped and emulated by QEMU.
>


And is there any need for the E820 type 7 to be exposed? I presume
not as the ACPI NFIT is sufficient?


>   Guest _FIT method will be implemented similarly in the future.
> 
> 
> 
> 3. Usage Example of vNVDIMM in Xen
> 
>  Our design is to provide virtual pmem devices to HVM domains. The
>  virtual pmem devices are backed by host pmem devices.
> 
>  Dom0 Linux kernel can detect the host pmem devices and create
>  /dev/pmemXX for each detected devices. Users in Dom0 can then create
>  DAX file system on /dev/pmemXX and create several pre-allocate files
>  in the DAX file system.
> 
>  After setup the file system on the host pmem, users can add the
>  following lines in the xl configuration files to assign the host pmem
>  regions to domains:
>      vnvdimm = [ 'file=/dev/pmem0' ]
>  or
>      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> 
>   The first type of configuration assigns the entire pmem device
>   (/dev/pmem0) to the domain, while the second assigns the space
>   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
>   the domain.
> 
>   When the domain starts, guest can detect the (virtual) pmem devices
>   via ACPI and guest read/write on the virtual pmem devices are
>   directly applied on their host backends.

Would guest namespace (128kb) be written at offset 0 of said file (or block)?
And of course the guest can only manipulate this using ACPI _DSM methods?

> 
> 
> 
> 4. Design of vNVDIMM in Xen
> 
>  As KVM/QEMU, our design currently only provides pmem vNVDIMM.
> 
>  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>  three parts:
>  (1) Guest clwb/clflushopt enabling,
>  (2) pmem address management, and
>  (3) Guest ACPI emulation.
> 
>  The rest of this section present the design of each part
>  respectively. The basic design principle to reuse existing code in
>  Linux NVDIMM driver, QEMU and Xen as much as possible.
> 
> 
> 4.1 Guest clwb/clflushopt Enabling
> 
>  The instruction enabling is simple and we do the same work as in KVM/QEMU:
>  - clwb/clflushopt are exposed to guest via guest cpuid.
> 

Again, isn't that deprecated and the new mechanism (pokng at some register)
has to be used?
> 
> 4.2 pmem Address Management
> 
>  pmem address management is primarily composed of three parts:
>  (1) detection of pmem devices and their address ranges, which is
>      accomplished by Dom0 Linux pmem driver and Xen hypervisor;
>  (2) get SPA ranges of an pmem area that will be mapped to domain
>      which is accomplished by xl;
>  (3) map the pmem area to a domain, which is accomplished by qemu and
s/qemu/QEMU/
>      Xen hypervisor.
> 
>  Our design intends to reuse the current memory management for normal
>  RAM in Xen to manage the mapping of pmem. Then we will come across a
>  problem: where we store the memory management data structs for pmem.

s/we store/where to/
> 
>  The rest of this section addresses above aspects respectively.

Wait. What about alternatives? Why treat it as a RAM region instead of
as an MMIO region?

> 
> 4.2.1 Reserve Storage for Management Structures
> 
>  A core data struct in Xen memory management is 'struct page_info'.
>  For normal ram, Xen creates a page_info struct for each page. For
>  pmem, we are going to do the same. However, for large capacity pmem
>  devices (e.g. several terabytes or even larger), a large amount of
>  page_info structs will occupy too much storage space that cannot
>  fit in the normal ram.
> 
>  Our solution, as used by Linux kernel, is to reserve an area on pmem
>  and place pmem's page_info structs in that reserved area. Therefore,
>  we can always ensure there is enough space for pmem page_info
>  structs, though the access to them is slower than directly from the
>  normal ram.
> 
>  Such a pmem namespace can be created via a userspace tool ndctl and
>  then recognized by Linux NVDIMM driver. However, they currently only
>  reserve space for Linux kernel's page structs. Therefore, our design
>  need to extend both Linux NVDIMM driver and ndctl to reserve
>  arbitrary size.

That seems .. fragile? What if Windows or FreeBSD want to use it
too? Would this 'struct page' on on NVDIMM be generalized enough
to work with Linux,Xen, FreeBSD and what not?

And this ndctl is https://github.com/pmem/ndctl I presume?

And how is this structure reserved? Is it a seperate namespace entry?
And QEMU knows not to access it? Or Xen needs to make sure _nobody_
except it can access it? Which means Xen may need to know the format
of the ndctl structures that are laid out in the NVDIMM region?

> 
> 4.2.2 Detection of Host pmem Devices
> 
>  The detection and initialize host pmem devices require a non-trivial
>  driver to interact with the corresponding ACPI namespace devices,
>  parse namespace labels and make necessary recovery actions. Instead
>  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
>  our designs leaves it to Dom0 Linux and let Dom0 Linux report
>  detected host pmem devices to Xen hypervisor.

So Xen would ignore at bootup ACPI NFIT structures?
> 
>  Our design takes following steps to detect host pmem devices when Xen
>  boots.
>  (1) As booting on bare metal, host pmem devices are detected by Dom0
>      Linux NVDIMM driver.
> 
>  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
>      of the pmem devices and reserved areas to Xen hypervisor via a
>      new hypercall.

reserved areas? That is the namespace region and the SPA <start,end>
for the ndctl areas? Are the ndctl areas guarnateed to be contingous?

Is there some spec on the ndctl and how/where they are stuck in the NVDIMM?

> 
>  (3) Xen hypervisor then checks
>      - whether SPA and size of the newly reported pmem device is overlap
>        with any previously reported pmem devices;

Or normal RAM?

>      - whether the reserved area can fit in the pmem device and is
>        large enough to hold page_info structs for itself.

I think I know what you mean but it sounds odd.

Perhaps:

 large enough to hold page_info struct's for it's entire range?

Native speaker, like Ian, would know how to say this right I think.

Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
tool and what version was used to create this? What if one version
used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
It would be a bit of catching up .. wait, this same problem MUST
be with Linux? How does it deal with this?

> 
>      If any checks fail, the reported pmem device will be ignored by
>      Xen hypervisor and hence will not be used by any

I hope this hypercall returns an error code too?

>      guests. Otherwise, Xen hypervisor will recorded the reported
s/recorded/record/
>      parameters and create page_info structs in the reserved area.

Ohh. You just blast it away? I guess it makes sense. Then what is the
purpose of the ndctl? Just to carve out an namespace region for this?

And what if there is something there from previous OS (say Linux)?
Just blast it away? But could Linux depend on this containing some
persistent information? Or does it also blast it away?

But those regions may be non-contingous (or maybe not? I need to check
the spec to double-check) so how do you figure out this 'reserved area'
as it may be an N SPA's of the <start>,<end> type?

> 
>  (4) Because the reserved area is now used by Xen hypervisor, it
>      should not be accessible by Dom0 any more. Therefore, if a host
>      pmem device is recorded by Xen hypervisor, Xen will unmap its

s/recorded/usurped/? Maybe monopolized? Owned? Ah, possesed!

s/its/this/
>      reserved area from Dom0. Our design also needs to extend Linux
>      NVDIMM driver to "balloon out" the reserved area after it
>      successfully reports a pmem device to Xen hypervisor.

This "balloon out" is interesting. You are effectively telling Linux
to ignore a certain range of 'struct page_info', so that if somebody
uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
can't read the struct page_info anymore).

How would you do this? Simulate an NVDIMM unplug?

But if you do that how will SMART tools work anymore? And
who would do the _DSM checks on the health of the NVDIMM?

/me scratches his head. Perhaps the answers are later in this
design..

> 
> 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> 
>  Before a pmem file is assigned to a domain, we need to know the host
>  SPA ranges that are allocated to this file. We do this work in xl.
> 
>  If a pmem device /dev/pmem0 is given, xl will read
>  /sys/block/pmem0/device/{resource,size} respectively for the start
>  SPA and size of the pmem device.

Oh! How convient!
> 
>  If a pre-allocated file /mnt/dax/file is given,
>  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
>      it uses the method above to get the start SPA of the host pmem
>      device.
>  (2) xl then uses fiemap ioctl to get the extend mappings of
>      /mnt/dax/file, and adds the corresponding physical offsets and
>      lengths in each mapping entries to above start SPA to get the SPA
>      ranges pre-allocated for this file.

Nice !
> 
>  The resulting host SPA ranges will be passed to QEMU which allocates
>  guest address space for vNVDIMM devices and calls Xen hypervisor to
>  map the guest address to the host SPA ranges.
> 
> 4.2.4 Map Host pmem to Guests
> 
>  Our design reuses the existing address mapping in Xen for the normal
>  ram to map pmem. We will still initiate the mapping for pmem from
>  QEMU, and there is one difference from the mapping of normal ram:
>  - For the normal ram, QEMU only needs to provide gpfn, and the actual
>    host memory where gpfn is mapped is allocated by Xen hypervisor.
>  - For pmem, QEMU provides both gpfn and mfn where gpfn is expected to
>    be mapped to. mfn is provided by xl as described in Section 4.2.3.
> 
>  Our design introduce a new XENMEM op for the pmem mapping, which
>  finally calls guest_physmap_add_page() to add the host pmem page to a
>  domain's address space.
> 
> 4.2.5 Misc 1: RAS
> 
>  Machine check can occur from NVDIMM as normal ram, so that we follow
>  the current machine check handling in Xen for MC# from NVDIMM.

OK, so that is mc_memerr_dhandler. OK,

Is there enought telemtry information for the guest to know
it is NVDIMM and handle it via the NVDIMM #MCE error handling which
is different than normal #MCE?

I presume this means a certain Linux guest dependency as well
for this to work?

> 
> 4.2.6 Misc 2: hotplug
> 
>  The hotplugged host NVDIMM devices is detected via _FIT method under
>  the root ACPI namespace device for NVDIMM. We rely on Dom0 Linux
>  kernel to discover the hotplugged NVDIMM devices and follow steps in
>  Section 4.2.2 to report the hotplugged devices to Xen hypervisor.
> 
> 
> 4.3 Guest ACPI Emulation
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM. As QEMU has already implemented ACPI support
>  for vNVDIMM on KVM, our design intends to reuse its implementation.
> 
> 4.3.1 Building Guest ACPI Tables
> 
>  Two tables for vNVDIMM need to be built:
>  - NFIT, which defines the basic parameters of vNVDIMM devices and
>    does not contain any AML code.
>  - SSDT, which defines ACPI namespace devices for vNVDIMM in AML code.
> 
>  The contents of both tables are affected by some parameters
>  (e.g. address and size of vNVDIMM devices) that cannot be determined
>  until a guest configuration is given. However, all AML code in guest
>  ACPI are currently generated at compile time fro pre-crafted .asl

s/fro/for/

>  files, which is not feasible for ACPI namespace devices for vNVDIMM.
> 
>  We could either introduce an AML builder to generate AML code at
>  runtime like what QEMU is currently doing, or pass vNVDIMM ACPI
>  tables from QEMU to Xen. In order to reduce the duplicated code (to

s/to Xen/to hvmloader/ I think?

>  AML builder in QEMU), our design takes the latter approach. Basically,
>  our design takes the following steps:
>  1) The current QEMU does not build any ACPI stuffs when it runs as
>     the Xen device model, so we need to patch it to generate NFIT and
>     AML code of ACPI namespace devices for vNVDIMM.
> 
>  2) QEMU then copies above NFIT and ACPI namespace devices to an area
>     at the end of guest memory below 4G. The guest physical address
>     and size of this area are written to xenstore keys
>     (/local/domain/domid/hvmloader/dm-acpi/{address,length}) The
>     detailed format of data in this area is explained later.
> 
>  3) hvmloader reads above xenstore keys to probe the passed-in ACPI
>     tables and ACPI namespace devices, and detects the potential
>     collisions as explained later.
> 
>  4) If no collisions are found, hvmloader will
>     (1) append the passed-in ACPI tables to the end of existing guest
>         ACPI tables, like what current construct_passthrough_tables()
>         does.
>     (2) construct a SSDT for each passed-in ACPI namespace devices and
>         append to the end of existing guest ACPI tables.
> 
>  Passing arbitrary ACPI tables and AML code from QEMU could
>  introduce at least two types of collisions:
>  1) a passed-in table and a Xen-built table have the same signature
>  2) a passed-in ACPI namespace device and a Xen-built ACPI namespace
>     device have the same device name.
> 
>  Our design takes the following method to avoid and detect collisions.
>  1) The data layout of area where QEMU copies its NFIT and ACPI
>     namespace devices is organized as below:

Why can't this be expressed in XenStore?

You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
?

> 
>      1 byte 4 bytes  length bytes
>     +------+--------+-----------+------+--------+-----------+-----
>     | type | length | data blob | type | length | data blob | ...
>     +------+--------+-----------+------+--------+-----------+-----
> 
>     type: 0 - data blob contains a complete ACPI table
>           1 - data blob contains AML code for an ACPI namespace device
> 
>     length: the number of bytes of data blob
> 
>     data blob: type 0 - a complete ACPI table
>                type 1 - composed as below:
> 
>                          4 bytes   (length - 4) bytes
> 	                +---------+------------------+
> 			| name[4] | AML code snippet |
> 			+---------+------------------+
> 
>                         name[4]         : name of ACPI namespace device
> 			AML code snippet: AML code inside "Device(name[4])"
> 
>                e.g. for an ACPI namespace device defined by
> 	             Device(NVDR)
> 		     {
> 		       Name (_HID, "ACPI0012")
> 		       ...
> 		     }
> 		    QEMU builds a data blob like
> 		        +--------------------+-----------------------------+
> 			| 'N', 'V', 'D', 'R' | Name (_HID, "ACPI0012") ... |
> 			+--------------------+-----------------------------+
> 
>  2) hvmloader stores signatures of its own guest ACPI tables in an
>     array builtin_table_sigs[], and names of its own guest ACPI
>     namespace devices in an array builtin_nd_names[]. Because there
>     are only a few guest ACPI tables and namespace devices built by
>     Xen, we can hardcode their signatures or names in hvmloader.
> 
>  3) When hvmloader loads a type 0 entry, it extracts the signature

s/type 0/data blob->type 0/ ?

>     from the data blob and search for it in builtin_table_sigs[].  If
>     found anyone, hvmloader will report an error and stop. Otherwise,
>     it will append it to the end of loaded guest ACPI.
> 
>  4) When hvmloader loads a type 1 entry, it extracts the device name
>     from the data blob and search for it in builtin_nd_names[]. If
>     found anyone, hvmloader will report and error and stop. Otherwise,
>     it will wrap the AML code snippet by "Device (name[4]) {...}" and
>     include it in a new SSDT which is then appended to the end of
>     loaded guest ACPI.
> 
> 4.3.2 Emulating Guest _DSM
> 
>  Our design leaves the emulation of guest _DSM to QEMU. Just as what
>  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
>  Xen and then all guest evaluations of _DSM are trapped and emulated
>  by QEMU.

Sweet!

So one question that I am not if it has been answered, with the
'struct page_info' being removed from the dom0 how will OEM _DSM method
operation? For example some of the AML code may asking to poke
at specific SPAs, but how will Linux do this properly without
'struct page_info' be available?

Thanks!
> 
> 
> References:
> [1] ACPI Specification v6,
>     http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> [2] NVDIMM Namespace Specification,
>     http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> [3] NVDIMM Block Window Driver Writer's Guide,
>     http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> [4] NVDIMM DSM Interface Example,
>     http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> [5] UEFI Specification v2.6,
>     http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_6.pdf
> [6] Intel Architecture Instruction Set Extensions Programming Reference,
>     https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf
> [7] https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00006.html
> [8] https://lists.xen.org/archives/html/xen-devel/2016-06/msg00606.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03 21:25 ` Konrad Rzeszutek Wilk
@ 2016-08-03 23:16   ` Konrad Rzeszutek Wilk
  2016-08-04  1:51     ` Haozhong Zhang
  2016-08-04  8:52   ` Haozhong Zhang
  1 sibling, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-03 23:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Stefano Stabellini, Wei Liu,
	Nakajima, Jun, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong

> > 1.4 clwb/clflushopt
> > 
> >  Writes to NVDIMM may be cached by caches, so certain flushing
> >  operations should be performed to make them persistent on
> >  NVDIMM. clwb is used in favor of clflushopt and clflush to flush
> >  writes from caches to memory.
> > 
> >  Details of clwb/clflushopt can be found in Chapter 10 of [6].
> 
> Didn't that opcode get dropped in favour of poking in some register?

Nevermind. I got this confused with pcommit which was deprecated.

But looking at chapter 10.2.2 it mentions that to commit to
persistent memory you need to use pcommit. So what is the story here?
.. snip...
> >  A part not put in above figure is enabling guest clwb/clflushopt
> >  which exposes those instructions to guest via guest cpuid.
> 
> And aren't those deprecated?

And again. Ignore that comment.
.. snip..
> > 4. Design of vNVDIMM in Xen
> > 
> >  As KVM/QEMU, our design currently only provides pmem vNVDIMM.
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt enabling,
> >  (2) pmem address management, and
> >  (3) Guest ACPI emulation.
> > 
> >  The rest of this section present the design of each part
> >  respectively. The basic design principle to reuse existing code in
> >  Linux NVDIMM driver, QEMU and Xen as much as possible.
> > 
> > 
> > 4.1 Guest clwb/clflushopt Enabling
> > 
> >  The instruction enabling is simple and we do the same work as in KVM/QEMU:
> >  - clwb/clflushopt are exposed to guest via guest cpuid.
> > 
> 
> Again, isn't that deprecated and the new mechanism (pokng at some register)
> has to be used?

So clflushopt can be used for flushing out a cacheline. But what
to do about store in the non-volatile memory? I recall that you could
do an sfence and then pcommit, which would be aking to an SCSI SYNC
command.

But with pcommit being deprecated (albeit the URL you pointed too
still lists pcommit) - at least in Xen and Linux - how do you
enforce this wholesale flush?

Thanks!

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03 23:16   ` Konrad Rzeszutek Wilk
@ 2016-08-04  1:51     ` Haozhong Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-04  1:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Stefano Stabellini, Wei Liu,
	Jan Beulich, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Nakajima, Jun, Xiao Guangrong

On 08/03/16 19:16, Konrad Rzeszutek Wilk wrote:
> > > 1.4 clwb/clflushopt
> > > 
> > >  Writes to NVDIMM may be cached by caches, so certain flushing
> > >  operations should be performed to make them persistent on
> > >  NVDIMM. clwb is used in favor of clflushopt and clflush to flush
> > >  writes from caches to memory.
> > > 
> > >  Details of clwb/clflushopt can be found in Chapter 10 of [6].
> > 
> > Didn't that opcode get dropped in favour of poking in some register?
> 
> Nevermind. I got this confused with pcommit which was deprecated.
> 
> But looking at chapter 10.2.2 it mentions that to commit to
> persistent memory you need to use pcommit. So what is the story here?

The document has not been updated yet, though patches to revert
pcommit support for both Linux and Xen had been merged.

> .. snip...
> > >  A part not put in above figure is enabling guest clwb/clflushopt
> > >  which exposes those instructions to guest via guest cpuid.
> > 
> > And aren't those deprecated?
> 
> And again. Ignore that comment.
> .. snip..
> > > 4. Design of vNVDIMM in Xen
> > > 
> > >  As KVM/QEMU, our design currently only provides pmem vNVDIMM.
> > > 
> > >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> > >  three parts:
> > >  (1) Guest clwb/clflushopt enabling,
> > >  (2) pmem address management, and
> > >  (3) Guest ACPI emulation.
> > > 
> > >  The rest of this section present the design of each part
> > >  respectively. The basic design principle to reuse existing code in
> > >  Linux NVDIMM driver, QEMU and Xen as much as possible.
> > > 
> > > 
> > > 4.1 Guest clwb/clflushopt Enabling
> > > 
> > >  The instruction enabling is simple and we do the same work as in KVM/QEMU:
> > >  - clwb/clflushopt are exposed to guest via guest cpuid.
> > > 
> > 
> > Again, isn't that deprecated and the new mechanism (pokng at some register)
> > has to be used?
> 
> So clflushopt can be used for flushing out a cacheline. But what
> to do about store in the non-volatile memory? I recall that you could
> do an sfence and then pcommit, which would be aking to an SCSI SYNC
> command.
> 
> But with pcommit being deprecated (albeit the URL you pointed too
> still lists pcommit) - at least in Xen and Linux - how do you
> enforce this wholesale flush?
>

After deprecating pcommit, at least one of following two approaches
should be provided by HW to guarantee persistent:

1) Asynchronous DRAM refresh (ADR)
   If the platform supports ADR, flush CPU cache lines (e.g. by
   clwb/clflushopt/clflush) will result result in flush write pending
   queues in memory controller to NVDIMM.
   
2) ACPI Flush Hint Address Structure
   If ACPI flush hint address structure is available for a NVDIMM
   region, software can write to that structure to flush any preceding
   stores to that NVDIMM region. (Section 5.2.25.8 of ACPI Spec 6.1)

ADR is preferred, as guest clwb/clflushopt/clflush do not introduce
VMEXITs.

However, I'm also going to emulate ACPI flush hint address structure
in case of the lack of ADR. Basically, guest writes to ACPI flush hint
address structure will be trapped to QEMU which will submit them to
the host ACPI flush hint address structure via Dom0 NVDIMM driver.

If neither ADR nor ACPI flush hint address structure is available,
persistent can not be guaranteed.

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-03 21:25 ` Konrad Rzeszutek Wilk
  2016-08-03 23:16   ` Konrad Rzeszutek Wilk
@ 2016-08-04  8:52   ` Haozhong Zhang
  2016-08-04  9:25     ` Jan Beulich
  2016-08-04 14:51     ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-04  8:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Stefano Stabellini, Wei Liu,
	Nakajima, Jun, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong

Hi Konrad,

On 08/03/16 17:25, Konrad Rzeszutek Wilk wrote:
> On Mon, Jul 18, 2016 at 08:29:12AM +0800, Haozhong Zhang wrote:
> > Hi,
> > 
> 
> Hey!
> 
> Thanks for posting! Sorry for the late review. Below are some of my
> comment.
>

Thank you for the review!

[..]
> And is there any need for the E820 type 7 to be exposed? I presume
> not as the ACPI NFIT is sufficient?
>

No, NFIT is sufficient and provides more information than E820.

> 
> >   Guest _FIT method will be implemented similarly in the future.
> > 
> > 
> > 
> > 3. Usage Example of vNVDIMM in Xen
> > 
> >  Our design is to provide virtual pmem devices to HVM domains. The
> >  virtual pmem devices are backed by host pmem devices.
> > 
> >  Dom0 Linux kernel can detect the host pmem devices and create
> >  /dev/pmemXX for each detected devices. Users in Dom0 can then create
> >  DAX file system on /dev/pmemXX and create several pre-allocate files
> >  in the DAX file system.
> > 
> >  After setup the file system on the host pmem, users can add the
> >  following lines in the xl configuration files to assign the host pmem
> >  regions to domains:
> >      vnvdimm = [ 'file=/dev/pmem0' ]
> >  or
> >      vnvdimm = [ 'file=/mnt/dax/pre_allocated_file' ]
> > 
> >   The first type of configuration assigns the entire pmem device
> >   (/dev/pmem0) to the domain, while the second assigns the space
> >   allocated to /mnt/dax/pre_allocated_file on the host pmem device to
> >   the domain.
> > 
> >   When the domain starts, guest can detect the (virtual) pmem devices
> >   via ACPI and guest read/write on the virtual pmem devices are
> >   directly applied on their host backends.
> 
> Would guest namespace (128kb) be written at offset 0 of said file (or block)?
> And of course the guest can only manipulate this using ACPI _DSM methods?
>

I guess you mean the label storage area which stores labels of
namespaces. In the current QEMU implementation, the guest label
storage area is at the end of the file or the block device. It's not
mapped to the guest address space (which I missed to state here) and
can be accessed only via guest _DSM.

[..] 
> > 4.2 pmem Address Management
> > 
> >  pmem address management is primarily composed of three parts:
> >  (1) detection of pmem devices and their address ranges, which is
> >      accomplished by Dom0 Linux pmem driver and Xen hypervisor;
> >  (2) get SPA ranges of an pmem area that will be mapped to domain
> >      which is accomplished by xl;
> >  (3) map the pmem area to a domain, which is accomplished by qemu and
> s/qemu/QEMU/
> >      Xen hypervisor.
> > 
> >  Our design intends to reuse the current memory management for normal
> >  RAM in Xen to manage the mapping of pmem. Then we will come across a
> >  problem: where we store the memory management data structs for pmem.
> 
> s/we store/where to/
> > 
> >  The rest of this section addresses above aspects respectively.
> 
> Wait. What about alternatives? Why treat it as a RAM region instead of
> as an MMIO region?
>

The part used as the label storage area of vNVDIMM is treated as MMIO
as described by a later section of this design. Other parts of vNVDIMM
are directly accessed by guest, so I think we can treat them as normal
RAM regions and map to guest, though we definitely need to mark them
as pmem regions via virtual NFIT.

> > 
> > 4.2.1 Reserve Storage for Management Structures
> > 
> >  A core data struct in Xen memory management is 'struct page_info'.
> >  For normal ram, Xen creates a page_info struct for each page. For
> >  pmem, we are going to do the same. However, for large capacity pmem
> >  devices (e.g. several terabytes or even larger), a large amount of
> >  page_info structs will occupy too much storage space that cannot
> >  fit in the normal ram.
> > 
> >  Our solution, as used by Linux kernel, is to reserve an area on pmem
> >  and place pmem's page_info structs in that reserved area. Therefore,
> >  we can always ensure there is enough space for pmem page_info
> >  structs, though the access to them is slower than directly from the
> >  normal ram.
> > 
> >  Such a pmem namespace can be created via a userspace tool ndctl and
> >  then recognized by Linux NVDIMM driver. However, they currently only
> >  reserve space for Linux kernel's page structs. Therefore, our design
> >  need to extend both Linux NVDIMM driver and ndctl to reserve
> >  arbitrary size.
> 
> That seems .. fragile? What if Windows or FreeBSD want to use it
> too?

AFAIK, the way used by current Linux NVDIMM driver for reservation has
not been documented in any public specifications yet. I'll consult
driver developers for more information.

> Would this 'struct page' on on NVDIMM be generalized enough
> to work with Linux,Xen, FreeBSD and what not?
>

No. Different operating systems may choose different data structures
to manage NVDIMM according to their own requirements and
consideration, so it would be hard to reach an agreement on what to
put in a generic data structure (and make it as part of ABI?).

> And this ndctl is https://github.com/pmem/ndctl I presume?

Yes. Sorry that I forgot to attach the URL.

>
> And how is this structure reserved? Is it a seperate namespace entry?

No, it does not introduce any extra namespace entry. The current
NVDIMM driver in Linux does the reservation in the way shown by the
following diagram (I omit details about alignment and padding for
simplicity):

 SPA  SPA+4K
  |      |
  V      V
  +------+-----------+-- ... ---+-----...-----+
  |      | nd_pfn_sb | reserved | free to use |
  +------+-----------+-- ... ---+-----...-----+
  |<--   nd_pfn_sb.dataoff   -->|             |
  |    (+ necessary padding)                  |
  |                                           |
  |<------------- pmem namespace ------------>|

Given a pmem namespace which starts from SPA,
 1) the driver stores a struct nd_pfn_sb at SPA+4K
 2) the reserved area is after nd_pfn_sb
 3) the free-to-use area is after the reserved area, and its location
    relative to SPA can be derived from nd_pfn_sb.dataoff
 4) only the free-to-use area is exposed to a block device /dev/pmemX.
    Access to sector N of /dev/pmemX actually goes to (SPA +
    nd_pfn_sb.dataoff + N * SECT_SIZE)
 5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
    checksum. If the driver finds such signature and the checksum
    matches, then it knows this device contains reserved area.

> And QEMU knows not to access it?

QEMU as a userspace program can only access /dev/pmemX and hence has
no way to touch the reserved area.

> Or Xen needs to make sure _nobody_
> except it can access it? Which means Xen may need to know the format
> of the ndctl structures that are laid out in the NVDIMM region?
>

Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
to Xen hypervisor, which then unmaps the reserved area from Dom0.

> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> 
> So Xen would ignore at bootup ACPI NFIT structures?

Yes, parsing NFIT is left to Dom0 which has the correct driver.

> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >      Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >      of the pmem devices and reserved areas to Xen hypervisor via a
> >      new hypercall.
> 
> reserved areas? That is the namespace region and the SPA <start,end>
> for the ndctl areas? Are the ndctl areas guarnateed to be contingous?
>

explained above. The reserved area on an individual pmem namespace is
contiguous.

> Is there some spec on the ndctl and how/where they are stuck in the NVDIMM?
>

No public spec so far, as mentioned above.

> > 
> >  (3) Xen hypervisor then checks
> >      - whether SPA and size of the newly reported pmem device is overlap
> >        with any previously reported pmem devices;
> 
> Or normal RAM?
>

Yes, I missed normal RAM here.

> >      - whether the reserved area can fit in the pmem device and is
> >        large enough to hold page_info structs for itself.
> 
> I think I know what you mean but it sounds odd.
> 
> Perhaps:
> 
>  large enough to hold page_info struct's for it's entire range?
>

Yes, that is what I mean

> Native speaker, like Ian, would know how to say this right I think.
> 
> Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
> tool and what version was used to create this? What if one version
> used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
> It would be a bit of catching up .. wait, this same problem MUST
> be with Linux? How does it deal with this?
>

Good question. Linux chooses a size (64 bytes) larger than its current
sizeof(struct page) (40 bytes). We may also do in the same way,
e.g. 32 bytes vs. 64 bytes?

> > 
> >      If any checks fail, the reported pmem device will be ignored by
> >      Xen hypervisor and hence will not be used by any
> 
> I hope this hypercall returns an error code too?
>

Definitely yes

> >      guests. Otherwise, Xen hypervisor will recorded the reported
> s/recorded/record/
> >      parameters and create page_info structs in the reserved area.
> 
> Ohh. You just blast it away? I guess it makes sense. Then what is the
> purpose of the ndctl? Just to carve out an namespace region for this?
>

ndctl is used by, for example, a system admin to reserve space on a
host pmem namespace. If there is already data in the namespace, ndctl
will give a warning message and exit as long as --force option is not
given. However, if --force is present, ndctl will break the existing
data.

> And what if there is something there from previous OS (say Linux)?
> Just blast it away? But could Linux depend on this containing some
> persistent information? Or does it also blast it away?
>

As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
a matched checksum, it will know it's safe to write to the reserved
area. Otherwise, it will treat the pmem namespace as a raw device and
store page struct's in the normal RAM.

> But those regions may be non-contingous (or maybe not? I need to check
> the spec to double-check) so how do you figure out this 'reserved area'
> as it may be an N SPA's of the <start>,<end> type?
>

the reserved area is per pmem namespace.

> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >      should not be accessible by Dom0 any more. Therefore, if a host
> >      pmem device is recorded by Xen hypervisor, Xen will unmap its
> 
> s/recorded/usurped/? Maybe monopolized? Owned? Ah, possesed!
> 
> s/its/this/
> >      reserved area from Dom0. Our design also needs to extend Linux
> >      NVDIMM driver to "balloon out" the reserved area after it
> >      successfully reports a pmem device to Xen hypervisor.
> 
> This "balloon out" is interesting. You are effectively telling Linux
> to ignore a certain range of 'struct page_info', so that if somebody
> uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> can't read the struct page_info anymore).
>
> How would you do this? Simulate an NVDIMM unplug?

s/page_info/page/ (struct page for linux, struct page_info for xen)

As in Jan's comment, "balloon out" is a confusing name here.
Basically, it's to remove the reserved area from some resource struct
in nvdimm driver to avoid it's accessed out of the driver via the
resource struct. And the nvdimm driver does not map the reserved area,
so I think it cannot be touched via page_walk.

> 
> But if you do that how will SMART tools work anymore? And
> who would do the _DSM checks on the health of the NVDIMM?
>

A userspace SMART tool cannot access the reserved area, so I think it
can still work. I haven't look at the implementation of any SMART
tools for NVDIMM, but I guess they would finally call the driver to
evaluate the ARS _DSM which reports the bad blocks. As long as the
driver does not return the bad blocks in the reserved area to SMART
tools (which I suppose to be handled by driver itself), SMART tools
should work fine.

> /me scratches his head. Perhaps the answers are later in this
> design..
>
> > 
> > 4.2.3 Get Host Machine Address (SPA) of Host pmem Files
> > 
> >  Before a pmem file is assigned to a domain, we need to know the host
> >  SPA ranges that are allocated to this file. We do this work in xl.
> > 
> >  If a pmem device /dev/pmem0 is given, xl will read
> >  /sys/block/pmem0/device/{resource,size} respectively for the start
> >  SPA and size of the pmem device.
> 
> Oh! How convient!
> > 
> >  If a pre-allocated file /mnt/dax/file is given,
> >  (1) xl first finds the host pmem device where /mnt/dax/file is. Then
> >      it uses the method above to get the start SPA of the host pmem
> >      device.
> >  (2) xl then uses fiemap ioctl to get the extend mappings of
> >      /mnt/dax/file, and adds the corresponding physical offsets and
> >      lengths in each mapping entries to above start SPA to get the SPA
> >      ranges pre-allocated for this file.
> 
> Nice !
> > 
> >  The resulting host SPA ranges will be passed to QEMU which allocates
> >  guest address space for vNVDIMM devices and calls Xen hypervisor to
> >  map the guest address to the host SPA ranges.
> > 
> > 4.2.4 Map Host pmem to Guests
> > 
> >  Our design reuses the existing address mapping in Xen for the normal
> >  ram to map pmem. We will still initiate the mapping for pmem from
> >  QEMU, and there is one difference from the mapping of normal ram:
> >  - For the normal ram, QEMU only needs to provide gpfn, and the actual
> >    host memory where gpfn is mapped is allocated by Xen hypervisor.
> >  - For pmem, QEMU provides both gpfn and mfn where gpfn is expected to
> >    be mapped to. mfn is provided by xl as described in Section 4.2.3.
> > 
> >  Our design introduce a new XENMEM op for the pmem mapping, which
> >  finally calls guest_physmap_add_page() to add the host pmem page to a
> >  domain's address space.
> > 
> > 4.2.5 Misc 1: RAS
> > 
> >  Machine check can occur from NVDIMM as normal ram, so that we follow
> >  the current machine check handling in Xen for MC# from NVDIMM.
> 
> OK, so that is mc_memerr_dhandler. OK,
> 
> Is there enought telemtry information for the guest to know
> it is NVDIMM and handle it via the NVDIMM #MCE error handling which
> is different than normal #MCE?
> 
> I presume this means a certain Linux guest dependency as well
> for this to work?
>

Yes, the guest should at least know which address belongs to
vNVDIMM. Then it can tell from the address of virtual #MC where the
error comes from. Otherwise, the guest will see an #MC for an address
it doesn't know.

> > 
> > 4.2.6 Misc 2: hotplug
> > 
> >  The hotplugged host NVDIMM devices is detected via _FIT method under
> >  the root ACPI namespace device for NVDIMM. We rely on Dom0 Linux
> >  kernel to discover the hotplugged NVDIMM devices and follow steps in
> >  Section 4.2.2 to report the hotplugged devices to Xen hypervisor.
> > 
> > 
> > 4.3 Guest ACPI Emulation
> > 
> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >  emulating guest _DSM. As QEMU has already implemented ACPI support
> >  for vNVDIMM on KVM, our design intends to reuse its implementation.
> > 
> > 4.3.1 Building Guest ACPI Tables
> > 
> >  Two tables for vNVDIMM need to be built:
> >  - NFIT, which defines the basic parameters of vNVDIMM devices and
> >    does not contain any AML code.
> >  - SSDT, which defines ACPI namespace devices for vNVDIMM in AML code.
> > 
> >  The contents of both tables are affected by some parameters
> >  (e.g. address and size of vNVDIMM devices) that cannot be determined
> >  until a guest configuration is given. However, all AML code in guest
> >  ACPI are currently generated at compile time fro pre-crafted .asl
> 
> s/fro/for/
> 
> >  files, which is not feasible for ACPI namespace devices for vNVDIMM.
> > 
> >  We could either introduce an AML builder to generate AML code at
> >  runtime like what QEMU is currently doing, or pass vNVDIMM ACPI
> >  tables from QEMU to Xen. In order to reduce the duplicated code (to
> 
> s/to Xen/to hvmloader/ I think?
>

yes

> >  AML builder in QEMU), our design takes the latter approach. Basically,
> >  our design takes the following steps:
> >  1) The current QEMU does not build any ACPI stuffs when it runs as
> >     the Xen device model, so we need to patch it to generate NFIT and
> >     AML code of ACPI namespace devices for vNVDIMM.
> > 
> >  2) QEMU then copies above NFIT and ACPI namespace devices to an area
> >     at the end of guest memory below 4G. The guest physical address
> >     and size of this area are written to xenstore keys
> >     (/local/domain/domid/hvmloader/dm-acpi/{address,length}) The
> >     detailed format of data in this area is explained later.
> > 
> >  3) hvmloader reads above xenstore keys to probe the passed-in ACPI
> >     tables and ACPI namespace devices, and detects the potential
> >     collisions as explained later.
> > 
> >  4) If no collisions are found, hvmloader will
> >     (1) append the passed-in ACPI tables to the end of existing guest
> >         ACPI tables, like what current construct_passthrough_tables()
> >         does.
> >     (2) construct a SSDT for each passed-in ACPI namespace devices and
> >         append to the end of existing guest ACPI tables.
> > 
> >  Passing arbitrary ACPI tables and AML code from QEMU could
> >  introduce at least two types of collisions:
> >  1) a passed-in table and a Xen-built table have the same signature
> >  2) a passed-in ACPI namespace device and a Xen-built ACPI namespace
> >     device have the same device name.
> > 
> >  Our design takes the following method to avoid and detect collisions.
> >  1) The data layout of area where QEMU copies its NFIT and ACPI
> >     namespace devices is organized as below:
> 
> Why can't this be expressed in XenStore?
> 
> You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> ?
>

If XenStore can be used, then it could save some guest memory.

This is a general mechanism to pass ACPI which and is not limited to
NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
XenStore is still a proper place when the number is large. Maybe we
should put an upper limit for the number of entries.

> > 
> >      1 byte 4 bytes  length bytes
> >     +------+--------+-----------+------+--------+-----------+-----
> >     | type | length | data blob | type | length | data blob | ...
> >     +------+--------+-----------+------+--------+-----------+-----
> > 
> >     type: 0 - data blob contains a complete ACPI table
> >           1 - data blob contains AML code for an ACPI namespace device
> > 
> >     length: the number of bytes of data blob
> > 
> >     data blob: type 0 - a complete ACPI table
> >                type 1 - composed as below:
> > 
> >                          4 bytes   (length - 4) bytes
> > 	                +---------+------------------+
> > 			| name[4] | AML code snippet |
> > 			+---------+------------------+
> > 
> >                         name[4]         : name of ACPI namespace device
> > 			AML code snippet: AML code inside "Device(name[4])"
> > 
> >                e.g. for an ACPI namespace device defined by
> > 	             Device(NVDR)
> > 		     {
> > 		       Name (_HID, "ACPI0012")
> > 		       ...
> > 		     }
> > 		    QEMU builds a data blob like
> > 		        +--------------------+-----------------------------+
> > 			| 'N', 'V', 'D', 'R' | Name (_HID, "ACPI0012") ... |
> > 			+--------------------+-----------------------------+
> > 
> >  2) hvmloader stores signatures of its own guest ACPI tables in an
> >     array builtin_table_sigs[], and names of its own guest ACPI
> >     namespace devices in an array builtin_nd_names[]. Because there
> >     are only a few guest ACPI tables and namespace devices built by
> >     Xen, we can hardcode their signatures or names in hvmloader.
> > 
> >  3) When hvmloader loads a type 0 entry, it extracts the signature
> 
> s/type 0/data blob->type 0/ ?
>

no, type information is out of data blob ({type, length, data blob})

> >     from the data blob and search for it in builtin_table_sigs[].  If
> >     found anyone, hvmloader will report an error and stop. Otherwise,
> >     it will append it to the end of loaded guest ACPI.
> > 
> >  4) When hvmloader loads a type 1 entry, it extracts the device name
> >     from the data blob and search for it in builtin_nd_names[]. If
> >     found anyone, hvmloader will report and error and stop. Otherwise,
> >     it will wrap the AML code snippet by "Device (name[4]) {...}" and
> >     include it in a new SSDT which is then appended to the end of
> >     loaded guest ACPI.
> > 
> > 4.3.2 Emulating Guest _DSM
> > 
> >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> >  Xen and then all guest evaluations of _DSM are trapped and emulated
> >  by QEMU.
> 
> Sweet!
> 
> So one question that I am not if it has been answered, with the
> 'struct page_info' being removed from the dom0 how will OEM _DSM method
> operation? For example some of the AML code may asking to poke
> at specific SPAs, but how will Linux do this properly without
> 'struct page_info' be available?
>

(s/page_info/page/)

The current Intel NVDIMM driver in Linux does not evaluate any OEM
_DSM method, so I'm not sure whether the kernel has to access a NVDIMM
page during evaluating _DSM.

The most close one in my mind, though not an OEM _DSM, is function 1
of ARS _DSM, which requires inputs of a start SPA and a length in
bytes. After kernel gives the inputs, the scrubbing of the specified
area is done by the hardware and does not requires any mappings in OS.

Any example of such OEM _DSM methods?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-04  8:52   ` Haozhong Zhang
@ 2016-08-04  9:25     ` Jan Beulich
  2016-08-04  9:35       ` Haozhong Zhang
  2016-08-04 14:51     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2016-08-04  9:25 UTC (permalink / raw)
  To: Haozhong Zhang, Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

>>> On 04.08.16 at 10:52, <haozhong.zhang@intel.com> wrote:
> On 08/03/16 17:25, Konrad Rzeszutek Wilk wrote:
>> Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
>> tool and what version was used to create this? What if one version
>> used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
>> It would be a bit of catching up .. wait, this same problem MUST
>> be with Linux? How does it deal with this?
> 
> Good question. Linux chooses a size (64 bytes) larger than its current
> sizeof(struct page) (40 bytes). We may also do in the same way,
> e.g. 32 bytes vs. 64 bytes?

I don't understand this: These structures aren't meant to be
persistent, so their size shouldn't matter?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-04  9:25     ` Jan Beulich
@ 2016-08-04  9:35       ` Haozhong Zhang
  2016-08-04 14:51         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-04  9:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Kevin Tian, Stefano Stabellini, Wei Liu,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel,
	Jun Nakajima, Xiao Guangrong

On 08/04/16 03:25, Jan Beulich wrote:
> >>> On 04.08.16 at 10:52, <haozhong.zhang@intel.com> wrote:
> > On 08/03/16 17:25, Konrad Rzeszutek Wilk wrote:
> >> Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
> >> tool and what version was used to create this? What if one version
> >> used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
> >> It would be a bit of catching up .. wait, this same problem MUST
> >> be with Linux? How does it deal with this?
> > 
> > Good question. Linux chooses a size (64 bytes) larger than its current
> > sizeof(struct page) (40 bytes). We may also do in the same way,
> > e.g. 32 bytes vs. 64 bytes?
> 
> I don't understand this: These structures aren't meant to be
> persistent, so their size shouldn't matter?
> 

But the size of the reserved area is persistent. If the size of struct
page_info increases in future versions, the reserved area which is
just enough for old version struct page_info would not be enough for
the new version.

Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-04  8:52   ` Haozhong Zhang
  2016-08-04  9:25     ` Jan Beulich
@ 2016-08-04 14:51     ` Konrad Rzeszutek Wilk
  2016-08-05  6:25       ` Haozhong Zhang
  1 sibling, 1 reply; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-04 14:51 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Stefano Stabellini, Juergen Gross, Wei Liu, Tian,
	Kevin, Xiao Guangrong, Nakajima, Jun

> > >  Such a pmem namespace can be created via a userspace tool ndctl and
> > >  then recognized by Linux NVDIMM driver. However, they currently only
> > >  reserve space for Linux kernel's page structs. Therefore, our design
> > >  need to extend both Linux NVDIMM driver and ndctl to reserve
> > >  arbitrary size.
> > 
> > That seems .. fragile? What if Windows or FreeBSD want to use it
> > too?
> 
> AFAIK, the way used by current Linux NVDIMM driver for reservation has
> not been documented in any public specifications yet. I'll consult
> driver developers for more information.
> 
> > Would this 'struct page' on on NVDIMM be generalized enough
> > to work with Linux,Xen, FreeBSD and what not?
> >
> 
> No. Different operating systems may choose different data structures
> to manage NVDIMM according to their own requirements and
> consideration, so it would be hard to reach an agreement on what to
> put in a generic data structure (and make it as part of ABI?).

Yes. As I can see different OSes having different sizes. And then
this size of 'reserved region' ends up being too small and only
some part of the NVDIMM can be used.

> 
> > And this ndctl is https://github.com/pmem/ndctl I presume?
> 
> Yes. Sorry that I forgot to attach the URL.
> 
> >
> > And how is this structure reserved? Is it a seperate namespace entry?
> 
> No, it does not introduce any extra namespace entry. The current
> NVDIMM driver in Linux does the reservation in the way shown by the
> following diagram (I omit details about alignment and padding for
> simplicity):
> 
>  SPA  SPA+4K
>   |      |
>   V      V
>   +------+-----------+-- ... ---+-----...-----+
>   |      | nd_pfn_sb | reserved | free to use |
>   +------+-----------+-- ... ---+-----...-----+
>   |<--   nd_pfn_sb.dataoff   -->|             |
>   |    (+ necessary padding)                  |
>   |                                           |
>   |<------------- pmem namespace ------------>|
> 
> Given a pmem namespace which starts from SPA,

AAAAh, so it is at start of the namespace! Thanks

>  1) the driver stores a struct nd_pfn_sb at SPA+4K
>  2) the reserved area is after nd_pfn_sb
>  3) the free-to-use area is after the reserved area, and its location
>     relative to SPA can be derived from nd_pfn_sb.dataoff
>  4) only the free-to-use area is exposed to a block device /dev/pmemX.
>     Access to sector N of /dev/pmemX actually goes to (SPA +
>     nd_pfn_sb.dataoff + N * SECT_SIZE)
>  5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
>     checksum. If the driver finds such signature and the checksum
>     matches, then it knows this device contains reserved area.

/me nods.

And of course this nice diagram and such is going to be in
a public ABI document :-)
> 
> > And QEMU knows not to access it?
> 
> QEMU as a userspace program can only access /dev/pmemX and hence has
> no way to touch the reserved area.

Rightto.
> 
> > Or Xen needs to make sure _nobody_
> > except it can access it? Which means Xen may need to know the format
> > of the ndctl structures that are laid out in the NVDIMM region?
> >
> 
> Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
> boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
> to Xen hypervisor, which then unmaps the reserved area from Dom0.

OK, so the /dev/pmem driver would consult this when somebody is mmaping
the area. But since this would be removed from the driver (unregistered)
it would report an zero size?

Or would it "Otherwise, it will treat the pmem namespace as a raw device and
store page struct's in the normal RAM." - which means dom0 can still
access the SPA (except obviously the area that is for this reserved region)?

..snip..
> > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > s/recorded/record/
> > >      parameters and create page_info structs in the reserved area.
> > 
> > Ohh. You just blast it away? I guess it makes sense. Then what is the
> > purpose of the ndctl? Just to carve out an namespace region for this?
> >
> 
> ndctl is used by, for example, a system admin to reserve space on a
> host pmem namespace. If there is already data in the namespace, ndctl
> will give a warning message and exit as long as --force option is not
> given. However, if --force is present, ndctl will break the existing
> data.
> 
> > And what if there is something there from previous OS (say Linux)?
> > Just blast it away? But could Linux depend on this containing some
> > persistent information? Or does it also blast it away?
> >
> 
> As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> a matched checksum, it will know it's safe to write to the reserved
> area. Otherwise, it will treat the pmem namespace as a raw device and
> store page struct's in the normal RAM.

OK, so my worry is that we will have a divergence. Which is that
the system admin creates this under ndctl v0, boots Xen uses it.
Then moves the NVDIMM to another machine which has ndctl v1 and
he/she boots in Linux.

Linux gets all confused b/c the region has something it can't understand
and the user is very angry.

So it sounds like the size the ndctl reserves MUST be baked in an ABI
and made sure to expand if needed.

..snip..
> > This "balloon out" is interesting. You are effectively telling Linux
> > to ignore a certain range of 'struct page_info', so that if somebody
> > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > can't read the struct page_info anymore).
> >
> > How would you do this? Simulate an NVDIMM unplug?
> 
> s/page_info/page/ (struct page for linux, struct page_info for xen)
> 
> As in Jan's comment, "balloon out" is a confusing name here.
> Basically, it's to remove the reserved area from some resource struct
> in nvdimm driver to avoid it's accessed out of the driver via the
> resource struct. And the nvdimm driver does not map the reserved area,
> so I think it cannot be touched via page_walk.

OK, I need to read the Linux code more to make sure I am
not missing something.

Basically the question that keeps revolving in my head is:

Why is this even neccessary?

Let me expand - it feels like (and I think I am missing something
here) that we are crippling the Linux driver so that it won't
break - b/c if it tried to access the 'strut page_info' in this
reserved region it would crash. So we eliminate that, and make
the driver believe the region exists (is reserved), but it can't
use it. And instead use the normal RAM pages to keep track
of the NVDIMM SPAs.

Or perhaps not keep track at all and just treat the whole
NVDIMM as opaque MMIO that is inaccessible?

But how will that work if there is a DAX filesystem on it?
The ext4 needs some mechanism to access the files that are there.
(Otherwise you couldn't use the fiemap ioctl to find the SPAs).

[see below]
> 
> > 
> > But if you do that how will SMART tools work anymore? And
> > who would do the _DSM checks on the health of the NVDIMM?
> >
> 
> A userspace SMART tool cannot access the reserved area, so I think it
> can still work. I haven't look at the implementation of any SMART
> tools for NVDIMM, but I guess they would finally call the driver to
> evaluate the ARS _DSM which reports the bad blocks. As long as the
> driver does not return the bad blocks in the reserved area to SMART
> tools (which I suppose to be handled by driver itself), SMART tools
> should work fine.
> 
> > /me scratches his head. Perhaps the answers are later in this
> > design..

So I think I figured out the issue here!!

You just want to have the Linux kernel driver to use normal RAM
pages to keep track of the NVDIMM SPA ranges. As in treat
the NVDIMM as if it is normal RAM?

[Or is Linux treating this area as MMIO region (in wihch case it does not
need struct page_info)??]

And then Xen can use this reserved region for its own
purpose!

Perhaps then the section that explains this 'reserved region' could
say something along:

"We need to keep track of the SPAs. The guest NVDIMM 'file'
on the NVDIMM may be in the worst case be randomly and in descending
discontingous order (say from the end of the NVDIMM), we need
to keep track of each of the SPAs. The reason is that we need
the SPAs when we populate the guest EPT.

As such we can store the guest SPA in memory (linear array?)
or red-black tree, or any other - but all of them will consume
"normal RAM". And with sufficient large enough NVDIMM we may
not have enough 'normal RAM' to store this.

Also we only need to know these SPAs during guest creation,
destruction, ballooning, etc - hence we may store them on the
NVDIMM itself. Fortunatly for us the ndctl and Linux are
available which carve out right after the namespace region (128kb)
and 'reserved region' which the OS can use to store its
struct page_info to cover the full range of the NVDIMM.

The complexity in this is that:
 - We MUST make sure Linux does not try to use it while
   we use it.
 - That the size of this 'reserved region' is sufficiently
   large for our 'struct page_info' structure.
 - The layout has an ABI baked.
 - Linux fs'es with DAX support MUST be able mlock these SPA
   regions (so that nobody tries to remove the 'file' while
   a guest is using it).
 - Linus fs'es with DAX support MUST be able to resize the
   'file', hereby using more of the SPAs and rewritting the
   properties of the file on DAX (which should then cause an
   memory hotplug ACPI in the guest treating the new size of
   the file as new NFIT region?)

"

I think that covers it?
..snip..
> > >  Our design takes the following method to avoid and detect collisions.
> > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > >     namespace devices is organized as below:
> > 
> > Why can't this be expressed in XenStore?
> > 
> > You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > ?
> >
> 
> If XenStore can be used, then it could save some guest memory.

It is also easier than relaying on the format of a blob in memory.
> 
> This is a general mechanism to pass ACPI which and is not limited to
> NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> XenStore is still a proper place when the number is large. Maybe we
> should put an upper limit for the number of entries.

Why put a limit on it? It should easily handle thousands of <name>.
And the only attributes you have under <name> are just address,
length and type.

.. snip..
> > > 4.3.2 Emulating Guest _DSM
> > > 
> > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > >  by QEMU.
> > 
> > Sweet!
> > 
> > So one question that I am not if it has been answered, with the
> > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > operation? For example some of the AML code may asking to poke
> > at specific SPAs, but how will Linux do this properly without
> > 'struct page_info' be available?
> >
> 
> (s/page_info/page/)
> 
> The current Intel NVDIMM driver in Linux does not evaluate any OEM
> _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> page during evaluating _DSM.
> 
> The most close one in my mind, though not an OEM _DSM, is function 1
> of ARS _DSM, which requires inputs of a start SPA and a length in
> bytes. After kernel gives the inputs, the scrubbing of the specified
> area is done by the hardware and does not requires any mappings in OS.

<nods>
> 
> Any example of such OEM _DSM methods?

I can't think of any right now - but that is the danger of OEMs - they
may decide to do something .. ill advisable. Hence having it work
the same way as Linux is what we should strive for.


> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-04  9:35       ` Haozhong Zhang
@ 2016-08-04 14:51         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-04 14:51 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, Wei Liu, George Dunlap, Ian Jackson,
	Jun Nakajima, Kevin Tian, Stefano Stabellini, Xiao Guangrong,
	xen-devel, Juergen Gross

On Thu, Aug 04, 2016 at 05:35:12PM +0800, Haozhong Zhang wrote:
> On 08/04/16 03:25, Jan Beulich wrote:
> > >>> On 04.08.16 at 10:52, <haozhong.zhang@intel.com> wrote:
> > > On 08/03/16 17:25, Konrad Rzeszutek Wilk wrote:
> > >> Anyhow, wouldn't this 'sizeof(struct page_info)' depend on the ndctl
> > >> tool and what version was used to create this? What if one version
> > >> used 32-bytes for a PAGE, while another used 64-bytes for a PAGE too?
> > >> It would be a bit of catching up .. wait, this same problem MUST
> > >> be with Linux? How does it deal with this?
> > > 
> > > Good question. Linux chooses a size (64 bytes) larger than its current
> > > sizeof(struct page) (40 bytes). We may also do in the same way,
> > > e.g. 32 bytes vs. 64 bytes?
> > 
> > I don't understand this: These structures aren't meant to be
> > persistent, so their size shouldn't matter?
> > 
> 
> But the size of the reserved area is persistent. If the size of struct
> page_info increases in future versions, the reserved area which is
> just enough for old version struct page_info would not be enough for
> the new version.

Exactly!!

Thanks!

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-04 14:51     ` Konrad Rzeszutek Wilk
@ 2016-08-05  6:25       ` Haozhong Zhang
  2016-08-05 13:29         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 24+ messages in thread
From: Haozhong Zhang @ 2016-08-05  6:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Juergen Gross, Tian, Kevin, Stefano Stabellini, Wei Liu,
	Nakajima, Jun, George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel, Jan Beulich, Xiao Guangrong

On 08/04/16 10:51, Konrad Rzeszutek Wilk wrote:
> > > >  Such a pmem namespace can be created via a userspace tool ndctl and
> > > >  then recognized by Linux NVDIMM driver. However, they currently only
> > > >  reserve space for Linux kernel's page structs. Therefore, our design
> > > >  need to extend both Linux NVDIMM driver and ndctl to reserve
> > > >  arbitrary size.
> > > 
> > > That seems .. fragile? What if Windows or FreeBSD want to use it
> > > too?
> > 
> > AFAIK, the way used by current Linux NVDIMM driver for reservation has
> > not been documented in any public specifications yet. I'll consult
> > driver developers for more information.
> > 
> > > Would this 'struct page' on on NVDIMM be generalized enough
> > > to work with Linux,Xen, FreeBSD and what not?
> > >
> > 
> > No. Different operating systems may choose different data structures
> > to manage NVDIMM according to their own requirements and
> > consideration, so it would be hard to reach an agreement on what to
> > put in a generic data structure (and make it as part of ABI?).
> 
> Yes. As I can see different OSes having different sizes. And then
> this size of 'reserved region' ends up being too small and only
> some part of the NVDIMM can be used.
>

If the reserved area is too small for some OS, those OS may choose to
put management data structures in the normal RAM in order to map all
NVDIMM.

Possibly, a tool can be developed to adjust the reserved size w/o
breaking existing data (e.g. by moving data towards the end to leave
room for reserved area).

> > 
> > > And this ndctl is https://github.com/pmem/ndctl I presume?
> > 
> > Yes. Sorry that I forgot to attach the URL.
> > 
> > >
> > > And how is this structure reserved? Is it a seperate namespace entry?
> > 
> > No, it does not introduce any extra namespace entry. The current
> > NVDIMM driver in Linux does the reservation in the way shown by the
> > following diagram (I omit details about alignment and padding for
> > simplicity):
> > 
> >  SPA  SPA+4K
> >   |      |
> >   V      V
> >   +------+-----------+-- ... ---+-----...-----+
> >   |      | nd_pfn_sb | reserved | free to use |
> >   +------+-----------+-- ... ---+-----...-----+
> >   |<--   nd_pfn_sb.dataoff   -->|             |
> >   |    (+ necessary padding)                  |
> >   |                                           |
> >   |<------------- pmem namespace ------------>|
> > 
> > Given a pmem namespace which starts from SPA,
> 
> AAAAh, so it is at start of the namespace! Thanks
> 
> >  1) the driver stores a struct nd_pfn_sb at SPA+4K
> >  2) the reserved area is after nd_pfn_sb
> >  3) the free-to-use area is after the reserved area, and its location
> >     relative to SPA can be derived from nd_pfn_sb.dataoff
> >  4) only the free-to-use area is exposed to a block device /dev/pmemX.
> >     Access to sector N of /dev/pmemX actually goes to (SPA +
> >     nd_pfn_sb.dataoff + N * SECT_SIZE)
> >  5) nd_pfn_sb also contains a signature "NVDIMM_PFN_INFO" and a
> >     checksum. If the driver finds such signature and the checksum
> >     matches, then it knows this device contains reserved area.
> 
> /me nods.
> 
> And of course this nice diagram and such is going to be in
> a public ABI document :-)
> > 
> > > And QEMU knows not to access it?
> > 
> > QEMU as a userspace program can only access /dev/pmemX and hence has
> > no way to touch the reserved area.
> 
> Rightto.
> > 
> > > Or Xen needs to make sure _nobody_
> > > except it can access it? Which means Xen may need to know the format
> > > of the ndctl structures that are laid out in the NVDIMM region?
> > >
> > 
> > Xen hypervisor relies on dom0 driver to parse the layout.  At Dom0
> > boot, Dom0 NVDIMM driver reports address/size of area reserved for Xen
> > to Xen hypervisor, which then unmaps the reserved area from Dom0.
> 
> OK, so the /dev/pmem driver would consult this when somebody is mmaping
> the area. But since this would be removed from the driver (unregistered)
> it would report an zero size?
>

The current pmem driver in Linux need be modified (which I'm doing) to
understand the reserved area is unmapped and should never be accessed.

> Or would it "Otherwise, it will treat the pmem namespace as a raw device and
> store page struct's in the normal RAM." - which means dom0 can still
> access the SPA (except obviously the area that is for this reserved region)?
>

Yes, a raw device means there is no reserved area and the entire pmem
namespace can be used for free (i.e. the free-to-use are in above
diagram covers the entire pmem namespace).

> ..snip..
> > > >      guests. Otherwise, Xen hypervisor will recorded the reported
> > > s/recorded/record/
> > > >      parameters and create page_info structs in the reserved area.
> > > 
> > > Ohh. You just blast it away? I guess it makes sense. Then what is the
> > > purpose of the ndctl? Just to carve out an namespace region for this?
> > >
> > 
> > ndctl is used by, for example, a system admin to reserve space on a
> > host pmem namespace. If there is already data in the namespace, ndctl
> > will give a warning message and exit as long as --force option is not
> > given. However, if --force is present, ndctl will break the existing
> > data.
> > 
> > > And what if there is something there from previous OS (say Linux)?
> > > Just blast it away? But could Linux depend on this containing some
> > > persistent information? Or does it also blast it away?
> > >
> > 
> > As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> > a matched checksum, it will know it's safe to write to the reserved
> > area. Otherwise, it will treat the pmem namespace as a raw device and
> > store page struct's in the normal RAM.
> 
> OK, so my worry is that we will have a divergence. Which is that
> the system admin creates this under ndctl v0, boots Xen uses it.
> Then moves the NVDIMM to another machine which has ndctl v1 and
> he/she boots in Linux.
> 
> Linux gets all confused b/c the region has something it can't understand
> and the user is very angry.
> 
> So it sounds like the size the ndctl reserves MUST be baked in an ABI
> and made sure to expand if needed.
>

ndctl is a management tool which passes all its requests to the driver
via sysfs, so the compatibility across different versions of Linux
would actual be introduced by the different versions of drivers.

All newer versions of drivers should provide backwards compatibility
to previous versions (which is the current drivers'
behavior). However, the forwards compatibility is hard to preserved,
e.g.
 - an old version w/o reserved area support (e.g. the one in linux
   kernel 4.2) recognizes a pmem namespace w/ reserved area as a raw
   device and may write to the reserved area. If it's a xen reserved
   area and the driver is in dom0, the dom0 kernel will crash.
   
 - the same crash would happen if an old version driver w/ reserved
   area support but xen reserved area support (e.g. the one in linux
   kernel 4.7) is used for a pmem namespace w/ xen reserved area.

For the cross-OS compatibility, there is an effort to standardize the
reservation. In the meantime, only linux is capable to handle such
pmem namespaces with reserved area.

> ..snip..
> > > This "balloon out" is interesting. You are effectively telling Linux
> > > to ignore a certain range of 'struct page_info', so that if somebody
> > > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > > can't read the struct page_info anymore).
> > >
> > > How would you do this? Simulate an NVDIMM unplug?
> > 
> > s/page_info/page/ (struct page for linux, struct page_info for xen)
> > 
> > As in Jan's comment, "balloon out" is a confusing name here.
> > Basically, it's to remove the reserved area from some resource struct
> > in nvdimm driver to avoid it's accessed out of the driver via the
> > resource struct. And the nvdimm driver does not map the reserved area,
> > so I think it cannot be touched via page_walk.
> 
> OK, I need to read the Linux code more to make sure I am
> not missing something.
> 
> Basically the question that keeps revolving in my head is:
> 
> Why is this even neccessary?
> 
> Let me expand - it feels like (and I think I am missing something
> here) that we are crippling the Linux driver so that it won't
> break - b/c if it tried to access the 'strut page_info' in this
> reserved region it would crash. So we eliminate that, and make
> the driver believe the region exists (is reserved), but it can't
> use it. And instead use the normal RAM pages to keep track
> of the NVDIMM SPAs.
> 
> Or perhaps not keep track at all and just treat the whole
> NVDIMM as opaque MMIO that is inaccessible?
>

If we trust the driver in dom0 kernel always does correct things (and
we can trust it, right?), no crash will happen. However, as Jan
comment (https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg00433.html):

| Right now Dom0 isn't allowed to access any memory in use by Xen
| (and not explicitly shared), and I don't think we should deviate
| from that model for pmem.

xen hypervisor must explicitly disallow dom0 from accessing the
reserved area.

> But how will that work if there is a DAX filesystem on it?
> The ext4 needs some mechanism to access the files that are there.
> (Otherwise you couldn't use the fiemap ioctl to find the SPAs).
>

No, the file system does not touch the reserved area. If a reserved
area exists, the start SPA of /dev/pmem0 reported via sysfs is the
start SPA of the reserved area, so fiemap can still work.

> [see below]
> > 
> > > 
> > > But if you do that how will SMART tools work anymore? And
> > > who would do the _DSM checks on the health of the NVDIMM?
> > >
> > 
> > A userspace SMART tool cannot access the reserved area, so I think it
> > can still work. I haven't look at the implementation of any SMART
> > tools for NVDIMM, but I guess they would finally call the driver to
> > evaluate the ARS _DSM which reports the bad blocks. As long as the
> > driver does not return the bad blocks in the reserved area to SMART
> > tools (which I suppose to be handled by driver itself), SMART tools
> > should work fine.
> > 
> > > /me scratches his head. Perhaps the answers are later in this
> > > design..
> 
> So I think I figured out the issue here!!
> 
> You just want to have the Linux kernel driver to use normal RAM
> pages to keep track of the NVDIMM SPA ranges.

Yes, this is what the current driver does for a raw device.

> As in treat the NVDIMM as if it is normal RAM?

If you are talking about the location of page struct, then yes.  The
page struct's for NVDIMM is put in the normal RAM just like the page
struct's for the normal RAM. But NVDIMM can never, for example, be
allocated via the kernel memory allocator (buddy/slab/etc.).

> 
> [Or is Linux treating this area as MMIO region (in wihch case it does not
> need struct page_info)??]
>
> And then Xen can use this reserved region for its own
> purpose!
> 
> Perhaps then the section that explains this 'reserved region' could
> say something along:
> 
> "We need to keep track of the SPAs. The guest NVDIMM 'file'
> on the NVDIMM may be in the worst case be randomly and in descending
> discontingous order (say from the end of the NVDIMM), we need
> to keep track of each of the SPAs. The reason is that we need
> the SPAs when we populate the guest EPT.
> 
> As such we can store the guest SPA in memory (linear array?)
> or red-black tree, or any other - but all of them will consume
> "normal RAM". And with sufficient large enough NVDIMM we may
> not have enough 'normal RAM' to store this.
> 
> Also we only need to know these SPAs during guest creation,
> destruction, ballooning, etc - hence we may store them on the
> NVDIMM itself. Fortunatly for us the ndctl and Linux are
> available which carve out right after the namespace region (128kb)
> and 'reserved region' which the OS can use to store its
> struct page_info to cover the full range of the NVDIMM.
> 
> The complexity in this is that:
>  - We MUST make sure Linux does not try to use it while
>    we use it.
>  - That the size of this 'reserved region' is sufficiently
>    large for our 'struct page_info' structure.
>  - The layout has an ABI baked.
>  - Linux fs'es with DAX support MUST be able mlock these SPA
>    regions (so that nobody tries to remove the 'file' while
>    a guest is using it).

I need to check whether linux currently does this.

>  - Linus fs'es with DAX support MUST be able to resize the
>    'file', hereby using more of the SPAs and rewritting the
>    properties of the file on DAX (which should then cause an
>    memory hotplug ACPI in the guest treating the new size of
>    the file as new NFIT region?)
>

Currently my plan is to disallow such resizing and possibly other
changes out of guest if it's being used by guest (akin to disk) in the
first implementation. It's mostly for simplicity and we can add it in
future. For hotplug, we can pass another file as a new pmem namespace
to guest.

> "
> 
> I think that covers it?
> ..snip..
> > > >  Our design takes the following method to avoid and detect collisions.
> > > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > > >     namespace devices is organized as below:
> > > 
> > > Why can't this be expressed in XenStore?
> > > 
> > > You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > > ?
> > >
> > 
> > If XenStore can be used, then it could save some guest memory.
> 
> It is also easier than relaying on the format of a blob in memory.
> > 
> > This is a general mechanism to pass ACPI which and is not limited to
> > NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> > XenStore is still a proper place when the number is large. Maybe we
> > should put an upper limit for the number of entries.
> 
> Why put a limit on it? It should easily handle thousands of <name>.
> And the only attributes you have under <name> are just address,
> length and type.
>

OK, if it's not a problem, I will use xenstore to pass those
information.

> .. snip..
> > > > 4.3.2 Emulating Guest _DSM
> > > > 
> > > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > > >  by QEMU.
> > > 
> > > Sweet!
> > > 
> > > So one question that I am not if it has been answered, with the
> > > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > > operation? For example some of the AML code may asking to poke
> > > at specific SPAs, but how will Linux do this properly without
> > > 'struct page_info' be available?
> > >
> > 
> > (s/page_info/page/)
> > 
> > The current Intel NVDIMM driver in Linux does not evaluate any OEM
> > _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> > page during evaluating _DSM.
> > 
> > The most close one in my mind, though not an OEM _DSM, is function 1
> > of ARS _DSM, which requires inputs of a start SPA and a length in
> > bytes. After kernel gives the inputs, the scrubbing of the specified
> > area is done by the hardware and does not requires any mappings in OS.
> 
> <nods>
> > 
> > Any example of such OEM _DSM methods?
> 
> I can't think of any right now - but that is the danger of OEMs - they
> may decide to do something .. ill advisable. Hence having it work
> the same way as Linux is what we should strive for.
> 

I see: though the evaluation itself does not use any software
maintained mappings, the driver may use when handling the result of
evaluation, e.g. ARS _DSM reports bad blocks in the reserved area and
the driver may then have to access the reserved area (though this
could never happen in the current kernel because the driver does ARS
before reservation).

Currently there is no OEM _DSM support in linux kernel, so I cannot
think of any solution. However, if such an OEM _DSM comes, we may add
xen specific handling to the driver or introduce a way in nvdimm
driver framework to avoid accessing the reserved area in certain
circumstances (e.g. when used in xen dom0).

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC Design Doc v2] Add vNVDIMM support for Xen
  2016-08-05  6:25       ` Haozhong Zhang
@ 2016-08-05 13:29         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 24+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-08-05 13:29 UTC (permalink / raw)
  To: xen-devel, Jan Beulich, George Dunlap, Andrew Cooper,
	Ian Jackson, Stefano Stabellini, Juergen Gross, Wei Liu, Tian,
	Kevin, Xiao Guangrong, Nakajima, Jun

> > > As above, if linux driver detects the signature "NVDIMM_PFN_INFO" and
> > > a matched checksum, it will know it's safe to write to the reserved
> > > area. Otherwise, it will treat the pmem namespace as a raw device and
> > > store page struct's in the normal RAM.
> > 
> > OK, so my worry is that we will have a divergence. Which is that
> > the system admin creates this under ndctl v0, boots Xen uses it.
> > Then moves the NVDIMM to another machine which has ndctl v1 and
> > he/she boots in Linux.
> > 
> > Linux gets all confused b/c the region has something it can't understand
> > and the user is very angry.
> > 
> > So it sounds like the size the ndctl reserves MUST be baked in an ABI
> > and made sure to expand if needed.
> >
> 
> ndctl is a management tool which passes all its requests to the driver
> via sysfs, so the compatibility across different versions of Linux
> would actual be introduced by the different versions of drivers.
> 
> All newer versions of drivers should provide backwards compatibility
> to previous versions (which is the current drivers'
> behavior). However, the forwards compatibility is hard to preserved,
> e.g.
>  - an old version w/o reserved area support (e.g. the one in linux
>    kernel 4.2) recognizes a pmem namespace w/ reserved area as a raw
>    device and may write to the reserved area. If it's a xen reserved
>    area and the driver is in dom0, the dom0 kernel will crash.

Yikes!
>    
>  - the same crash would happen if an old version driver w/ reserved
>    area support but xen reserved area support (e.g. the one in linux
>    kernel 4.7) is used for a pmem namespace w/ xen reserved area.
> 
> For the cross-OS compatibility, there is an effort to standardize the
> reservation. In the meantime, only linux is capable to handle such
> pmem namespaces with reserved area.

It may be good to mention these difficulties you enumerated in the design
doc so if somebody does end up in this position and they search for it - they
could find a reference.

> 
> > ..snip..
> > > > This "balloon out" is interesting. You are effectively telling Linux
> > > > to ignore a certain range of 'struct page_info', so that if somebody
> > > > uses /sys/debug/kernel/page_walk it won't blow up? (As the kerne
> > > > can't read the struct page_info anymore).
> > > >
> > > > How would you do this? Simulate an NVDIMM unplug?
> > > 
> > > s/page_info/page/ (struct page for linux, struct page_info for xen)
> > > 
> > > As in Jan's comment, "balloon out" is a confusing name here.
> > > Basically, it's to remove the reserved area from some resource struct
> > > in nvdimm driver to avoid it's accessed out of the driver via the
> > > resource struct. And the nvdimm driver does not map the reserved area,
> > > so I think it cannot be touched via page_walk.
> > 
> > OK, I need to read the Linux code more to make sure I am
> > not missing something.
> > 
> > Basically the question that keeps revolving in my head is:
> > 
> > Why is this even neccessary?
> > 
> > Let me expand - it feels like (and I think I am missing something
> > here) that we are crippling the Linux driver so that it won't
> > break - b/c if it tried to access the 'strut page_info' in this
> > reserved region it would crash. So we eliminate that, and make
> > the driver believe the region exists (is reserved), but it can't
> > use it. And instead use the normal RAM pages to keep track
> > of the NVDIMM SPAs.
> > 
> > Or perhaps not keep track at all and just treat the whole
> > NVDIMM as opaque MMIO that is inaccessible?
> >
> 
> If we trust the driver in dom0 kernel always does correct things (and
> we can trust it, right?), no crash will happen. However, as Jan
> comment (https://lists.xenproject.org/archives/html/xen-devel/2016-08/msg00433.html):
> 
> | Right now Dom0 isn't allowed to access any memory in use by Xen
> | (and not explicitly shared), and I don't think we should deviate
> | from that model for pmem.
> 
> xen hypervisor must explicitly disallow dom0 from accessing the
> reserved area.

Right.
> 
> > But how will that work if there is a DAX filesystem on it?
> > The ext4 needs some mechanism to access the files that are there.
> > (Otherwise you couldn't use the fiemap ioctl to find the SPAs).
> >
> 
> No, the file system does not touch the reserved area. If a reserved

Ah, OK!
> area exists, the start SPA of /dev/pmem0 reported via sysfs is the
> start SPA of the reserved area, so fiemap can still work.
> 
> > [see below]
> > > 
> > > > 
> > > > But if you do that how will SMART tools work anymore? And
> > > > who would do the _DSM checks on the health of the NVDIMM?
> > > >
> > > 
> > > A userspace SMART tool cannot access the reserved area, so I think it
> > > can still work. I haven't look at the implementation of any SMART
> > > tools for NVDIMM, but I guess they would finally call the driver to
> > > evaluate the ARS _DSM which reports the bad blocks. As long as the
> > > driver does not return the bad blocks in the reserved area to SMART
> > > tools (which I suppose to be handled by driver itself), SMART tools
> > > should work fine.
> > > 
> > > > /me scratches his head. Perhaps the answers are later in this
> > > > design..
> > 
> > So I think I figured out the issue here!!
> > 
> > You just want to have the Linux kernel driver to use normal RAM
> > pages to keep track of the NVDIMM SPA ranges.
> 
> Yes, this is what the current driver does for a raw device.
> 
> > As in treat the NVDIMM as if it is normal RAM?
> 
> If you are talking about the location of page struct, then yes.  The
> page struct's for NVDIMM is put in the normal RAM just like the page
> struct's for the normal RAM. But NVDIMM can never, for example, be
> allocated via the kernel memory allocator (buddy/slab/etc.).

Right. I was thinking of page struct location.
> 
> > 
> > [Or is Linux treating this area as MMIO region (in wihch case it does not
> > need struct page_info)??]
> >
> > And then Xen can use this reserved region for its own
> > purpose!
> > 
> > Perhaps then the section that explains this 'reserved region' could
> > say something along:
> > 
> > "We need to keep track of the SPAs. The guest NVDIMM 'file'
> > on the NVDIMM may be in the worst case be randomly and in descending
> > discontingous order (say from the end of the NVDIMM), we need
> > to keep track of each of the SPAs. The reason is that we need
> > the SPAs when we populate the guest EPT.
> > 
> > As such we can store the guest SPA in memory (linear array?)
> > or red-black tree, or any other - but all of them will consume
> > "normal RAM". And with sufficient large enough NVDIMM we may
> > not have enough 'normal RAM' to store this.
> > 
> > Also we only need to know these SPAs during guest creation,
> > destruction, ballooning, etc - hence we may store them on the
> > NVDIMM itself. Fortunatly for us the ndctl and Linux are
> > available which carve out right after the namespace region (128kb)
> > and 'reserved region' which the OS can use to store its
> > struct page_info to cover the full range of the NVDIMM.
> > 
> > The complexity in this is that:
> >  - We MUST make sure Linux does not try to use it while
> >    we use it.
> >  - That the size of this 'reserved region' is sufficiently
> >    large for our 'struct page_info' structure.
> >  - The layout has an ABI baked.
> >  - Linux fs'es with DAX support MUST be able mlock these SPA
> >    regions (so that nobody tries to remove the 'file' while
> >    a guest is using it).
> 
> I need to check whether linux currently does this.
> 
> >  - Linus fs'es with DAX support MUST be able to resize the
> >    'file', hereby using more of the SPAs and rewritting the
> >    properties of the file on DAX (which should then cause an
> >    memory hotplug ACPI in the guest treating the new size of
> >    the file as new NFIT region?)
> >
> 
> Currently my plan is to disallow such resizing and possibly other
> changes out of guest if it's being used by guest (akin to disk) in the
> first implementation. It's mostly for simplicity and we can add it in
> future. For hotplug, we can pass another file as a new pmem namespace
> to guest.
> 
> > "
> > 
> > I think that covers it?
> > ..snip..
> > > > >  Our design takes the following method to avoid and detect collisions.
> > > > >  1) The data layout of area where QEMU copies its NFIT and ACPI
> > > > >     namespace devices is organized as below:
> > > > 
> > > > Why can't this be expressed in XenStore?
> > > > 
> > > > You could have /local/domain/domid/hvmloader/dm-acpi/<name>/{address,length, type}
> > > > ?
> > > >
> > > 
> > > If XenStore can be used, then it could save some guest memory.
> > 
> > It is also easier than relaying on the format of a blob in memory.
> > > 
> > > This is a general mechanism to pass ACPI which and is not limited to
> > > NVDIMM, so it means QEMU may pass a lot of entries. I'm not sure if
> > > XenStore is still a proper place when the number is large. Maybe we
> > > should put an upper limit for the number of entries.
> > 
> > Why put a limit on it? It should easily handle thousands of <name>.
> > And the only attributes you have under <name> are just address,
> > length and type.
> >
> 
> OK, if it's not a problem, I will use xenstore to pass those
> information.
> 
> > .. snip..
> > > > > 4.3.2 Emulating Guest _DSM
> > > > > 
> > > > >  Our design leaves the emulation of guest _DSM to QEMU. Just as what
> > > > >  it does with KVM, QEMU registers the _DSM buffer as MMIO region with
> > > > >  Xen and then all guest evaluations of _DSM are trapped and emulated
> > > > >  by QEMU.
> > > > 
> > > > Sweet!
> > > > 
> > > > So one question that I am not if it has been answered, with the
> > > > 'struct page_info' being removed from the dom0 how will OEM _DSM method
> > > > operation? For example some of the AML code may asking to poke
> > > > at specific SPAs, but how will Linux do this properly without
> > > > 'struct page_info' be available?
> > > >
> > > 
> > > (s/page_info/page/)
> > > 
> > > The current Intel NVDIMM driver in Linux does not evaluate any OEM
> > > _DSM method, so I'm not sure whether the kernel has to access a NVDIMM
> > > page during evaluating _DSM.
> > > 
> > > The most close one in my mind, though not an OEM _DSM, is function 1
> > > of ARS _DSM, which requires inputs of a start SPA and a length in
> > > bytes. After kernel gives the inputs, the scrubbing of the specified
> > > area is done by the hardware and does not requires any mappings in OS.
> > 
> > <nods>
> > > 
> > > Any example of such OEM _DSM methods?
> > 
> > I can't think of any right now - but that is the danger of OEMs - they
> > may decide to do something .. ill advisable. Hence having it work
> > the same way as Linux is what we should strive for.
> > 
> 
> I see: though the evaluation itself does not use any software
> maintained mappings, the driver may use when handling the result of
> evaluation, e.g. ARS _DSM reports bad blocks in the reserved area and
> the driver may then have to access the reserved area (though this
> could never happen in the current kernel because the driver does ARS
> before reservation).
> 
> Currently there is no OEM _DSM support in linux kernel, so I cannot
> think of any solution. However, if such an OEM _DSM comes, we may add
> xen specific handling to the driver or introduce a way in nvdimm
> driver framework to avoid accessing the reserved area in certain
> circumstances (e.g. when used in xen dom0).

Thanks!
> 
> Thanks,
> Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-08-05 13:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-18  0:29 [RFC Design Doc v2] Add vNVDIMM support for Xen Haozhong Zhang
2016-07-18  8:36 ` Tian, Kevin
2016-07-18  9:01   ` Zhang, Haozhong
2016-07-19  0:58     ` Tian, Kevin
2016-07-19  2:10       ` Zhang, Haozhong
2016-07-19  1:57 ` Bob Liu
2016-07-19  2:40   ` Haozhong Zhang
2016-08-02 14:46 ` Jan Beulich
2016-08-03  6:54   ` Haozhong Zhang
2016-08-03  8:45     ` Jan Beulich
2016-08-03  9:37       ` Haozhong Zhang
2016-08-03  9:47         ` Jan Beulich
2016-08-03 10:08           ` Haozhong Zhang
2016-08-03 10:18             ` Jan Beulich
2016-08-03 21:25 ` Konrad Rzeszutek Wilk
2016-08-03 23:16   ` Konrad Rzeszutek Wilk
2016-08-04  1:51     ` Haozhong Zhang
2016-08-04  8:52   ` Haozhong Zhang
2016-08-04  9:25     ` Jan Beulich
2016-08-04  9:35       ` Haozhong Zhang
2016-08-04 14:51         ` Konrad Rzeszutek Wilk
2016-08-04 14:51     ` Konrad Rzeszutek Wilk
2016-08-05  6:25       ` Haozhong Zhang
2016-08-05 13:29         ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).